Page MenuHomePhabricator

Meta datasource/feature refactoring for revscoring
Closed, ResolvedPublic

Description

The revscoring library has a lot of code duplication for similar features -- e.g. duplication between features.revision and features.parent_revision.

We also have some features hard-coded for enwiki (e.g. features.revision.image_links and features.revision.cite_links)

These should be re-implemented as meta-features that require parameterization.

E.g. in /datasources/meta/wikitext_parsing.py

from ..datasource import Datasource

class WikiTextParseTree(Datasource):
  def __init__(self, text_datasource, name=None):
    ...
  def process(self, text):
    ...

In datasources/revision.py:

from .meta import WikiTextParseTree

text = Datasource("revision.text")

parse_tree = WikiTextParseTree(text, name="revision.parse_tree")

In datasources/parent_revision.py:

from .meta import WikiTextParseTree

text = Datasource("parent_revision.text")

parse_tree = WikiTextParseTree(text, name="parent_revision.parse_tree")

There are a large set of features that look like this that we can clean up.

Event Timeline

Halfak created this task.Dec 9 2015, 8:59 PM
Halfak updated the task description. (Show Details)
Halfak raised the priority of this task from to Needs Triage.
Halfak moved this task to Active on the Scoring-platform-team (Current) board.
Halfak added a subscriber: Halfak.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 8:59 PM
He7d3r updated the task description. (Show Details)Dec 12 2015, 5:17 PM
He7d3r set Security to None.

OK. So I think I've worked out something better. I think that we should be nesting datasources and related features based on how they are oriented. Right now, they are all oriented to a particular revision (e.g. previous_user_revision means, in relation to the user who saved the current revision, their last revision before this one). That's complex and honestly a mouth-full. I think that we can make the import structure reflect this relationship and orientation.

Here's what I've got:

<revision>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • revision.text (Datasource)
  • revision.bytes (Datasource)
  • <parent>
  • <page>
  • <user>

<page>

  • id (Datasource)
  • title (Datasource)
  • <namespace>
  • <creation>

<namespace>

  • id (Datasouce)
  • name (Datasource)

<parent>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • revision.text (Datasource)
  • revision.bytes (Datasource)
  • <user>

<user>

  • id (Datasource)
  • text (Datasource)
  • editcount (Datasource)
  • registration (Datasource)
  • groups (Datasource)
  • emailable (Datasource)
  • gender (Datasource)
  • block_id (Datasource)
  • blocked_by (Datasource)
  • blocked_by_id (Datasource)
  • blocked_timestamp (Datasource)
  • block_reason (Datasource)
  • block_expiry (Datasource)
  • <last_revision>

<creation>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • <user>

<last_revision>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • <page>

It turns out that, by nesting these datasources, I can describe them with very few lines of code and I can re-use quite a bit. most of the items are just a revision with a few fields missing (because they are irrelevant -- like revision.page.creation.page.

I've been working on making the feature sets work with this structure too and that's resulted in a lot of code re-use there too. The last trick that I'm struggling with is how to get the same code re-use benefit in APIExtractor. Will have to think more about that and come back to it again.

He7d3r added a subscriber: He7d3r.Dec 23 2015, 2:50 PM

Just as an example of a feature collection that mimics this structure (partially), we have features.wikitext.tokenized:

<revision>

  • tokens (int)
  • whitespaces (int)
  • markups (int)
  • cjks (int)
  • urls (int)
  • entities (int)
  • words (int)
  • uppercase_words (int)
  • punctuations (int)
  • breaks (int)
  • <parent>

<parent>

  • tokens (int)
  • whitespaces (int)
  • markups (int)
  • cjks (int)
  • urls (int)
  • entities (int)
  • words (int)
  • uppercase_words (int)
  • punctuations (int)
  • breaks (int)

<diff>

  • token_delta_sum (int)
  • token_delta_increase (int)
  • token_delta_decrease (int)
  • token_prop_delta_sum (float)
  • token_prop_delta_increase (float)
  • token_prop_delta_decrease (float)
  • whitespace_delta_sum (int)
  • whitespace_delta_increase (int)
  • ...

Now that I am looking at this, it seems like <diff> should be part of any <revision> where both it and its <parent> have content. It doesn't really make sense to have <diff> be part of the datasource tree. Or maybe it can be, but it just doesn't have any datasources and just acts as a placeholder.

I forgot to make an important point about the structure described above. <revision> and <parent> are nearly identical. This is where the code re-use comes in.

Halfak closed this task as Resolved.Jan 21 2016, 3:44 PM