Page MenuHomePhabricator

Meta datasource/feature refactoring for revscoring
Closed, ResolvedPublic

Description

The revscoring library has a lot of code duplication for similar features -- e.g. duplication between features.revision and features.parent_revision.

We also have some features hard-coded for enwiki (e.g. features.revision.image_links and features.revision.cite_links)

These should be re-implemented as meta-features that require parameterization.

E.g. in /datasources/meta/wikitext_parsing.py

from ..datasource import Datasource

class WikiTextParseTree(Datasource):
  def __init__(self, text_datasource, name=None):
    ...
  def process(self, text):
    ...

In datasources/revision.py:

from .meta import WikiTextParseTree

text = Datasource("revision.text")

parse_tree = WikiTextParseTree(text, name="revision.parse_tree")

In datasources/parent_revision.py:

from .meta import WikiTextParseTree

text = Datasource("parent_revision.text")

parse_tree = WikiTextParseTree(text, name="parent_revision.parse_tree")

There are a large set of features that look like this that we can clean up.

Event Timeline

Halfak raised the priority of this task from to Needs Triage.
Halfak updated the task description. (Show Details)
Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.
Halfak subscribed.
He7d3r set Security to None.

OK. So I think I've worked out something better. I think that we should be nesting datasources and related features based on how they are oriented. Right now, they are all oriented to a particular revision (e.g. previous_user_revision means, in relation to the user who saved the current revision, their last revision before this one). That's complex and honestly a mouth-full. I think that we can make the import structure reflect this relationship and orientation.

Here's what I've got:

<revision>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • revision.text (Datasource)
  • revision.bytes (Datasource)
  • <parent>
  • <page>
  • <user>

<page>

  • id (Datasource)
  • title (Datasource)
  • <namespace>
  • <creation>

<namespace>

  • id (Datasouce)
  • name (Datasource)

<parent>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • revision.text (Datasource)
  • revision.bytes (Datasource)
  • <user>

<user>

  • id (Datasource)
  • text (Datasource)
  • editcount (Datasource)
  • registration (Datasource)
  • groups (Datasource)
  • emailable (Datasource)
  • gender (Datasource)
  • block_id (Datasource)
  • blocked_by (Datasource)
  • blocked_by_id (Datasource)
  • blocked_timestamp (Datasource)
  • block_reason (Datasource)
  • block_expiry (Datasource)
  • <last_revision>

<creation>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • <user>

<last_revision>

  • id (Datasource)
  • revision.timestamp (Datasource)
  • revision.comment (Datasource)
  • revision.byte_len (Datasource)
  • revision.minor (Datasource)
  • revision.content_model (Datasource)
  • revision.content_format (Datasource)
  • <page>

It turns out that, by nesting these datasources, I can describe them with very few lines of code and I can re-use quite a bit. most of the items are just a revision with a few fields missing (because they are irrelevant -- like revision.page.creation.page.

I've been working on making the feature sets work with this structure too and that's resulted in a lot of code re-use there too. The last trick that I'm struggling with is how to get the same code re-use benefit in APIExtractor. Will have to think more about that and come back to it again.

Just as an example of a feature collection that mimics this structure (partially), we have features.wikitext.tokenized:

<revision>

  • tokens (int)
  • whitespaces (int)
  • markups (int)
  • cjks (int)
  • urls (int)
  • entities (int)
  • words (int)
  • uppercase_words (int)
  • punctuations (int)
  • breaks (int)
  • <parent>

<parent>

  • tokens (int)
  • whitespaces (int)
  • markups (int)
  • cjks (int)
  • urls (int)
  • entities (int)
  • words (int)
  • uppercase_words (int)
  • punctuations (int)
  • breaks (int)

<diff>

  • token_delta_sum (int)
  • token_delta_increase (int)
  • token_delta_decrease (int)
  • token_prop_delta_sum (float)
  • token_prop_delta_increase (float)
  • token_prop_delta_decrease (float)
  • whitespace_delta_sum (int)
  • whitespace_delta_increase (int)
  • ...

Now that I am looking at this, it seems like <diff> should be part of any <revision> where both it and its <parent> have content. It doesn't really make sense to have <diff> be part of the datasource tree. Or maybe it can be, but it just doesn't have any datasources and just acts as a placeholder.

I forgot to make an important point about the structure described above. <revision> and <parent> are nearly identical. This is where the code re-use comes in.