Meta datasource/feature refactoring for revscoring
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Halfak
	Dec 9 2015, 8:59 PM

Description

The revscoring library has a lot of code duplication for similar features -- e.g. duplication between features.revision and features.parent_revision.

We also have some features hard-coded for enwiki (e.g. features.revision.image_links and features.revision.cite_links)

These should be re-implemented as meta-features that require parameterization.

E.g. in /datasources/meta/wikitext_parsing.py

from ..datasource import Datasource

class WikiTextParseTree(Datasource):
  def __init__(self, text_datasource, name=None):
    ...
  def process(self, text):
    ...

In datasources/revision.py:

from .meta import WikiTextParseTree

text = Datasource("revision.text")

parse_tree = WikiTextParseTree(text, name="revision.parse_tree")

In datasources/parent_revision.py:

from .meta import WikiTextParseTree

text = Datasource("parent_revision.text")

parse_tree = WikiTextParseTree(text, name="parent_revision.parse_tree")

There are a large set of features that look like this that we can clean up.

Related Objects
Search...

Status	Assigned	Task
Resolved	Halfak	T120138 [Epic] Explore disparate impacts of damage detection and goodfaith prediction on anons and newcomers.
Resolved	Halfak	T122269 [epic] revscoring 1.0.0
Resolved	Halfak	T121005 Meta datasource/feature refactoring for revscoring

Event Timeline

Halfak created this task.Dec 9 2015, 8:59 PM

Halfak raised the priority of this task from to Needs Triage.

Halfak updated the task description. (Show Details)

Halfak added a project: Machine-Learning-Team (Active Tasks).

Halfak moved this task to Parked on the Machine-Learning-Team (Active Tasks) board.

Halfak subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptDec 9 2015, 8:59 PM

He7d3r updated the task description. (Show Details)Dec 12 2015, 5:17 PM

He7d3r set Security to None.

Halfak moved this task from Parked to Backlog on the Machine-Learning-Team (Active Tasks) board.Dec 23 2015, 3:58 AM

OK. So I think I've worked out something better. I think that we should be nesting datasources and related features based on how they are oriented. Right now, they are all oriented to a particular revision (e.g. previous_user_revision means, in relation to the user who saved the current revision, their last revision before this one). That's complex and honestly a mouth-full. I think that we can make the import structure reflect this relationship and orientation.

Here's what I've got:

id (Datasource)
revision.timestamp (Datasource)
revision.comment (Datasource)
revision.byte_len (Datasource)
revision.minor (Datasource)
revision.content_model (Datasource)
revision.content_format (Datasource)
revision.text (Datasource)
revision.bytes (Datasource)
<parent>
<page>
<user>

<page>

id (Datasource)
title (Datasource)
<namespace>
<creation>

id (Datasouce)
name (Datasource)

id (Datasource)
revision.timestamp (Datasource)
revision.comment (Datasource)
revision.byte_len (Datasource)
revision.minor (Datasource)
revision.content_model (Datasource)
revision.content_format (Datasource)
revision.text (Datasource)
revision.bytes (Datasource)
<user>

<user>

id (Datasource)
text (Datasource)
editcount (Datasource)
registration (Datasource)
groups (Datasource)
emailable (Datasource)
gender (Datasource)
block_id (Datasource)
blocked_by (Datasource)
blocked_by_id (Datasource)
blocked_timestamp (Datasource)
block_reason (Datasource)
block_expiry (Datasource)
<last_revision>

id (Datasource)
revision.timestamp (Datasource)
revision.comment (Datasource)
revision.byte_len (Datasource)
revision.minor (Datasource)
revision.content_model (Datasource)
revision.content_format (Datasource)
<user>

<last_revision>

id (Datasource)
revision.timestamp (Datasource)
revision.comment (Datasource)
revision.byte_len (Datasource)
revision.minor (Datasource)
revision.content_model (Datasource)
revision.content_format (Datasource)
<page>

It turns out that, by nesting these datasources, I can describe them with very few lines of code and I can re-use quite a bit. most of the items are just a revision with a few fields missing (because they are irrelevant -- like revision.page.creation.page.

I've been working on making the feature sets work with this structure too and that's resulted in a lot of code re-use there too. The last trick that I'm struggling with is how to get the same code re-use benefit in APIExtractor. Will have to think more about that and come back to it again.

Halfak added a parent task: T122269: [epic] revscoring 1.0.0.Dec 23 2015, 4:13 AM

Halfak claimed this task.Dec 23 2015, 4:15 AM

Halfak mentioned this in T121003: Implement word frequency diff features.

Halfak added a project: revscoring.

He7d3r subscribed.Dec 23 2015, 2:50 PM

Just as an example of a feature collection that mimics this structure (partially), we have features.wikitext.tokenized:

tokens (int)
whitespaces (int)
markups (int)
cjks (int)
urls (int)
entities (int)
words (int)
uppercase_words (int)
punctuations (int)
breaks (int)
<parent>

tokens (int)
whitespaces (int)
markups (int)
cjks (int)
urls (int)
entities (int)
words (int)
uppercase_words (int)
punctuations (int)
breaks (int)

<diff>

token_delta_sum (int)
token_delta_increase (int)
token_delta_decrease (int)
token_prop_delta_sum (float)
token_prop_delta_increase (float)
token_prop_delta_decrease (float)
whitespace_delta_sum (int)
whitespace_delta_increase (int)
...

Now that I am looking at this, it seems like <diff> should be part of any <revision> where both it and its <parent> have content. It doesn't really make sense to have <diff> be part of the datasource tree. Or maybe it can be, but it just doesn't have any datasources and just acts as a placeholder.

I forgot to make an important point about the structure described above. <revision> and <parent> are nearly identical. This is where the code re-use comes in.

See https://github.com/wiki-ai/revscoring/pull/233

Ladsgroup moved this task from Review to Backlog on the Machine-Learning-Team (Active Tasks) board.Dec 30 2015, 9:41 PM

ToAruShiroiNeko removed a project: revscoring.Jan 1 2016, 2:36 PM

Halfak moved this task from Backlog to Review on the Machine-Learning-Team (Active Tasks) board.Jan 1 2016, 6:04 PM

Halfak moved this task from Review to Completed on the Machine-Learning-Team (Active Tasks) board.Jan 15 2016, 5:57 PM

Halfak closed this task as Resolved.Jan 21 2016, 3:44 PM

Meta datasource/feature refactoring for revscoringClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Meta datasource/feature refactoring for revscoring
Closed, ResolvedPublic
Actions

Related Objects
Search...