Refactor revscoring to handle session-orientation
Open, LowestPublic
Actions

Assigned To

None

Authored By

	Halfak
	Aug 26 2019, 4:39 PM

Related Objects
Search...

Status	Assigned	Task
Declined	• Capt_Swing	T127511 Start sending TWA invites again through HostBot
Resolved	• Capt_Swing	T127259 Integrate ORES into HostBot
Open	None	T123095 Train and test newcomer quality models for deployment in ORES
Open	None	T231214 Refactor revscoring to handle session-orientation

Event Timeline

Halfak created this task.Aug 26 2019, 4:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2019, 4:39 PM

Halfak claimed this task.Aug 26 2019, 4:46 PM

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Sep 6 2019, 9:21 PM

Halfak moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Sep 10 2019, 3:02 PM

I'm way overdue in updating this task.

So I have a PR: https://github.com/wikimedia/revscoring/pull/450

I've taken two major steps.

Write a function called list_of_tree that converts any dependency to a "list_of" dependency and rewrites a DependentSet (actually a tree/map structure) with these "list_of" dependencies.
Implemented native vectorized operations across revscoring for meta features (e.g. min, max, add, div, sum, mean, etc.)

list_of_tree

This function converts a DependentSet (actually a tree/map structure) from a bunch of singleton datasources and features into "list_of" features. E.g., datasource.revision.text is a string. datasource.session.revisions.text is a list of strings. The "list_of" conversion function recursively converts all dependencies and uses memoization to make sure we don't re-convert any repeat dependencies. The one weird bit of this function is a name_rewrite method. It rewrites a dependency's name

vectorized operations

When everything becomes a list, all of the Features we have defined become FeatureVectors. In the past, we didn't do very much with FeatureVectors so they were generally not as well supported. But now that everything is a FeatureVector (at least in the case of session_oriented), we need to have vectors make more sense.

So for example, a singleton operation might look like this:

1 / 4 == 0.25

A vector operation might look like this

[1, 4, 7, 0] / [5, 10, 8, 2] == [0.2, 0.4, 0.875, 0]

And similarly, we need other operators to work as expected. E.g. with vector division we need to avoid divide by zero. So you might do something like this:

[1, 4, 7, 0] / max([5, 10, 8, 0], 1) = [0.2, 0.4, 0.875, 0]

because:

max([5, 10, 8, 0], 1) = [5, 10, 8, 1]

The way I achieved this is to make it so all Modifiers in revscoring.features understand vectors and will operate in the way you would expect. I built two decorators: function_applier() and binary_operator() that build functions that look for the presence of a FeatureVector in the argument list and either return a SingletonFunctionApplier or VectorFunctionApplier as appropriate.

As of right now, revscoring.datasources.session_oriented is complete and works as expected. revscoring.features.bytes.session is complete and works as expected. I'm hoping to have what exists now reviewed before I move forward with applying list_of_tree() to everything else.

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Oct 4 2019, 8:57 PM

Halfak added a project: revscoring.

Restricted Application added a project: artificial-intelligence. · View Herald TranscriptOct 4 2019, 8:57 PM

@Halfak, everything looks great so far. I think you are good to go with applying list_of_tree() elsewhere. It seems well documented & tested, I didn't see anything that needed changing, although I am curious about the performance of the native vectorized operations.

Halfak moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Oct 15 2019, 3:04 PM

Halfak moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.Nov 7 2019, 9:44 PM

New changes! See https://github.com/wikimedia/revscoring/compare/9ff5ac176d6fb71f4ccc88bd43a1e36439cb4968...04bb9dfd6acc7da1c908e02d77ff2729a4cb3875 for what is new.

User problem

So the user problem is that -- for a session, there is only one user and many revisions. It used to be that any revision could have a different user. So, I moved user into the session object.

Old revision orientation:
revision
- diff
- parent
- page
- user
  - info
  - last_revision

New session orientation structure
session
- revisions
  - diff
  - parent
  - page
- user
  - info
  - last_revision

This looks great until you need to apply features that require user.info.registration to be compared against revision.timestamp (e.g. for user.seconds_since_registration). So, in order to get around this, I first started trying to be clever and passing around "user_datasources" to the old feature constructors. Then I realized that I should just re-implement a few feature constructors. See SessionUser as an example of these re-implementations. I figured that it was more important to be direct than to be DRY to a fault.

The listy-functiony problem

So we have this concept inside of revscoring called a "Meta Dependency". It's essentially a function that returns a new Dependency -- usually a Feature or a Datasource. When we run the list_of_tree() method over the feature trees, we get all of the members that are defined as Feature and Datasource, but we don't get the functions. The functions still return the same non-listified Features and Datasources they used to. So in order to deal with this, I added a python decorator called DependentSet.meta_dependency (since the only place where these weird meta-dependecies need to exist is as a method of a DependentSet). It adds a cute little attribute to the method that we can look for later to see if the method is a regular method or if it is intended to behave as a Meta Dependency. In the list_of_tree method, I use python's inspect module to look for methods and check if they are a meta_dependency. If they are, I follow a decorator pattern to convert them into a method that returns a list_of_ify()'d Dependency. And it works as expected. Biggest downside is that I need to go everywhere in the feature tree looking for these methods and adding @DependentSet.meta_depencency. Honestly, I kind of like that they get flagged as such though.

Big remaining todo:

I haven't applied any of this to any of the language-based features yet. That's up next. The language based features will be interesting!

Halfak added a parent task: T123095: Train and test newcomer quality models for deployment in ORES.Mar 18 2020, 7:27 PM

@Halfak I finally took a look at your PR,. It looks good so far, I like the @DependentSet.meta_depencencydecorator approach, although not sure how much work it will be to search through the feature tree to place the decorator. If you don't think that will be a burden, then I say go ahead and apply this to the language-based features.

Halfak moved this task from Review to Parked on the Machine-Learning-Team (Active Tasks) board.Apr 13 2020, 4:48 PM

Halfak removed Halfak as the assignee of this task.Jun 22 2020, 4:51 PM

Halfak edited projects, added Machine-Learning-Team; removed Machine-Learning-Team (Active Tasks).

Halfak triaged this task as Lowest priority.Jul 13 2020, 4:39 PM

Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.

• ACraze moved this task from New development to Backlog/Revscoring on the Machine-Learning-Team board.Jan 19 2021, 10:23 PM