Page MenuHomePhabricator

Refactor revscoring to handle session-orientation
Open, Needs TriagePublic

Event Timeline

Halfak created this task.Aug 26 2019, 4:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 26 2019, 4:39 PM
Halfak claimed this task.Aug 26 2019, 4:46 PM
Halfak added a comment.Oct 4 2019, 8:56 PM

I'm way overdue in updating this task.

So I have a PR: https://github.com/wikimedia/revscoring/pull/450

I've taken two major steps.

  1. Write a function called list_of_tree that converts any dependency to a "list_of" dependency and rewrites a DependentSet (actually a tree/map structure) with these "list_of" dependencies.
  2. Implemented native vectorized operations across revscoring for meta features (e.g. min, max, add, div, sum, mean, etc.)

list_of_tree

This function converts a DependentSet (actually a tree/map structure) from a bunch of singleton datasources and features into "list_of" features. E.g., datasource.revision.text is a string. datasource.session.revisions.text is a list of strings. The "list_of" conversion function recursively converts all dependencies and uses memoization to make sure we don't re-convert any repeat dependencies. The one weird bit of this function is a name_rewrite method. It rewrites a dependency's name

vectorized operations

When everything becomes a list, all of the Features we have defined become FeatureVectors. In the past, we didn't do very much with FeatureVectors so they were generally not as well supported. But now that everything is a FeatureVector (at least in the case of session_oriented), we need to have vectors make more sense.

So for example, a singleton operation might look like this:

1 / 4 == 0.25

A vector operation might look like this

[1, 4, 7, 0] / [5, 10, 8, 2] == [0.2, 0.4, 0.875, 0]

And similarly, we need other operators to work as expected. E.g. with vector division we need to avoid divide by zero. So you might do something like this:

[1, 4, 7, 0] / max([5, 10, 8, 0], 1) = [0.2, 0.4, 0.875, 0]

because:

max([5, 10, 8, 0], 1) = [5, 10, 8, 1]

The way I achieved this is to make it so all Modifiers in revscoring.features understand vectors and will operate in the way you would expect. I built two decorators: function_applier() and binary_operator() that build functions that look for the presence of a FeatureVector in the argument list and either return a SingletonFunctionApplier or VectorFunctionApplier as appropriate.

Halfak added a comment.Oct 4 2019, 8:57 PM

As of right now, revscoring.datasources.session_oriented is complete and works as expected. revscoring.features.bytes.session is complete and works as expected. I'm hoping to have what exists now reviewed before I move forward with applying list_of_tree() to everything else.

Halfak added a project: revscoring.
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptOct 4 2019, 8:57 PM
ACraze added a subscriber: ACraze.Oct 8 2019, 5:40 PM

@Halfak, everything looks great so far. I think you are good to go with applying list_of_tree() elsewhere. It seems well documented & tested, I didn't see anything that needed changing, although I am curious about the performance of the native vectorized operations.

New changes! See https://github.com/wikimedia/revscoring/compare/9ff5ac176d6fb71f4ccc88bd43a1e36439cb4968...04bb9dfd6acc7da1c908e02d77ff2729a4cb3875 for what is new.

User problem

So the user problem is that -- for a session, there is only one user and many revisions. It used to be that any revision could have a different user. So, I moved user into the session object.

  1. Old revision orientation:
  2. revision
    • diff
    • parent
    • page
    • user
      • info
      • last_revision
  1. New session orientation structure
  2. session
    • revisions
      • diff
      • parent
      • page
    • user
      • info
      • last_revision

This looks great until you need to apply features that require user.info.registration to be compared against revision.timestamp (e.g. for user.seconds_since_registration). So, in order to get around this, I first started trying to be clever and passing around "user_datasources" to the old feature constructors. Then I realized that I should just re-implement a few feature constructors. See SessionUser as an example of these re-implementations. I figured that it was more important to be direct than to be DRY to a fault.

The listy-functiony problem

So we have this concept inside of revscoring called a "Meta Dependency". It's essentially a function that returns a new Dependency -- usually a Feature or a Datasource. When we run the list_of_tree() method over the feature trees, we get all of the members that are defined as Feature and Datasource, but we don't get the functions. The functions still return the same non-listified Features and Datasources they used to. So in order to deal with this, I added a python decorator called DependentSet.meta_dependency (since the only place where these weird meta-dependecies need to exist is as a method of a DependentSet). It adds a cute little attribute to the method that we can look for later to see if the method is a regular method or if it is intended to behave as a Meta Dependency. In the list_of_tree method, I use python's inspect module to look for methods and check if they are a meta_dependency. If they are, I follow a decorator pattern to convert them into a method that returns a list_of_ify()'d Dependency. And it works as expected. Biggest downside is that I need to go everywhere in the feature tree looking for these methods and adding @DependentSet.meta_depencency. Honestly, I kind of like that they get flagged as such though.

Big remaining todo:

I haven't applied any of this to any of the language-based features yet. That's up next. The language based features will be interesting!