Page MenuHomePhabricator

Decide whether we will include raw features
Open, LowestPublic

Description

I'd like to include the extracted features in Hadoop. This is a lot of additional data so I wanted to raise this with Analytics as an open question. For each score (0.4–3KB), this would roughly double the amount of data we need to store.

Potential path:

/wmf/data/ores/feature/wiki=enwiki/model=editquality/feature_name=feature.wikitext.revision.parent.external_links

The raw features will be valuable for several reasons, including:

  • Potentially useful to researchers and tool authors.
  • Could dramatically reduce our time to retrain and adjust existing models.
  • Would create exciting possibilities for training or scoring directly in Hadoop.

A detail worth mentioning is that feature_name might have hundreds of values for each model. These features are "upstream" of the model_version so have their own time window for validity. One complication is that feature directories should properly be deleted in the rare event that we need to change how a feature's value is calculated, but not change the feature's name.

Event Timeline

Offsite chat suggests that there's value in actually storing the raw "root" data sources that we build the feature tree from. Let's estimate the storage requirements for doing that.

Sounds ok to us! Just to be sure, can you provide an estimate of the total increase in size? How often will that get updated and archived? Keep in mind that to update data you have to overwrite the whole directory.

I think the best way to produce an estimate would be to generate a dataset with root datasources based on a random sample of revisions. It's important to note that root data for the average edit likely increases over time because the content of revisions will also increase over time in a wiki. So, I think the best strategy would be to gather a purely random sample across all time and then extrapolate from that.

We can even use the random samples we use to train our models. E.g. for English Wikipedia's editquality models, we can use the 20k sample used in training and testing.

One other note: there is quite a lot of overlap in root datasources for all models relevant to a single revision. So it would be beneficial if the schema would allow storing one set of root datasources per rev_id but multiple scores for each of the models.

Just for fun, I elaborated on the quick estimate based on existing w_cache files. Note that these are not the "root" data sources, these are the final, calculated values ready to be input into a model. Storing the calculated values is one alternative we might consider, the tradeoff is that the values are compact and can be used for existing models, but lack flexibility and completeness, so additional features added in the future will probably require another full MW API extraction.

I did a crude wc on cached features for each model, and got the following average bytes per record. Note that this is for a pickled format of the features, and includes all of the non-feature fields for each row—which might be a pretty realistic equivalent of a row in the proposed Hive table.

editquality: 8kB / row
articlequality: 3kB / row
draftquality: 0.5kB / row
drafttopic: 4kB / row

The features overlap to some degree between models, but I'll ignore that and sum the sizes (excluding page-based drafttopic) to get a conservative overestimate of about 12kB / revision. For enwiki, 815M revisions x 12kB = 9.1GB.

This increase in data sounds fine, and the proposed example path looks fine too. Hundreds of subfolders are only annoying when it comes to repairing Hive tables to make them aware of new partitions, but if you're accessing from Spark it won't matter.

Harej triaged this task as Lowest priority.Mar 26 2019, 9:19 PM

Here's an industry precedent for building a shared feature store, (disclaimer: I'm only reading the transcript, don't know if the video is good)

https://www.infoq.com/presentations/michelangelo-palette-uber

They section "Feature Store Organization" has some good ideas, and suggests this hierarchy:

  • entity being analyzed, e.g. revision
  • feature group, e.g. word embeddings
  • feature name
  • join key, e.g. revision ID

I like this better than our original plan to tightly couple to ORES usage, it would give us more room to grow and share common features between various initiatives. For example the path from the task description would become something like,

/wmf/data/feature/wiki=enwiki/entity=revision/feature_group=links/feature_name=feature.wikitext.revision.parent.external_links

and knowledge about how this feature is used in the ORES editquality model would be encapsulated within ORES. The same would go for any other consumer of this feature.

The talk goes on to describe some useful ideas about synchronizing between the offline training data store and the online, realtime scoring store.

Here's another article with more background description of the platform, in case it's interesting:
https://eng.uber.com/michelangelo/

Another production feature store framework we might learn from,
https://www.logicalclocks.com/feature-store/

Another production feature store framework we might learn from,
https://www.logicalclocks.com/feature-store/

Using/extending) the existing Data Lake hadoop cluster for this might be a win-win, since analytics folks might be able to get use out of some of the features too, depending on what they are.

Hopsworks looks really awesome; at least its claims do!

+1, either hops is doing some incredible organic marketing or their ideas and libraries are good and people are using it. I also like their plan to use https://github.com/uber/petastorm, and I think we should consider incorporating parts of their stack even if we don't stand up a Hops cluster.

Another production feature store framework we might learn from,
https://www.logicalclocks.com/feature-store/

Using/extending) the existing Data Lake hadoop cluster for this might be a win-win, since analytics folks might be able to get use out of some of the features too, depending on what they are.

I was thinking exactly this, putting the features directly into HDFS will let us use our existing and heterogenous tools, and avoids vendor lock-in.

It's such a small amount of data that we're probably fine also duplicating it behind a proprietary store if this would unlock some amazing use case achievement, of course. But I'm suspicious by default, after recently reading ceejbot's reminder that open-source doesn't mean open ownership or control. Let's start by borrowing just the ideas :-)

Please do let me know what the current consensus is around postponing any feature pipeline work until WMF has a dedicated machine learning team to get the architecture right. I'm excited to try some prototyping, but can see how premature decisions might hamstring us later. Alternatively, if we get lucky then an early prototype might provide helpful experiences which feed into the longer-term plan.

Please do let me know what the current consensus is around postponing any feature pipeline work until WMF has a dedicated machine learning team to get the architecture right

Basically you said it, we understand that this is an important use case but at this time the people that would work on it need to be hired on the ML team so it is unlikely that any design/prototyping happens in the next 6 months. The bulk of analytics efforts is dedicated to the new event platform, mediacounts api and gpu infrastructure for the next couple quarters.