Page MenuHomePhabricator

Decide whether we will include raw features
Open, LowestPublic

Description

I'd like to include the extracted features in Hadoop. This is a lot of additional data so I wanted to raise this with Analytics as an open question. For each score (0.4–3KB), this would roughly double the amount of data we need to store.

Potential path:

/wmf/data/ores/feature/wiki=enwiki/model=editquality/feature_name=feature.wikitext.revision.parent.external_links

The raw features will be valuable for several reasons, including:

  • Potentially useful to researchers and tool authors.
  • Could dramatically reduce our time to retrain and adjust existing models.
  • Would create exciting possibilities for training or scoring directly in Hadoop.

A detail worth mentioning is that feature_name might have hundreds of values for each model. These features are "upstream" of the model_version so have their own time window for validity. One complication is that feature directories should properly be deleted in the rare event that we need to change how a feature's value is calculated, but not change the feature's name.

Event Timeline

awight created this task.Dec 3 2018, 11:41 PM
awight added subscribers: Halfak, Ladsgroup.
awight updated the task description. (Show Details)Dec 3 2018, 11:43 PM
awight added a comment.Dec 7 2018, 3:57 PM

Offsite chat suggests that there's value in actually storing the raw "root" data sources that we build the feature tree from. Let's estimate the storage requirements for doing that.

fdans added a subscriber: fdans.EditedDec 10 2018, 4:55 PM

Sounds ok to us! Just to be sure, can you provide an estimate of the total increase in size? How often will that get updated and archived? Keep in mind that to update data you have to overwrite the whole directory.

fdans moved this task from Incoming to Radar on the Analytics board.Dec 10 2018, 4:57 PM
fdans added a subscriber: Milimetric.

I think the best way to produce an estimate would be to generate a dataset with root datasources based on a random sample of revisions. It's important to note that root data for the average edit likely increases over time because the content of revisions will also increase over time in a wiki. So, I think the best strategy would be to gather a purely random sample across all time and then extrapolate from that.

We can even use the random samples we use to train our models. E.g. for English Wikipedia's editquality models, we can use the 20k sample used in training and testing.

One other note: there is quite a lot of overlap in root datasources for all models relevant to a single revision. So it would be beneficial if the schema would allow storing one set of root datasources per rev_id but multiple scores for each of the models.

Just for fun, I elaborated on the quick estimate based on existing w_cache files. Note that these are not the "root" data sources, these are the final, calculated values ready to be input into a model. Storing the calculated values is one alternative we might consider, the tradeoff is that the values are compact and can be used for existing models, but lack flexibility and completeness, so additional features added in the future will probably require another full MW API extraction.

I did a crude wc on cached features for each model, and got the following average bytes per record. Note that this is for a pickled format of the features, and includes all of the non-feature fields for each row—which might be a pretty realistic equivalent of a row in the proposed Hive table.

editquality: 8kB / row
articlequality: 3kB / row
draftquality: 0.5kB / row
drafttopic: 4kB / row

The features overlap to some degree between models, but I'll ignore that and sum the sizes (excluding page-based drafttopic) to get a conservative overestimate of about 12kB / revision. For enwiki, 815M revisions x 12kB = 9.1GB.

This increase in data sounds fine, and the proposed example path looks fine too. Hundreds of subfolders are only annoying when it comes to repairing Hive tables to make them aware of new partitions, but if you're accessing from Spark it won't matter.

awight removed a subscriber: awight.Thu, Mar 21, 4:05 PM
Harej triaged this task as Lowest priority.Tue, Mar 26, 9:19 PM