Page MenuHomePhabricator

Choose HDFS paths and partitioning for ORES scores
Closed, ResolvedPublic

Description

Currently, I'm thinking:

/wmf/data/ores/score/wiki=enwiki/model=editquality/ – individual scores, one per revision.  Partitioned by wiki and model.

Conclusion

We have an agreement to partition on wiki, model, and model_version like this:

/wmf/data/ores/score/wiki=enwiki/model=editquality/model_version=2.1/

Event Timeline

It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists those to hive partitions cannot be schema dependent, at this time partitions are:

`datacenter` string,
`year` bigint,
`month` bigint,
`day` bigint,
`hour` bigint

It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists those to hive partitions cannot be schema dependent

This is true if data is to be read in an event way. Since data might be fully backfilled (model upgrade for instance), maybe events are not the best?

It is worth looking at already existing event data, if we want to reuse the logic that reads events and persists those to hive partitions cannot be schema dependent

This is true if data is to be read in an event way. Since data might be fully backfilled (model upgrade for instance), maybe events are not the best?

Yes, backfilling is a major part of what we'll be doing to maintain the resulting table. I'm thinking that the event way is not a great fit, and planning to "tee" the stream into a specialized table. The main use case is, "Get me all scores for model X on wiki Y". Maybe we should also partition on model version, if that makes it easier to drop the old data during model upgrade?

I support the idea of using model name and version as partitions. Wiki_db would possibly be another good fit if requests will most often be on singular projects. Finally, if we partition on model and name, time is not needed I assume.

Rather than backfilling which implies you are "filling a hole of data" this is a model recalculation completely. As in: you are recalculating scores due to either model or feature calculation being updated.

I'm thinking that the event way is not a great fit, and planning to "tee" the stream into a specialized table.

I think in this scenario almost 100% of events are not in a stream, rather you are reading them from mediawiki revisions, that is understood. What I was trying to say is that you can persist revision information in a language that is already used to describe that data, and an example of that can be found in mediawiki_revision_score table in events database (and never incarnations of it).

I support the idea of using model name and version as partitions. Wiki_db would possibly be another good fit if requests will most often be on singular projects. Finally, if we partition on model and name, time is not needed I assume.

Great, thanks for the sanity check. Agreed that time shouldn't be needed.

Will the order of partitions make a difference? For example, if consumers are more likely to get multiple models of scores for a single wiki, vs. multiple wikis for a single model, would we put the wiki partition before model partition in the HDFS path, or would this have no effect?

@awight: Can you explain a bit the consumer use cases? Data in hive of this nature is mostly consumed by automated processes that create derived datasets. Do you have any other consumers in mind?

Rather than backfilling which implies you are "filling a hole of data" this is a model recalculation completely. As in: you are recalculating scores due to either model or feature calculation being updated.

Yes that's a good point. The "tee" is just for continuing to sync scores to recent changes.

I'm thinking that the event way is not a great fit, and planning to "tee" the stream into a specialized table.

I think in this scenario almost 100% of events are not in a stream, rather you are reading them from mediawiki revisions, that is understood. What I was trying to say is that you can persist revision information in a language that is already used to describe that data, and an example of that can be found in mediawiki_revision_score table in events database (and never incarnations of it).

I might be misunderstanding this. Once we've piped a mediawiki_revision_score event into our new table, we'll never need to read that event again. Does that answer your point about using the existing format for revision events?

@awight: Can you explain a bit the consumer use cases? Data in hive of this nature is mostly consumed by automated processes that create derived datasets. Do you have any other consumers in mind?

I have two use cases in mind,

  • The main use will be to produce regular dumps of all scores, one file for each <model, wiki> with an ORES model. For example, get all itemquality scores for Wikidata in order to study the impact of bot contributions on data quality.
  • Some processes such as the recommendation API are already running on Hive, and for example might benefit from the new scores table, by finding the top 1% quality of articles on a wiki (sorry I'm not clear on how this training works now).

Will the order of partitions make a difference? For example, if consumers are more likely to get multiple models of scores for a single wiki, vs. multiple wikis for a single model, would we put the wiki partition before model partition in the HDFS path, or would this have no effect?

I think it makes no difference (or so small compared to the actual computation time that i doesn't matter). We should focus on user-oriented readability I assume :)

Some processes such as the recommendation API are already running on Hive, and for example might benefit from the new scores table, by finding the top 1% quality of articles on a wiki (sorry I'm not clear on how this training works now).

This is probably a batch process running async no different from the one you described before that would produce regular dumps as hadoop is not accessed by real time processes at all. It seems then that consumers of your data are just jobs creating files or derived tables which is what we would expect.

Once we've piped a mediawiki_revision_score event into our new table, we'll never need to read that event again. Does that answer your point about using the existing format for revision events?

mmm.. maybe a conversation on irc will clarify a bit? revisions are deleted all the time so to have a set of "existing revisions" requires that you maintain that set, makes sense?

Once we've piped a mediawiki_revision_score event into our new table, we'll never need to read that event again. Does that answer your point about using the existing format for revision events?

mmm.. maybe a conversation on irc will clarify a bit? revisions are deleted all the time so to have a set of "existing revisions" requires that you maintain that set, makes sense?

I'll definitely bring this up in IRC so I understand better. The recalculation job is still a bit nebulous, but currently we're thinking it will be like:

  • When we update a model version or introduce a new model, we run the recalculation job as a maintenance script from a dedicated box.
  • The job queries MediaWiki for a list of all revisions in a wiki. It hits the ORES API for each revision (in batches).
  • Probably uses an endpoint like "precache" in order to reuse the existing pipeline, but with an extra parameter to prevent caching in Redis, and maybe to prevent a mediawiki_revision_scored event to be emitted...

Process above seems a bit error prone (as you do not want to hit your live pipeline to recalculate scores for events 15 years behind), those two calculations should have probably different priorities and not go against the same system. I could see how a model recalculation for enwiki will completely clogg up your pipeline and you will not be able to score incoming newly created revisions. Anyways, this is on your end of the systems and it is up to your team. Partitions outlined by @JAllemandou above regarding model/wiki seem best.

Thanks for all the help!

@JAllemandou Can you confirm that we should partition on model_version? Will that make it possible to efficiently purge all data from an old model version?

Will that make it possible to efficiently purge all data from an old model version?

Yes, it would. Purging means you are going to drop the "whole" "model version" partition. A partition is a location in hdfs. Which would translate to a directory in hdfs.

Example (simplified):

if data looks like:
/wmf/model=some-model/version=10/wiki=fawiki/

then dropping version=10 is equivalent to running:

hdfs dfs -rm /wmf/model=some-model/version=10.

If you partition per wiki and "model version" you can selectively delete data per "model" version per wiki. . The data (not the tables on top of it) will get dropped. So the same table will exist on top of the remaining data which could be /wmf/model=some-model/version=20/wiki=fawiki/

Wonderful, that answers all the questions I have from our perspective. I'll update the task description…

awight updated the task description. (Show Details)

Another comment about folders that I hadn't thought before having read your update in the description: I actually think that the chosen is not the most efficient.
In term of data-retrieval, using /wmf/data/ores/score/wiki=enwiki/model=editquality/model_version=2.1/ or /wmf/data/ores/score/model=editquality/model_version=2.1/wiki=enwiki/ is very similar.
However it is different at deletion step: the latter is a lot easier, as it involves only a single delete for the whole model.

Another comment about folders that I hadn't thought before having read your update in the description: I actually think that the chosen is not the most efficient.
In term of data-retrieval, using /wmf/data/ores/score/wiki=enwiki/model=editquality/model_version=2.1/ or /wmf/data/ores/score/model=editquality/model_version=2.1/wiki=enwiki/ is very similar.
However it is different at deletion step: the latter is a lot easier, as it involves only a single delete for the whole model.

Oh, really! I'm pleasantly surprised to hear that retrieval is similar.

A specific model and version is actually specific to each wiki however, so I think the wiki-first path might still be a better conceptual fit. Sometimes we do regenerate all models of a given type (i.e. model=editquality), but that's just a weird artifact of our model training framework... I think we might as well run each model delete separately. Also, the model versions aren't necessarily consistent between wikis nor do we necessarily update all models at once. For artificial example, we might update <enwiki, drafttopic, 2.1> and <frwiki, drafttopic, 2.0> to <enwiki, drafttopic, 3.0> and <frwiki, drafttopic, 3.0>, but we don't have some new NLP library for German and can't update <dewiki, drafttopic, 2.2> to the 3.x API.