Page MenuHomePhabricator

Merge ORES precaching with ORESFetchScoreJob
Closed, DeclinedPublic

Description

Right now we have ChangeProp rules for precaching ORES and we also have a job that does pretty much the same. Can we merge them?

Questions:

  • CP is precaching both datacenters. Do we need that? Can we send requests to both DCs from the JobQueue?
  • What JobQueue request is a subset of what CP requests to cache. Can we request all from the JobQueue?
  • Can we get ORES into storing things in Cassandra so that we have automagic multi-dc replication?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

CP is precaching both datacenters. Do we need that?

Yup, ORES is active/active

Can we send requests to both DCs from the JobQueue?

No, because the job writes to mediawiki database and codfw is replica and we can't/shouldn't write to replicas of the second datacenter

What JobQueue request is a subset of what CP requests to cache. Can we request all from the JobQueue?

The models and wikis that have ores ext. enabled. $wgOresModels and $wgOresUiEnabled has the data:

'wgOresUiEnabled' => [
	'default' => false,
	'arwiki' => true, // T192498
	'bswiki' => true, // T197010
	'cawiki' => true, // T192501
	'cswiki' => true, // T151611
	'enwiki' => true, // T140003
	'eswiki' => true, // T130279
	'eswikibooks' => true, // T145394
	'etwiki' => true, // T159609
	'fawiki' => true, // T130211
	'fiwiki' => true, // T163011
	'frwiki' => true,
	'hewiki' => true, // T161621
	'huwiki' => true, // T192496
	'lvwiki' => true, // T192499
	'nlwiki' => true, // T139432
	'plwiki' => true, // T140005
	'ptwiki' => true, // T139692
	'rowiki' => true, // T170723
	'ruwiki' => true,
	'simplewiki' => true, // T182012
	'sqwiki' => true, // T170723
	'srwiki' => true, // T197012
	'svwiki' => true, // T174560
	'trwiki' => true, // T139992
	'wikidatawiki' => true, // T130212
	'testwiki' => true, // T199913
	'test2wiki' => true, // T200412
],
'wgOresModels' => [
	'default' => [
		'damaging' => [ 'enabled' => true ],
		'goodfaith' => [ 'enabled' => true ],
		'reverted' => [ 'enabled' => false ],
		'wp10' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'types' => [ 1 ] ],
	],
	'enwiki' => [
		'damaging' => [ 'enabled' => true, 'excludeBots' => true ],
		'goodfaith' => [ 'enabled' => true, 'excludeBots' => true ],
		'reverted' => [ 'enabled' => false ],
		'wp10' => [ 'enabled' => true, 'namespaces' => [ 0, 118 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => true, 'namespaces' => [ 0, 118 ], 'types' => [ 1 ], 'excludeBots' => true, 'cleanParent' => true ],
	],
	'arwiki' => [
		'damaging' => [ 'enabled' => true ],
		// goodfaith is disabled for arwiki (T192498, T193905)
		'goodfaith' => [ 'enabled' => false ],
		'reverted' => [ 'enabled' => false ],
		'wp10' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'types' => [ 1 ] ],
	],
	'srwiki' => [
		'damaging' => [ 'enabled' => true ],
		// goodfaith is disabled for srwiki (T197012)
		'goodfaith' => [ 'enabled' => false ],
		'reverted' => [ 'enabled' => false ],
		'wp10' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'types' => [ 1 ] ],
	],
	'euwiki' => [
		'damaging' => [ 'enabled' => false ],
		'goodfaith' => [ 'enabled' => false ],
		'reverted' => [ 'enabled' => false ],
		'wp10' => [ 'enabled' => true, 'namespaces' => [ 0 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => false, 'namespaces' => [ 0 ], 'types' => [ 1 ] ],
	],
	'testwiki' => [
		'damaging' => [ 'enabled' => true, 'excludeBots' => true ],
		'goodfaith' => [ 'enabled' => true, 'excludeBots' => true ],
		'reverted' => [ 'enabled' => false ],
		// wp10 and draftquality are disabled until ORES is configured to allow these
		// for testwiki. See T198997
		'wp10' => [ 'enabled' => true, 'namespaces' => [ 0, 118 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => true, 'namespaces' => [ 0, 118 ], 'types' => [ 1 ], 'excludeBots' => true, 'cleanParent' => true ],
	],
	'test2wiki' => [
		'damaging' => [ 'enabled' => true, 'excludeBots' => true ],
		'goodfaith' => [ 'enabled' => true, 'excludeBots' => true ],
		'reverted' => [ 'enabled' => false ],
		// wp10 and draftquality are disabled until ORES is configured to allow these
		// for test2wiki. See T198997
		'wp10' => [ 'enabled' => false, 'namespaces' => [ 0, 118 ], 'cleanParent' => true ],
		'draftquality' => [ 'enabled' => false, 'namespaces' => [ 0, 118 ], 'types' => [ 1 ], 'excludeBots' => true, 'cleanParent' => true ],
	],
],

Can we get ORES into storing things in Cassandra so that we have automagic multi-dc replication?

If you mean ORES service, that sounds good to me, I personally don't know how much redis-agnostic ores is but regardless it should be. I also don't know how much of a hassle is to change it but about that @akosiaris knows way better than me.

We've had the ORES-->Restbase discussion a few times and I can't remember why we decided not to go that direction last time.

Generally, for ORES' caching concerns, we need to be able to store blobs (JSON) behind multi-part keys and we either need practically unlimited storage or a LRU with at least 8GB.

ORES has a basic ScoreCache abstraction. We have implemented a dummy cache and a simple in-memory LRU as well as our redis connector LRU. So long as we can store and retrieve JSON blobs behind arbitrary keys, we should be able to build a Cassandra connector too.

See https://github.com/wiki-ai/ores/blob/master/ores/score_caches/score_cache.py for the interface we expect to implement.

What JobQueue request is a subset of what CP requests to cache. Can we request all from the JobQueue?

The models and wikis that have ores ext. enabled. $wgOresModels and $wgOresUiEnabled has the data

Sorry for bothering again, I might have misunderstood - how does this compare to what /precache endpoint is doing right now?

We have implemented a dummy cache and a simple in-memory LRU as well as our redis connector LRU

Ok, Cassandra is not the best idea for the LRU semantics, it's practically impossible to make it efficient there. All we can do is a LARGE TTL-based cache with multimaster. Obviously, true unlimited storage is out of the question :) So, if LRU is strictly required, we can't do that, if TTL-based is ok, then we can finally get rid of double-pre-processing every revision is both DCs and either split pre-generation load between datacenters, or utilize more the non-primary DC etc.

TLDR: not a big fan of double-processing everything and melting ice caps.

Pchelolo renamed this task from Merge ORES precaching to Merge ORES precaching with ORESFetchScoreJob.Aug 18 2018, 1:07 AM

Sorry for bothering again, I might have misunderstood - how does this compare to what /precache endpoint is doing right now?

It is way more complex, precache end point is on the server side, this is client and only hit the ores service when needed for example it only sends request for draftquality model when a page is created in English Wikipedia. It doesn't ask for the model in other cases but precache endpoint gives out everything. Maybe we can talk about it later in IRC?

TLDR: not a big fan of double-processing everything and melting ice caps.

<3 <3 <3 I want this to be fixed, that would be fantastic!

It is way more complex, precache end point is on the server side, this is client and only hit the ores service when needed for example it only sends request for draftquality model when a page is created in English Wikipedia. It doesn't ask for the model in other cases but precache endpoint gives out everything. Maybe we can talk about it later in IRC?

Ok, interesting... So having the hook we're discussing in T201869 will not really be very useful to emit the revision-scored event - with the precache endpoint we can add much more ORES data into the event, which is better I guess. @Ottomata ? And requesting all the non-needed info from the job seems like very bad design.

TLDR: not a big fan of double-processing everything and melting ice caps.
<3 <3 <3 I want this to be fixed, that would be fantastic!

This is a bit separate from merging the job and change-prop rule, more related to cache replication.

HM ok. So if we can't do it MW JobQueue, then it's either from ORES /precache (or some endpoint!) or in change-prop. I think from our previous discussion, Aaron didn't want us to change the response of /precache sooooo? What should we do?

I'd like to get back to the topic of ORES score caches and RESTBase for a second :P My understanding is that precaching is done for new revisions. CP sends the req to ORES, which computes the result and stores it into Redis (in both DCs). Is that correct? If so, perhaps we don't need an LRU. If we key the data on the page title and rewrite scores when new revisions come in, then we are bounded by the number of articles (if we need to have separate records for different models, that can be accommodated as well). If Redis is used as an LRU, then using RESTBase would effectively achieve the same goal, with the added benefit that the last computed scores for each page would always be available (regardless of the last revision's age and/or number of scores that came after it). And, ofc, this would also implicitly deal with needing to compute the scores in each DC.

Keying on page title doesn't work because we store scores for revisions historically. Thus revision IDs are necessary. Also, it is important to note that page titles are not a durable identifier. It's not uncommon to rename pages.

It seems to me that it would make perfect sense to store the most recent article quality and topic scores in restbase along with the most recent content, rendering, etc. However, this makes less sense for damaging/goodfaith. E.g. just because the most edit is scored as "not damaging" doesn't mean there is not damage in the page. It just means the last edit likely didn't cause any new damage.

Ladsgroup triaged this task as Medium priority.Nov 28 2018, 6:14 AM

As we're finally emitting the revision-score events and will make them publically accessible soon, I would like to bring this up again.

So to summarize, it seems we have actually 3 different sets of models in different context:

  1. The /v3/precache uses the config from ORES itself to find out which models should be precached - that's used by ChangeProp
  2. The wgOresModels which specified what needs to be fetched and stored in the DB in the FetchScoreJob
  3. (Not sure about this) the config of all available models with is returned in ORES API when only the rev_id is specified.

Since the revision-score event is created by change-prop, that is using the precache endpoint, only the precached models are available in the stream. Ideally, it seems, we want all the models to be in the stream. This would mean calling ORES for all models after revision-create event (possibly only for needed namespaces). I'm not proposing to change anything, just trying to figure out what we want to have eventually

Right now, the only reason we limit /v3/precache is to save cycles (and icebergs). If we want to extend what gets scored as part of precaching in order to fill out the revision-score stream, that is OK. We're essentially trying to target scores that will be important and useful. If there's a use-case for adding to revision-score, then it belongs in /v3/precache too -- so no conflict there.

elukey subscribed.

We are moving to Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing

I am closing old tasks related to ORES since it is being deprecated, please re-open if you feel that any work could be done on Lift Wing.