ores sends duplicate page text over the wire
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Ladsgroup
	Jan 28 2019, 1:39 PM

Description

This the current datasources that ores sends over the wire from uwsgi to celery:

{'datasource.revision.user.info.doc': {'editcount': 1247, 'registration': None, 'name': 'E.B.', 'userid': 67, 'gender': 'unknown', 'groups': ['*', 'user', 'autoconfirmed']}, 'datasource.revision.user.id': 67, 'datasource.revision.parent.text': 'lots of text', 'datasource.revision.user.info.registration': None, 'datasource.revision.id': 4569, 'datasource.revision.user.info.groups': {'*', 'user', 'autoconfirmed'}, 'datasource.extractor.dependents': {<datasource.revision.user.id>, <datasource.revision.parent.text>, <datasource.revision.user.info.registration>, <datasource.revision.user.info.groups>, <datasource.revision.comment>, <datasource.revision.text>, <datasource.revision.timestamp>, <datasource.revision.page.namespace.id>}, 'datasource.revision.doc': {'slots': {'main': {'*': 'lots of text', 'contentformat': 'text/x-wiki', 'contentmodel': 'wikitext'}}, 'revid': 4569, 'size': 6192, 'userid': 67, 'page': {'pageid': 105, 'ns': 0, 'title': 'Književnost'}, 'comment': '', 'parentid': 4562, 'user': 'E.B.', 'timestamp': '2004-08-22T06:12:32Z'}, 'datasource.revision.text': 'lots of text', 'datasource.revision.parent.doc': {'slots': {'main': {'*': 'lots of text', 'contentformat': 'text/x-wiki', 'contentmodel': 'wikitext'}}, 'revid': 4562, 'size': 6196, 'userid': 67, 'page': {'pageid': 105, 'ns': 0, 'title': 'Književnost'}, 'comment': 'sintaksa, gramatika...', 'parentid': 4560, 'user': 'E.B.', 'timestamp': '2004-08-22T05:56:51Z'}, 'datasource.revision.timestamp': Timestamp('2004-08-22T06:12:32Z'), 'datasource.revision.page.namespace.id': 0, 'datasource.revision.comment': ''}

I replaced lots of text with "lots of text". As you can see both revision text and parent revision text is getting duplicated. Fixing this issue will increase the redis capacity and saves some cute polar bears from dying.

Related Objects

Mentioned In: rORES236c3a994343: Merge pull request #319 from wikimedia/T214841
rORESfaaa2b82d022: Rebuild root_caches values
rORESf7044b5adeaa: Rebuild root_caches values
rORES9ecede5a7481: Rebuild root_caches values
rORESdcf21176ce4a: Rebuild root_caches values

Event Timeline

Ladsgroup created this task.Jan 28 2019, 1:39 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 28 2019, 1:39 PM

I just looked into this… I'm not sure how to tackle this: I guess the cleanest would be to just remove datasource.revision(.parent).text in favor of the text in datasource.revision(.parent).doc… but I doubt that's feasible now.

@Ladsgroup Can we change datasource.revision(.parent).text so that it's just a "pointer" to datasource.revision(.parent).doc["slots"]["main"]["*"] that is not fully serialized in JSON? Maybe using Dependent?

In T214841#4921166, @hoo wrote:

I just looked into this… I'm not sure how to tackle this: I guess the cleanest would be to just remove datasource.revision(.parent).text in favor of the text in datasource.revision(.parent).doc… but I doubt that's feasible now.

@Ladsgroup Can we change datasource.revision(.parent).text so that it's just a "pointer" to datasource.revision(.parent).doc["slots"]["main"]["*"] that is not fully serialized in JSON? Maybe using Dependent?

It sounds good to me!

We shouldn't really be having this issue. dig() should always only return the base datasources. I do not believe that datasource.revision.doc should exist in what is being sent over the wire.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mwapi
>>> from revscoring.dependencies import dig
>>> from editquality.feature_lists import enwiki
>>> list(dig(enwiki.damaging))
[<datasource.revision.page.namespace.id>, <datasource.revision.parent.text>, <feature.1>, <datasource.revision.text>, <datasource.revision.user.info.groups>, <datasource.revision.user.id>, <datasource.revision.user.info.registration_str>, <datasource.revision.timestamp_str>, <datasource.revision.comment>]
>>> from revscoring.extractors import api
>>> extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org"))
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> root_datasources = list(dig(enwiki.damaging))
>>> values = extractor.extract(123456, root_datasources)
>>> for ds, value in zip(root_datasources, values):
...     print(ds, str(value)[:100])
... 
datasource.revision.page.namespace.id 0
datasource.revision.parent.text '''Basic Training''' or bootcamp is a short intensive program for induction of recruits into an [[ar
feature.1 1
datasource.revision.text '''Basic Training''' or bootcamp is a short intensive program for induction of recruits into an [[ar
datasource.revision.user.info.groups ['*', 'user', 'autoconfirmed']
datasource.revision.user.id 744
datasource.revision.user.info.registration_str 2002-02-28T06:13:53Z
datasource.revision.timestamp_str 2002-07-25T05:02:10Z
datasource.revision.comment

Here you can see that the doc is not included. I'm guessing that ORES is doing something more complicated and somehow preserving the cache.

OK so it looks like we re-use "root_caches" when we should really be re-building the root_caches using the output of our extract call.

See https://github.com/wikimedia/ores/blob/master/ores/scoring_context.py#L171 We assume that root_cache is updated in place. I think the solution here is to not use root_cache and to instead build a new cache based on the values returned from .extract().