Page MenuHomePhabricator

ores sends duplicate page text over the wire
Closed, ResolvedPublic

Description

This the current datasources that ores sends over the wire from uwsgi to celery:

{'datasource.revision.user.info.doc': {'editcount': 1247, 'registration': None, 'name': 'E.B.', 'userid': 67, 'gender': 'unknown', 'groups': ['*', 'user', 'autoconfirmed']}, 'datasource.revision.user.id': 67, 'datasource.revision.parent.text': 'lots of text', 'datasource.revision.user.info.registration': None, 'datasource.revision.id': 4569, 'datasource.revision.user.info.groups': {'*', 'user', 'autoconfirmed'}, 'datasource.extractor.dependents': {<datasource.revision.user.id>, <datasource.revision.parent.text>, <datasource.revision.user.info.registration>, <datasource.revision.user.info.groups>, <datasource.revision.comment>, <datasource.revision.text>, <datasource.revision.timestamp>, <datasource.revision.page.namespace.id>}, 'datasource.revision.doc': {'slots': {'main': {'*': 'lots of text', 'contentformat': 'text/x-wiki', 'contentmodel': 'wikitext'}}, 'revid': 4569, 'size': 6192, 'userid': 67, 'page': {'pageid': 105, 'ns': 0, 'title': 'Književnost'}, 'comment': '', 'parentid': 4562, 'user': 'E.B.', 'timestamp': '2004-08-22T06:12:32Z'}, 'datasource.revision.text': 'lots of text', 'datasource.revision.parent.doc': {'slots': {'main': {'*': 'lots of text', 'contentformat': 'text/x-wiki', 'contentmodel': 'wikitext'}}, 'revid': 4562, 'size': 6196, 'userid': 67, 'page': {'pageid': 105, 'ns': 0, 'title': 'Književnost'}, 'comment': 'sintaksa, gramatika...', 'parentid': 4560, 'user': 'E.B.', 'timestamp': '2004-08-22T05:56:51Z'}, 'datasource.revision.timestamp': Timestamp('2004-08-22T06:12:32Z'), 'datasource.revision.page.namespace.id': 0, 'datasource.revision.comment': ''}

I replaced lots of text with "lots of text". As you can see both revision text and parent revision text is getting duplicated. Fixing this issue will increase the redis capacity and saves some cute polar bears from dying.

Event Timeline

I just looked into this… I'm not sure how to tackle this: I guess the cleanest would be to just remove datasource.revision(.parent).text in favor of the text in datasource.revision(.parent).doc… but I doubt that's feasible now.

@Ladsgroup Can we change datasource.revision(.parent).text so that it's just a "pointer" to datasource.revision(.parent).doc["slots"]["main"]["*"] that is not fully serialized in JSON? Maybe using Dependent?

I just looked into this… I'm not sure how to tackle this: I guess the cleanest would be to just remove datasource.revision(.parent).text in favor of the text in datasource.revision(.parent).doc… but I doubt that's feasible now.

@Ladsgroup Can we change datasource.revision(.parent).text so that it's just a "pointer" to datasource.revision(.parent).doc["slots"]["main"]["*"] that is not fully serialized in JSON? Maybe using Dependent?

It sounds good to me!

We shouldn't really be having this issue. dig() should always only return the base datasources. I do not believe that datasource.revision.doc should exist in what is being sent over the wire.

$ python
Python 3.5.1+ (default, Mar 30 2016, 22:46:26) 
[GCC 5.3.1 20160330] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mwapi
>>> from revscoring.dependencies import dig
>>> from editquality.feature_lists import enwiki
>>> list(dig(enwiki.damaging))
[<datasource.revision.page.namespace.id>, <datasource.revision.parent.text>, <feature.1>, <datasource.revision.text>, <datasource.revision.user.info.groups>, <datasource.revision.user.id>, <datasource.revision.user.info.registration_str>, <datasource.revision.timestamp_str>, <datasource.revision.comment>]
>>> from revscoring.extractors import api
>>> extractor = api.Extractor(mwapi.Session("https://en.wikipedia.org"))
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> root_datasources = list(dig(enwiki.damaging))
>>> values = extractor.extract(123456, root_datasources)
>>> for ds, value in zip(root_datasources, values):
...     print(ds, str(value)[:100])
... 
datasource.revision.page.namespace.id 0
datasource.revision.parent.text '''Basic Training''' or bootcamp is a short intensive program for induction of recruits into an [[ar
feature.1 1
datasource.revision.text '''Basic Training''' or bootcamp is a short intensive program for induction of recruits into an [[ar
datasource.revision.user.info.groups ['*', 'user', 'autoconfirmed']
datasource.revision.user.id 744
datasource.revision.user.info.registration_str 2002-02-28T06:13:53Z
datasource.revision.timestamp_str 2002-07-25T05:02:10Z
datasource.revision.comment 

Here you can see that the doc is not included. I'm guessing that ORES is doing something more complicated and somehow preserving the cache.

OK so it looks like we re-use "root_caches" when we should really be re-building the root_caches using the output of our extract call.

See https://github.com/wikimedia/ores/blob/master/ores/scoring_context.py#L171 We assume that root_cache is updated in place. I think the solution here is to not use root_cache and to instead build a new cache based on the values returned from .extract().

Halfak triaged this task as Medium priority.Feb 15 2019, 8:34 PM
Halfak moved this task from Maintenance/cleanup to Ready to go on the Machine-Learning-Team board.
This comment was removed by Ladsgroup.

The PR didn't quite do what it needed to do so I pushed a followup commit and cleaned up some of the tests.

This is waiting on @Ladsgroup's review.

Which has apparently happened and @Halfak needs to fix an error.