Page MenuHomePhabricator

revscoring feature extraction error for wikitext papes in Wikidata
Closed, ResolvedPublic

Description

With the new feature extraction logging I see a lot of the following feature extraction errors (note - the output of the error has been fixed manually, see https://github.com/wikimedia/ores/pull/357):

Feature extraction error for model itemquality  and revision 1334902099 due to: JSONDecodeError: Failed to process datasource.wikibase.revision.entity_doc: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/srv/deployment/ores/deploy-cache/revs/29de1cc854a8226d657002d5d44ffa39382276cc/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 244, in _solve
    value = dependent(*args)
  File "/srv/deployment/ores/deploy-cache/revs/29de1cc854a8226d657002d5d44ffa39382276cc/venv/lib/python3.5/site-packages/revscoring/dependencies/dependent.py", line 54, in __call__
    return self.process(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/29de1cc854a8226d657002d5d44ffa39382276cc/venv/lib/python3.5/site-packages/revscoring/features/wikibase/datasources/revision_oriented.py", line 117, in _process_entity_doc
    return json.loads(text)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Feature extraction error for model 1585637588 and revision itemquality due to: JSONDecodeError: Failed to process datasource.wikibase.revision.entity_doc: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
Traceback (most recent call last):
  File "/srv/deployment/ores/deploy-cache/revs/29de1cc854a8226d657002d5d44ffa39382276cc/venv/lib/python3.5/site-packages/revscoring/dependencies/functions.py", line 244, in _solve
    value = dependent(*args)
  File "/srv/deployment/ores/deploy-cache/revs/29de1cc854a8226d657002d5d44ffa39382276cc/venv/lib/python3.5/site-packages/revscoring/dependencies/dependent.py", line 54, in __call__
    return self.process(*args, **kwargs)
  File "/srv/deployment/ores/deploy-cache/revs/29de1cc854a8226d657002d5d44ffa39382276cc/venv/lib/python3.5/site-packages/revscoring/features/wikibase/datasources/revision_oriented.py", line 117, in _process_entity_doc
    return json.loads(text)
  File "/usr/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.5/json/decoder.py", line 355, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Examples:

https://ores.wikimedia.org/v3/scores/wikidatawiki/1334902099/itemquality
https://ores.wikimedia.org/v3/scores/wikidatawiki/1585637588/itemquality

Event Timeline

I was able to repro with the following minimal code:

import mwapi

from revscoring import Model
from revscoring.extractors import api

with open("/srv/deployment/ores/deploy/submodules/articlequality/models/wikidatawiki.item_quality.gradient_boosting.model") as f:
    model = Model.load(f)

extractor = api.Extractor(mwapi.Session(host="https://wikidata.org",
                                         user_agent="revscoring demo"))
values = extractor.extract(1334902099, model.features)
print(model.score(values))

And then I've ran it on ores1005 for convenience with:

PYTHONPATH=/srv/deployment/ores/deploy python3 scoring.py

I used pdb to trace the content of the text variable that the code is trying to json.loads, and I got:

'I edit Wikipedia articles using this same account name. I focus mainly on data about particle physicists.'

The other similar error for 1585637588 is related to the following text value:

'{{Property documentation}}\n{{China properties}}'

If I try the same code with 1334912099, the text value is something like:

'{"type":"item","id":"Q104665876","labels":{"pl":{"language":"pl", ....

That of course is correctly parsed by json.loads. So maybe some extra defensive code should be added to revscoring to a broader set of revisions? Or possibly just emit a better error message, something that is more helpful for the user. For example anything that is not a type: item should lead to an error msg like "cannot score a revision that doesn't contain an item" or similar. Thoughts?

Halfak renamed this task from revscoring feature extraction error for Wikidata to revscoring feature extraction error for wikitext papes in Wikidata .Mar 9 2022, 5:52 PM

Agreed that the error message could be better. The feature extraction should work on anything that is an entity (Properties, Items, and Lexemes), but will fail for wikitext.

It looks like this is a good place to catch the error: https://github.com/wikimedia/revscoring/blob/master/revscoring/features/wikibase/datasources/revision_oriented.py#L115

I might suggest a general message like UnexpectedContentType("Expected entity JSON but content can't be parsed as JSON: <snippet of content for debugging>") since we don't know at the feature extraction level what exactly the intended use of the model is.

If we go that direction, we'd want to add an error like that to https://github.com/wikimedia/revscoring/blob/master/revscoring/errors.py#L41 and make it a subtype of DependencyError

@Halfak I like the idea, and it should be relatively easy to be implemented in revscoring. We'll need to do the dance of updating revscoring again and deploy but it should be a minor change, so I am ok with it. I'll try to come up with a proposal during the next days, or I'll review anything that you have in mind if you have something already ready to submit :)

Thanks!

Been overloaded recently. Don't wait for me, but I'll put something together if I get inspired.

@Halfak created https://github.com/wikimedia/revscoring/pull/517, need to do some testing but it should work as intended.

Current status:

Next steps:

  • release revscoring 2.11.2 to pypi and deploy it.

https://pypi.org/project/revscoring/2.11.2/ is published, next step is to prepare the changes to deploy ORES and to test them in deployment-prep :)

Change 791576 had a related patch set uploaded (by AikoChou; author: AikoChou):

[research/ores/wheels@python37] Update revscoring to 2.11.2

https://gerrit.wikimedia.org/r/791576

I realized that the copy of the revscoring repository from which I published 2.11.2 may not have had the correct commit from Aiko, so I created https://github.com/wikimedia/revscoring/pull/520 to release 2.11.3 and be sure. Sorry for the trouble, I'll update the docs once done.

Published https://pypi.org/project/revscoring/2.11.4/ from a python 3.7 environment (just to be extra sure). The size of the wheel seems to be the same of 2.11.2, so probably some compression changes for Python 3.7 happened (the 2.11.1 version, IIRC, was the last one released with Python 3.5).

Change 791576 merged by Elukey:

[research/ores/wheels@python37] Update revscoring to 2.11.4

https://gerrit.wikimedia.org/r/791576

Next steps:

  1. Create a change to the ores-deploy repo to bump the wheels submodule
  2. Cherry pick the change in deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud's /srv/deployment/ores/deploy
  3. Deploy the change to deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud
  4. Run the httpbb test suite to confirm that all models are working.
  5. Try to reproduce the errors highlighted in this task, and see if the fix works.

Finally, if all the above went fine, merge the ores-deploy change and deploy to prod :)

Note for ores-deploy repo:

When first time cloning a project with submodules in it, by default you get empty submodule directories. You need to run git submodule init and git submodule update to fetch all the data from the projects.

To bump the wheels submodule, cd into the submodule, git pull origin python37 since we put new wheels in python37 branch, then go back to the main dir, and git diff should show a change in the wheels submodule sha.

Yep sorry forgot a few details, nice :) Before finishing let's expand https://wikitech.wikimedia.org/wiki/ORES/Deployment#Deploy_to_the_test_server with the steps to follow!

Change 800025 had a related patch set uploaded (by Elukey; author: Elukey):

[machinelearning/liftwing/inference-services@main] articlequality: update dependencies to use revscoring 2.11.4

https://gerrit.wikimedia.org/r/800025

Aiko deployed the change to deployment-prep, it looks very good:

elukey@deployment-ores02:~$ curl localhost:8081/v3/scores/wikidatawiki/1334902099/itemquality -i
HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 482
Cache-Control: no-store, no-cache, max-age=0
Pragma: no-cache
Expires: Thu, 01 Jan 1970 00:00:00 GMT
Access-Control-Allow-Origin: *
Server: deployment-ores02.deployment-prep.eqiad1.wikimedia.cloud
Access-Control-Allow-Headers: X-Wikimedia-Debug

{
  "wikidatawiki": {
    "models": {
      "itemquality": {
        "version": "0.5.0"
      }
    },
    "scores": {
      "1334902099": {
        "itemquality": {
          "error": {
            "message": "<class 'revscoring.errors.UnexpectedContentType'>([\"Expected content of type JSON, but the following can't be parsed (max 50 chars showed): I edit Wikipedia articles using this same account \"])",
            "type": "Exception"
          }
        }
      }
    }
  }
}

We still have the following entry logged in the logs but it is way clearer than what outlined in the description:

2022-05-27 14:05:51,028 ERROR ores.scoring_systems.scoring_system: Feature extraction error for model 1334902099 and revision itemquality due to: <class 'revscoring.errors.UnexpectedContentType'>(["Expected content of type JSON, but the following can't be parsed (max 50 chars showed): I edit Wikipedia articles using this same account "])