Page MenuHomePhabricator

upload issues with a mismatch file
Closed, ResolvedPublic

Description

Problem:
We got a file containing mismatches from enwp by @Mike_Peel. It fails to import with "unexpected error" after uploading to the production store. We need to investigate what the cause is.

Related Objects

Event Timeline

The unexpected errors turned out to be the result of the server's job queue timing out, which is resolved in PR#304.

Other than that, the file seems to have issues validating URLs with a long hyphen (e.g. line 2092), so any such URLs might need to be urlencoded. In addition, I noticed that some external values are missing from the file, and that some are not valid for the property type (such as in lines 2568 - 2625).

Something definitely went wrong with the extraction of the P1030 constraints - I'll look into that and provide an updated file, but for now you could just remove those lines (any containing 'P1030').

Something definitely went wrong with the extraction of the P1030 constraints - I'll look into that and provide an updated file, but for now you could just remove those lines (any containing 'P1030').

OK, the problem is with https://en.wikipedia.org/wiki/Category:Light_characteristic_different_from_Wikidata - which is where the template is an infobox rather than an authority control template. So the simplest thing is to just skip any categories where the template name contains 'infobox', which I've implemented at https://github.com/mpeel/wikicode/commit/78a5957a504a4ea5c99eabefeba3594e0bf5095d . That should solve this in the longer term. For now, I suggest just removing those lines from the import file, or if possible, coding up something that would catch bad lines like these and skips over them.

\o/

And we are ready for launch tomorrow. I have now imported everything but the P1030 mismatches.
I had to fix a few more things as Itamar mentioned due to URL encoding. Here are two examples where the importer was struggling with the different hyphens:

@ItamarWMDE Is this something we can handle on the importer side?

Unfortunately, it might be a bit too much overhead, since these chars are not included in the URL specification. We currently use a prebuilt validator to check the validity of URLs, to enable this we will have to create some custom regex rules to enable this or alternately add another step in the process to urlencode all the URL inputs (which might result in some unexpected invalid URLs appearing to be valid).

I think in the case of this particular csv, I would advise @Mike_Peel to use urllib.urlencode() over the constructed url in this line: https://github.com/mpeel/wikicode/blob/78a5957a504a4ea5c99eabefeba3594e0bf5095d/wikidata_enwiki_mismatch.py#L86