Page MenuHomePhabricator

In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted
Open, Needs TriagePublic8 Estimated Story PointsBUG REPORT

Description

In the enwiki-NS0-20220101-ENTERPRISE-HTML.json.tar.gz file, there are JSON files where article has categories and templates, but those are not listed in the JSON itself. For example, https://en.wikipedia.org/wiki/Jaroslav_Olesnitsky

Event Timeline

I checked enwiki-NS0-20220120-ENTERPRISE-HTML.json.tar.gz, it has the same issue. Example entry for: https://en.wikipedia.org/wiki/Egli_Trimi

Thanks for letting us know, I'll take a look at this.

FYI, both example entries are close to the beginning of the first ndjson file.

Are you using Mediawiki API to obtain categories and templates? I am betting you are not processing continue properly to merge multiple API responses when one batch of data is distributed across multiple responses. You have to merge data, otherwise some pages look like they have no templates/categories. I just now encountered that when I was using API to populate templates/categories manually (because dumps are missing them randomly). I used the following API query and you can see with some luck that some of returned pages are missing templates/categories, because you have to follow continue params, but then other pages are missing. Only when batchcomplete is true you know you got everything (but you have to merge everything you got before that).

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&generator=allpages&gapnamespace=0&gapfilterredir=nonredirects&gaplimit=500&prop=categories|templates|redirects&cllimit=500&tllimit=500&rdlimit=500

Yes we are using Mediawiki API, and missing categories problem is something we are working on at the moment. I was just looking into that myself, we are not using generators in our API we are querying specific titles every time, but I would assume we might have similar problem with our processing. Will keep yo posted on this. Thanks for this, I'll let you know if this was the problem for us.

Even if you request a single title, I think you still might get continue param.

Thanks for looking into this. Good luck finding the issue.

Protsack.stephan raised the priority of this task from Medium to High.
Protsack.stephan set the point value for this task to 8.
Lena.Milenko changed the task status from Open to In Progress.May 6 2022, 3:22 PM
Lena.Milenko changed the task status from In Progress to Open.May 26 2022, 12:07 PM

We did some updates, updated the limits on the data that we fetch, should be better than it was before. But not 100% solved yet.

Protsack.stephan raised the priority of this task from High to Needs Triage.Oct 12 2022, 9:43 AM

What is the value in (incorrectly) obtaining these from the MediaWiki API rather than parsing them out of the Parsoid HTML?

How can you parse all the templates used from Parsoid HTML? Some templates do not have any output into HTML?

How can you parse all the templates used from Parsoid HTML? Some templates do not have any output into HTML?

Regardless of whether there is any visible output, all templates do generate a span marker because they need to be editable in e.g. VisualEditor.

See https://test.wikipedia.org/api/rest_v1/page/html/Void which just contains {{void}} and outputs: <span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Void","href":"./Template:Void"},"params":{},"i":0}}]}' id="mwAg"></span>

But this wouldn't include subtemplates, which I guess is needed here? It'll work for categories though.

So there is API call to get all categories/templates which is then included into dumps. The issue here is that sometimes that call seems to fail and not have data included.