In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted
Open, Needs TriagePublic8 Estimated Story PointsBUG REPORT
Actions

Assigned To

None

Authored By

	Mitar
	Jan 26 2022, 10:49 AM

Description

In the enwiki-NS0-20220101-ENTERPRISE-HTML.json.tar.gz file, there are JSON files where article has categories and templates, but those are not listed in the JSON itself. For example, https://en.wikipedia.org/wiki/Jaroslav_Olesnitsky

Related Objects

Mentioned In: T305407: Stale data / missing pages in HTML ("enterprise")

Event Timeline

Mitar created this task.Jan 26 2022, 10:49 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2022, 10:49 AM

I checked enwiki-NS0-20220120-ENTERPRISE-HTML.json.tar.gz, it has the same issue. Example entry for: https://en.wikipedia.org/wiki/Egli_Trimi

• Lena.Milenko added a project: Wikimedia Enterprise Volunteer Request.Jan 28 2022, 1:45 PM

Thanks for letting us know, I'll take a look at this.

FYI, both example entries are close to the beginning of the first ndjson file.

Protsack.stephan triaged this task as Medium priority.Feb 3 2022, 1:44 PM

Are you using Mediawiki API to obtain categories and templates? I am betting you are not processing continue properly to merge multiple API responses when one batch of data is distributed across multiple responses. You have to merge data, otherwise some pages look like they have no templates/categories. I just now encountered that when I was using API to populate templates/categories manually (because dumps are missing them randomly). I used the following API query and you can see with some luck that some of returned pages are missing templates/categories, because you have to follow continue params, but then other pages are missing. Only when batchcomplete is true you know you got everything (but you have to merge everything you got before that).

https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&generator=allpages&gapnamespace=0&gapfilterredir=nonredirects&gaplimit=500&prop=categories|templates|redirects&cllimit=500&tllimit=500&rdlimit=500

Yes we are using Mediawiki API, and missing categories problem is something we are working on at the moment. I was just looking into that myself, we are not using generators in our API we are querying specific titles every time, but I would assume we might have similar problem with our processing. Will keep yo posted on this. Thanks for this, I'll let you know if this was the problem for us.

Even if you request a single title, I think you still might get continue param.

Thanks for looking into this. Good luck finding the issue.

Protsack.stephan claimed this task.May 4 2022, 2:41 PM

Protsack.stephan raised the priority of this task from Medium to High.

Protsack.stephan set the point value for this task to 8.

Protsack.stephan moved this task from Incoming to In Progress on the Wikimedia Enterprise board.May 6 2022, 3:11 PM

• Lena.Milenko changed the task status from Open to In Progress.May 6 2022, 3:22 PM

Protsack.stephan moved this task from In Progress to Estimated /Discussed on the Wikimedia Enterprise board.May 26 2022, 8:54 AM

• Lena.Milenko changed the task status from In Progress to Open.May 26 2022, 12:07 PM

Protsack.stephan removed Protsack.stephan as the assignee of this task.Jun 9 2022, 2:23 PM

Protsack.stephan moved this task from Estimated /Discussed to Machine Readability PB on the Wikimedia Enterprise board.

Protsack.stephan moved this task from Machine Readability PB to Estimated /Discussed on the Wikimedia Enterprise board.

@Protsack.stephan Was there any progress on this?

We did some updates, updated the limits on the data that we fetch, should be better than it was before. But not 100% solved yet.

Protsack.stephan moved this task from Estimated /Discussed to Incoming on the Wikimedia Enterprise board.Jun 30 2022, 7:14 AM

Protsack.stephan raised the priority of this task from High to Needs Triage.Oct 12 2022, 9:43 AM

What is the value in (incorrectly) obtaining these from the MediaWiki API rather than parsing them out of the Parsoid HTML?

How can you parse all the templates used from Parsoid HTML? Some templates do not have any output into HTML?

In T300124#8613553, @Mitar wrote:

How can you parse all the templates used from Parsoid HTML? Some templates do not have any output into HTML?

Regardless of whether there is any visible output, all templates do generate a span marker because they need to be editable in e.g. VisualEditor.

See https://test.wikipedia.org/api/rest_v1/page/html/Void which just contains {{void}} and outputs: <span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Void","href":"./Template:Void"},"params":{},"i":0}}]}' id="mwAg"></span>

But this wouldn't include subtemplates, which I guess is needed here? It'll work for categories though.

So there is API call to get all categories/templates which is then included into dumps. The issue here is that sometimes that call seems to fail and not have data included.

HShaikh moved this task from Incoming to Community requests on the Wikimedia Enterprise board.Mar 28 2023, 2:31 PM

Mitar mentioned this in T305407: Stale data / missing pages in HTML ("enterprise") .Aug 18 2023, 8:38 AM

In Wikimedia Enterprise HTML Dumps, categories and templates are not always extractedOpen, Needs TriagePublic8 Estimated Story PointsBUG REPORTActions

Description

Related Objects

Event Timeline

In Wikimedia Enterprise HTML Dumps, categories and templates are not always extracted
Open, Needs TriagePublic8 Estimated Story PointsBUG REPORT
Actions