In the enwiki-NS0-20220101-ENTERPRISE-HTML.json.tar.gz file, there are JSON files where article has categories and templates, but those are not listed in the JSON itself. For example, https://en.wikipedia.org/wiki/Jaroslav_Olesnitsky
Description
Related Objects
Event Timeline
I checked enwiki-NS0-20220120-ENTERPRISE-HTML.json.tar.gz, it has the same issue. Example entry for: https://en.wikipedia.org/wiki/Egli_Trimi
Are you using Mediawiki API to obtain categories and templates? I am betting you are not processing continue properly to merge multiple API responses when one batch of data is distributed across multiple responses. You have to merge data, otherwise some pages look like they have no templates/categories. I just now encountered that when I was using API to populate templates/categories manually (because dumps are missing them randomly). I used the following API query and you can see with some luck that some of returned pages are missing templates/categories, because you have to follow continue params, but then other pages are missing. Only when batchcomplete is true you know you got everything (but you have to merge everything you got before that).
https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&generator=allpages&gapnamespace=0&gapfilterredir=nonredirects&gaplimit=500&prop=categories|templates|redirects&cllimit=500&tllimit=500&rdlimit=500
Yes we are using Mediawiki API, and missing categories problem is something we are working on at the moment. I was just looking into that myself, we are not using generators in our API we are querying specific titles every time, but I would assume we might have similar problem with our processing. Will keep yo posted on this. Thanks for this, I'll let you know if this was the problem for us.
Even if you request a single title, I think you still might get continue param.
Thanks for looking into this. Good luck finding the issue.
We did some updates, updated the limits on the data that we fetch, should be better than it was before. But not 100% solved yet.
What is the value in (incorrectly) obtaining these from the MediaWiki API rather than parsing them out of the Parsoid HTML?
How can you parse all the templates used from Parsoid HTML? Some templates do not have any output into HTML?
Regardless of whether there is any visible output, all templates do generate a span marker because they need to be editable in e.g. VisualEditor.
See https://test.wikipedia.org/api/rest_v1/page/html/Void which just contains {{void}} and outputs: <span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"Void","href":"./Template:Void"},"params":{},"i":0}}]}' id="mwAg"></span>
But this wouldn't include subtemplates, which I guess is needed here? It'll work for categories though.
So there is API call to get all categories/templates which is then included into dumps. The issue here is that sometimes that call seems to fail and not have data included.