To take a concrete example, we look for "body" inside the footnote marker's mw-data. Find the corresponding Parsoid+Cite code which is generating the footnote marker, and determine whether "body" is added in exactly the case we hope it is: when a ref tag appears outside of the references list, and includes child content.
Repeat this investigation for each of the assumptions made by our scraper.
Assumptions
- All inline <ref> tags produce a sup.mw-ref: Not quite. The test done inside of Parsoid matches on sup[typeof~='mw:Extension/ref'], so we should change to use that.
- Reference lists match [typeof~="mw:Extension/references"] - True. Although there seems to be an exception to this rule when the references tag is explicitly written in wikitext, I've confirmed that articles with and without explicit references tags all include this typeof on the generated list. There's also an internal condition using this selector, so it must somehow be guaranteed on all reflists.
- Reference reuse will not have data-mw.body - Not quite. Any self-closing ref tag will have no body (1, 2). However:
- An exact duplicate ref tag is rendered the same as its predecessor, the only way we can detect this case is to group by identical data-mw.body.id pointing to the same block. (Verified empirically.) We won't count this case specially.
- Major exception is that template-produced content (eg. {{Cite web}}) also has no body attribute. Try a totally different approach, of counting how many uses are attached to each name. This also throws off the LDR detection.
- Cite error will have data-mw.errors - True. See errors stored in that field.
- Error codes are exposed, we can tally how many we see of each code.
- Ref tag with a name will have data-mw.attrs.name - True. "name" comes from the wikitext. It seems to never be unset in any case.
- Transclusions match [typeof~="mw:Transclusion"] - True, according to this comment and this code.
- Template transclusion has data-mw.parts, and one part in that array has part.template.target.wt which is the template name. - Doesn't feel safe. We should scan all named parts for potential ref-producing templates.
- List all template parts
- We're only capturing the outer template name (this is accidental, but good).
Interesting edge cases in the code (which are not fully understood yet):
- "nested refs" can be optionally enabled, they have fun special behaviors.
- refs embedded in a data attribute
Code to review: