Page MenuHomePhabricator

Document whether our scraper patterns are based on fictitious or real semantics
Closed, ResolvedPublic

Description

To take a concrete example, we look for "body" inside the footnote marker's mw-data. Find the corresponding Parsoid+Cite code which is generating the footnote marker, and determine whether "body" is added in exactly the case we hope it is: when a ref tag appears outside of the references list, and includes child content.

Repeat this investigation for each of the assumptions made by our scraper.

Assumptions

  • All inline <ref> tags produce a sup.mw-ref: Not quite. The test done inside of Parsoid matches on sup[typeof~='mw:Extension/ref'], so we should change to use that.
  • Reference lists match [typeof~="mw:Extension/references"] - True. Although there seems to be an exception to this rule when the references tag is explicitly written in wikitext, I've confirmed that articles with and without explicit references tags all include this typeof on the generated list. There's also an internal condition using this selector, so it must somehow be guaranteed on all reflists.
  • Reference reuse will not have data-mw.body - Not quite. Any self-closing ref tag will have no body (1, 2). However:
    • An exact duplicate ref tag is rendered the same as its predecessor, the only way we can detect this case is to group by identical data-mw.body.id pointing to the same block. (Verified empirically.) We won't count this case specially.
    • Major exception is that template-produced content (eg. {{Cite web}}) also has no body attribute. Try a totally different approach, of counting how many uses are attached to each name. This also throws off the LDR detection.
  • Cite error will have data-mw.errors - True. See errors stored in that field.
    • Error codes are exposed, we can tally how many we see of each code.
  • Ref tag with a name will have data-mw.attrs.name - True. "name" comes from the wikitext. It seems to never be unset in any case.
  • Transclusions match [typeof~="mw:Transclusion"] - True, according to this comment and this code.
  • Template transclusion has data-mw.parts, and one part in that array has part.template.target.wt which is the template name. - Doesn't feel safe. We should scan all named parts for potential ref-producing templates.
    • List all template parts
    • We're only capturing the outer template name (this is accidental, but good).

Interesting edge cases in the code (which are not fully understood yet):

  • "nested refs" can be optionally enabled, they have fun special behaviors.
  • refs embedded in a data attribute

Code to review:

Event Timeline

Glad we talked about this. I've found three different versions of Parsoid HTML, depending on how we fetch the data:

Parse API:

curl 'https://ha.wikipedia.org/w/api.php?action=parse&oldid=193821&format=json' | jq '.parse.text["*"]' -r

Dumps:

jq '.article_body.html' test/data/fixture1.ndjson -r

REST:

https://ha.wikipedia.org/api/rest_v1/page/html/Fantasy_Black_Channel/193821

The dumps HTML is most similar to the REST output and according to https://github.com/protsack-stephan/mediawiki-api-client seems to derive from it. The differences seem to be mostly whitespace and canonical link href replacement, for now I won't chase down the source of these differences.

However, the Parsoid / "parse" API response is significantly different in ways that affects us such as the ref tags rendering with different classnames.

awight updated the task description. (Show Details)
awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2023-03-14 board.
awight closed this task as Resolved.EditedApr 2 2023, 5:11 PM
awight claimed this task.
awight moved this task from Tech Review to Done on the WMDE-TechWish-Sprint-2023-03-14 board.

Two details came up in conversation:

  • html comments may appear in the wikitext we use to pull template name (now addressed by a patch)
  • Ref name is not necessarily unique, the full key is {group, name}. This can be addressed in follow-up work: T333672: Include "group" in unique ref key