⚓ T332058 Document whether our scraper patterns are based on fictitious or real semantics

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	awight	T332058 Document whether our scraper patterns are based on fictitious or real semantics
Resolved	thiemowmde	T333672 Include "group" in unique ref key

awight created this task.Mar 14 2023, 4:52 PM

lilients_WMDE moved this task from Incoming to In progress on the WMDE-References-FocusArea board.Mar 15 2023, 2:14 PM

awight claimed this task.Mar 24 2023, 9:30 AM

awight moved this task from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2023-03-14 board.

Glad we talked about this. I've found three different versions of Parsoid HTML, depending on how we fetch the data:

Parse API:

curl 'https://ha.wikipedia.org/w/api.php?action=parse&oldid=193821&format=json' | jq '.parse.text["*"]' -r

Dumps:

jq '.article_body.html' test/data/fixture1.ndjson -r

REST:

https://ha.wikipedia.org/api/rest_v1/page/html/Fantasy_Black_Channel/193821

The dumps HTML is most similar to the REST output and according to https://github.com/protsack-stephan/mediawiki-api-client seems to derive from it. The differences seem to be mostly whitespace and canonical link href replacement, for now I won't chase down the source of these differences.

However, the Parsoid / "parse" API response is significantly different in ways that affects us such as the ref tags rendering with different classnames.

awight updated the task description. (Show Details)Mar 27 2023, 10:56 AM

awight updated the task description. (Show Details)Mar 27 2023, 4:01 PM

awight updated the task description. (Show Details)Mar 27 2023, 4:05 PM

awight updated the task description. (Show Details)Mar 27 2023, 4:59 PM

awight updated the task description. (Show Details)Mar 28 2023, 10:14 AM

awight updated the task description. (Show Details)Mar 28 2023, 1:46 PM

awight updated the task description. (Show Details)Mar 29 2023, 12:05 PM

awight removed awight as the assignee of this task.Mar 29 2023, 1:45 PM

awight updated the task description. (Show Details)

awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2023-03-14 board.

awight updated the task description. (Show Details)Mar 31 2023, 11:04 AM

Two details came up in conversation:

html comments may appear in the wikitext we use to pull template name (now addressed by a patch)
Ref name is not necessarily unique, the full key is {group, name}. This can be addressed in follow-up work: T333672: Include "group" in unique ref key

thiemowmde closed subtask T333672: Include "group" in unique ref key as Resolved.Apr 5 2023, 8:56 AM

awight moved this task from In progress to Done on the WMDE-References-FocusArea board.Oct 23 2024, 7:07 AM

Document whether our scraper patterns are based on fictitious or real semantics
Closed, ResolvedPublic
Actions

Description

Related Objects
Search...

Event Timeline

Document whether our scraper patterns are based on fictitious or real semanticsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Document whether our scraper patterns are based on fictitious or real semantics
Closed, ResolvedPublic
Actions

Related Objects
Search...