Page MenuHomePhabricator

[SPIKE] Redo loot-style content analysis but with the MediaWiki parser
Closed, ResolvedPublic

Description

As part of the 2015-16 Q2 experimental goal, we did some analysis of HTML content for a sample of articles. The HTML output was mostly driven by Parsoid, however, so we should redo this analysis but with the MediaWiki parser as the backend.

 AC

  • Change joakin/loot-content-analysis to use the MediaWiki parser
  • Publish the results to mobile-l
  • Use the results to prioritise any future engineering work
  • Explore whether we can do this sitewide using a database dump

Event Timeline

phuedx created this task.Jan 12 2016, 12:10 AM
phuedx raised the priority of this task from to Needs Triage.
phuedx updated the task description. (Show Details)
phuedx added a project: MobileFrontend.
phuedx added a subscriber: phuedx.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 12 2016, 12:10 AM
Jdlrobson updated the task description. (Show Details)Jan 12 2016, 7:32 PM
Jdlrobson set Security to None.
Jhernandez triaged this task as Normal priority.
Jhernandez added a subscriber: Jhernandez.

As discussed @Jhernandez I took a look at using MobileFormatter and results are somewhat worrying - https://phabricator.wikimedia.org/T110613#1946285 but trade off for first paint may be worth pursuing...

Analysis published: http://chimeces.com/loot-content-analysis/

No surprises, results are very similar to restbase ones (a bit less payload size on references in general).

Will send email soon to mobile-l and consider how to perform this kind of work on a bigger sample set (doesn't need to be the whole wikipedia, but a much bigger sample set).

@Jhernandez: What's needed to sign this off? I think AC #1 and #2 are done.

phuedx closed this task as Resolved.Jan 26 2016, 6:59 PM

This has been in Ready for Signoff since last Wednesday. The actionable AC – all but the last – have been met: we're prioritising stripping navboxes and we're investigating stripping references.