Page MenuHomePhabricator

Run comparison of html extracts again
Closed, ResolvedPublic

Description

I ran the comparison script, which compare the exract_html fields, again.
Here are the English results:


For the new run I added a 'b' to the file names. The en.v2.txt is from the old run we did a while ago.
(http://jdlrobson.com/summaries/en.2b.html)

So far I've noticed a bunch more issue classes:

  • showing coordinates in Qatar and United States (probably the order of operations is to blame)
  • Escaped HTML causes issues in Transformers:_The_Last_Knight: <i id=\"mwCQ\">Transformers</i> and & inside Ariana Grande. See also Logan (Film), DJ Khaled, Keanu Reeves, Beyoncé, Clint Eastwood, Emma Watson and Chris Pine,
  • undefined in Donald Trump, Barack Obama and Botulism

Related Objects

Event Timeline

bearND triaged this task as High priority.Jan 16 2018, 11:17 PM
bearND created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 16 2018, 11:17 PM
Jdlrobson updated the task description. (Show Details)Jan 16 2018, 11:20 PM
Jdlrobson updated the task description. (Show Details)Jan 16 2018, 11:24 PM

Change 404603 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/services/mobileapps@master] Summaries should not contain coordinates

https://gerrit.wikimedia.org/r/404603

Change 404609 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/services/mobileapps@master] Do not completely flatten links with child nodes

https://gerrit.wikimedia.org/r/404609

Change 404610 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/services/mobileapps@master] Test case: Nested spans should be removed...

https://gerrit.wikimedia.org/r/404610

After the above 2 patches I re-ran the script: http://jdlrobson.com/summaries/en.2b.jon.html
Those first 2 fixes eradicate all the problems.
The 3rd patch adds a test which captures the Ariana Grande empty parenthetical issue, but this is probably lower importance.

Change 404609 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Do not completely flatten links with child nodes

https://gerrit.wikimedia.org/r/404609

Change 404603 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summaries should not contain coordinates

https://gerrit.wikimedia.org/r/404603

@bearND I think all but the space in Gal Gadot are fixed now?
I should note I'm aware of this happening, but consider it pretty low priority and minor. In interest of deploying sooner, I'd be keen to open a new task for that and deal with that post-deploy rather than add to the list of changes we're making. What do you think?

@Jdlrobson I think we can deploy to MCS what we have merged so far tomorrow. The switchover has to happen in RESTBase, which should happen after the DevSummit week anyways. There are a few more tasks in MCS outstanding, too. I believe there is a good chance we will have the extra space eradicated as well by then, too. But I'm ok if that happens after switchover, too. Same for the nested <span></span>.

Is http://jdlrobson.com/summaries/en.2b.jon.html is still the latest and the same as what's on master? I haven't run it again but could do so if you haven't.

Jdlrobson closed this task as Resolved.Jan 17 2018, 11:30 PM
Jdlrobson claimed this task.

Being bold. I've opened up T185161 for the remaining issue...

Jdlrobson updated the task description. (Show Details)Jan 17 2018, 11:30 PM

Mentioned in SAL (#wikimedia-operations) [2018-01-18T18:47:35Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@669fb5b]: Update mobileapps to 2690899 (T184328 T184557 T177007 T184669 T177430 T185050) (duration: 07m 03s)

Change 404610 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: remove nested spans

https://gerrit.wikimedia.org/r/404610