Content service shifts wrong lead paragraph to the top in certain cases
Closed, ResolvedPublic1 Story Points

Description

The English Wikipedia page Maubeuge begins with the following paragraphs:

{{Use dmy dates|date=October 2014}}

{{Infobox French commune...}}

'''Maubeuge''' is a [[Communes of France|commune]]...

It is situated on both banks of the [[Sambre]]...

However, in version 2.1.144-r-2016-05-09 of the Android app, the paragraph "It is situated..." is displayed as the lead, with the true lead after the infobox.

Note: this only happens if Restbase is enabled. When using MW, the correct paragraph is shifted.

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 14 2016, 7:35 PM
Dbrant renamed this task from Android app displays paragraphs in incorrect order to Content service shifts wrong lead paragraph to the top in certain case..May 14 2016, 9:02 PM
Dbrant added a project: Mobile-Content-Service.
Dbrant updated the task description. (Show Details)
neilpquinn updated the task description. (Show Details)May 14 2016, 9:12 PM
Niedzielski set the point value for this task to 1.May 25 2016, 7:42 PM
bearND claimed this task.May 25 2016, 7:49 PM

Change 292384 had a related patch set uploaded (by BearND):
Reduce minimum lead paragraph length

https://gerrit.wikimedia.org/r/292384

Change 292384 merged by jenkins-bot:
Reduce minimum lead paragraph length

https://gerrit.wikimedia.org/r/292384

bearND renamed this task from Content service shifts wrong lead paragraph to the top in certain case. to Content service shifts wrong lead paragraph to the top in certain cases.Jun 3 2016, 5:32 PM

@Niedzielski My comment about Asian languages in the commit message was more along the lines that for those I think we should have an even smaller minimum number of characters, maybe like 10 or 15. The harder part is figuring out which domains are affected. One idea is to hard-code the list of Asian domains. Maybe there is a better way.

To get a list of pages that require purging for this case is tricky.

  • One idea is to look at the wikitext and see if the number of characters of the first paragraph after any templates is between 40 and 80.
  • Another option is to look at the mobile-sections output and look in sections[0] for any paragraphs anywhere that are between 40 and 80. This will have some false positives but probably easier to automate.

@Dbrant This is in Tracking on the backlog. Should it be?

Mholloway added a subscriber: mobrovac.EditedJun 15 2016, 3:36 PM

I think this might be blocked on reaching final agreement on a regeneration strategy. I'm not sure exactly what we have to work with in constructing queries for purging/regeneration purposes (wikitext, a DOM object, or just the final, stored JSON response). I've been assuming we only have the latter, i.e. the stored JSON response to work with. (@mobrovac, is this correct?)

With that in mind, I'd vote for the second strategy @bearND suggests above, where there exists content with a length of 40 to 80 characters when converted to plain text between a set of <p></p> tags, as fine-grained enough to provide a meaningful limit but simple enough to execute correctly.

In formal terms, for /mobile-sections/, I guess that would be where
var regex = new RegExp('<p>.{40,80}</p>', 'g');
and
regex.test(lead.sections[0].text.replace(/<(?!\s*\/?\s*p\b)[^>]*>/g, '')); [1]

@bearND @Niedzielski @Dbrant does this seem reasonable?

@mobrovac, do we need to do both mobile-sections and mobile-sections-lead for this one (and all mobile-sections* for the other task) or can we somehow get by with doing only one endpoint? (In both cases, adapting to -lead or -remaining endpoints just involves omitting the lead. or remaining. prefix from the response object being tested as appropriate).

[1] The replacement regex is intended to strip all markup except <p></p> tags and works as expected based on my testing. It's inspired by http://stackoverflow.com/a/828636. Regex improvement/bikeshedding welcome.

Another option is to look at the mobile-sections output and look in sections[0] for any paragraphs anywhere that are between 40 and 80. This will have some false positives but probably easier to automate.

Correct HTML regexs are tricky. I like this approach of looking at the plain text length. I'm missing the context on why we limit the length to 80 but I'm sure there's good reason.

@Dbrant lol

If we have access to the plain text, then for sure we should use that and not HTML regexes, but if my guess at how this works is correct, we don't have that luxury; all we have access to is the service's cached final output, i.e., strings of HTML content broken into sections and served in a JSON structure.

Dbrant closed this task as Resolved.Jun 27 2016, 4:37 PM