Content service shifts wrong lead paragraph to the top in certain cases
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

The English Wikipedia page Maubeuge begins with the following paragraphs:

{{Use dmy dates|date=October 2014}}

{{Infobox French commune...}}

'''Maubeuge''' is a [[Communes of France|commune]]...

It is situated on both banks of the [[Sambre]]...

However, in version 2.1.144-r-2016-05-09 of the Android app, the paragraph "It is situated..." is displayed as the lead, with the true lead after the infobox.

Screenshot_2016-05-14-12-32-17.png (1×1 px, 1 MB)

Note: this only happens if Restbase is enabled. When using MW, the correct paragraph is shifted.

Details

	Subject	Repo	Branch	Lines +/-
	Reduce minimum lead paragraph length	mediawiki/services/mobileapps	master	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		• bearND	T135315 Content service shifts wrong lead paragraph to the top in certain cases
		Resolved		• mobrovac	T136964 Pre-generate/purge mobile-sections endpoints to fix page links inside image captions

Event Timeline

nshahquinn-wmf created this task.May 14 2016, 7:35 PM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 14 2016, 7:35 PM

Dbrant renamed this task from Android app displays paragraphs in incorrect order to Content service shifts wrong lead paragraph to the top in certain case..May 14 2016, 9:02 PM

Dbrant added a project: Mobile-Content-Service.

Dbrant updated the task description. (Show Details)

nshahquinn updated the task description. (Show Details)May 14 2016, 9:12 PM

• Niedzielski set the point value for this task to 1.May 25 2016, 7:42 PM

Dbrant moved this task from Needs Triage to Tracking on the Wikipedia-Android-App-Backlog board.May 25 2016, 7:45 PM

• bearND moved this task from Incoming to Backlog on the Mobile-Content-Service board.May 25 2016, 7:45 PM

• bearND claimed this task.May 25 2016, 7:49 PM

Change 292384 had a related patch set uploaded (by BearND):
Reduce minimum lead paragraph length

https://gerrit.wikimedia.org/r/292384

gerritbot added a project: Patch-For-Review.Jun 2 2016, 4:33 PM

Change 292384 merged by jenkins-bot:
Reduce minimum lead paragraph length

https://gerrit.wikimedia.org/r/292384

• bearND added projects: Mobile-App-Android-Sprint-83-Bismuth, Unplanned-Sprint-Work.Jun 2 2016, 5:30 PM

• bearND moved this task from To Do to Ready for Signoff on the Mobile-App-Android-Sprint-83-Bismuth board.

• bearND moved this task from Backlog to To Deploy on the Mobile-Content-Service board.Jun 3 2016, 2:55 AM

• bearND renamed this task from Content service shifts wrong lead paragraph to the top in certain case. to Content service shifts wrong lead paragraph to the top in certain cases.Jun 3 2016, 5:32 PM

• bearND mentioned this in T136964: Pre-generate/purge mobile-sections endpoints to fix page links inside image captions.

• Mholloway mentioned this in T136997: Add class to shifted first paragraphs for easy identification after the fact.Jun 3 2016, 9:51 PM

@Niedzielski My comment about Asian languages in the commit message was more along the lines that for those I think we should have an even smaller minimum number of characters, maybe like 10 or 15. The harder part is figuring out which domains are affected. One idea is to hard-code the list of Asian domains. Maybe there is a better way.

• Mholloway subscribed.Jun 3 2016, 11:34 PM

To get a list of pages that require purging for this case is tricky.

One idea is to look at the wikitext and see if the number of characters of the first paragraph after any templates is between 40 and 80.
Another option is to look at the mobile-sections output and look in sections[0] for any paragraphs anywhere that are between 40 and 80. This will have some false positives but probably easier to automate.

Dbrant added a project: Mobile-App-Android-Sprint-84-Polonium.Jun 6 2016, 1:06 PM

Dbrant moved this task from To Do to Ready for Signoff on the Mobile-App-Android-Sprint-84-Polonium board.Jun 6 2016, 1:06 PM

MBinder_WMF removed a project: Unplanned-Sprint-Work.Jun 6 2016, 5:22 PM

@Dbrant This is in Tracking on the backlog. Should it be?

Dbrant added a subtask: T136964: Pre-generate/purge mobile-sections endpoints to fix page links inside image captions.Jun 8 2016, 8:17 PM

I think this might be blocked on reaching final agreement on a regeneration strategy. I'm not sure exactly what we have to work with in constructing queries for purging/regeneration purposes (wikitext, a DOM object, or just the final, stored JSON response). I've been assuming we only have the latter, i.e. the stored JSON response to work with. (@mobrovac, is this correct?)

With that in mind, I'd vote for the second strategy @bearND suggests above, where there exists content with a length of 40 to 80 characters when converted to plain text between a set of tags, as fine-grained enough to provide a meaningful limit but simple enough to execute correctly.

In formal terms, for /mobile-sections/, I guess that would be where
var regex = new RegExp('.{40,80}', 'g');
and
regex.test(lead.sections[0].text.replace(/<(?!\s*\/?\s*p\b)[^>]*>/g, '')); [1]

@bearND @Niedzielski @Dbrant does this seem reasonable?

@mobrovac, do we need to do both mobile-sections and mobile-sections-lead for this one (and all mobile-sections* for the other task) or can we somehow get by with doing only one endpoint? (In both cases, adapting to -lead or -remaining endpoints just involves omitting the lead. or remaining. prefix from the response object being tested as appropriate).

[1] The replacement regex is intended to strip all markup except tags and works as expected based on my testing. It's inspired by http://stackoverflow.com/a/828636. Regex improvement/bikeshedding welcome.

Another option is to look at the mobile-sections output and look in sections[0] for any paragraphs anywhere that are between 40 and 80. This will have some false positives but probably easier to automate.

Correct HTML regexs are tricky. I like this approach of looking at the plain text length. I'm missing the context on why we limit the length to 80 but I'm sure there's good reason.

MBinder_WMF unsubscribed.Jun 15 2016, 4:32 PM

obligatory:
http://stackoverflow.com/a/1732454/67808

@Dbrant lol

If we have access to the plain text, then for sure we should use that and not HTML regexes, but if my guess at how this works is correct, we don't have that luxury; all we have access to is the service's cached final output, i.e., strings of HTML content broken into sections and served in a JSON structure.

Dbrant added a project: Mobile-App-Android-Sprint-85-Astatine.Jun 20 2016, 3:46 PM

Dbrant moved this task from To Do to Ready for Signoff on the Mobile-App-Android-Sprint-85-Astatine board.Jun 20 2016, 3:48 PM

Dbrant moved this task from Ready for Signoff to Done on the Mobile-App-Android-Sprint-85-Astatine board.Jun 27 2016, 4:35 PM

Dbrant closed this task as Resolved.Jun 27 2016, 4:37 PM

• mobrovac closed subtask T136964: Pre-generate/purge mobile-sections endpoints to fix page links inside image captions as Resolved.Jul 13 2016, 11:07 AM