First paragraph comes second
Open, Needs TriagePublic

Description

First paragraphs of some specific (ja.wp) articles come second, when viewing with iOS/Android app.
Not all articles, but for some articles (not sure what is the trigger of this phenomenon).

Occurs on:
Android app: 2.7.269-beta-2018-12-11
iOS app: 6.1.4 (1537)

Examples:
チョー (俳優)
https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%A7%E3%83%BC_(%E4%BF%B3%E5%84%AA)
大阪王将
https://ja.wikipedia.org/wiki/%E5%A4%A7%E9%98%AA%E7%8E%8B%E5%B0%86
真藤順丈
https://ja.wikipedia.org/wiki/%E7%9C%9F%E8%97%A4%E9%A0%86%E4%B8%88

iOS bug example:


Android bug example:

Takot created this task.Jan 18 2019, 4:12 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Takot updated the task description. (Show Details)Jan 18 2019, 4:15 PM
Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)Jan 18 2019, 4:22 PM
Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)Jan 18 2019, 4:29 PM

@Mholloway @bearND

The section data 0.text in https://ja.wikipedia.org/api/rest_v1/page/mobile-sections-lead/大阪王将 return the following data:

"<p>類似する名称を持つ店舗チェーンとして<a href=\"/wiki/餃子の王将\" title=\"餃子の王将\">餃子の王将</a>があるが、こちらは株式会社<a href=\"/wiki/王将フードサービス\" title=\"王将フードサービス\">王将フードサービス</a>が<a href=\"/wiki/京阪神\" title=\"京阪神\">京阪神</a>地区を中心に展開している<b>全く別の</b>チェーンであり、業務上の関係は一切存在しない。</p>\n\n<p><b>大阪王将</b>(おおさかおうしょう)は、株式会社<a href=\"/wiki/イートアンド\" title=\"イートアンド\">イートアンド</a>が展開している中華料理店チェーン。</p>\n\n"

which actually switches the first paragraph ( <b>大阪王将</b>(おおさかおうしょう)は、株式会社<a href=\"/wiki/イートアンド\" title=\"イートアンド\">イートアンド</a>が展開している中華料理店チェーン。</p>) to the second paragraph.

Would that be an issue of the endpoint?

I haven't tested this, but it looks like the true first paragraph is being skipped in the LeadIntroductionTransform transform from wikimedia-page-library because it does not contain at least 50 (minEligibleTextLength) characters.

The assumption of a 50-character minimum for a reasonable paragraph is probably not valid for languages with non-Latin alphabets.

bearND added a project: Language-Team.EditedJan 18 2019, 11:35 PM
bearND added subscribers: Mhurd, Niedzielski.

Yes, that is done in the page library.
The page library could detect if the content language is Asian and then chose a smaller minimum, let's say 20 or even lower (any suggestions?).

This could be done in one of two ways:
In head: <meta http-equiv="content-language" content="ja">
Or lang attribute on the body element: <body lang="ja" ...>

Or it could detect it based on the lead section itself:
/[\u0600-\uFFFF]/.test(paragraphElement.textContent).

I slightly prefer the latter version since we wouldn't have to hard-code a list of languages and don't need to look at other areas of the DOM. We could chose to include all characters >= \u0600, which means starting with the Arabic Unicode block[1]. Or at the very least start with \u3000.

[1] https://en.wikipedia.org/wiki/Unicode_block

Even 10 or 15 might be safer for CJK if the article is a sentence. I might only need 6 characters to build a sentence.
Different languages can vary a lot. But if the only motivation is to avoid <br> tag why not bring the condition into the detection?

Another idea is to use entropy.