Page MenuHomePhabricator

First paragraph comes second
Closed, ResolvedPublic

Description

First paragraphs of some specific (ja.wp) articles come second, when viewing with iOS/Android app.
Not all articles, but for some articles (not sure what is the trigger of this phenomenon).

Occurs on:
Android app: 2.7.269-beta-2018-12-11
iOS app: 6.1.4 (1537)

Examples:
チョー (俳優)
https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%A7%E3%83%BC_(%E4%BF%B3%E5%84%AA)
大阪王将
https://ja.wikipedia.org/wiki/%E5%A4%A7%E9%98%AA%E7%8E%8B%E5%B0%86
真藤順丈
https://ja.wikipedia.org/wiki/%E7%9C%9F%E8%97%A4%E9%A0%86%E4%B8%88

iOS bug example:

IMG_2142.PNG (1×640 px, 189 KB)

Android bug example:
Screenshot_20190119-011135.png (1×1 px, 360 KB)

Event Timeline

Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)
Takot updated the task description. (Show Details)

@Mholloway @bearND

The section data 0.text in https://ja.wikipedia.org/api/rest_v1/page/mobile-sections-lead/大阪王将 return the following data:

"<p>類似する名称を持つ店舗チェーンとして<a href=\"/wiki/餃子の王将\" title=\"餃子の王将\">餃子の王将</a>があるが、こちらは株式会社<a href=\"/wiki/王将フードサービス\" title=\"王将フードサービス\">王将フードサービス</a>が<a href=\"/wiki/京阪神\" title=\"京阪神\">京阪神</a>地区を中心に展開している<b>全く別の</b>チェーンであり、業務上の関係は一切存在しない。</p>\n\n<p><b>大阪王将</b>(おおさかおうしょう)は、株式会社<a href=\"/wiki/イートアンド\" title=\"イートアンド\">イートアンド</a>が展開している中華料理店チェーン。</p>\n\n"

which actually switches the first paragraph ( <b>大阪王将</b>(おおさかおうしょう)は、株式会社<a href=\"/wiki/イートアンド\" title=\"イートアンド\">イートアンド</a>が展開している中華料理店チェーン。</p>) to the second paragraph.

Would that be an issue of the endpoint?

I haven't tested this, but it looks like the true first paragraph is being skipped in the LeadIntroductionTransform transform from wikimedia-page-library because it does not contain at least 50 (minEligibleTextLength) characters.

The assumption of a 50-character minimum for a reasonable paragraph is probably not valid for languages with non-Latin alphabets.

bearND added subscribers: Mhurd, Niedzielski.

Yes, that is done in the page library.
The page library could detect if the content language is Asian and then chose a smaller minimum, let's say 20 or even lower (any suggestions?).

This could be done in one of two ways:
In head: <meta http-equiv="content-language" content="ja">
Or lang attribute on the body element: <body lang="ja" ...>

Or it could detect it based on the lead section itself:
/[\u0600-\uFFFF]/.test(paragraphElement.textContent).

I slightly prefer the latter version since we wouldn't have to hard-code a list of languages and don't need to look at other areas of the DOM. We could chose to include all characters >= \u0600, which means starting with the Arabic Unicode block[1]. Or at the very least start with \u3000.

[1] https://en.wikipedia.org/wiki/Unicode_block

Even 10 or 15 might be safer for CJK if the article is a sentence. I might only need 6 characters to build a sentence.
Different languages can vary a lot. But if the only motivation is to avoid <br> tag why not bring the condition into the detection?

Another idea is to use entropy.

LGoto triaged this task as Low priority.Aug 27 2019, 8:34 PM
Nikerabbit subscribed.

One could also use the Unicode script property escapes with regular expressions. In Content Translation we normally count whitespace separated words, but for CJK languages we just count characters instead.

JMinor claimed this task.
JMinor subscribed.

Just checked https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%A7%E3%83%BC_(%E4%BF%B3%E5%84%AA) and this is working as expected on mobile web and iOS. I think this has been fixed, but not closed.