First paragraph comes second
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Takot
	Jan 18 2019, 4:12 PM

Description

First paragraphs of some specific (ja.wp) articles come second, when viewing with iOS/Android app.
Not all articles, but for some articles (not sure what is the trigger of this phenomenon).

Occurs on:
Android app: 2.7.269-beta-2018-12-11
iOS app: 6.1.4 (1537)

Examples:
チョー (俳優)
https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%A7%E3%83%BC_(%E4%BF%B3%E5%84%AA)
大阪王将
https://ja.wikipedia.org/wiki/%E5%A4%A7%E9%98%AA%E7%8E%8B%E5%B0%86
真藤順丈
https://ja.wikipedia.org/wiki/%E7%9C%9F%E8%97%A4%E9%A0%86%E4%B8%88

iOS bug example:

Android bug example:

Screenshot_20190119-011135.png (1×1 px, 360 KB)

Event Timeline

Takot created this task.Jan 18 2019, 4:12 PM

Restricted Application added projects: Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog. · View Herald TranscriptJan 18 2019, 4:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Takot updated the task description. (Show Details)Jan 18 2019, 4:15 PM

Takot updated the task description. (Show Details)

Takot updated the task description. (Show Details)Jan 18 2019, 4:22 PM

Takot updated the task description. (Show Details)

Takot updated the task description. (Show Details)Jan 18 2019, 4:29 PM

@Mholloway @bearND

The section data 0.text in https://ja.wikipedia.org/api/rest_v1/page/mobile-sections-lead/大阪王将 return the following data:

"<p>類似する名称を持つ店舗チェーンとして<a href=\"/wiki/餃子の王将\" title=\"餃子の王将\">餃子の王将</a>があるが、こちらは株式会社<a href=\"/wiki/王将フードサービス\" title=\"王将フードサービス\">王将フードサービス</a>が<a href=\"/wiki/京阪神\" title=\"京阪神\">京阪神</a>地区を中心に展開している<b>全く別の</b>チェーンであり、業務上の関係は一切存在しない。</p>\n\n<p><b>大阪王将</b>（おおさかおうしょう）は、株式会社<a href=\"/wiki/イートアンド\" title=\"イートアンド\">イートアンド</a>が展開している中華料理店チェーン。</p>\n\n"

which actually switches the first paragraph ( <b>大阪王将</b>（おおさかおうしょう）は、株式会社<a href=\"/wiki/イートアンド\" title=\"イートアンド\">イートアンド</a>が展開している中華料理店チェーン。</p>) to the second paragraph.

Would that be an issue of the endpoint?

I haven't tested this, but it looks like the true first paragraph is being skipped in the LeadIntroductionTransform transform from wikimedia-page-library because it does not contain at least 50 (minEligibleTextLength) characters.

The assumption of a 50-character minimum for a reasonable paragraph is probably not valid for languages with non-Latin alphabets.

Yes, that is done in the page library.
The page library could detect if the content language is Asian and then chose a smaller minimum, let's say 20 or even lower (any suggestions?).

This could be done in one of two ways:
In head: <meta http-equiv="content-language" content="ja">
Or lang attribute on the body element: <body lang="ja" ...>

Or it could detect it based on the lead section itself:
/[\u0600-\uFFFF]/.test(paragraphElement.textContent).

I slightly prefer the latter version since we wouldn't have to hard-code a list of languages and don't need to look at other areas of the DOM. We could chose to include all characters >= \u0600, which means starting with the Arabic Unicode block[1]. Or at the very least start with \u3000.

[1] https://en.wikipedia.org/wiki/Unicode_block

Even 10 or 15 might be safer for CJK if the article is a sentence. I might only need 6 characters to build a sentence.
Different languages can vary a lot. But if the only motivation is to avoid <br> tag why not bring the condition into the detection?

Another idea is to use entropy.

• JMinor moved this task from Needs Triage to Tracking on the Wikipedia-iOS-App-Backlog board.Jan 22 2019, 7:31 PM

• Charlotte moved this task from Needs Triage to Bug Backlog on the Wikipedia-Android-App-Backlog board.Feb 7 2019, 4:55 PM

LGoto triaged this task as Low priority.Aug 27 2019, 8:34 PM

• Charlotte moved this task from Bug Backlog to Tracking on the Wikipedia-Android-App-Backlog board.Nov 27 2019, 5:41 PM

One could also use the Unicode script property escapes with regular expressions. In Content Translation we normally count whitespace separated words, but for CJK languages we just count characters instead.

Just checked https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%A7%E3%83%BC_(%E4%BF%B3%E5%84%AA) and this is working as expected on mobile web and iOS. I think this has been fixed, but not closed.

I also checked these pages on iOS app 6.8.2 (1868) and seemed working as expected.

チョー (俳優)
https://ja.wikipedia.org/wiki/%E3%83%81%E3%83%A7%E3%83%BC_(%E4%BF%B3%E5%84%AA)
大阪王将
https://ja.wikipedia.org/wiki/%E5%A4%A7%E9%98%AA%E7%8E%8B%E5%B0%86
真藤順丈
https://ja.wikipedia.org/wiki/%E7%9C%9F%E8%97%A4%E9%A0%86%E4%B8%88

	F27939447: Screenshot_20190119-011135.png
	Jan 18 2019, 4:22 PM

First paragraph comes secondClosed, ResolvedPublicActions

Description

Event Timeline

First paragraph comes second
Closed, ResolvedPublic
Actions