Page MenuHomePhabricator

[Bug report] Removing parentheses breaks chemical formulas
Closed, ResolvedPublic

Description

Extension:Popups removes texts in parentheses from the extract.
https://phabricator.wikimedia.org/diffusion/EPOP/browse/master/src/formatter.js$97

Some chemical formulas have parentheses (e.g. Acetic anhydride).

Thus, if an article contains chemical formulas with parentheses in the head of the article, popup of the link to the article shows broken chemical formulas.

Steps to Reproduce

  1. Enable Extension:Popups.
  2. Create an article with chemical formulas parentheses with in the head. e.g. https://en.wikipedia.org/wiki/Acetic_anhydride which has formula (CH3CO)2O.
  3. Make a link to the article. e.g. https://en.wikipedia.org/wiki/Acetic_acid
  4. Move mouse to the link and show popup.

Actual Results

The popup only shows the chemical formulas without parentheses (e.g. '2O') wrongly.

Expected Results

The popup shows the chemical formulas (e.g. '(CH3CO)2O') correctly.

Related resource

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Vachovec1 reopened this task as Open.Feb 22 2018, 10:16 PM

@Jdlrobson : This is probably not resolved yet. Try to get a summary for Kyselina myristová. On my comp, the central group of the formula is omitted from the summary.

bearND added a subscriber: bearND.Feb 22 2018, 10:23 PM
Vachovec1 renamed this task from [Bug report]Removing parentheses breaks chemical formulas to [Bug report] Removing parentheses breaks chemical formulas.Feb 22 2018, 10:36 PM
Jdlrobson added a comment.EditedFeb 22 2018, 10:36 PM

@Vachovec1 when I visit https://cs.wikipedia.org/api/rest_v1/page/summary/Kyselina_myristov%C3%A1
it looks like the summary is

<p><b>Kyselina myristová</b> je běžná nasycená mastná kyselina se vzorcem CH<sub>3</sub> <sub>12</sub>COOH. Její soli a estery se nazývají <b>myristáty</b>.</p>

@bearND it looks like "(CH2)" is being stripped from:

Kyselina myristová (systematický název kyselina tetradekanová) je běžná nasycená mastná kyselina se vzorcem CH3(CH2)12COOH. Její soli a estery se nazývají myristáty (systematicky tetradekanoáty).

but the spec clearly says do not strip a parenthetical which doesn't contain spaces...

@Vachovec1 when I visit https://cs.wikipedia.org/api/rest_v1/page/summary/Kyselina_myristov%C3%A1
it looks like the summary is

<p><b>Kyselina myristová</b> je běžná nasycená mastná kyselina se vzorcem CH<sub>3</sub> <sub>12</sub>COOH. Její soli a estery se nazývají <b>myristáty</b>.</p>

Correct.

Change 413628 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/services/mobileapps@master] Add new test case to match spec

https://gerrit.wikimedia.org/r/413628

^ @bearND can you take it from there? (I've added a test case to show how this should behave.)

@Jdlrobson Sure. Will look into it. Thanks for the test case.

Ugh, this is going to get ugly. The reason why (CH<sub id="mwCw">2</sub>) is stripped is because it does include a space -- in the HTML, before the id attribute. I can strip the id attributes beforehand but this doesn't guarantee that there wouldn't be other attributes leading to a strip of the parenthetical.

We really need to remove all these attributes... :/

Alternatively, the regex could take the matching string and use a replace function - throwing it into a dom node and using textContent.

Change 413628 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: remove some attributes before removing parentheticals

https://gerrit.wikimedia.org/r/413628

We don't want to be too aggressive when removing attributes since we don't want to remove needed attributes for other elements, like img, maybe figure and more.

bearND claimed this task.

I'm running the compare script. How urgent is this to get deployed?

Some more complicated testcases:

Nefrit (cs) / Nephrite (en)
Olivín (cs) / Olivine (en)

There is a comma inside the formula. In Czech Wikipedia there are no spaces, in English Wikipedia there is a space after the comma. How will the summary behave?

bearND added a comment.EditedFeb 23 2018, 12:57 AM

@Vachovec1 Thanks. These are even better test cases.

Looks like Nefrit is fine:

'<b id="mwCQ">Nefrit</b> <a rel="mw:WikiLink" href="./Vápník" title="Vápník" id="mwCg">Ca</a><sub id="mwCw">2</sub>(<a rel="mw:WikiLink" href="./Hořčík" title="Hořčík" id="mwDA">Mg</a>,<a rel="mw:WikiLink" href="./Železo" title="Železo" id="mwDQ">Fe</a>)<sub id="mwDg">5</sub><a rel="mw:WikiLink" href="./Křemík" title="Křemík" id="mwDw">Si</a><sub id="mwEA">8</sub><a rel="mw:WikiLink" href="./Kyslík" title="Kyslík" id="mwEQ">O</a><sub id="mwEg">22</sub>(<a rel="mw:WikiLink" href="./Kyslík" title="Kyslík" id="mwEw">O</a><a rel="mw:WikiLink" href="./Vodík" title="Vodík" id="mwFA">H</a>)<sub id="mwFQ">2</sub>',

turns into

'<b>Nefrit</b> Ca<sub>2</sub>(Mg,Fe)<sub>5</sub>Si<sub>8</sub>O<sub>22</sub>(OH)<sub>2</sub>'

Here's the text version: Nefrit Ca2(Mg,Fe)5Si8O22(OH)2

@Vachovec1 I'm extremely grateful for all your help with finding these errors. Thank you so much!

The other ones are missing several components. Here are the text versions of the formulas:

  • Nefrit: Ca2(Mg,Fe)5Si8O22(OH)2
  • Nephrite: Ca2 5Si8O22(OH)2
  • Olivín: (Mg,Fe)2[SiO4]
  • Olivine: 2SiO4

Still needs more work.

Change 413658 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Summary: add more tests for chemical formulas

https://gerrit.wikimedia.org/r/413658

Change 413669 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Summary: keep more (chemical) formulas

https://gerrit.wikimedia.org/r/413669

Change 413658 abandoned by BearND:
Summary: add more tests for chemical formulas

Reason:
superseeded by Idbae8d0743ca5f35dc37ea243eaa0819ed6a328a

https://gerrit.wikimedia.org/r/413658

Change 413669 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: keep more (chemical) formulas intact

https://gerrit.wikimedia.org/r/413669

ovasileva triaged this task as High priority.Feb 23 2018, 4:21 PM
ovasileva moved this task from Backlog to For Review on the Page-Previews board.Feb 26 2018, 4:05 PM

Change 415455 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Summary: fix more complex chemical formulas

https://gerrit.wikimedia.org/r/415455

bearND added a comment.EditedFeb 28 2018, 9:49 PM

@Jdlrobson There is another issue with chemical formulas. see the test case in the patch above or compare https://en.wikipedia.org/api/rest_v1/page/html/Organic_acid_anhydride with https://en.wikipedia.org/api/rest_v1/page/summary/Organic_acid_anhydride or http://localhost:6927/en.wikipedia.org/v1/page/summary/Organic_acid_anhydride.
(RC(O))2O turns into (RC )2O.
That one is harder to solve since this breaks during the removal of nested parentheticals.

Ok, got a workaround for this. Not super proud of it but it seems to get it done. Check out the updated patch.

Change 415455 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: fix more complex chemical formulas

https://gerrit.wikimedia.org/r/415455

Change 415593 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Summary: don't strip any parentheses if complex formula in intro

https://gerrit.wikimedia.org/r/415593

Change 415593 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: don't strip any parentheses if complex formula in intro

https://gerrit.wikimedia.org/r/415593

Change 415607 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Summary: but keep stripping empty ()

https://gerrit.wikimedia.org/r/415607

Change 415607 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: but keep stripping empty ()

https://gerrit.wikimedia.org/r/415607

Mentioned in SAL (#wikimedia-operations) [2018-03-01T18:07:25Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@ada38aa]: Update mobileapps to 0db4a60 (T183833)

Mentioned in SAL (#wikimedia-operations) [2018-03-01T18:13:26Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@ada38aa]: Update mobileapps to 0db4a60 (T183833) (duration: 06m 01s)

bearND added a comment.Mar 1 2018, 6:22 PM

@ovasileva So, I think most formulas should stay intact now. We're also not touching any parentheticals if the intro paragraph has something like km<sup>2</sup>, which affects a lot of geographical locations.
But this comes at a price, that some IPAs in parentheticals don't get stripped anymore. Is that ok for now. I think we really need to push the communities to use the noexcerpt class (T188134).

Example pages where the IPA shows now on enwiki are Iran, Italy, Iceland, Los_Angeles, Malta.

Change 415623 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Summary: make regex more specific about chemical formulas

https://gerrit.wikimedia.org/r/415623

@bearND - that sounds good to me! Thank you!

Note: IPA s do not use templates so without a bit editing every page that is going to be a little impractical and unlikely.
Getting this right for a 100% articles is also impractical without editor assistance.
We may have to live with this unless we can solve the community problem :-/

bearND added a comment.EditedMar 1 2018, 7:11 PM

@Jdlrobson ok, I'm made a change to this chem formula detection so that it only applies when right before the <sup> or <sub> there's a ), or an upper case with optional lowercase letter (for O and Br, ...), or an digit for a math expression. See patch ^. I think this should help with most locations because they have km<sup>2</sup>.

Change 415623 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Summary: make regex more specific about certain formulas

https://gerrit.wikimedia.org/r/415623

Mentioned in SAL (#wikimedia-operations) [2018-03-01T21:28:05Z] <bsitzmann@tin> Started deploy [mobileapps/deploy@bd9924e]: Update mobileapps to 1056fde (T183833)

Mentioned in SAL (#wikimedia-operations) [2018-03-01T21:33:19Z] <bsitzmann@tin> Finished deploy [mobileapps/deploy@bd9924e]: Update mobileapps to 1056fde (T183833) (duration: 05m 15s)

Dvorapa moved this task from For Review to Backlog on the Page-Previews board.EditedMar 2 2018, 4:30 PM
Dvorapa added a subscriber: Dvorapa.

Notice the CH_322 (ashould be CH_3(CH_2)_22)

bearND added a comment.EditedMar 2 2018, 4:37 PM

I purged this page (it was at 1.3.2 before) and now it (with 1.3.4) has the complete formula: (C6H5)2CO. You may need to refresh your browser cache if you had recently opened the page preview for it.

For me purge, blank edit or page reload didn't help. And this looks broken for so many chemical articles across Czech Wikipedia

Dvorapa added a comment.EditedMar 2 2018, 4:43 PM

For example and testing purposes: https://cs.wikipedia.org/wiki/%C5%A0ablona:Monokarboxylov%C3%A9_kyseliny shows many similar buggy page previews

bearND added a comment.Mar 2 2018, 5:19 PM

You're right. I still see it in the UI, even though the response for that summary has changed to the correct string. I can see it in the debugger that is has the full formula but the actual popup still shows the old values.

[...] vzorcem (C<sub>6</sub>H<sub>5</sub>)<sub>2</sub>CO, zkracovaným [...].

Maybe one of web devs can look at it? Not sure where the old string gets cached.

ovasileva moved this task from Backlog to Needs Analysis on the Page-Previews board.
ovasileva moved this task from To Triage to Needs Analysis on the Readers-Web-Backlog board.

Change 397919 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] Remove client side formatters in Popups code base

https://gerrit.wikimedia.org/r/397919

There are some client side formatters that look like they are still running.
See

@phuedx @pmiazga @Sniedzielski ^

Change 415926 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@master] Remove client side formatters in the REST formatter

https://gerrit.wikimedia.org/r/415926

Change 415926 merged by jenkins-bot:
[mediawiki/extensions/Popups@master] Remove client side formatters in the REST formatter

https://gerrit.wikimedia.org/r/415926

Change 415934 had a related patch set uploaded (by Jdlrobson; owner: Jdlrobson):
[mediawiki/extensions/Popups@wmf/1.31.0-wmf.23] Remove client side formatters in the REST formatter

https://gerrit.wikimedia.org/r/415934

^ Not here Monday but anybody able to SWAT that?

Change 415934 merged by jenkins-bot:
[mediawiki/extensions/Popups@wmf/1.31.0-wmf.23] Remove client side formatters in the REST formatter

https://gerrit.wikimedia.org/r/415934

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:06:03Z] <hashar@tin> Started scap: Popups: Remove client side formatters in the REST formatter - T183833

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:06:20Z] <hashar@tin> scap aborted: Popups: Remove client side formatters in the REST formatter - T183833 (duration: 00m 16s)

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:08:40Z] <hashar@tin> Started scap: Popups: Remove client side formatters in the REST formatter - T183833

Mentioned in SAL (#wikimedia-operations) [2018-03-05T14:31:48Z] <hashar@tin> Finished scap: Popups: Remove client side formatters in the REST formatter - T183833 (duration: 23m 08s)

ABorbaWMF added a subscriber: ABorbaWMF.

Looks good on production now



@Jdlrobson I also noticed that the plain text version uses these formatters:

export function formatPlainTextExtract( plainTextExtract, title ) {
	var extract = plainTextExtract;
	if ( plainTextExtract === undefined ) {
		return [];
	}

	extract = removeParentheticals( extract );
	extract = removeTrailingEllipsis( extract );
[...]

Isn't the plain text version still used somewhere or is that all old code that should go away?

Isn't the plain text version still used somewhere or is that all old code that should go away?

@Jdlrobson can definitely speak to whether this is code that should go away. I suspect that this might now just be a misnomer though, perhaps formatMediaWikiApiExtract might be a better name?

@bearND formatPlainTestExtract is used only when using a MediaWiki API, not restbase. At this moment it's used only on private instances, as default data gateway is restbaseHTML. Currently we're discussing what to do with the old MediawikiGateway - Do we want to support that, if yes do we want to keep the old logic with parentheticals stripping etc).

bearND added a comment.Mar 6 2018, 3:50 PM

I see. Sounds like keeping the code but renaming it to something MW API sounds good to me.

bearND removed bearND as the assignee of this task.Mar 6 2018, 3:51 PM

yup, previously it was used both for restbasePlain and mediawikiPlain, now it's mw only.

Jdlrobson closed this task as Resolved.Mar 6 2018, 7:06 PM
Jdlrobson claimed this task.

Let's break the conversation about mediawikiPlain into T189042.