Page MenuHomePhabricator

Hovercard text extract is broken for `* ` sequence in parenthesis
Closed, DuplicatePublic

Description

Steps to reproduce

  1. Create a page with first sentence like this: '''Marek Eben''' (* [[18. prosinec|18. prosince]] [[1957]] [[Praha]]) je ... (or copy this page)
  2. Link to it from another article
  3. See its hovercard

Expected behavior
The first sentence should be shown in hovercard excluding parenthesis.

Current behavior
Only Marek Eben (* 18. is shown in a hovercard. I suppose this is some problem with * (asterisk & space meaning borndate) in (), because the same problem is also in hovercard for another similar page.

Snímek z 2017-08-19 22-37-01.png (741×1 px, 276 KB)

Event Timeline

Dvorapa updated the task description. (Show Details)
Dvorapa renamed this task from Hovercard text extract is broken for `* ` sequence in () to Hovercard text extract is broken for `* ` sequence in parenthesis.Aug 19 2017, 10:32 PM
Jdlrobson subscribed.

Thanks for the bug report! We're working on a new API to improve our summaries (T113094). I've made sure to include this example in there: https://gerrit.wikimedia.org/r/364906 Strip any content in brackets and IPA

With regards to fixing this in the TextExtracts extension, that's unlikely, once we've got this new API in production we are going to be supporting and encouraging its use and TextExtracts will be in maintenance mode.

@Jdlrobson: when would be the new API (with this error presumably fixed) in production? This bug is very annoying. On cs-wiki it affects a considerable part of biographical articles (possibly thousands of pages).

Btw. I would like to know, why Justin Timberlake preview shows no bug, when for example Vladimír Országh preview is bugged. Only difference I see is a space after the star/asterisk.

Cc @ovasileva ^ current timeline is December as far as I understand. New endpoint is built but needs thorough testing and I'm not sure if we can enable an endpoint on a single wiki. @bearND do you know?

@Vachovec1 Yes, it is in the space after asterisk. This difference brokes the old system completely.
@Jdlrobson Maybe our wiki could help with testing for other wikis. We can ask cswiki users, if they want Czech Wikipedia to participate on testing, and ackowledge them, that this functionality could be broken during next months (not more than before I hope), and we will collect all issue reports with new system for you. What do you think @Vachovec1?

@Jdlrobson I think it should be doable in the RESTBase layer but not sure how complicated that would be. @Pchelolo from the Services team would know better than me.
If it's December for the other wikis then I hope it would be early December since there are no deployments allowed during the last two weeks.

@bearND unfortunately we can't do that in RESTBase layer. The bug is in TextExtracts extension itself - for the request we're making to it from RB we only get content up to the asterix: https://cs.wikipedia.org/w/api.php?action=query&prop=extracts&exintro=true&exsentences=5&titles=Marek_Eben

@Pchelolo I was talking about exposing the MCS summary API from the RB summary endpoint for a single will project, not modifying the output coming from TextExtracts.

Oh, ok, didn't understand that. It's completely doable, will make a PR later today

Cc @ovasileva ^ current timeline is December as far as I understand. New endpoint is built but needs thorough testing and I'm not sure if we can enable an endpoint on a single wiki. @bearND do you know?

Yup - December for the endpoint, January for the deployments.

I think the idea of cs-wiki as a "test wiki" for the new API is feasible. We can give you feedback (e. g. we can look for errors and report them here), to help with the development.

I've uploaded a side by side comparison here for old versus new summary endpoint to give an idea of what you'd be trading in doing this:
http://jdlrobson.com/summaries/cs.html

@Jdlrobson From the first 200 only 22nd and 61st look broken

I've uploaded a side by side comparison here for old versus new summary endpoint to give an idea of what you'd be trading in doing this:
http://jdlrobson.com/summaries/cs.html

Ah, very interesting. Yes, #22 and #61 are missing (omission? breakage?).

Some observations (first 200):

#38 (this bug) – fixed.

#9, #15, #23, #62, #149 – new versions unnecesarilly stripped. Why?

#125, #127, #132, #173, #177 – old versions bugged (different cases of "unexpected" dots), now fixed.

Overall, looks promising.

61 is also broken in my report, but 22 is not. BTW, when I run the page number 61 through the current master code it seems to me to return the expected results. I guess we should run this report again.

@Vachovec1
#9, #15, #23, #62, #149 – new versions unnecesarilly stripped. Why?

That's usually because the new implementation only considers text from the first paragraph.

To clarify blank boxes do not mean they are broken. We are handling redirects differently for the new endpoint so you can ignore those in your comparisons. 22nd is that way because of the way the content is edited. It has no full stop and is a separate paragraph so behaving as expected!

@Vachovec1
#9, #15, #23, #62, #149 – new versions unnecesarilly stripped. Why?

That's usually because the new implementation only considers text from the first paragraph.

Ah, I see. Perhaps there should be a requirement for a minimum number of characters?

Btw. #888 is showing blue stripes. What's that?

@Vachovec1 The blue stripes just mean that the HTML is exactly the same between old and new version.

#808 is definitely bugged (in the new version).

#808 is definitely bugged (in the new version).

@Vachovec1 Agreed. Created a task for this.