Page MenuHomePhabricator

Do side by side comparison of old summary endpoint against new summary endpoint
Closed, ResolvedPublic

Description

To give us more confidence in swapping out the existing summary endpoint with the newly built summary endpoint (T168848) we'll want to compare results side by side, to determine if the new summary endpoint is as good or better than the existing one.

@bearND volunteered to comparing the output of the top 1000 pages in a few wikis.

Once this is done and associated parties are happy (Web, Apps), we will swap out the existing summary endpoint with the new.

Outcome

  • A document showing the HTML and text summaries of the top 1000 pages in a few wikis for old and new endpoint side by side. The purpose of this document is for product owners/QA to review the quality of summaries at their leisure.
  • Any bugs are flagged where the old summary endpoint is superior to the new one.

Related Objects

Event Timeline

@Jdlrobson is development on the new end point still stalled while OCG is worked out? Or is it moving forward? Are we still also replacing the text extracts in the old summary end point as well as a first step?

@Jdlrobson is development on the new end point still stalled while OCG is worked out? Or is it moving forward?

I'm working on it part time, but you should think of it as stalled. That said, I strongly believe the new endpoint is superior to the old and that's what I'd like to prove.

Are we still also replacing the text extracts in the old summary end point as well as a first step?

Yes that's what I'm hoping and doing this side by side comparison will accelerate that.

@Jdlrobson is development on the new end point still stalled while OCG is worked out?

Yes.

@bearND: If you're willing to compile the data for the side-by-side comparison as well as take a pass, then, sincerely, thank you. However, @ovasileva, the PM in charge of the Page Previews project, must have the final say.

Or is it moving forward?

Apparently so.

Just to confirm, work on the endpoint is stalled from our side until we can complete OCG. We will probably be getting back to this sometime next quarter.

However, @ovasileva, the PM in charge of the Page Previews project, must have the final say.

Sorry. This sentence was overzealous. Both Apps/Web PM's will need to sign off on this change.

Change 378368 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add compare script for old and new extracts

https://gerrit.wikimedia.org/r/378368

@Jdlrobson @ovasileva I've compiled some files that help compare the extract implementations, old vs new. (Old = MW API TextExtract, New = summary endpoint on MCS master as of today).
I've done this for many languages (almost top 30-40 languages by number of users). If you want this run for another language let me know.

Instructions:

  1. Download and unzip the zip file with the results (~30 MB). This should result in an extracts folder. Go to that folder.
  2. Open the HTML files in your browser to see a simple overview of how the appearance of the HTML extract changes from the old to the new endpoint implementation.
  3. Use the two txt files with a good diff tool to see the differences in more detail.

Issues:

  • Sometimes images (figure, img), tables, and other extra elements (blockquote, dl) appear in new endpoint, making the "New" column wider than the "Old" column. This makes it harder to compare the text. I have (commented out) added a few CSS rules at the end of the css file to hide them but it doesn't alway catch all. Maybe it's better to add some jQuery magic to make the columns resizable? In any case I believe those elements should actually not appear in the extract_html anyways.

Notes:

  • For the HTML table I replace the image URLs from protocol-relative ("//") to "https://" so that they and the local CSS file still get resolved when viewing the HTML file locally (from a "file://" base URL)
  • RTL handling for ar, fa, he. I manually added dir="rtl" to the comparison table to make the RTL content appear correctly. The "Old" version is always in the middle.
  • If old and new HTML value are identical then the styling of the two cells are changed to have a funny, striped pattern. This is actually quite rare since the length is usually different. I just wanted to highlight those occurrences to avoid straining my eyes to find any diffs if there are none.
  • !! STATUS = 204 !! is expected and means the new endpoint returns an HTTP response of 204 ("No content").
  • undefined: This means that the extract_html property in the summary response is missing. This is often due to the article in question being a redirect or a stub. Examples: en.html#322, de.html#24, de.html#247, de.html#301. We should keep an eye on redirect handling since that is hard to test with a local MCS instance.
  • Another thing that could be improved in the new endpoint is to flatten some <span> elements which have no attributes besides id or about.

I've run script http://jdlrobson.com/summaries/

Will open bugs for the following shortly:

  • Charles Darwin and Charles Manson do not strip references for some reason (they should)
  • The Shakira edge case.
  • Netherlands and Azerbaijan shows small text
  • BDSM and List of highest-grossing films in China shows an image
  • Kylian Mbappé contains a table

@Jdlrobson Thanks for hosting this. I didn't know a good place to host the output files. The zip file I provided has the output for a bunch more languages to save you some time if you'd like to upload more. The cases I would like to see some discussion about is where images or tables appear in the new extract. Some examples:

  • bg.html#7 (Списък на страните по телефонен код) has an image and a table
  • bg.html#156 (Паметник на Бузлуджа), bg.html#702 (Санторини): just images (figure tag) with captions, no real text.
  • de.html#402 (Dave Gahan), de.html#461 (Siebenschläfer) has multiple images
  • de.html#927 (Boeing 747) has multiple embedded videos

bg.html#7 (Списък на страните по телефонен код) has an image and a table

bg.html#156 (Паметник на Бузлуджа), bg.html#702 (Санторини): just images (figure tag) with captions, no real text.
de.html#402 (Dave Gahan), de.html#461 (Siebenschläfer) has multiple images
de.html#927 (Boeing 747) has multiple embedded videos

@bearND filed as https://phabricator.wikimedia.org/T176522 - pretty sure that will fix all those problems. Seems our definition of "intro" needs some work.

@bearND There's some minor nitpicks on https://gerrit.wikimedia.org/r/#/c/378368/ - am happy to fix those up myself and merge if you don't have time? Looks like this is pretty much done though.

Change 378368 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add compare script for old and new extracts

https://gerrit.wikimedia.org/r/378368