Improve media endpoint performance
Closed, ResolvedPublic
Actions

Description

The media endpoint (T177430) is deployed, but it is quite slow. This task is to improve its performance.

ab -c 10 -n 100 http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/media/Barack_Obama

Concurrency Level:      10
Time taken for tests:   19.425 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      10906700 bytes
HTML transferred:       10818200 bytes
Requests per second:    5.15 [#/sec] (mean)
Time per request:       1942.543 [ms] (mean)
Time per request:       194.254 [ms] (mean, across all concurrent requests)
Transfer rate:          548.31 [Kbytes/sec] received

ab -c 10 -n 100 http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/media/Russia

Concurrency Level:      10
Time taken for tests:   18.794 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      18715500 bytes
HTML transferred:       18627000 bytes
Requests per second:    5.32 [#/sec] (mean)
Time per request:       1879.434 [ms] (mean)
Time per request:       187.943 [ms] (mean, across all concurrent requests)
Transfer rate:          972.47 [Kbytes/sec] received

Details

	Subject	Repo	Branch	Lines +/-
	Media: Call RESTBase and MW API in parallel to improve performance	mediawiki/services/mobileapps	master	+85 -108

Customize query in gerrit

Related Objects

Mentioned In: T196523: Page Preview should not strip space before period
T177619: Return variant URLs and titles in the metadata response
T195325: Page media endpoint missing math formula images
Mentioned Here: rGMOAb5e152d5262f: Merge "Media: Call RESTBase and MW API in parallel to improve performance"
T177619: Return variant URLs and titles in the metadata response
T195325: Page media endpoint missing math formula images
T196523: Page Preview should not strip space before period
T177430: Develop a Media JSON API

Event Timeline

• Mholloway created this task.Mar 15 2018, 10:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2018, 10:38 PM

• Mholloway claimed this task.Mar 16 2018, 3:18 AM

• Mholloway moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

@bearND I've tried a few different ideas:

simplifying the MW API query
simplifying the querySelectorAll selectors
using getElementsByTagName rather than querySelectorAll to find media items on the page
restructuring the processing to fetch imageinfo from the API with generator=images while processing the HTML document rather than waiting to get the images list from the HTML (note: this one has some other challenges)

....but none made a dent in the latency when I tested with ab. I'm out of ideas for the moment. Do you have any others?

If you look at the latency on https://grafana.wikimedia.org/dashboard/db/mobileapps, if I'm interpreting things correctly, it looks like response times dropped dramatically once this was behind RESTBase. It's now somewhat faster than mobile-sections-lead. Given that, I wonder how urgently performance improvements are needed.

It did improve. I don't have any suggestions for further improvement at the moment. BTW, I've updated the dashboard to move from median (p50) to p75. The drop in the last 7 days doesn't look as big anymore but it's still visible.

• Mholloway removed • Mholloway as the assignee of this task.Mar 17 2018, 2:02 PM

• Mholloway moved this task from Doing to To Do on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.

• Mholloway claimed this task.May 10 2018, 6:25 PM

• Mholloway triaged this task as High priority.

• Mholloway merged a task: T194417: Optimize media end point.

• Mholloway added a subscriber: LGoto.

• Mholloway moved this task from To Do to Doing on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jun 13 2018, 8:13 PM

@Fjalapeno @Jhernandez @bearND

It might be that we're just trying to do too much in this one endpoint. Here's the flow of a single request:

Get the Parsoid HTML and MW siteinfo for the wiki in two promises joined with BBPromise.join() (I/O-bound but should be pretty fast—note that MCS caches siteinfo in memory);
then parse the HTML and build a DOM with Domino to parse image filenames and some other data from the page (a fairly heavy/slow operation);
then query the MW API again for extended Commons metadata for the titles parsed from the page and merge the response with the data parsed from the page (I/O bound, possibly slow on the MediaWiki side).

We're kind of locked into this flow by the product requirements we've defined, in particular presenting the results in the order they appear on the page and filtering based on size and usage on the page. I'm not seeing any easy wins in the code, and from node profiling it looks like the majority of the processing time is spent parsing the HTML and building a DOM, which we'd expect.

Something to think about would be to split up data from the page and extended metadata from Commons into two separate endpoints. The latter would be a new endpoint that would accept a single title at a time; something like

https://en.wikipedia.org/api/rest_v1/page/media-details/File:Purple-throated_carib_hummingbird_feeding.jpg

I think this would fit the page preview and image gallery use cases in the apps.

@Mholloway Can't we request the images in the page from the mediawiki API at the beginning in parallel and after we have the order from the DOM we can merge the info? That way all the network I/O is in parallel and should speed things up a bit.

@Jhernandez I seem to remember there being some catch around that approach, but it's worth another look. I'll give it a shot.

I had thought an API query with generator=images would only return images and not audio/video files, so we'd end up having to make a follow-up API call anyway, but it looks like it actually does return all files (so the images name is a bit misleading).

We used to have some old prototype code that worked for both images and videos with generator=images. Looks like it also works for audio files: https://en.wikipedia.org/w/api.php?action=query&format=json&generator=images&iiextmetadatafilter=ImageDescription&iiextmetadatalanguage=en&iiprop=url%7Cextmetadata%7Cdimensions&iiurlwidth=640&prop=imageinfo&rawcontinue=&titles=User%3aBSitzmann_%28WMF%29%2fMCS%2fTest%2fFrankenstein

Change 444322 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Media: Call RESTBase and MW API in parallel to improve performance

https://gerrit.wikimedia.org/r/444322

gerritbot added a project: Patch-For-Review.Jul 6 2018, 10:19 PM

• Mholloway moved this task from Doing to Code Review on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jul 6 2018, 10:20 PM

Change 444322 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Media: Call RESTBase and MW API in parallel to improve performance

https://gerrit.wikimedia.org/r/444322

• Mholloway moved this task from Code Review to To Deploy on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jul 11 2018, 7:57 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-11T20:16:33Z] <bsitzmann@deploy1001> Started deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523)

Stashbot mentioned this in T177619: Return variant URLs and titles in the metadata response.Jul 11 2018, 8:16 PM

Stashbot mentioned this in T196523: Page Preview should not strip space before period.

Mentioned in SAL (#wikimedia-operations) [2018-07-11T20:20:04Z] <bsitzmann@deploy1001> Finished deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) (duration: 03m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T22:18:16Z] <bsitzmann@deploy1001> Started deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T22:25:00Z] <bsitzmann@deploy1001> Finished deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) (duration: 06m 44s)

• bearND moved this task from To Deploy to Sign off on the Product-Infrastructure-Team-Backlog-Deprecated (Kanban) board.Jul 11 2018, 10:33 PM

@Mholloway @bearND are we happy with the performance? Should we keep open until we can perf test in prod to check that the continuation queries perform well enough?

I've checked on https://grafana.wikimedia.org/dashboard/db/mobileapps?panelId=9&fullscreen&orgId=1&from=now-2d&to=now looking at page_media_--title_--revision-_--tid but I don't see any significant changes for now.

Yeah, sadly, the change doesn't seem to have improved things much. I don't think we should revert, since the code is cleaner now and the peformance isn't worse.

To memorialize some conversation on the ticket, the update just deployed:

performs slightly worse for pages large enough to require query continuation to gather imageinfo for all images (e.g., Barack Obama);
performs better for substantial but not extremely large pages (e.g., Hummingbird);
appears to perform slightly worse for pages with few or no images (e.g., Foo (disambiguation)), perhaps because of added complexity of the underlying MW API query.

Apparently all of this washes out to performing about the same as before.

And sadly this has the side effect of making it much hard (impossible?) to allow referencing media of older page revisions. See discussion on https://github.com/wikimedia/restbase/pull/1021.

Reverted.

FTR I was poking at the dashboard again and looking at the past 30d I saw this:

Screen Shot 2018-07-13 at 13.39.49.png (325×954 px, 56 KB)

What happened there? The external latencies seem consistent tho.

Sad to see we couldn't improve this :( Is there anything else you think we could do?

@Jhernandez Nice find. That dashboard has a few interesting things going on:

https://grafana.wikimedia.org/dashboard/db/mobileapps?panelId=9&fullscreen&orgId=1&from=now-90d&to=now

First, when looking at all endpoints and not isolating the view to the media endpoint I notice that the references and mobile-sections-lead had very similar latency drops at precisely the same time on June 25. Also, feed/onthisday had a similar latency drop a few weeks earlier, on June 6. Finally, both of these events occurred at weirdly even and precise points in time: 12:00:00 UTC on June 25 for the media/references/mobile-sections-lead improvement, and 00:00:00 on June 6 for the earlier onthisday improvement. There were MCS deployments around both of those times, a few hours earlier for the June 6 drop and most of a day later for the June 25 drop, but they don't really match in time and don't contain anything relevant to the endpoints seeing the improvement. Even the node module updates for those deployments don't seem to have anything very interesting going on. And I don't see anything relevant around those times in the Server Admin Log, either.

So honestly, I'm at a loss as to how those improvements happened. It's great that they did, and I guess in practical terms that resolves this ticket, but I'd like to understand how it happened. Maybe @mobrovac or @Pchelolo have ideas that aren't occurring to me.

Keeping open to see if we can find out what happened. I'll ping services once irccloud works again for me.

• Jhernandez closed this task as Resolved.Aug 6 2018, 11:42 AM

	F23717899: Screen Shot 2018-07-13 at 13.39.49.png
	Jul 13 2018, 11:45 AM

Improve media endpoint performanceClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Improve media endpoint performance
Closed, ResolvedPublic
Actions