Page MenuHomePhabricator

Improve media endpoint performance
Closed, ResolvedPublic

Description

The media endpoint (T177430) is deployed, but it is quite slow. This task is to improve its performance.

ab -c 10 -n 100 http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/media/Barack_Obama

Concurrency Level:      10
Time taken for tests:   19.425 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      10906700 bytes
HTML transferred:       10818200 bytes
Requests per second:    5.15 [#/sec] (mean)
Time per request:       1942.543 [ms] (mean)
Time per request:       194.254 [ms] (mean, across all concurrent requests)
Transfer rate:          548.31 [Kbytes/sec] received
ab -c 10 -n 100 http://mobileapps.svc.codfw.wmnet:8888/en.wikipedia.org/v1/page/media/Russia

Concurrency Level:      10
Time taken for tests:   18.794 seconds
Complete requests:      100
Failed requests:        0
Total transferred:      18715500 bytes
HTML transferred:       18627000 bytes
Requests per second:    5.32 [#/sec] (mean)
Time per request:       1879.434 [ms] (mean)
Time per request:       187.943 [ms] (mean, across all concurrent requests)
Transfer rate:          972.47 [Kbytes/sec] received

Event Timeline

@bearND I've tried a few different ideas:

  • simplifying the MW API query
  • simplifying the querySelectorAll selectors
  • using getElementsByTagName rather than querySelectorAll to find media items on the page
  • restructuring the processing to fetch imageinfo from the API with generator=images while processing the HTML document rather than waiting to get the images list from the HTML (note: this one has some other challenges)

....but none made a dent in the latency when I tested with ab. I'm out of ideas for the moment. Do you have any others?

If you look at the latency on https://grafana.wikimedia.org/dashboard/db/mobileapps, if I'm interpreting things correctly, it looks like response times dropped dramatically once this was behind RESTBase. It's now somewhat faster than mobile-sections-lead. Given that, I wonder how urgently performance improvements are needed.

It did improve. I don't have any suggestions for further improvement at the moment. BTW, I've updated the dashboard to move from median (p50) to p75. The drop in the last 7 days doesn't look as big anymore but it's still visible.

Mholloway triaged this task as High priority.
Mholloway added a subscriber: LGoto.

@Fjalapeno @Jhernandez @bearND

It might be that we're just trying to do too much in this one endpoint. Here's the flow of a single request:

  1. Get the Parsoid HTML and MW siteinfo for the wiki in two promises joined with BBPromise.join() (I/O-bound but should be pretty fast—note that MCS caches siteinfo in memory);
  2. then parse the HTML and build a DOM with Domino to parse image filenames and some other data from the page (a fairly heavy/slow operation);
  3. then query the MW API again for extended Commons metadata for the titles parsed from the page and merge the response with the data parsed from the page (I/O bound, possibly slow on the MediaWiki side).

We're kind of locked into this flow by the product requirements we've defined, in particular presenting the results in the order they appear on the page and filtering based on size and usage on the page. I'm not seeing any easy wins in the code, and from node profiling it looks like the majority of the processing time is spent parsing the HTML and building a DOM, which we'd expect.

Something to think about would be to split up data from the page and extended metadata from Commons into two separate endpoints. The latter would be a new endpoint that would accept a single title at a time; something like

https://en.wikipedia.org/api/rest_v1/page/media-details/File:Purple-throated_carib_hummingbird_feeding.jpg

I think this would fit the page preview and image gallery use cases in the apps.

@Mholloway Can't we request the images in the page from the mediawiki API at the beginning in parallel and after we have the order from the DOM we can merge the info? That way all the network I/O is in parallel and should speed things up a bit.

@Jhernandez I seem to remember there being some catch around that approach, but it's worth another look. I'll give it a shot.

I had thought an API query with generator=images would only return images and not audio/video files, so we'd end up having to make a follow-up API call anyway, but it looks like it actually does return all files (so the images name is a bit misleading).

Change 444322 had a related patch set uploaded (by Mholloway; owner: Mholloway):
[mediawiki/services/mobileapps@master] Media: Call RESTBase and MW API in parallel to improve performance

https://gerrit.wikimedia.org/r/444322

Change 444322 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Media: Call RESTBase and MW API in parallel to improve performance

https://gerrit.wikimedia.org/r/444322

Stashbot subscribed.

Mentioned in SAL (#wikimedia-operations) [2018-07-11T20:16:33Z] <bsitzmann@deploy1001> Started deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T20:20:04Z] <bsitzmann@deploy1001> Finished deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) (duration: 03m 30s)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T22:18:16Z] <bsitzmann@deploy1001> Started deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523)

Mentioned in SAL (#wikimedia-operations) [2018-07-11T22:25:00Z] <bsitzmann@deploy1001> Finished deploy [mobileapps/deploy@03fa731]: Update mobileapps to b5e152d (T195325 T189830 T177619 T196523) (duration: 06m 44s)

@Mholloway @bearND are we happy with the performance? Should we keep open until we can perf test in prod to check that the continuation queries perform well enough?

I've checked on https://grafana.wikimedia.org/dashboard/db/mobileapps?panelId=9&fullscreen&orgId=1&from=now-2d&to=now looking at page_media_--title_--revision-_--tid but I don't see any significant changes for now.

Yeah, sadly, the change doesn't seem to have improved things much. I don't think we should revert, since the code is cleaner now and the peformance isn't worse.

To memorialize some conversation on the ticket, the update just deployed:

  • performs slightly worse for pages large enough to require query continuation to gather imageinfo for all images (e.g., Barack Obama);
  • performs better for substantial but not extremely large pages (e.g., Hummingbird);
  • appears to perform slightly worse for pages with few or no images (e.g., Foo (disambiguation)), perhaps because of added complexity of the underlying MW API query.

Apparently all of this washes out to performing about the same as before.

And sadly this has the side effect of making it much hard (impossible?) to allow referencing media of older page revisions. See discussion on https://github.com/wikimedia/restbase/pull/1021.

FTR I was poking at the dashboard again and looking at the past 30d I saw this:

Screen Shot 2018-07-13 at 13.39.49.png (325×954 px, 56 KB)

What happened there? The external latencies seem consistent tho.

Sad to see we couldn't improve this :( Is there anything else you think we could do?

@Jhernandez Nice find. That dashboard has a few interesting things going on:

https://grafana.wikimedia.org/dashboard/db/mobileapps?panelId=9&fullscreen&orgId=1&from=now-90d&to=now

First, when looking at all endpoints and not isolating the view to the media endpoint I notice that the references and mobile-sections-lead had very similar latency drops at precisely the same time on June 25. Also, feed/onthisday had a similar latency drop a few weeks earlier, on June 6. Finally, both of these events occurred at weirdly even and precise points in time: 12:00:00 UTC on June 25 for the media/references/mobile-sections-lead improvement, and 00:00:00 on June 6 for the earlier onthisday improvement. There were MCS deployments around both of those times, a few hours earlier for the June 6 drop and most of a day later for the June 25 drop, but they don't really match in time and don't contain anything relevant to the endpoints seeing the improvement. Even the node module updates for those deployments don't seem to have anything very interesting going on. And I don't see anything relevant around those times in the Server Admin Log, either.

So honestly, I'm at a loss as to how those improvements happened. It's great that they did, and I guess in practical terms that resolves this ticket, but I'd like to understand how it happened. Maybe @mobrovac or @Pchelolo have ideas that aren't occurring to me.

Keeping open to see if we can find out what happened. I'll ping services once irccloud works again for me.