Page MenuHomePhabricator

Pageviews Analysis: API requests to backend are failing or timing out since around 2022-02-10
Closed, ResolvedPublic5 Estimated Story Points

Description

For some reason, the queries against the replicas are much slower than they used to be. Up until very recently (around February 10), it was more or less instantaneous for most articles with the default date range. Consequently some tools in the Pageviews suite like Topviews, which query for edit data on 100+ articles, take a very long time to load or fail completely. There have been several reports of this from users on IRC, the discussion page and elsewhere.

Initial investigation suggests this is unique to the Pageviews tool, as I don't see many errors popping up about query timeouts for XTools and other tools I maintain.

Event Timeline

MusikAnimal renamed this task from Querying the replicas very slow starting around February 10, 2022 to Querying the replicas is very slow starting around February 10, 2022.Feb 14 2022, 4:20 AM
MusikAnimal claimed this task.

As critical maintenance I'm filing this under Community-Tech. Investigation in progress.

MusikAnimal renamed this task from Querying the replicas is very slow starting around February 10, 2022 to Pageviews Analysis: Querying the replicas is very slow starting around February 10, 2022.Feb 14 2022, 4:23 AM

So, Pageviews Analysis is mostly written in JavaScript, and makes API requests to a PHP script to query the replicas and get editing data. It seems presumably due to increased traffic, we're running out of local ports to serve:

2022-02-13 09:43:45: (gw_backend.c.240) establishing connection failed: Resource temporarily unavailable socket: unix:/var/run/lighttpd/php.socket.pageviews-0 
2022-02-13 09:43:45: (gw_backend.c.257) backend error; we'll disable for 1 secs and send the request to another backend instead: load: 263 
2022-02-13 09:43:45: (gw_backend.c.264) If this happened on Linux: You have run out of local ports. Check the manual, section Performance how to handle this. 
2022-02-13 09:43:45: (gw_backend.c.956) all handlers for /pageviews/api.php?pages=San+Diego&project=en.wikipedia.org&start=2022-01-23&end=2022-02-12&totals=true on .php are down.

My gut was to increase the number of Kubernetes pod replicas, so I've done that (it's now at 4 and was at 2). I think that's helping? No new errors so far, but Topviews still seems slow-ish, even with its APCu caching. If ports are running out, than it's probably not even hitting that cache. So anyways I don't think it's that the queries are slow, as I originally wrote in the task.

MusikAnimal renamed this task from Pageviews Analysis: Querying the replicas is very slow starting around February 10, 2022 to Pageviews Analysis: API requests to backend are failing or timing out since around 2022-02-10.Feb 14 2022, 5:37 AM

As far as I can tell, the issue with unavailable sockets / not enough ports seems to have gone away or is no longer appearing error.log, but I'm still seeing a lot of failed requests in the UI. I do see we're exceeding our connection quota for the replicas at times:

2022-02-15 03:37:06: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  mysqli::__construct(): (HY000/1226): User 's52877' has exceeded the 'max_user_connections' resource (current value: 10) in /data/project/pageviews/public_html/pageviews/api.php on line 56

but this is not as frequent as the requests I see timing out in the network log, so I don't think that's it. (It's not unusual to occasionally hit the max_user_connections error).

I still think there's some kind of issue with connecting to the backend on Toolforge, in general. On my local, I can reload Topviews over and over with instant results from the APCu cache. On Toolforge, sometimes 9/10 requests to the backend will get an instant response (from cache) but the 10th hangs. Then I try reloading the page, and it loads instantly. This intermittence tells me that 10th request must not have ever hit the backend at all, because we had a cache hit just before and immediately after the request that timed out. I will reach out to Cloud Services for advice.

Many thanks to @Majavah and jeremy_b on IRC for helping debug this. An excerpt from our convo:

taavi> this chart shows the response timings as measured by the proxy that sits directly before the webservice pods:
12:47:38 <taavi> https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=1w&g0.expr=sum(nginx_ingress_controller_request_duration_seconds_sum%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)%20%2F%20sum(nginx_ingress_controller_request_duration_seconds_count%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)&g0.tab=0
12:48:13 <taavi> 499 is a nginx-specific code for "Used when the client has closed the request before the server could send a response.", so browser timeout I think
12:48:14 <musikanimal> ah! that is very telling
12:48:56 <taavi> so I'm wondering if you're suddenly getting hit by some misbehaving crawler
12:49:00 <musikanimal> 499 sounds exactly what I'm experiencing, and I see this started on the 10th
…
12:49:48 <taavi> I'm grepping the front proxy logs atm
12:54:34 <musikanimal> I just turned on the access.log. In my experience with VPS tools that have the XFF header enabled, most web crawlers have human-like UAs, so it may be difficult to infer which traffic is organic and which is not
12:55:06 <taavi> nothing obvious that I can see :( even the most active individual IPs are only at couple thousand requests per day, those are obviously crawlers too per the IP addresses but don't explain the sudden jump in the graphs
12:55:09 <musikanimal> also most web crawlers I don't think load JavaScript, right? that was my theory as to why Pageviews was immune to the whole web crawler problem, up until now
12:56:33 <taavi> one thing you could try to reduce load on the kubernetes pods would be to move images, localization files, etc static content from tools-static, https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Static_file_server

I did turn on the access.log briefly but as expected it doesn't tell me much without actual IPs. I don't think its bots tough... unless they recently all decided to execute JavaScript including XMLHttpRequests. Majavah reports the most active unique IPs are only hitting a few thousand times a day. So it's either a very misbehaving IP-hopping crawler or bot, or it's real traffic. In either case, for such a lightweight tool with caching, I'd like to serve it since it should be cheap to serve. The problem is the request times out sometime during transmission, or the browser cancelled the request.

Here's a more telling chart: https://tools-prometheus.wmflabs.org/tools/graph?g0.range_input=4w&g0.end_input=2022-02-16%2011%3A52&g0.expr=sum(nginx_ingress_controller_request_duration_seconds_sum%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)%20%2F%20sum(nginx_ingress_controller_request_duration_seconds_count%7Bnamespace%3D%22tool-pageviews%22%7D)%20by%20(status)&g0.tab=0

As far back as data goes we've had no problems, except for a few days of 499s starting Feb 3, then the big and ever-growing jump since Feb 10, and you see it go down after I increased the kubernetes pod replica count to 4. I'd like to increase that more, but it appears 4 is the maximum. Maybe we can make a special request for the pageviews account to have more?

Solved by moving to VPS! Our new home is at https://pageviews.wmcloud.org

We're immediately seeing phenomenal performance, even better than before 2022-02-10. Topviews, which normally takes at least 5-10 seconds to load for uncached pages, now loads in just a few seconds at most.

It's been a good 30 mins will traffic switched over, and I don't see any issues with the installation, so I'm resolving this. Thanks again to everyone who helped debug the Toolforge issue.