Page MenuHomePhabricator

XTools is down or very slow due to bot scraping
Open, MediumPublic

Description

It looks like XTools is down and when attempting to access it I'm just directed to an error page from Wikimedia's webserver proxy.

image.png (1,364×715 px, 69 KB)

Event Timeline

Devnull renamed this task from XTools is down with a proxy error to XTools is down.
fnegri claimed this task.
fnegri added subscribers: Andrew, fnegri.

The VM xtools-prod08 had two separate issues:

  1. a soft reboot failed (probably because of T383583: VM nova records attached to incorrect cloudcephmon IPs), that was fixed by @Andrew with virsh destroy followed by openstack server migrate.
  2. the machine came back online, but the disk was full. I fixed that by removing old logs from /var/log/apache2 with rm *1?.gz. That freed up 493M but it will require more clean-ups to prevent the disk from filling again.

Seems to be working on my end. @Nemoralis did you see the same error or something different? I did notice we seem to be getting pounded by bots which may produce the "service overloaded" error. I am traveling right now but will arrive at my destination in a few hours, and I'll look into this more.

Thanks so much to @fnegri and @Andrew for looking into this while I was in the air without internet! Very much appreciated! This is the longest outage XTools has ever experienced, and it would have been much longer had we not had your help :)

Seems to be working on my end. @Nemoralis did you see the same error or something different?

Yes, it was the same error.

Repeat of T384711: XTools is down or very slow due to bot scraping except this time I'm around to investigate!

Looks like we've had some abusive bots pounding the service lately. 3.7 million requests from the same bot alone in the last 24 hours! Fortunately the automated throttling was doing its job, and few if any XTools requests from these bots were actually fulfilled. However, we were still logging the requests, and due to the sheer volume that led to the disk becoming full.

I have blocked the bots and reduced the log ratation to go back only 7 days instead of 14 days. Things should be OK now.

Happened again today. I blocked one more bot, and deleted a bunch more logs. We're now down to 72% disk usage.

The bots that are blocked won't log anymore, so unless we get new bots going this crazy with DDoS-ish requests, this situation where the disk is full from log files should be resolved now.

I also turned off the log_forensic Apache module, which logged the same requests (with mostly the same data) twice in addition to the normal Apache logs. After deleting those logs, we're down to 50% disk usage. I think someone else enabled that module at some point for debugging purposes. I've never noticed it until now.

Alien333 reopened this task as Open.EditedApr 21 2025, 10:07 AM
Alien333 subscribed.

I don't think this is the fault of last week's patches, because these only affected blame and editcounter, but the homepage also errors.

The only thing indicating trouble at grafana is this (we're getting about twice as much 3XX errors as usual). Apparently started with a spike at about 5:00 UTC.

image.png (1,910×941 px, 167 KB)

@MusikAnimal: Possibly bots eating up space again? Spike of failed requests maybe means spike of requests.

There's some IP hopping scraper going wild. I managed to block it by user agent and some other heuristics, but there's a chance we're blocking genuine traffic, too.

The "XTools/ArticleInfo" script works again, thank you! ⭐

I don't think this is the fault of last week's patches, because these only affected blame and editcounter, but the homepage also errors.

Those patches haven't been deployed yet, anyway.

The only thing indicating trouble at grafana is this (we're getting about twice as much 3XX errors as usual). Apparently started with a spike at about 5:00 UTC.

That's the Toolforge tool, which only redirects to VPS. For the VPS project, we'd want to look at xtools-prod08 (app server) and xtools-prod09 (API server) at https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=xtools. I can see RAM and network usage spiked on the app server during the incident.

Anyway, if XTools ever goes down, it's almost always the same problem – abusive bots. The best way to mitigate this is a login wall. We already require login for requests deemed as "expensive", but thousands of rapid requests for smaller bits of data not subject to login will have the same effect of hogging up our database quota (and/or Apache connections). I think the long-term solution is here is just require login for everyone, but T224382 needs to be fixed first.

I also wonder if we can get API requests routed directly to xtools-prod09, as opposed to having go through the app server's Apache config first. If we did that, during outages like this where Apache stops taking incoming requests, the Pageinfo (aka "ArticleInfo") script that so many use would continue to function normally. Currently we're forwarding traffic using a ProxyPass directive, meaning the PHP application on prod08 is never touched for API requests (good), but prod08's Apache still has to make room for all the API server's traffic. I'm not aware of a solution for this apart from adding a new dedicated web proxy, i.e. xtools-api.wmcloud.org.

Noting I did remove the UA block, so hopefully no more collateral damage. I've got some potential short-term solutions for the bad bots that I'm going to explore.

taavi removed fnegri as the assignee of this task.Apr 21 2025, 8:24 PM
taavi subscribed.

(Resetting assignee since this task seems to have been repurposed for an another issue. In general, please open new tasks instead of re-opening unrelated ones!)

Yes, it should have been a new task, but the outage this task originally documented was also due to bad bots (though that outage did result in different symptoms).

I've had to re-add the UA block temporarily while I work on a better solution.

MusikAnimal renamed this task from XTools is down to XTools is down or very slow due to bot scraping.Apr 22 2025, 4:01 AM
MusikAnimal lowered the priority of this task from High to Medium.Apr 22 2025, 8:31 PM

So, just as with the previous incident (T384711#10492861), the issue is actually more about the flood of traffic itself (up to 300+ req/sec!). XTools was actually doing its job of detecting web crawlers and rejecting the requests, for the few requests that actually made it to the application layer.

I think what I have in place now will mostly solve our issue for this specific bot attack. I do not have a general long-term solution yet. Noting again that our problem seems to be DDoS-style flooding of Apache so that it can't accept any new requests, so any amount of bot detection isn't going to work, unless it can be ran before Apache handles the request.

Noting that it's still possible we are blocking some genuine traffic, but the collateral damage should be considerably less than it was yesterday.

In addition, many users will now have to login when they didn't before – even as a first-time user.

Now it seems to be very laggy and in the page returned, if it does not emit directly an error, the graphs are not loaded.

Now it seems to be very laggy and in the page returned, if it does not emit directly an error, the graphs are not loaded.

Vital signs look okay and things seem to be running normally for me.

Could you file a separate task and give more details, including links?

Now it seems to be very laggy and in the page returned, if it does not emit directly an error, the graphs are not loaded.

Now it seems to work better, maybe in the last quarter of an hour there was a spike in requests that have overloaded the infrastructure behind the tool.

Now it seems to be very laggy and in the page returned, if it does not emit directly an error, the graphs are not loaded.

Vital signs look okay and things seem to be running normally for me.

Could you file a separate task and give more details, including links?

I have no more detail beyond the fact that the tool was laggy and error prone.

Notable details on the errors that can help if you remember are:

  • which parts didn't work (which pages,which chart, which section, &c)?
  • how precisely did the charts not work? Did they just show a grey background? Did the legend appear?

Without more precise information we can't really try to fix the bug.

I was trying to see my contribution (here) and the Emanuele676's ones (here). The pages took a long time to load, so I tried to reload the page with the Firefox browser reload button multiple times for both pages. Sometimes it appeared an error like the one at the top, sometimes the page finished to load half styled and without graphs, and (but I don't think that this info is useful) I seem to remember that the legend was present. Soon after I finished writing the message above it seemed like everything went back to normal.

I thought that this happened because graphs are heavier and animated, so maybe they were configured to be added later asynchronously.

The title of this report is "XTools is down or very slow due to bot scraping" and it seems reasonable to me to think that the tool was either very slow or was refusing incoming connections because it was flooded by bot scraping activity.