Page MenuHomePhabricator

High load on deployment-mediawiki14 and slow responses
Closed, ResolvedPublicBUG REPORT

Description

[17:24]  < MatmaRex> is beta cluster having a bad time, or is it just me?
[17:24]  < MatmaRex> pages are taking forever to load
[17:24]  <    bd808> beta.wmflabs.org seems slow/stuck from my laptop
[17:26]  <    bd808> there are some tall spikes on the aggregated load graph for the Cloud VPS project in total
[17:27]  <    bd808> I finally got a timeout from deployment-cache-text08 trying to get data from the backing MediaWiki.
[17:30]  <    bd808> load on deployment-mediawiki14 is ~6. https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&var-project=deployment-prep&var-instance=All&from=now-2d&to=now&viewPanel=902
[17:30]  <    bd808> !log `shutdown -r now` on deployment-mediawiki14. Load has been growing for ~2 days.
[17:30]  < stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[17:31]  <    bd808> MatmaRex: maybe better for a bit. ¯\_(ツ)_/¯
[17:32]  <    bd808> its popping right back up though. I need to find hashar's notes on looking at the inbound traffic.
[17:32]  < MatmaRex> huh, interesting
[17:34]  <    bd808> yeah, it's right back up to 6 again with 9 parallel php processes at the top of the %CPU
[17:35]  < MatmaRex> so, scraping, surely?
[17:36]  <    bd808> zuul is very quiet, so yeah I would guess some bots being agressive

Event Timeline

{T389181} was a prior round of overload (sorry, private task because of lots of IPv4 addresses in it)

The top 10 clients per grep -oP '"X-Client-IP": "\d+\.\d+\.\d+\.\d+' /var/log/apache2/other_vhosts_access-json.log|sort|uniq -c|sort -nr|head -n10 on deployment-mediawiki14 are coming from large IPv4 allocations registered to Microsoft. I assume these are Azure addresses. I am going to block the CIDRs connected to these clients at our varnish layer via Horizon managed hiera.

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/f2d68f95b1405955014af32d4eb6b7c83d201083%5E%21/

diff --git a/deployment-prep/_.yaml b/deployment-prep/_.yaml
index f96f58f..acc36ea 100644
--- a/deployment-prep/_.yaml
+++ b/deployment-prep/_.yaml

@@ -10,6 +10,19 @@
     - 47.80.0.0/13
     - 47.74.0.0/15
     - 47.76.0.0/14
+    - 52.152.0.0/13
+    - 52.160.0.0/11
+    - 52.145.0.0/16
+    - 52.148.0.0/14
+    - 52.146.0.0/15
+    - 40.112.0.0/13
+    - 40.76.0.0/14
+    - 40.120.0.0/14
+    - 40.125.0.0/17
+    - 40.124.0.0/16
+    - 40.74.0.0/15
+    - 40.96.0.0/12
+    - 40.80.0.0/12
 acmechief_host: deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud
 apt::use_experimental: true
 aptly::group: wikidev

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/290e8a5c6939740c79abbe71546be60178ac8f1c%5E%21/#F0

diff --git a/deployment-prep/_.yaml b/deployment-prep/_.yaml
index acc36ea..212ff33 100644
--- a/deployment-prep/_.yaml
+++ b/deployment-prep/_.yaml

@@ -23,6 +23,10 @@
     - 40.74.0.0/15
     - 40.96.0.0/12
     - 40.80.0.0/12
+    - 13.64.0.0/11
+    - 13.104.0.0/14
+    - 13.96.0.0/13
+    - 23.96.0.0/13
 acmechief_host: deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud
 apt::use_experimental: true
 aptly::group: wikidev

Mentioned in SAL (#wikimedia-releng) [2025-04-15T18:06:15Z] <bd808> sudo puppet agent -tv on deployment-cache-text08 to update varnish deny list (T392003)

Mentioned in SAL (#wikimedia-releng) [2025-04-15T18:11:26Z] <bd808> bd808@deployment-cache-text08:~$ sudo service varnish-frontend restart (T392003)

https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/44875f6c7725d413a0071da0f18cec6f1bbf749d%5E%21/#F0

diff --git a/deployment-prep/_.yaml b/deployment-prep/_.yaml
index 212ff33..8a6f878 100644
--- a/deployment-prep/_.yaml
+++ b/deployment-prep/_.yaml

@@ -27,6 +27,17 @@
     - 13.104.0.0/14
     - 13.96.0.0/13
     - 23.96.0.0/13
+    - 45.89.148.0/23
+    - 45.93.184.0/23
+    - 91.124.117.0/24
+    - 140.228.23.0/24
+    - 146.104.0.0/14
+    - 146.110.0.0/16
+    - 146.100.0.0/14
+    - 146.108.0.0/15
+    - 96.62.0.0/16
+    - 154.16.246.0/24
+    - 102.129.130.0/24
 acmechief_host: deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud
 apt::use_experimental: true
 aptly::group: wikidev

The intertubes are full of bots and they all want that juicy betawiki content I guess. :/

Mentioned in SAL (#wikimedia-releng) [2025-04-15T19:40:10Z] <bd808> Forced puppet run and restarted varnish on deployment-cache-text08 to pick up new blocks (T392003)

The last hiera change I made for this is https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/8697c499f4c8e91428329ea98b4e519de03e3507%5E%21/#F0. I was just sorting the blocked_nets list with sort n in vim.

diff --git a/deployment-prep/_.yaml b/deployment-prep/_.yaml
index 8a6f878..8c432a1 100644
--- a/deployment-prep/_.yaml
+++ b/deployment-prep/_.yaml

@@ -2,42 +2,42 @@
   blocked_nets:
     networks:
     - 8.208.0.0/12
-    - 47.240.0.0/14
-    - 47.244.0.0/15
-    - 47.236.0.0/14
-    - 47.246.0.0/16
-    - 47.235.0.0/16
-    - 47.80.0.0/13
-    - 47.74.0.0/15
-    - 47.76.0.0/14
-    - 52.152.0.0/13
-    - 52.160.0.0/11
-    - 52.145.0.0/16
-    - 52.148.0.0/14
-    - 52.146.0.0/15
-    - 40.112.0.0/13
-    - 40.76.0.0/14
-    - 40.120.0.0/14
-    - 40.125.0.0/17
-    - 40.124.0.0/16
-    - 40.74.0.0/15
-    - 40.96.0.0/12
-    - 40.80.0.0/12
-    - 13.64.0.0/11
     - 13.104.0.0/14
+    - 13.64.0.0/11
     - 13.96.0.0/13
     - 23.96.0.0/13
+    - 40.112.0.0/13
+    - 40.120.0.0/14
+    - 40.124.0.0/16
+    - 40.125.0.0/17
+    - 40.74.0.0/15
+    - 40.76.0.0/14
+    - 40.80.0.0/12
+    - 40.96.0.0/12
     - 45.89.148.0/23
     - 45.93.184.0/23
+    - 47.235.0.0/16
+    - 47.236.0.0/14
+    - 47.240.0.0/14
+    - 47.244.0.0/15
+    - 47.246.0.0/16
+    - 47.74.0.0/15
+    - 47.76.0.0/14
+    - 47.80.0.0/13
+    - 52.145.0.0/16
+    - 52.146.0.0/15
+    - 52.148.0.0/14
+    - 52.152.0.0/13
+    - 52.160.0.0/11
     - 91.124.117.0/24
-    - 140.228.23.0/24
-    - 146.104.0.0/14
-    - 146.110.0.0/16
-    - 146.100.0.0/14
-    - 146.108.0.0/15
     - 96.62.0.0/16
-    - 154.16.246.0/24
     - 102.129.130.0/24
+    - 140.228.23.0/24
+    - 146.100.0.0/14
+    - 146.104.0.0/14
+    - 146.108.0.0/15
+    - 146.110.0.0/16
+    - 154.16.246.0/24
 acmechief_host: deployment-acme-chief05.deployment-prep.eqiad1.wikimedia.cloud
 apt::use_experimental: true
 aptly::group: wikidev
bd808 claimed this task.
dancy renamed this task from HIgh load on deployment-mediawiki14 and slow responses to High load on deployment-mediawiki14 and slow responses .May 5 2025, 5:32 PM