We attempted to fix this with a change to the config file: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570892
But this fix had to be reverted after applying it to production: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570901
We suspect it had to be reverted due to it causing problems and not just unfortunate co-incidence. It only caused a small number of errors:
https://logstash.wikimedia.org/goto/c9affc21521d0b19a2ff129c918496d9
However we fired about 40 icinga alerts of app servers see this copy of the log from #wikimedia-operations
1 | (CR) Hoo man: [C: +2] Wikibase Client: Fix setting name typo [mediawiki-config] - https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: Hoo man) |
---|---|
2 | 2:36 PM (Merged) jenkins-bot: Wikibase Client: Fix setting name typo [mediawiki-config] - https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: Hoo man) |
3 | 2:38 PM <•logmsgbot> !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 20s) |
4 | 2:38 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Jhedden: [C: +2] icinga: update sms contact for jhedden [puppet] - https://gerrit.wikimedia.org/r/570896 (owner: Jhedden) |
5 | 2:38 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log |
6 | 2:38 PM T244529: mw.wikibase.getLabelByLang not return item label for some items - https://phabricator.wikimedia.org/T244529 |
7 | 2:40 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (PS1) Elukey: Add presto_clusters_secrets in common.yaml [labs/private] - https://gerrit.wikimedia.org/r/570900 |
8 | 2:40 PM <•logmsgbot> !log hoo@deploy1001 Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org) |
9 | 2:40 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log |
10 | 2:40 PM <•icinga-wm> IRC echo bot PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops |
11 | 2:40 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Elukey: [V: +2 C: +2] Add presto_clusters_secrets in common.yaml [labs/private] - https://gerrit.wikimedia.org/r/570900 (owner: Elukey) |
12 | 2:40 PM <•icinga-wm> IRC echo bot PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
13 | 2:40 PM PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
14 | 2:40 PM PROBLEM - Apache HTTP on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
15 | 2:40 PM PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
16 | 2:40 PM PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
17 | 2:40 PM PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
18 | 2:40 PM PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
19 | 2:40 PM PROBLEM - Nginx local proxy to apache on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
20 | 2:40 PM PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
21 | 2:41 PM PROBLEM - Apache HTTP on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
22 | 2:41 PM PROBLEM - Nginx local proxy to apache on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
23 | 2:41 PM PROBLEM - Nginx local proxy to apache on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
24 | 2:41 PM PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
25 | 2:41 PM PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
26 | 2:41 PM PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
27 | 2:41 PM PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
28 | 2:41 PM PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
29 | 2:41 PM <hoo> How do I force the deploy |
30 | 2:41 PM <•icinga-wm> IRC echo bot PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
31 | 2:41 PM PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
32 | 2:41 PM PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
33 | 2:41 PM <hoo> it's a revert |
34 | 2:41 PM <•icinga-wm> IRC echo bot PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
35 | 2:41 PM <hauskatze> devnull somebody unplugged the wrong cable |
36 | 2:41 PM <•icinga-wm> IRC echo bot PROBLEM - Apache HTTP on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
37 | 2:41 PM PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
38 | 2:41 PM PROBLEM - Apache HTTP on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
39 | 2:41 PM PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
40 | 2:41 PM PROBLEM - PHP7 rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
41 | 2:41 PM PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
42 | 2:41 PM <hoo> not sure why but my last change broke it |
43 | 2:41 PM <•icinga-wm> IRC echo bot PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
44 | 2:41 PM PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
45 | 2:41 PM PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
46 | 2:41 PM PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
47 | 2:41 PM PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
48 | 2:41 PM PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
49 | 2:41 PM PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers |
50 | 2:41 PM ⇐ •icinga-wm quit (~icinga-wm@wikimedia/bot/icinga-wm) Excess Flood |
51 | 2:41 PM <hoo> I guess |
52 | 2:41 PM <Cohaf> sorry if wrong channel, I can't visit all wikis now |
53 | 2:41 PM <elukey> Cohaf: yep it is, we are working on it :) |
54 | 2:42 PM <hauskatze> devnull Cohaf: that's what happens when the Apache server crashes :) |
55 | 2:42 PM <hoo> Got it, reverting with --force now |
56 | 2:42 PM <Cohaf> thanks, I had 502 all round |
57 | 2:42 PM <_joe_> Giuseppe Lavagetto hoo: damnit yes |
58 | 2:42 PM <elukey> hoo: there was an occurrence of the same problem before, it is probably not your change |
59 | 2:42 PM <Cohaf> I recalled then I was able to access |
60 | 2:42 PM from Singapore |
61 | 2:42 PM <elukey> but let's revert in any case |
62 | 2:43 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Jcrespo: [C: +1] "> Patch Set 2:" [puppet] - https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: Marostegui) |
63 | 2:43 PM <Praxidicae> Adrestia #rip |
64 | 2:43 PM <Amir1> Amir Sarabadani !log ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=zhwiki --force "Amir Sarabadani (WMDE)" --sysop (T244578) |
65 | 2:43 PM <godog> Filippo Giunchedi 0x99D49B6B00CAD1E5 hoo: how's the revert ? |
66 | 2:43 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log |
67 | 2:43 PM T244578: Tracking task: 2020-02-07 MW API server outage(s) - https://phabricator.wikimedia.org/T244578 |
68 | 2:43 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (PS1) Hoo man: Revert "Wikibase Client: Fix setting name typo" [mediawiki-config] - https://gerrit.wikimedia.org/r/570901 |
69 | 2:43 PM <•logmsgbot> !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: REVERT: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 40s) |
70 | 2:43 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log |
71 | 2:43 PM T244529: mw.wikibase.getLabelByLang not return item label for some items - https://phabricator.wikimedia.org/T244529 |
72 | 2:43 PM <hoo> godog: Done |
73 | 2:44 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Hoo man: [C: +2] "For consistency" [mediawiki-config] - https://gerrit.wikimedia.org/r/570901 (owner: Hoo man) |
74 | 2:44 PM <_joe_> Giuseppe Lavagetto we're back |
75 | 2:44 PM <godog> Filippo Giunchedi 0x99D49B6B00CAD1E5 hoo: thank you |
76 | 2:44 PM <Cohaf> thanks |
77 | 2:44 PM <hoo> Seems that typo actually hid a very nasty bug :S |
78 | 2:44 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (Merged) jenkins-bot: Revert "Wikibase Client: Fix setting name typo" [mediawiki-config] - https://gerrit.wikimedia.org/r/570901 (owner: Hoo man) |
79 | 2:45 PM <_joe_> Giuseppe Lavagetto also it's friday :) |
80 | 2:45 PM → AmandaNP joined (uid1203@wikipedia/DeltaQuad) |
81 | 2:45 PM <hoo> Yes, that calls for bad luck :S |
82 | 2:45 PM → icinga-wm joined (~icinga-wm@wikimedia/bot/icinga-wm) |
83 | 2:45 PM <icinga-wm> IRC echo bot RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.714 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
84 | 2:45 PM RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 0.9534 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets |
85 | 2:45 PM •icinga-wm was voiced (+v) by •ChanServ |
86 | 2:45 PM <•icinga-wm> IRC echo bot RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.679 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
87 | 2:45 PM RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
88 | 2:45 PM RECOVERY - PHP7 rendering on mw1319 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
89 | 2:45 PM RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
90 | 2:45 PM RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
91 | 2:45 PM RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
92 | 2:45 PM RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
93 | 2:45 PM RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
94 | 2:45 PM RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
95 | 2:45 PM RECOVERY - Nginx local proxy to apache on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.412 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
96 | 2:45 PM RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
97 | 2:45 PM RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
98 | 2:45 PM RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST |
99 | 2:45 PM RECOVERY - PHP7 rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
100 | 2:45 PM RECOVERY - PHP7 rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
101 | 2:46 PM RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
102 | 2:46 PM RECOVERY - Nginx local proxy to apache on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.613 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
103 | 2:46 PM RECOVERY - Nginx local proxy to apache on mw1328 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
104 | 2:46 PM RECOVERY - PHP7 rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
105 | 2:46 PM RECOVERY - Nginx local proxy to apache on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.572 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
106 | 2:46 PM PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-met |
107 | 2:46 PM RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
108 | 2:46 PM RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
109 | 2:46 PM RECOVERY - Nginx local proxy to apache on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
110 | 2:46 PM RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 71.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 |
111 | 2:46 PM RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
112 | 2:46 PM RECOVERY - Apache HTTP on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
113 | 2:46 PM RECOVERY - Nginx local proxy to apache on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
114 | 2:46 PM RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
115 | 2:46 PM RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Application_servers |
116 | 2:46 PM RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal |
117 | 2:46 PM RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal |
118 | 2:46 PM RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering |
119 | 2:46 PM ⇐ hauskatze and kevinbazira quit |
120 | 2:47 PM <•icinga-wm> IRC echo bot RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.6208 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash |
121 | 2:48 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (PS3) Muehlenhoff: Switch logstash hosts to standard Partman recipe [puppet] - https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955) |
122 | 2:48 PM <•icinga-wm> IRC echo bot RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1 |
123 | 2:48 PM RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST |
124 | 2:48 PM RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET |
125 | 2:48 PM RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal |
126 | 2:48 PM RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase |
127 | 2:49 PM PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX |
128 | 2:49 PM RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops |
129 | 2:49 PM RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds |
130 | 2:49 PM RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX |
131 | 2:49 PM RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
132 | 2:49 PM RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase |
133 | 2:49 PM RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton |
134 | 2:49 PM RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton |
135 | 2:49 PM RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton |
136 | 2:49 PM RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton |
137 | 2:50 PM RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET |
138 | 2:50 PM RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase |
139 | 2:50 PM RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase |
140 | 2:51 PM PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX |
141 | 2:52 PM RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX |
And according to reports took the wikis down.
Below is a table of what we presume the config was actually set to at different times due to this typo:
Looking at the value for $wgWBClientSettings['tmpItemTermsMigrationStage']
Date | 3rd Feb (all on wmf.16) | 2020-02-04 22:35 (group0 on .18; group1, group2 on .16) | 2020-02-05 20:44:00 (group0, group1 on .18; group2 on .16) | 2020-02-06 20:25 (all on wmf.18) | 2020-02-06 20:44:00 (group0, group1 on .18; group2 on .16) | 2020-02-07 14:38:00 (group0, group1 on .18; group2 on .16) + typo fix | |
---|---|---|---|---|---|---|---|
Source of value | wikidatawiki | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | wmf-config wikidatawiki |
Actual Value | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 100000 => MIGRATION_WRITE_NEW, 74000000 => MIGRATION_WRITE_BOTH, 'max' => MIGRATION_OLD, ] | |
Source of value | testwikidatawiki | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | wmf-config testwikidatawiki |
Actual Value | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | |
Source of value | random group 0 wiki | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | wmf-config default |
Actual Value | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 8000000 => MIGRATION_WRITE_NEW, 'max' => MIGRATION_OLD, ] | |
Source of value | random group 1 wiki | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | wmf-config default |
Actual Value | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 8000000 => MIGRATION_WRITE_NEW, 'max' => MIGRATION_OLD, ] | |
Source of value | random group 2 wiki | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | WikibaseLib.default.php | wmf-config default |
Actual Value | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 'max' => MIGRATION_WRITE_NEW ] | [ 'max' => MIGRATION_WRITE_BOTH ] | [ 8000000 => MIGRATION_WRITE_NEW, 'max' => MIGRATION_OLD, ] | |
= reading from potentially unmigrated rows
= heavy load on new termstore; suspected to have partially brought down wikis
= very heavy load on new termstore suspected to have brought down the wikis