Page MenuHomePhabricator

Typo in wmf-config for tmpItemTermsMigrationStages name
Closed, ResolvedPublic

Description

We attempted to fix this with a change to the config file: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570892
But this fix had to be reverted after applying it to production: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/570901

We suspect it had to be reverted due to it causing problems and not just unfortunate co-incidence. It only caused a small number of errors:
https://logstash.wikimedia.org/goto/c9affc21521d0b19a2ff129c918496d9

However we fired about 40 icinga alerts of app servers see this copy of the log from #wikimedia-operations

1(CR) Hoo man: [C: +2] Wikibase Client: Fix setting name typo [mediawiki-config] - https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: Hoo man)
22:36 PM (Merged) jenkins-bot: Wikibase Client: Fix setting name typo [mediawiki-config] - https://gerrit.wikimedia.org/r/570892 (https://phabricator.wikimedia.org/T244529) (owner: Hoo man)
32:38 PM <•logmsgbot> !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 20s)
42:38 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Jhedden: [C: +2] icinga: update sms contact for jhedden [puppet] - https://gerrit.wikimedia.org/r/570896 (owner: Jhedden)
52:38 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
62:38 PM T244529: mw.wikibase.getLabelByLang not return item label for some items - https://phabricator.wikimedia.org/T244529
72:40 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (PS1) Elukey: Add presto_clusters_secrets in common.yaml [labs/private] - https://gerrit.wikimedia.org/r/570900
82:40 PM <•logmsgbot> !log hoo@deploy1001 Scap failed!: 9/11 canaries failed their endpoint checks(http://en.wikipedia.org)
92:40 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
102:40 PM <•icinga-wm> IRC echo bot PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
112:40 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Elukey: [V: +2 C: +2] Add presto_clusters_secrets in common.yaml [labs/private] - https://gerrit.wikimedia.org/r/570900 (owner: Elukey)
122:40 PM <•icinga-wm> IRC echo bot PROBLEM - PHP7 rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
132:40 PM PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
142:40 PM PROBLEM - Apache HTTP on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
152:40 PM PROBLEM - PHP7 rendering on mw1271 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
162:40 PM PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
172:40 PM PROBLEM - Nginx local proxy to apache on mw1322 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
182:40 PM PROBLEM - PHP7 rendering on mw1270 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
192:40 PM PROBLEM - Nginx local proxy to apache on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
202:40 PM PROBLEM - Nginx local proxy to apache on mw1316 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
212:41 PM PROBLEM - Apache HTTP on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
222:41 PM PROBLEM - Nginx local proxy to apache on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
232:41 PM PROBLEM - Nginx local proxy to apache on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
242:41 PM PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
252:41 PM PROBLEM - Apache HTTP on mw1330 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
262:41 PM PROBLEM - Apache HTTP on mw1320 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
272:41 PM PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
282:41 PM PROBLEM - PHP7 rendering on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
292:41 PM <hoo> How do I force the deploy
302:41 PM <•icinga-wm> IRC echo bot PROBLEM - Apache HTTP on mw1324 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
312:41 PM PROBLEM - Apache HTTP on mw1319 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
322:41 PM PROBLEM - Nginx local proxy to apache on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
332:41 PM <hoo> it's a revert
342:41 PM <•icinga-wm> IRC echo bot PROBLEM - PHP7 rendering on mw1323 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
352:41 PM <hauskatze> devnull somebody unplugged the wrong cable
362:41 PM <•icinga-wm> IRC echo bot PROBLEM - Apache HTTP on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
372:41 PM PROBLEM - Nginx local proxy to apache on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
382:41 PM PROBLEM - Apache HTTP on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
392:41 PM PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
402:41 PM PROBLEM - PHP7 rendering on mw1255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
412:41 PM PROBLEM - Apache HTTP on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
422:41 PM <hoo> not sure why but my last change broke it
432:41 PM <•icinga-wm> IRC echo bot PROBLEM - Apache HTTP on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
442:41 PM PROBLEM - Nginx local proxy to apache on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
452:41 PM PROBLEM - Nginx local proxy to apache on mw1234 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
462:41 PM PROBLEM - Apache HTTP on mw1344 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
472:41 PM PROBLEM - Apache HTTP on mw1289 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
482:41 PM PROBLEM - Apache HTTP on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
492:41 PM PROBLEM - Apache HTTP on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
502:41 PM ⇐ •icinga-wm quit (~icinga-wm@wikimedia/bot/icinga-wm) Excess Flood
512:41 PM <hoo> I guess
522:41 PM <Cohaf> sorry if wrong channel, I can't visit all wikis now
532:41 PM <elukey> Cohaf: yep it is, we are working on it :)
542:42 PM <hauskatze> devnull Cohaf: that's what happens when the Apache server crashes :)
552:42 PM <hoo> Got it, reverting with --force now
562:42 PM <Cohaf> thanks, I had 502 all round
572:42 PM <_joe_> Giuseppe Lavagetto hoo: damnit yes
582:42 PM <elukey> hoo: there was an occurrence of the same problem before, it is probably not your change
592:42 PM <Cohaf> I recalled then I was able to access
602:42 PM from Singapore
612:42 PM <elukey> but let's revert in any case
622:43 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Jcrespo: [C: +1] "> Patch Set 2:" [puppet] - https://gerrit.wikimedia.org/r/570792 (https://phabricator.wikimedia.org/T240094) (owner: Marostegui)
632:43 PM <Praxidicae> Adrestia #rip
642:43 PM <Amir1> Amir Sarabadani !log ladsgroup@mwmaint1002:~$ mwscript createAndPromote.php --wiki=zhwiki --force "Amir Sarabadani (WMDE)" --sysop (T244578)
652:43 PM <godog> Filippo Giunchedi 0x99D49B6B00CAD1E5 hoo: how's the revert ?
662:43 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
672:43 PM T244578: Tracking task: 2020-02-07 MW API server outage(s) - https://phabricator.wikimedia.org/T244578
682:43 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (PS1) Hoo man: Revert "Wikibase Client: Fix setting name typo" [mediawiki-config] - https://gerrit.wikimedia.org/r/570901
692:43 PM <•logmsgbot> !log hoo@deploy1001 Synchronized wmf-config/Wikibase.php: REVERT: Wikibase Client: Fix setting name typo (T244529) (duration: 01m 40s)
702:43 PM <•stashbot> https://tools.wmflabs.org/stashbot/ Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
712:43 PM T244529: mw.wikibase.getLabelByLang not return item label for some items - https://phabricator.wikimedia.org/T244529
722:43 PM <hoo> godog: Done
732:44 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (CR) Hoo man: [C: +2] "For consistency" [mediawiki-config] - https://gerrit.wikimedia.org/r/570901 (owner: Hoo man)
742:44 PM <_joe_> Giuseppe Lavagetto we're back
752:44 PM <godog> Filippo Giunchedi 0x99D49B6B00CAD1E5 hoo: thank you
762:44 PM <Cohaf> thanks
772:44 PM <hoo> Seems that typo actually hid a very nasty bug :S
782:44 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (Merged) jenkins-bot: Revert "Wikibase Client: Fix setting name typo" [mediawiki-config] - https://gerrit.wikimedia.org/r/570901 (owner: Hoo man)
792:45 PM <_joe_> Giuseppe Lavagetto also it's friday :)
802:45 PM → AmandaNP joined (uid1203@wikipedia/DeltaQuad)
812:45 PM <hoo> Yes, that calls for bad luck :S
822:45 PM → icinga-wm joined (~icinga-wm@wikimedia/bot/icinga-wm)
832:45 PM <icinga-wm> IRC echo bot RECOVERY - Nginx local proxy to apache on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.714 second response time https://wikitech.wikimedia.org/wiki/Application_servers
842:45 PM RECOVERY - phpfpm_up reduced availability on icinga1001 is OK: (C)0.8 le (W)0.9 le 0.9534 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
852:45 PM •icinga-wm was voiced (+v) by •ChanServ
862:45 PM <•icinga-wm> IRC echo bot RECOVERY - Apache HTTP on mw1266 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.679 second response time https://wikitech.wikimedia.org/wiki/Application_servers
872:45 PM RECOVERY - Apache HTTP on mw1272 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.054 second response time https://wikitech.wikimedia.org/wiki/Application_servers
882:45 PM RECOVERY - PHP7 rendering on mw1319 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
892:45 PM RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
902:45 PM RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
912:45 PM RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
922:45 PM RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
932:45 PM RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
942:45 PM RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Application_servers
952:45 PM RECOVERY - Nginx local proxy to apache on mw1271 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.412 second response time https://wikitech.wikimedia.org/wiki/Application_servers
962:45 PM RECOVERY - Nginx local proxy to apache on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 629 bytes in 1.132 second response time https://wikitech.wikimedia.org/wiki/Application_servers
972:45 PM RECOVERY - Apache HTTP on mw1270 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 1.232 second response time https://wikitech.wikimedia.org/wiki/Application_servers
982:45 PM RECOVERY - High average POST latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=POST
992:45 PM RECOVERY - PHP7 rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1002:45 PM RECOVERY - PHP7 rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1012:46 PM RECOVERY - PHP7 rendering on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1022:46 PM RECOVERY - Nginx local proxy to apache on mw1275 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.613 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1032:46 PM RECOVERY - Nginx local proxy to apache on mw1328 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.175 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1042:46 PM RECOVERY - PHP7 rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.182 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1052:46 PM RECOVERY - Nginx local proxy to apache on mw1267 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.572 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1062:46 PM PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204,205} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-met
1072:46 PM RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1082:46 PM RECOVERY - PHP7 rendering on mw1332 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1092:46 PM RECOVERY - Nginx local proxy to apache on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1102:46 PM RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 71.45 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
1112:46 PM RECOVERY - Apache HTTP on mw1326 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1122:46 PM RECOVERY - Apache HTTP on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.032 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1132:46 PM RECOVERY - Nginx local proxy to apache on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 627 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1142:46 PM RECOVERY - PHP7 rendering on mw1269 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1152:46 PM RECOVERY - Nginx local proxy to apache on mw1263 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 628 bytes in 0.716 second response time https://wikitech.wikimedia.org/wiki/Application_servers
1162:46 PM RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
1172:46 PM RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
1182:46 PM RECOVERY - PHP7 rendering on mw1330 is OK: HTTP OK: HTTP/1.1 200 OK - 79818 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
1192:46 PM ⇐ hauskatze and kevinbazira quit
1202:47 PM <•icinga-wm> IRC echo bot RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.6208 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
1212:48 PM <•wikibugs> Wikibugs v2.1, https://tools.wmflabs.org/wikibugs/ (PS3) Muehlenhoff: Switch logstash hosts to standard Partman recipe [puppet] - https://gerrit.wikimedia.org/r/570600 (https://phabricator.wikimedia.org/T156955)
1222:48 PM <•icinga-wm> IRC echo bot RECOVERY - ATS TLS has reduced HTTP availability #page on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=13&fullscreen&refresh=1m&orgId=1
1232:48 PM RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
1242:48 PM RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
1252:48 PM RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
1262:48 PM RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
1272:49 PM PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
1282:49 PM RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
1292:49 PM RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
1302:49 PM RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
1312:49 PM RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
1322:49 PM RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
1332:49 PM RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
1342:49 PM RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
1352:49 PM RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
1362:49 PM RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
1372:50 PM RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
1382:50 PM RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
1392:50 PM RECOVERY - Restbase edge eqsin on text-lb.eqsin.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
1402:51 PM PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
1412:52 PM RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX

And according to reports took the wikis down.

Below is a table of what we presume the config was actually set to at different times due to this typo:
Looking at the value for $wgWBClientSettings['tmpItemTermsMigrationStage']

Date3rd Feb (all on wmf.16)2020-02-04 22:35 (group0 on .18; group1, group2 on .16)2020-02-05 20:44:00 (group0, group1 on .18; group2 on .16)2020-02-06 20:25 (all on wmf.18)2020-02-06 20:44:00 (group0, group1 on .18; group2 on .16)2020-02-07 14:38:00 (group0, group1 on .18; group2 on .16) + typo fix
Source of valuewikidatawikiWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpwmf-config wikidatawiki
Actual Value[ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ][ 'max' => MIGRATION_WRITE_NEW ] [ 100000 => MIGRATION_WRITE_NEW, 74000000 => MIGRATION_WRITE_BOTH, 'max' => MIGRATION_OLD, ]
Source of valuetestwikidatawikiWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpwmf-config testwikidatawiki
Actual Value[ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_NEW ][ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ]
Source of valuerandom group 0 wikiWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpwmf-config default
Actual Value[ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 8000000 => MIGRATION_WRITE_NEW, 'max' => MIGRATION_OLD, ]
Source of valuerandom group 1 wikiWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpwmf-config default
Actual Value[ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 'max' => MIGRATION_WRITE_NEW ] [ 8000000 => MIGRATION_WRITE_NEW, 'max' => MIGRATION_OLD, ]
Source of valuerandom group 2 wikiWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpWikibaseLib.default.phpwmf-config default
Actual Value[ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_BOTH ][ 'max' => MIGRATION_WRITE_NEW ][ 'max' => MIGRATION_WRITE_BOTH ][ 8000000 => MIGRATION_WRITE_NEW, 'max' => MIGRATION_OLD, ]

= reading from potentially unmigrated rows
= heavy load on new termstore; suspected to have partially brought down wikis
= very heavy load on new termstore suspected to have brought down the wikis

Details

Related Gerrit Patches:
mediawiki/extensions/Wikibase : masterRevert "Revert "wbterms: Set default for the term store to read new""
operations/mediawiki-config : masterFix typo in the config name
operations/mediawiki-config : masterStop reading for the new term store as the default of client wikis

Event Timeline

Tarrow created this task.Feb 10 2020, 9:23 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 10 2020, 9:23 AM
Tarrow renamed this task from Typo in config for tmpItemTermsMigrationStages name to Typo in wmf-config for tmpItemTermsMigrationStages name.Feb 10 2020, 9:23 AM
Tarrow updated the task description. (Show Details)Feb 10 2020, 11:46 AM
Tarrow updated the task description. (Show Details)Feb 10 2020, 1:25 PM
Tarrow updated the task description. (Show Details)Feb 10 2020, 1:46 PM

Not sure about if test would have had unmigrated item terms

Tarrow updated the task description. (Show Details)Feb 10 2020, 1:49 PM

Change 571338 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Stop reading for the new term store as the default of client wikis

https://gerrit.wikimedia.org/r/571338

Change 571339 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[operations/mediawiki-config@master] Fix typo in the config name

https://gerrit.wikimedia.org/r/571339

Change 571340 had a related patch set uploaded (by Ladsgroup; owner: Ladsgroup):
[mediawiki/extensions/Wikibase@master] Revert "Revert "wbterms: Set default for the term store to read new""

https://gerrit.wikimedia.org/r/571340

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptFeb 10 2020, 7:31 PM
Maintenance_bot moved this task from Incoming to In progress on the User-Ladsgroup board.

Change 571338 merged by jenkins-bot:
[operations/mediawiki-config@master] Stop reading for the new term store as the default of client wikis

https://gerrit.wikimedia.org/r/571338

Mentioned in SAL (#wikimedia-operations) [2020-02-11T12:10:24Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571338|Stop reading for the new term store as the default of client wikis (T244697)]] (duration: 01m 11s)

Mentioned in SAL (#wikimedia-operations) [2020-02-11T12:12:18Z] <ladsgroup@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:571338|Stop reading for the new term store as the default of client wikis (T244697)]], Second round, cache issue (duration: 01m 07s)

Change 571339 merged by jenkins-bot:
[operations/mediawiki-config@master] Fix typo in the config name

https://gerrit.wikimedia.org/r/571339

Mentioned in SAL (#wikimedia-operations) [2020-02-11T12:26:29Z] <ladsgroup@deploy1001> Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:571339|Fix typo in the config name (T244697)]] (duration: 01m 05s)

Mentioned in SAL (#wikimedia-operations) [2020-02-11T12:28:07Z] <ladsgroup@deploy1001> Synchronized wmf-config/Wikibase.php: SWAT: [[gerrit:571339|Fix typo in the config name (T244697)]], take II, cache (duration: 01m 06s)

Change 571340 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Revert "Revert "wbterms: Set default for the term store to read new""

https://gerrit.wikimedia.org/r/571340

Addshore closed this task as Resolved.Feb 18 2020, 10:34 AM