During the Morning SWAT, @Ottomata deployed https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/492770/. @Ottomata thought that this patch did not depend on wmf.19 being deployed, since we had last week deployed support for multi instance EventBus instance via the EventServices config. This config has already been deployed, and so the EventBus extension should not have been using the legacy wgEventServiceUrl config. Apparently it was somehow.
EventBus POST requests rate showing a drop when the SWAT config change got pushed at 19:40 and recovering at 20:10 when wmf.19 got deployed on all wikis:
During the time between the morning SWAT and the final push of wmf.19 to group2 wikis, many app servers (probably all those in group2) failed to produce to EventBus due to misconfiguration:
https://logstash.wikimedia.org/goto/a1781b78f692e030d2e775d08572ca9f\
https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667&panelId=13&fullscreen&edit
https://grafana.wikimedia.org/d/000000201/eventbus?from=1551369945820&to=1551391545820&orgId=1&var-site=eqiad&var-rule=All
I've been able to capture the failed events from the logstash error data in Kafka
$ kafkacat -C -b localhost:9092 -t udp_localhost-err -o 12688179 | grep '"channel": "EventBus"' > eventbus-outage-logstash.2019-02-28.json $ cat eventbus-outage-logstash.2019-02-28.json | jq -rc .events[0] > eventbus-outage-events.2019-02-28.json $ wc -l eventbus-outage-events.2019-02-28.json 269995 eventbus-outage-events.2019-02-28.json
Counting by topic:
147405 "mediawiki.job.wikibase-addUsagesForPage" 37177 "mediawiki.job.RecordLintJob" 17266 "mediawiki.job.cirrusSearchLinksUpdatePrioritized" 12189 "mediawiki.job.cirrusSearchLinksUpdate" 8831 "resource_change" 5848 "mediawiki.job.cdnPurge" 5420 "mediawiki.job.refreshLinksPrioritized" 5224 "mediawiki.job.htmlCacheUpdate" 4897 "mediawiki.job.recentChangesUpdate" 4419 "mediawiki.job.categoryMembershipChange" 4277 "mediawiki.revision-create" 2958 "mediawiki.page-links-change" 2444 "mediawiki.job.ORESFetchScoreJob" 2351 "mediawiki.job.EchoNotificationDeleteJob" 2146 null 1288 "mediawiki.revision-tags-change" 1285 "mediawiki.job.refreshLinks" 1281 "mediawiki.job.enotifNotify" 882 "mediawiki.job.flaggedrevs_CacheUpdate" 637 "mediawiki.job.wikibase-InjectRCRecords" 535 "mediawiki.page-properties-change" 325 "mediawiki.job.cirrusSearchIncomingLinkCount" 264 "mediawiki.page-create" 164 "mediawiki.job.CentralAuthCreateLocalAccountJob" 135 "mediawiki.job.LoginNotifyChecks" 126 "mediawiki.user-blocks-change" 51 "mediawiki.job.cirrusSearchDeletePages" 43 "mediawiki.page-delete" 33 "mediawiki.page-move" 32 "mediawiki.job.UpdateRepoOnMove" 31 "mediawiki.job.updateBetaFeaturesUserCounts" 6 "mediawiki.job.UpdateRepoOnDelete" 5 "mediawiki.job.ThumbnailRender" 5 "mediawiki.job.cirrusSearchCheckerJob" 4 "mediawiki.page-restrictions-change" 4 "mediawiki.job.compileArticleMetadata" 3 "mediawiki.revision-visibility-change" 2 "mediawiki.job.userGroupExpiry" 1 "mediawiki.job.cirrusSearchOtherIndex" 1 "mediawiki.job.cirrusSearchDeleteArchive"
I could replay these events back to EventBus...but I'm not sure that I should!