Page MenuHomePhabricator

EventBus mediawiki outage 2019-02-28
Closed, ResolvedPublic3 Estimated Story Points

Description

During the Morning SWAT, @Ottomata deployed https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/492770/. @Ottomata thought that this patch did not depend on wmf.19 being deployed, since we had last week deployed support for multi instance EventBus instance via the EventServices config. This config has already been deployed, and so the EventBus extension should not have been using the legacy wgEventServiceUrl config. Apparently it was somehow.

EventBus POST requests rate showing a drop when the SWAT config change got pushed at 19:40 and recovering at 20:10 when wmf.19 got deployed on all wikis:

eventbus_rate.png (658×1 px, 80 KB)

During the time between the morning SWAT and the final push of wmf.19 to group2 wikis, many app servers (probably all those in group2) failed to produce to EventBus due to misconfiguration:

https://logstash.wikimedia.org/goto/a1781b78f692e030d2e775d08572ca9f\
https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667&panelId=13&fullscreen&edit
https://grafana.wikimedia.org/d/000000201/eventbus?from=1551369945820&to=1551391545820&orgId=1&var-site=eqiad&var-rule=All

I've been able to capture the failed events from the logstash error data in Kafka

$ kafkacat -C -b localhost:9092 -t udp_localhost-err -o 12688179 | grep '"channel": "EventBus"' > eventbus-outage-logstash.2019-02-28.json
$ cat eventbus-outage-logstash.2019-02-28.json | jq -rc .events[0] > eventbus-outage-events.2019-02-28.json

$ wc -l eventbus-outage-events.2019-02-28.json
269995 eventbus-outage-events.2019-02-28.json

Counting by topic:

147405 "mediawiki.job.wikibase-addUsagesForPage"
 37177 "mediawiki.job.RecordLintJob"
 17266 "mediawiki.job.cirrusSearchLinksUpdatePrioritized"
 12189 "mediawiki.job.cirrusSearchLinksUpdate"
  8831 "resource_change"
  5848 "mediawiki.job.cdnPurge"
  5420 "mediawiki.job.refreshLinksPrioritized"
  5224 "mediawiki.job.htmlCacheUpdate"
  4897 "mediawiki.job.recentChangesUpdate"
  4419 "mediawiki.job.categoryMembershipChange"
  4277 "mediawiki.revision-create"
  2958 "mediawiki.page-links-change"
  2444 "mediawiki.job.ORESFetchScoreJob"
  2351 "mediawiki.job.EchoNotificationDeleteJob"
  2146 null
  1288 "mediawiki.revision-tags-change"
  1285 "mediawiki.job.refreshLinks"
  1281 "mediawiki.job.enotifNotify"
   882 "mediawiki.job.flaggedrevs_CacheUpdate"
   637 "mediawiki.job.wikibase-InjectRCRecords"
   535 "mediawiki.page-properties-change"
   325 "mediawiki.job.cirrusSearchIncomingLinkCount"
   264 "mediawiki.page-create"
   164 "mediawiki.job.CentralAuthCreateLocalAccountJob"
   135 "mediawiki.job.LoginNotifyChecks"
   126 "mediawiki.user-blocks-change"
    51 "mediawiki.job.cirrusSearchDeletePages"
    43 "mediawiki.page-delete"
    33 "mediawiki.page-move"
    32 "mediawiki.job.UpdateRepoOnMove"
    31 "mediawiki.job.updateBetaFeaturesUserCounts"
     6 "mediawiki.job.UpdateRepoOnDelete"
     5 "mediawiki.job.ThumbnailRender"
     5 "mediawiki.job.cirrusSearchCheckerJob"
     4 "mediawiki.page-restrictions-change"
     4 "mediawiki.job.compileArticleMetadata"
     3 "mediawiki.revision-visibility-change"
     2 "mediawiki.job.userGroupExpiry"
     1 "mediawiki.job.cirrusSearchOtherIndex"
     1 "mediawiki.job.cirrusSearchDeleteArchive"

I could replay these events back to EventBus...but I'm not sure that I should!

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-02-28T22:29:36Z] <ottomata> replaying events from mediawki eventbus config outage - T217385

Not all events fields from had their meta objects in logstash. Not sure why. I had to filter those out:

Replaying:

grep '"meta":' eventbus-outage-events.2019-02-28.json > eventbus-outage-events_with_meta.2019-02-28.json
while IFS= read -r line
do
   curl -H 'Content-Type: application/json' -d"$line" http://eventbus.discovery.wmnet:8085/v1/events
done < eventbus-outage-events_with_meta.2019-02-28.json

I think this should take about an hour to replay unparallelized.

Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

After this finished, I verified that last event in the file was produced.

Ottomata set the point value for this task to 3.Mar 4 2019, 4:09 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.

May I ask to help completing documentation (when possible, doesn't have to be now) https://wikitech.wikimedia.org/wiki/Incident_documentation/20190228-logstash ? The logstash incident seems bad enough, but (please correct me if I am wrong), these seems more user-fac-y and probably more interesting to end users.