Page MenuHomePhabricator

EventBus mediawiki outage 2019-02-28
Closed, ResolvedPublic3 Story Points

Description

During the Morning SWAT, @Ottomata deployed https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/492770/. @Ottomata thought that this patch did not depend on wmf.19 being deployed, since we had last week deployed support for multi instance EventBus instance via the EventServices config. This config has already been deployed, and so the EventBus extension should not have been using the legacy wgEventServiceUrl config. Apparently it was somehow.

EventBus POST requests rate showing a drop when the SWAT config change got pushed at 19:40 and recovering at 20:10 when wmf.19 got deployed on all wikis:

During the time between the morning SWAT and the final push of wmf.19 to group2 wikis, many app servers (probably all those in group2) failed to produce to EventBus due to misconfiguration:

https://logstash.wikimedia.org/goto/a1781b78f692e030d2e775d08572ca9f\
https://grafana.wikimedia.org/d/000000102/production-logging?orgId=1&from=1551378582537&to=1551388309667&panelId=13&fullscreen&edit
https://grafana.wikimedia.org/d/000000201/eventbus?from=1551369945820&to=1551391545820&orgId=1&var-site=eqiad&var-rule=All

I've been able to capture the failed events from the logstash error data in Kafka

$ kafkacat -C -b localhost:9092 -t udp_localhost-err -o 12688179 | grep '"channel": "EventBus"' > eventbus-outage-logstash.2019-02-28.json
$ cat eventbus-outage-logstash.2019-02-28.json | jq -rc .events[0] > eventbus-outage-events.2019-02-28.json

$ wc -l eventbus-outage-events.2019-02-28.json
269995 eventbus-outage-events.2019-02-28.json

Counting by topic:

147405 "mediawiki.job.wikibase-addUsagesForPage"
 37177 "mediawiki.job.RecordLintJob"
 17266 "mediawiki.job.cirrusSearchLinksUpdatePrioritized"
 12189 "mediawiki.job.cirrusSearchLinksUpdate"
  8831 "resource_change"
  5848 "mediawiki.job.cdnPurge"
  5420 "mediawiki.job.refreshLinksPrioritized"
  5224 "mediawiki.job.htmlCacheUpdate"
  4897 "mediawiki.job.recentChangesUpdate"
  4419 "mediawiki.job.categoryMembershipChange"
  4277 "mediawiki.revision-create"
  2958 "mediawiki.page-links-change"
  2444 "mediawiki.job.ORESFetchScoreJob"
  2351 "mediawiki.job.EchoNotificationDeleteJob"
  2146 null
  1288 "mediawiki.revision-tags-change"
  1285 "mediawiki.job.refreshLinks"
  1281 "mediawiki.job.enotifNotify"
   882 "mediawiki.job.flaggedrevs_CacheUpdate"
   637 "mediawiki.job.wikibase-InjectRCRecords"
   535 "mediawiki.page-properties-change"
   325 "mediawiki.job.cirrusSearchIncomingLinkCount"
   264 "mediawiki.page-create"
   164 "mediawiki.job.CentralAuthCreateLocalAccountJob"
   135 "mediawiki.job.LoginNotifyChecks"
   126 "mediawiki.user-blocks-change"
    51 "mediawiki.job.cirrusSearchDeletePages"
    43 "mediawiki.page-delete"
    33 "mediawiki.page-move"
    32 "mediawiki.job.UpdateRepoOnMove"
    31 "mediawiki.job.updateBetaFeaturesUserCounts"
     6 "mediawiki.job.UpdateRepoOnDelete"
     5 "mediawiki.job.ThumbnailRender"
     5 "mediawiki.job.cirrusSearchCheckerJob"
     4 "mediawiki.page-restrictions-change"
     4 "mediawiki.job.compileArticleMetadata"
     3 "mediawiki.revision-visibility-change"
     2 "mediawiki.job.userGroupExpiry"
     1 "mediawiki.job.cirrusSearchOtherIndex"
     1 "mediawiki.job.cirrusSearchDeleteArchive"

I could replay these events back to EventBus...but I'm not sure that I should!

Event Timeline

Restricted Application added a project: Analytics. · View Herald TranscriptFeb 28 2019, 10:09 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Ottomata updated the task description. (Show Details)Feb 28 2019, 10:11 PM
hashar added a subscriber: hashar.Feb 28 2019, 10:11 PM
hashar updated the task description. (Show Details)Feb 28 2019, 10:17 PM

Mentioned in SAL (#wikimedia-operations) [2019-02-28T22:29:36Z] <ottomata> replaying events from mediawki eventbus config outage - T217385

Not all events fields from had their meta objects in logstash. Not sure why. I had to filter those out:

Replaying:

grep '"meta":' eventbus-outage-events.2019-02-28.json > eventbus-outage-events_with_meta.2019-02-28.json
while IFS= read -r line
do
   curl -H 'Content-Type: application/json' -d"$line" http://eventbus.discovery.wmnet:8085/v1/events
done < eventbus-outage-events_with_meta.2019-02-28.json

I think this should take about an hour to replay unparallelized.

Ottomata claimed this task.Mar 4 2019, 4:08 PM
Ottomata added a project: Analytics-Kanban.
Ottomata moved this task from Next Up to Done on the Analytics-Kanban board.

After this finished, I verified that last event in the file was produced.

Ottomata set the point value for this task to 3.Mar 4 2019, 4:09 PM
Milimetric triaged this task as High priority.Mar 4 2019, 4:34 PM
Milimetric moved this task from Incoming to Modern Event Platform on the Analytics board.
jcrespo added a subscriber: jcrespo.Mar 4 2019, 5:51 PM

May I ask to help completing documentation (when possible, doesn't have to be now) https://wikitech.wikimedia.org/wiki/Incident_documentation/20190228-logstash ? The logstash incident seems bad enough, but (please correct me if I am wrong), these seems more user-fac-y and probably more interesting to end users.

Ya will do today.

herron added a subscriber: herron.Mar 4 2019, 6:02 PM
Ottomata moved this task from Backlog to Done on the Event-Platform board.Mar 4 2019, 10:29 PM
Nuria closed this task as Resolved.Mar 11 2019, 6:12 PM