Page MenuHomePhabricator

RESTBase, Change-Prop and MobileApps got in a loop
Closed, ResolvedPublic

Description

On July 01 approximately at 12:55AM UTC several Cassandra nodes died because of the usual tombstones issue, however the root cause of this was very unusual.

First, a rerender of the https://ru.wikipedia.org/wiki/Портал%3AГерпетология page came, probably because of the a transfusion update. This page uses flagged revisions, so according to Mediawiki the latest revision of the page is 68650450 while RESTBase also has revision 85886331 in storage.

After this rerenader, hundreds of events like this were emitted by RESTBase

{
  "meta": {
    "domain": "ru.wikipedia.org",
    "dt": "2017-07-01T09:45:26.813Z",
    "id": "02ee94c5-5e42-11e7-8769-b51cc5d64d49",
    "request_id": "01b614bc-5e42-11e7-8cc8-835a59932741",
    "schema_uri": "resource_change/1",
    "topic": "resource_change",
    "uri": "http://ru.wikipedia.org/api/rest_v1/page/html/%D0%9F%D0%BE%D1%80%D1%82%D0%B0%D0%BB%3A%D0%93%D0%B5%D1%80%D0%BF%D0%B5%D1%82%D0%BE%D0%BB%D0%BE%D0%B3%D0%B8%D1%8F/68650450"
  },
  "tags": [
    "restbase"
  ]
}
{
  "meta": {
    "domain": "ru.wikipedia.org",
    "dt": "2017-07-01T09:45:26.827Z",
    "id": "02f0a545-5e42-11e7-b23d-1eb168871649",
    "request_id": "01b936b1-5e42-11e7-9acf-d81c366b200a",
    "schema_uri": "resource_change/1",
    "topic": "resource_change",
    "uri": "http://ru.wikipedia.org/api/rest_v1/page/html/%D0%9F%D0%BE%D1%80%D1%82%D0%B0%D0%BB%3A%D0%93%D0%B5%D1%80%D0%BF%D0%B5%D1%82%D0%BE%D0%BB%D0%BE%D0%B3%D0%B8%D1%8F"
  },
  "tags": [
    "restbase"
  ]
}

These events should only be emitted in case a new render of the revision, different from the previously stored render was saved. Although Cassandra storage has about a thousand renders of that revision, their TIDs do not align with the incident timing, and they all seem like a legitimate result of the template rerender (why they were not removed by a revision retention policy is a big separate question on it's own).

All of these events were picked up by ChangeProp and triggered mobile content updates, which in turn came back to RESTBase, probably somehow triggering even more html-change events to be emitted. Eventually Cassandra nodes responsible for this partition died.

After almost a day of investigating this I still didn't identify the root cause, but the issue is pretty serious since if the condition that caused it happens again it can bring down the whole RESTBase cluster.

Event Timeline

Pchelolo created this task.Jul 6 2017, 5:32 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 6 2017, 5:32 PM
mobrovac triaged this task as High priority.Jul 6 2017, 8:59 PM

Mentioned in SAL (#wikimedia-operations) [2017-07-06T21:51:28Z] <ppchelko@tin> Started deploy [changeprop/deploy@e1230e6]: Extend automatic blacklisting T169911

Mentioned in SAL (#wikimedia-operations) [2017-07-06T21:52:37Z] <ppchelko@tin> Finished deploy [changeprop/deploy@e1230e6]: Extend automatic blacklisting T169911 (duration: 01m 09s)

We've enabled automatic blacklisting for all the derived content updates, so even if this issue happens again it won't kill RESTBase, so for now I'll stop investigating the root cause as we clearly don't have enough information

Krinkle updated the task description. (Show Details)Jul 6 2017, 10:28 PM

Setting this to blocked until we get more info when (if) this happens again

mobrovac changed the task status from Open to Stalled.Jul 7 2017, 3:24 PM
Pchelolo closed this task as Resolved.Mar 14 2018, 8:46 PM
Pchelolo edited projects, added Services (done); removed Services (blocked).

Didn't happen for a year, time to resolve.