## Timeline
2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF)
<date change>
2016-05-05T02:16:10 first INFO [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014)
2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated
2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster
2016-05-05T02:3X load spikes on parsoid cluster
2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE)
2016-05-05T02:38:14 first alert: <icinga-wm> PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-05-05T03:20:00 (ca.) Ori stops changeprop service
2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers.
## Analysis
1) The lower-case version of http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE was actually an illegal title according to the wiki's title capitalization setting, but this was not detected as a side-effect of an mbstring / HHVM bug {T91056}. When interpreted according to unicode & normalization rules of the wiki, it was a redirect to itself.
2) The uppercase title was edited, and change-propagation picked up the edit and requests updates in restbase with a no-cache request for the page summary end point.
* Note: The same update is also triggered by normal job processing.
3) RESTBase checked whether the page is a redirect. While the uppercase version of the page was not a redirect, the lower-case version was. RESTBase always stores content and revision information at normalized titles, which for this wiki means an upper-cased first char. The lower-case version of the article happened to have a higher revision number, so the redirect flag was picked up. To find the redirect target, the access / redirect filter then requested HTML to parse redirect target, but forwarded all the headers, including no-cache **[bug]**(https://github.com/wikimedia/restbase/blob/master/lib/revision_table_access_check_filter.js#L37).
4) In parsoid.js (in the RESTBase codebase), the missing revision parameter then triggered no-cache revision fetching, which retrieved the latest revision of the *uppercase* title from MW API, yielding 106801025.
5) This revision was then requested from parsoid. The content returned from parsoid was classified as differing & was saved.
* Reason for always-diff: The contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 was only based on the request parameters, which in this case *did not include the revision*. The latest stored revision was actually 125284517 (corresponding to the lowercase title), thus causing the content to mis-match when compared to the re-render of 106801025.
* Secondary updates **emitted a resource change event for HTML**
6) Changeprop **picked up the resource change event from #5, going back to #1 of the loop**.
This by itself would have just caused a sequential repeat of a single request, and would not have lead to an outage. However, several other issues caused an **exponential retry amplification**, which in turn created enough load on RESTBase and Parsoid to make this an outage:
* RESTBase secondary updates (temporarily inline) also perform an update request to the Mobile Content Service (MCS)
* MCS in turn requests HTML from RESTBase without no-cache but with ?redirect=false, gets HTML with self-redirect
* **Inline redirect extraction was still active in MCS, causing it to repeat requests 10x** (all served from storage) until redirect limit reached, then returned 400 from [parsoid-access.js](https://github.com/wikimedia/mediawiki-services-mobileapps/blob/b9f99a40711fc09eea7573cdc67c7ea3ef12880e/lib/parsoid-access.js#L106) (lots of log entries for this in RB logs).
* Since this was a no-cache request, secondary updates are done synchronously, and errors are reported. A generic 500 was returned from RESTBase.
* The change-propagation service received the 500 response (triggered by MCS redirect limit) and, after a delay of ~500ms, retried **(5x)** until the retry count was exceeded.
* ChangeProp processed retries eagerly & in parallel with `resource_change` events. There was no limit on processing concurrency apart form CPU, and delays for retries were short (500ms). As a result, parallelism of retries & event processing ramped up until the CPU was saturated, sending out at least 500+ un-cacheable requests per second.
## Actionables
### RESTBase
* Don't propagate the cache-control header for access checks: https://github.com/wikimedia/restbase/pull/604
* change contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 to always use a / the same revision as is used for the parsoid request
* look into stopping no-cache revision fetching as restriction update mechanism (avoids mis-match between revision & content storage)
* Improve logging / error reporting from temporary dependency updates
### ChangeProp service
* {T134456}
* {T134483}
### Mobile Content Service
* {T134538}
### Parsoid
* investigate slow render of the pages involved here
* consider setting up rate limits for repeat parses of the same domain / title / revision: {T134516}
## Alerts
* Set up mail alerts to services@ for service_checker failures
* In [node.pp](https://github.com/wikimedia/operations-pupp2et/blob/024f70d32a28a05237b109ccfc48103d3fa8ee04/modules/service/manifests/node.pp#L277) or [restbase.pp](https://github.com/wikimedia/operations-puppet/blob/b1a85262d3d6f6f526a66240125acb720b3bf9b3/manifests/role/restbase.pp#L22) ?
* Set up alert / monitoring against LVS end point, and make it page services *and* ops; should only trigger if service overall is down (not individual hosts)