## Timeline
2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF)
<date change>
2016-05-05T02:16:10 first INFO [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014)
2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated
2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster
2016-05-05T02:3X load spikes on parsoid cluster
2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE)
2016-05-05T02:38:14 first alert: <icinga-wm> PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2016-05-05T03:20:00 (ca.) Ori stops changeprop service
2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers.
## Analysis
1) The lower-case version of http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE is actually an illegal title according to the wiki's title capitalization setting, but this is not detected as a side-effect of an mbstring / HHVM bug <[[phab:T91056|https://phabricator.wikimedia.org/T91056]]>. When interpreted according to unicode & normalization rules of the wiki, it is a redirect to itself.
2) The uppercase title is edited, and change-propagation picks up the edit and requests updates in restbase with a no-cache request for the page summary end point.
* Note: The same update is also triggered by normal job processing.
3) RESTBase checks whether the page is a redirect. While the uppercase version of the page is not a redirect, the lower-case version is. RESTBase always stores content and revision information at normalized titles, which for this wiki means an upper-cased first char. The lower-case version of the article has a higher revision number, so the redirect flag is picked up. To find the redirect target, the access / redirect filter then requests HTML to parse redirect target, but forwards all the headers, including no-cache ***[bug]***(https://github.com/wikimedia/restbase/blob/master/lib/revision_table_access_check_filter.js#L37).
4) In parsoid.js, the missing revision parameter then triggers no-cache revision fetching, which retrieves the latest revision of the *uppercase* title from MW API, yielding 106801025.
5) This revision is then requested from parsoid. The content returned from parsoid differs & is saved.
* Reason for always-diff: The contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 is only based on the request parameters, which in this case *do not include the revision*. The latest stored revision was actually 125284517 (corresponding to the lowercase title), thus causing the content to mis-match when compared to the re-render of 106801025.
* Secondary updates ***emit a resource change event for HTML***
* Secondary updates (temporarily inline) perform an update request to the Mobile Content Service (MCS)
* MCS in turn requests HTML from RESTBase without no-cache but with ?redirect=false, gets HTML with self-redirect
* ***Inline redirect extraction is still active in MCS, repeats request 20x*** (all served from storage) unitl redirect limit reached, then returns 400 from https://github.com/wikimedia/mediawiki-services-mobileapps/blob/b9f99a40711fc09eea7573cdc67c7ea3ef12880e/lib/parsoid-access.js#L106 (lots of log entries for this in RB logs)
* Since this is a no-cache request, secondary updates are done synchronously, and errors are reported. A generic 500 is returned from RESTBase.
6) The change-propagation service gets the 500 response (triggered by MCS redirect limit) and, after a delay of ~500ms, retries ***(5x)*** until the retry count is exceeded. After logging an error, the event is no longer retried.
7) Changeprop ***picks up the resource change event from #5, going back to #1 of the loop***.
## Actionables
### RESTBase
* Don't propagate the cache-control header for access checks: https://github.com/wikimedia/restbase/pull/604
* change contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 to always use a / the same revision as is used for the parsoid request
* look into stopping no-cache revision fetching as restriction update mechanism (avoids mis-match between revision & content storage)
* Improve logging / error reporting from temporary dependency updates
### ChangeProp service
* {T134456}
* {T134483}
### Mobile Content Service
* {T134538}
### Parsoid
* investigate slow render of the pages involved here
* consider setting up rate limits for repeat parses of the same domain / title / revision: {T134516}
## Alerts
* Set up mail alerts to services@ for service_checker failures
* In [node.pp](https://github.com/wikimedia/operations-pupp2et/blob/024f70d32a28a05237b109ccfc48103d3fa8ee04/modules/service/manifests/node.pp#L277) or [restbase.pp](https://github.com/wikimedia/operations-puppet/blob/b1a85262d3d6f6f526a66240125acb720b3bf9b3/manifests/role/restbase.pp#L22) ?
* Set up alert / monitoring against LVS end point, and make it page services *and* ops; should only trigger if service overall is down (not individual hosts)