Change Details

## Timeline 2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF) <date change> 2016-05-05T02:16:10 first INFO [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014) 2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated 2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster 2016-05-05T02:3X load spikes on parsoid cluster 2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE) 2016-05-05T02:38:14 first alert: <icinga-wm> PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2016-05-05T03:20:00 (ca.) Ori stops changeprop service 2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers. ## Analysis 1) The lower-case version of http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE was actually an illegal title according to the wiki's title capitalization setting, but this was not detected as a side-effect of an mbstring / HHVM bug {T91056}. When interpreted according to unicode & normalization rules of the wiki, it was a redirect to itself. 2) The uppercase title was edited, and change-propagation picked up the edit and requests updates in restbase with a no-cache request for the page summary end point. * Note: The same update is also triggered by normal job processing. 3) RESTBase checked whether the page is a redirect. While the uppercase version of the page was not a redirect, the lower-case version was. RESTBase always stores content and revision information at normalized titles, which for this wiki means an upper-cased first char. The lower-case version of the article happened to have a higher revision number, so the redirect flag was picked up. To find the redirect target, the access / redirect filter then requested HTML to parse redirect target, but forwarded all the headers, including no-cache **[bug]**(https://github.com/wikimedia/restbase/blob/master/lib/revision_table_access_check_filter.js#L37). 4) In parsoid.js (in the RESTBase codebase), the missing revision parameter then triggered no-cache revision fetching, which retrieved the latest revision of the *uppercase* title from MW API, yielding 106801025. 5) This revision was then requested from parsoid. The content returned from parsoid was classified as differing & was saved. * Reason for always-diff: The contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 was only based on the request parameters, which in this case *did not include the revision*. The latest stored revision was actually 125284517 (corresponding to the lowercase title), thus causing the content to mis-match when compared to the re-render of 106801025. * Secondary updates **emitted a resource change event for HTML** 6) Changeprop **picked up the resource change event from #5, going back to #1 of the loop**. This by itself would have just caused a sequential repeat of a single request, and would not have lead to an outage. However, several other issues caused an **exponential retry amplification**, which in turn created enough load on RESTBase and Parsoid to make this an outage: * RESTBase secondary updates (temporarily inline) also perform an update request to the Mobile Content Service (MCS) * MCS in turn requests HTML from RESTBase without no-cache but with ?redirect=false, gets HTML with self-redirect * **Inline redirect extraction was still active in MCS, causing it to repeat requests 10x** (all served from storage) until redirect limit reached, then returned 400 from [parsoid-access.js](https://github.com/wikimedia/mediawiki-services-mobileapps/blob/b9f99a40711fc09eea7573cdc67c7ea3ef12880e/lib/parsoid-access.js#L106) (lots of log entries for this in RB logs). * Since this was a no-cache request, secondary updates are done synchronously, and errors are reported. A generic 500 was returned from RESTBase. * The change-propagation service received the 500 response (triggered by MCS redirect limit) and, after a delay of ~500ms, retried **(5x)** until the retry count was exceeded. * ChangeProp processed retries eagerly & in parallel with `resource_change` events. There was no limit on processing concurrency apart form CPU, and delays for retries were short (500ms). As a result, parallelism of retries & event processing ramped up until the CPU was saturated, sending out at least 500+ un-cacheable requests per second. ## Actionables ### RESTBase * Don't propagate the cache-control header for access checks: https://github.com/wikimedia/restbase/pull/604 * change contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 to always use a / the same revision as is used for the parsoid request * look into stopping no-cache revision fetching as restriction update mechanism (avoids mis-match between revision & content storage) * Improve logging / error reporting from temporary dependency updates ### ChangeProp service * {T134456} * {T134483} ### Mobile Content Service * {T134538} ### Parsoid * investigate slow render of the pages involved here * consider setting up rate limits for repeat parses of the same domain / title / revision: {T134516} ## Alerts * Set up mail alerts to services@ for service_checker failures * In [node.pp](https://github.com/wikimedia/operations-pupp2et/blob/024f70d32a28a05237b109ccfc48103d3fa8ee04/modules/service/manifests/node.pp#L277) or [restbase.pp](https://github.com/wikimedia/operations-puppet/blob/b1a85262d3d6f6f526a66240125acb720b3bf9b3/manifests/role/restbase.pp#L22) ? * Set up alert / monitoring against LVS end point, and make it page services *and* ops; should only trigger if service overall is down (not individual hosts)

## Timeline 2016-05-04T15:02:45 first requests to parsoid for oldId 106801025 (from ChangePropagation/WMF) <date change> 2016-05-05T02:16:10 first INFO [HintedHandoff:2] 2016-05-03 02:16:10,523 HintedHandOffManager.java:486 - Timed out replaying hints to /10.192.32.137; aborting (25847 delivered) (on restbase1014) 2016-05-05T02:28 org.apache.cassandra.db.compaction.CompactionTask log events elevated 2016-05-05T02:3X bytes_out and cpu_user start to spike on rb cluster 2016-05-05T02:3X load spikes on parsoid cluster 2016-05-05T02:34:16 first "Retry count exceeded" error in /srv/log/changeprop/main.log (for http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE) 2016-05-05T02:38:14 first alert: <icinga-wm> PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. 2016-05-05T03:20:00 (ca.) Ori stops changeprop service 2016-05-05T03:30:00 (ca.) Load on Parsoid and RESTBase drops, service recovers. ## Analysis 1) The lower-case version of http://fr.wikipedia.org/api/rest_v1/page/html/%EA%9D%AE was actually an illegal title according to the wiki's title capitalization setting, but this was not detected as a side-effect of an mbstring / HHVM bug {T91056}. When interpreted according to unicode & normalization rules of the wiki, it was a redirect to itself. 2) The uppercase title was edited, and change-propagation picked up the edit and requests updates in restbase with a no-cache request for the page summary end point. * Note: The same update is also triggered by normal job processing. 3) RESTBase checked whether the page is a redirect. While the uppercase version of the page was not a redirect, the lower-case version was. RESTBase always stores content and revision information at normalized titles, which for this wiki means an upper-cased first char. The lower-case version of the article happened to have a higher revision number, so the redirect flag was picked up. To find the redirect target, the access / redirect filter then requested HTML to parse redirect target, but forwarded all the headers, including no-cache **[bug]**(https://github.com/wikimedia/restbase/blob/master/lib/revision_table_access_check_filter.js#L37). 4) In parsoid.js (in the RESTBase codebase), the missing revision parameter then triggered no-cache revision fetching, which retrieved the latest revision of the *uppercase* title from MW API, yielding 106801025. 5) This revision was then requested from parsoid. The content returned from parsoid was classified as differing & was saved. * Reason for always-diff: The contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 was only based on the request parameters, which in this case *did not include the revision*. The latest stored revision was actually 125284517 (corresponding to the lowercase title), thus causing the content to mis-match when compared to the re-render of 106801025. * Secondary updates **emitted a resource change event for HTML** 6) Changeprop **picked up the resource change event from #5, going back to #1 of the loop**. This by itself would have just caused a sequential repeat of a single request, and would not have lead to an outage. However, several other issues caused an **exponential retry amplification**, which in turn created enough load on RESTBase and Parsoid to make this an outage: * RESTBase secondary updates (temporarily inline) also perform an update request to the Mobile Content Service (MCS) * MCS in turn requests HTML from RESTBase without no-cache but with ?redirect=false, gets HTML with self-redirect * **Inline redirect extraction was still active in MCS, causing it to repeat requests 10x** (all served from storage) until redirect limit reached, then returned 400 from [parsoid-access.js](https://github.com/wikimedia/mediawiki-services-mobileapps/blob/b9f99a40711fc09eea7573cdc67c7ea3ef12880e/lib/parsoid-access.js#L106) (lots of log entries for this in RB logs). * Since this was a no-cache request, secondary updates are done synchronously, and errors are reported. A generic 500 was returned from RESTBase. * The change-propagation service received the 500 response (triggered by MCS redirect limit) and, after a delay of ~500ms, retried **(5x)** until the retry count was exceeded. * ChangeProp processed retries eagerly & in parallel with `resource_change` events. There was no limit on processing concurrency apart form CPU, and delays for retries were short (500ms). As a result, parallelism of retries & event processing ramped up until the CPU was saturated, sending out at least 500+ un-cacheable requests per second. ## Actionables ### RESTBase * Don't propagate the cache-control header for access checks: https://github.com/wikimedia/restbase/pull/604 * change contentReq at https://github.com/wikimedia/restbase/blob/master/sys/parsoid.js#L420 to always use a / the same revision as is used for the parsoid request * look into stopping no-cache revision fetching as restriction update mechanism (avoids mis-match between revision & content storage) * Improve logging / error reporting from temporary dependency updates ### ChangeProp service * {T134456} * {T134483} ### Mobile Content Service * {T134538} ### Parsoid * investigate slow render of the pages involved here * consider setting up rate limits for repeat parses of the same domain / title / revision: {T134516} ## Alerts * Set up mail alerts to services@ for service_checker failures: {T134551} * In [node.pp](https://github.com/wikimedia/operations-pupp2et/blob/024f70d32a28a05237b109ccfc48103d3fa8ee04/modules/service/manifests/node.pp#L277) or [restbase.pp](https://github.com/wikimedia/operations-puppet/blob/b1a85262d3d6f6f526a66240125acb720b3bf9b3/manifests/role/restbase.pp#L22) ? * Set up alert / monitoring against LVS end point, and make it page services *and* ops; should only trigger if service overall is down (not individual hosts): {T134551}