Page MenuHomePhabricator

Investigate 2017-08-02 Save Timing regression (+40-60%)
Closed, ResolvedPublic

Description

On 2017-08-02 around 19:00 UTC, Backend Save Timing (p75) rapidly started regressing by 40-60% from a stable ~450ms (±50ms) to 700-900ms.

https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1500879593680&to=1501791224192


2017-08-02
18:11 thcipriani@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: JobQueueEventBus: Enable job events in group0 wikis T163380 Part I (duration: 00m 47s)
18:13 thcipriani@tin: Synchronized wmf-config/jobqueue.php: SWAT: JobQueueEventBus: Enable job events in group0 wikis T163380 Part II (duration: 00m 47s)
18:18 gehel: un-banning and repooling elastic102[4567] - T168816
18:18 thcipriani@tin: Synchronized php-1.30.0-wmf.12/includes/specials/SpecialUndelete.php: SWAT: Fix Special:Undelete search - use variable and not request param (duration: 00m 46s)
18:20 gehel@puppetmaster1001: conftool action : set/pooled=yes; selector: name=elastic10(24|25|26|27).eqiad.wmnet
18:21 thcipriani@tin: Synchronized wmf-config/InitialiseSettings-labs.php: SWAT: beta-only change Enable HTML5 sections in betalabs (duration: 00m 46s)
18:41 thcipriani@tin: Synchronized php-1.30.0-wmf.12/includes/specials/SpecialRecentchanges.php: SWAT: Follow-up 31be7d0: send tags list if experimental mode is disabled (duration: 00m 47s)
19:23 twentyafterfour: group1 wikis to 1.30.0-wmf.12 refs T168053
19:24 twentyafterfour@tin: rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.12 refs refs T168053
19:50 twentyafterfour@tin: rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.11 refs T168053 - rollback due to T172320

The roll-out of 1.30.0-wmf.12 seems like the obvious cause, but note that the regression hasn't reversed, despite the branch having been rolled back within 30 minutes.

Event Timeline

Krinkle created this task.Aug 3 2017, 8:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 3 2017, 8:16 PM
Krinkle updated the task description. (Show Details)Aug 3 2017, 8:22 PM
mmodell added a subscriber: mmodell.Aug 4 2017, 7:58 AM

strange.. the rollback should have had an effect, I would think..

greg added a subscriber: greg.Aug 4 2017, 9:27 PM

@Krinkle Let us know on Monday if you have any new causal theories and whether or not we should go forward with wmf.13 next week.

Krinkle added a comment.EditedAug 4 2017, 10:49 PM

@Krinkle Let us know on Monday if you have any new causal theories and whether or not we should go forward with wmf.13 next week.

There's a problem with Wikibase/Wikidata that has gotten worse two weeks in a row. I'd suggest we not move forward until that situation is resolved given its impact on all of JobQueue, Database load/lag, and cross-wiki interactions. The problems in question relate to InjectRCJob (which is how Wikidata makes recent change rows appear across all wikis) and ChangeNotification (which is how it triggers mass re-parses and purges for lots of pages across wikis). I am not sure of the potential impact that is having.

And the additional regressions this week:

Meanwhile, correlation is harder at the moment given other on-going issues, such as T171371 (and the inability to deploy code changes to Jobrunner due to T129148).

As for the actual save timing regression, it seems quite severe. I haven't been able to find an exact correlation with something that would explain it, but I'll keep looking.

Recording a profile to XHGui for a simple edit on www.mediawiki.org., shows a significant amount of time being spent in AbuseFilter (1s out of 4s).

https://performance.wikimedia.org/xhgui/run/view?id=598529873f3dfafe4714c38c

functioncall countinclusive wall time
[total] main()14.2s
MediaWiki::performAction13.1s
EditPage::edit13.1s
- EditPage::internalAttemptSave13.0s
-- WikiPage::doEditContent1800ms
-- EditPage::runPostMergeFilters11850ms
--- AbuseFilterHooks::onEditFilterMergedContent11322ms
functioncall countinclusive wall time
AbuseFilterHooks::onEditFilterMergedContent11322ms
- AbuseFilterHooks::filterEdit11311ms
-- AbuseFilter::filterAction11162ms
--- AbuseFilter::checkAllFilters11019ms
---- AbuseFilter::checkFilter45959ms
----- AbuseFilter::checkConditions45948ms
------ AbuseFilterParser::parse45931ms
------- AbuseFilterParser::intEval45929ms

When comparing aggregated Flame Graphs from Aug 4 to Aug 1, however, this doesn't stand out at all, so might be a red herring.

@Krinkle: Any update? I'm treating this as a blocker for wmf.13 unless I'm told otherwise.

Another profile

Another profile, this time for an edit on en.wikipedia.org instead of mediawiki.org, and logged-out.

Again. So it's wasn't perhaps the red-herring I thought it to be. Given no recent code changes in mediawiki/extensions/AbuseFilter.git, and no seemingly-relevant configuration changes in operations/mediawiki-config.git, I decided to take a look at recent modifications to AbuseFilters that apply globally to all wikis (maintained on meta.wikimedia.org). More about that later.

Re-cap timeline


https://grafana.wikimedia.org/dashboard/db/save-timing?refresh=5m&orgId=1&from=1501603584703&to=1502146344491

2017-08-02 18:30 UTCRegression starts sharply.
2017-08-05 12:15 UTCMetric is recovering...
2017-08-05 13:15 UTCRegression gone entirely – RECOVERY
2017-08-07 10:30 UTCRegression starts again.
2017-08-07 17:00 UTCRegression gets worse (same level as Aug 2-Aug 5).

2017-08-02:
[..]
18:11 thcipriani@tin: Synchronized wmf-config/InitialiseSettings.php: SWAT: JobQueueEventBus: Enable job events in group0 wikis Part I (duration: 00m 47s)
[..]
19:23 twentyafterfour: group1 wikis to 1.30.0-wmf.12 refs
19:50 twentyafterfour@tin: rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.11 refs - rollback due to
[..]
2017-08-05:
14:40 Reedy: created oauth tables on foundationwiki
14:13 reedy@tin: Synchronized php-1.30.0-wmf.12/extensions/WikimediaMaintenance/createExtensionTables.php: add oauth (duration: 00m 48s)
2017-08-05:
[..]
16:44 marostegui: Restart s7 instance on db1069 to pick up new replication filters
17:56 jynus: stopping slave and reparitioning db1098
17:10 marostegui: Restart s7 instance on db1102 to pick up new replication filters
17:05 gehel@tin: Finished deploy [wdqs/wdqs@da33919]: (no justification provided) (duration: 02m 28s)
17:02 gehel@tin: Started deploy [wdqs/wdqs@da33919]: (no justification provided)
[..]

There are events happening around these points in time (copied above from SAL), and almost no events outside these times. However, given the unrelated nature of these events, unless the mere act of touching the servers in any way causes the regression to turn itself off and on, I'm gonna conclude this isn't deployment-initiated.

Looking at the percentiles and edit counts more closely:

Looks like these regressions may just be the result of a particular bot process rapidly (large) submitting a number of potentially slow-parsing edits.

AbuseFilter

There certainly appears to be correlation to the Aug 2 and Aug 7 times where regressions started:

At this point I'm pondering between two three conclusions:

  1. The vandalism burst was composed of edits that were slow to parse. The burst explains the increase in edit-count. If we assume the vandalism edits were mostly slow to parse, we could explain the regression in p75 and p95 backend save timing as being a skew caued by those edits. This would mean other "regular" edits were unaffected.
  2. Or, the vandalism burst caused only the edit count increase. The introduction of a new AbuseFilter is what caused edits to be slowed down as a result of the overhead AbuseFilter imposes on needing to inspecting every single edit. In this scenario, AbuseFilter is responsible for the regression and affected (is affecting) all edits.
  3. Or, neither is solely responsible. The increases edit rates, and burst of vandalism edits that trigger AbuseFilter, cause an increased app server or database load that indirectly causes other edits to be slowed down. Number 3 is unlikely given I see no significant load increase (CPU, loadavg, Procs) on the appservers cluster.

Given I see a significant amount of time being spent in AbuseFilter for my normal test edits as well, I'm inclined to assume number 2.
The problem is AbuseFilter and it affects all edits.

Krinkle triaged this task as High priority.Aug 7 2017, 11:33 PM

I wonder how difficult it would be to add some instrumentation to the abusefilter global rules editor. I would think that it would be really helpful if any changes to the global abusefilter rules would get marked with a vertical line on performance graphs.

Removed from blocking status given that it appears to be unrelated to code deployments.

Krinkle closed this task as Resolved.Aug 17 2017, 1:32 AM
Krinkle claimed this task.

The individual AbuseFilter that was slow and toggled on and off at various points during the last 3 weeks, has currently been disabled again it seems. While that's good for performance, it does not address the wider problem which is that AbuseFilter has the capacity to double the time it takes to save an edit, which is unacceptable.

There are two problems here:

  • Time it takes to parse abuse filters.
  • Time it takes to execute abuse filters.

The parsing is currently taking up the majority of time (931ms spent in AbuseFilter parsing, out of of 1322ms spent in AbuseFilter overall, out of 3.0s spent saving the edit). It turns out this is not a new problem. A new "cacheable" parser was implemented over a year ago, but remains disabled in production due to various bugs found in its logic. Tracked at T156095: Re-enable AbuseFilterCachingParser once we are sure it's safe.

The execution time is still pretty significant at 400ms of this sample request (for a simple edit, at a time the the slow filter was enabled). The good news is that we already have complexity limits and those are enforced on a global level, not per-filter. Which means we do already account for various filters combined adding up to high complexity (even if individual filters have low complexity). The down side is that this isn't currently validated at the time filters are modified, but rather at run-time when edits are saved. It also doesn't appear to communicate this problem in any way to admins or editors that maintain the filters. It just abruptly ends abusefilter execution, letting the edit pass through (?), and writing something to the debug log for developers (not enabled in prod?).

Validating it at the time AbuseFilter saves filter modifications is tricky, I suppose, given global filters existing as well. (When editing local filters, we can include global filter combined complexity, but when editing global filters we can't). Might be worth giving them separate thresholds.

Related AbuseFilter tasks that touch on performance:

Main short-term priority and benefit seems to be at T156095: Re-enable AbuseFilterCachingParser once we are sure it's safe,

Krinkle moved this task from Inbox to Doing on the Performance-Team board.Aug 17 2017, 1:32 AM