Page MenuHomePhabricator

Ensure CAS errors aren't cause problems / trending up
Closed, ResolvedPublic

Description

We're seeing a few hundred CAS errors every day on production, make sure these aren't symptom of a deeper problem.

csteipp@fluorine:/a/mw-log/archive$ zgrep -c CAS exception.log-201510[23]*.gz
exception.log-20151020.gz:146
exception.log-20151021.gz:134
exception.log-20151022.gz:158
exception.log-20151023.gz:313
exception.log-20151024.gz:1486
exception.log-20151025.gz:368
exception.log-20151026.gz:630
exception.log-20151027.gz:659
exception.log-20151028.gz:350
exception.log-20151029.gz:414
exception.log-20151030.gz:291
exception.log-20151031.gz:394

=> Logstash query: https://logstash.wikimedia.org/goto/ee0b4bf7c773c18981cae04d43a3e2d9

Event Timeline

csteipp raised the priority of this task from to Needs Triage.
csteipp updated the task description. (Show Details)
csteipp added a project: Performance-Team.
csteipp added subscribers: csteipp, aaron.

@bd808, how hard would it be to add alerting for this log bucket, like you did for memcached?

The memcached monitor is setup in role::graphite::production::alerts based on the data that Logstash adds to logstash.rate.mediawiki.memcached.ERROR.sum.

We have a logstash.rate.mediawiki.exception.ERROR.sum graphite metric for exceptions in general, but nothing more granular than that for this use case. To use the existing alerting tools we would need to either setup a new log channel for these on the MediaWiki side and start feeding it to Logstash or add a special rule to our Logstash config to count this subset of the exception channel traffic in graphite.

Adding counting to logstash.rate.mediawiki.exception-CAS.ERROR.sum would look something like this in ::role::logstash:

   logstash::output::statsd { 'MW_exception_CAS_rate':
       host            => $statsd_host,
       guard_condition => '[type] == "mediawiki" and [channel] == "exception" and [message] =~ "CAS"',
       namespace       => 'logstash.rate',
       sender          => 'mediawiki',
       increment       => [ 'exception-CAS.%{level}' ],
}
ori triaged this task as Low priority.Nov 9 2015, 7:49 PM
ori moved this task from Inbox to Backlog: Maintenance on the Performance-Team board.
ori set Security to None.
Krinkle added a subscriber: Krinkle.

Per https://logstash.wikimedia.org/goto/ee0b4bf7c773c18981cae04d43a3e2d9 there is currently about 100 hits per day of exceptions with "CAS update failed"

A few examples:

  • /wiki/Special:ConfirmEmail/** MWException from line 4045 of .../User.php: CAS update failed on user_touched for user ID '**' (read from replica);
  • /w/api.php MWException from line 4045 of .../User.php: CAS update failed on user_touched for user ID '**' (read from master);
  • /wiki/Special:Enroll/** MWException from line 4045 of .../User.php: CAS update failed on user_touched for user
  • /wiki/Special:ContentTranslation?page=**&from=**&to=**&targettitle=&campaign=interlanguagelink MWException from line 4045 of
  • /wiki/**?veaction=edit MWException from line 4045 of .../User.php: CAS update failed on user_touched for user ID '**' (read from replica);

What's the current status on this?

I'm well aware of that. I mean what's the status on somebody actually caring and doing something about it.

Krinkle claimed this task.

Many tasks around CAS have been filed and fixed. Aside from that, (as far as I know) this was a trend due to an increase in strictness and existing code was violating that rule. It wasn't a widespread regression.

As for this task, it's mainly phrased as "detect regressions in the future" which we do already through Icinga alerts for spikes in MediaWiki exceptions, as well as by Scap during deployments.

I've also filed T199479 recently, which is about also adding Icinga alerts for MediaWiki errors (non-fatal errors). Although CAS errors do result in exceptions, and those are already tracked by an Icinga alert.