Page MenuHomePhabricator

Setup automated topk wide row reporting
Closed, DeclinedPublic

Description

T94121 tracks the larger problem of making high growth revision histories more tractable. In the meantime, the remedy has been to reactively (manually) identify these partitions (typically after a query triggers an OOM), remove them, and (optionally) blacklist the title. Until a more permanent solution is in place, we should automate the aforementioned procedure, and provide early (earlier) warning of impending issues.

This issue proposes to create a script that periodically (weekly, tentatively) queries for compaction generated wide-row warnings in logstash, summarizes them, and publishes the results via email to the Services list.

Event Timeline

Eevans added subscribers: bd808, greg.

To summarize an IRC conversation with @bd808 and @greg, terbium would seem to be the best candidate to generate this report from.

Change 314772 had a related patch set uploaded (by Eevans):
Update firewall to allow terbium access to elasticsearch

https://gerrit.wikimedia.org/r/314772

Change 314772 merged by Dzahn:
logstash: let maintenance hosts connect to elasticsearch

https://gerrit.wikimedia.org/r/314772

A first-stab at this can be found here: https://github.com/eevans/services-adhoc-reports/blob/master/report-topk-partion-sizes. It's still a bit rough, but WFM; Review welcome.

I've sent one report out to services@wikimedia.org already (the result of a manual run on Friday), and I've since set it up to run via cron from terbium on Sunday evenings.

NOTE: This is currently running from a Git checkout in my home directory, and is kicked off by an entry in my crontab. This should be considered temporary; Long-term this isn't acceptable. Pending review/feedback this should be properly deployed, documented, and Puppetized.
mobrovac added a subscriber: mobrovac.

@Eevans should we close this or are you planning more work here?

@Eevans should we close this or are you planning more work here?

Well, there was some question as to our level of commitment to this; Ideally we'd solve the underlying issues, and save ourselves the noise, but that may still be a ways off.

Currently it's running out of my home directory and crontab on terbium, if it's something we plan to rely on for a while, then the code should probably be reviewed, moved over, and its invocation Puppetized.

@mobrovac wdyt?

@Eevans should we close this or are you planning more work here?

Well, there was some question as to our level of commitment to this; Ideally we'd solve the underlying issues, and save ourselves the noise, but that may still be a ways off.

Currently it's running out of my home directory and crontab on terbium, if it's something we plan to rely on for a while, then the code should probably be reviewed, moved over, and its invocation Puppetized.

@mobrovac wdyt?

Reviewing and puppetising the script shouldn't take too long. I will review the script a bit more, but at first glance it looks good :)

Change 328660 had a related patch set uploaded (by Mobrovac):
RESTBase-Cassandra: Add the topk reporter

https://gerrit.wikimedia.org/r/328660

We don't need this any more, so declining.

Change 328660 abandoned by Mobrovac:
RESTBase-Cassandra: Add the topk reporter

https://gerrit.wikimedia.org/r/328660