Page MenuHomePhabricator

Ingest logs from scheduled maintenance scripts at WMF in Logstash
Open, Needs TriagePublic

Description

Why

Some benefits:

  • Telemetry into failures and progress of jobs, such as would have been useful during T282761.
  • Less disruption to human consumers after DCs are switched. Same place as before.
  • Not losing logs when the server is upgraded/re-imaged.
  • Not losing logs when server is lost.
  • Not losing logs when we ensure => absent a job, as happened today.
From #wikimedia-sre

19:48 <Krinkle> legoktm: huh, so.. /var/log/mediawiki/mediawiki_job_parser_cache_purging/ got rm'rfed by puppet I think.
[…]
19:53 <Krinkle> ugh, I was hoping these logs were just rotated plainly with the last entries kept as-is when nothing new is added.
19:57 <kormat> oh damn. the loss of the old log is unfortunate.

How

Not sure. I believe we use rsyslog for this in other places, which seems like a good fit. It would the scripts can continue to work "as expected" via stdout when invoked manually (including in production with production configuration), no runtime coupling with Kafka and Elastic, and dealing with all that batching/connectivity/retrying via a local buffer first.

What

  • Decide how.
  • Make it work.
  • Update/write documentation on Wikitech for how to tail logs of scheduled maintenance scripts.

Event Timeline

Aklapper renamed this task from Ingest logs from scheduled maintenance scripts at WMF in Logtash to Ingest logs from scheduled maintenance scripts at WMF in Logstash.Jun 30 2021, 7:29 PM

I believe we use rsyslog for this in other places, which seems like a good fit.

Would that result in a single entry in Logstash, or one per output line? I sometimes run into that issue with rsyslog but not sure if it's an inherent limitation or just something being misconfigured. Breaking up stack traces into separate log entries makes debugging very cumbersome.

I believe we use rsyslog for this in other places, which seems like a good fit.

Would that result in a single entry in Logstash, or one per output line? I sometimes run into that issue with rsyslog but not sure if it's an inherent limitation or just something being misconfigured. Breaking up stack traces into separate log entries makes debugging very cumbersome.

@Tgr I'm not sure what the default behaviour is when these various layers are combined as-is. It wouldn't surprise me indeed if it wasn't able to distinguish between separate logical messages or chunks. I'll also note that distinguishing that based on stdout/stderr alone can be quite tricky. For a lot of maintenance scripts, it may suffice to consider the entire process as producing 1 message. Certainly for error reporting that would suffice in 100% of cases I imagine, since we're talking only about fatal errors by default.

However, this task as-written does have a wider context beyond error reporting. For longer running scripts, it would be useful to have the standard output stream in Logstah, and that by definition is only useful if it flushes regularly and line-breaks are be the defacto default way to split log streams into messages from rsyslog to Logstash (e.g. progress of parsercache purge).

For MediaWiki core and php-wmerrors we currently "accomplish" linebreaks by using JSON-formatted syslog entries (@cee), which naturally encode these. It seems impractical though to somehow intervene on stderr and somehow translate stderr to such messages. I imagine this would instead be handled by a higher level somewhere. Perhaps this is something systemd can do based on its knowledge of a process having ended, and then providing the stderr buffer in a single log message. Or alternatively the rsyslog collector may be able to use something other than a line break as message separator and instead look for the standard prefix of each logical error log file entry and then take subsequent lines into the same message at once. (I hope this is a solved problem in 2022, but it also wouldn't surprise me if the internals of these layers involve fixed-size and undebounced write/read buffer loops that make it difficult to have a separator other than a line break without producing incomplete messages in race conditions).

We (o11y) discussed this task at the team meeting yesterday. My understanding is that at present the unattended job runs (i.e. via systemd timers) are logged only locally in /var/log/mediawiki/*/syslog.log due to systemd::timer::job usage by profile::mediawiki::periodic_job (?)

The simplest path forward (and please excuse me if I'm stating the obvious) would be to start sending all programs prefixed with mediawiki_job to kafka/logstash (by setting $.log_outputs, see for example modules/trafficserver/files/rsyslog.conf). The multiline issue would still be there but there would be no data loss and logs are availabile in logstash.

Output from running jobs interactively (i.e. stdout to the terminal) wouldn't still be captured though, for that I think (off the top of my head) we could be teeing to syslog in the wrapper that runs jobs

HTH!

Change 800683 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):

[operations/puppet@production] Log output of scheduled MediaWiki maintenance scripts

https://gerrit.wikimedia.org/r/800683

Change 800683 merged by RLazarus:

[operations/puppet@production] Log output of scheduled MediaWiki maintenance scripts

https://gerrit.wikimedia.org/r/800683

The patch does not work, see @Joe's comment there.

The patch does not work, see @Joe's comment there.

@RLazarus are you able to help us get a working patch for this? I am not well-versed enough in the puppet config to know where to make the fix.

Random ping! Would someone be interested in pushing this forward?
Absolutely not urgent, just asking in case it fell of people's radars by accident.

Not sure why this task was marked "Incident followup".

Very much in support of this Task. It can help prevent issues such as those described at T187101 (although the alerts would still be great for the most sensitive ones).