Page MenuHomePhabricator

various weekly and daily dumps run from systemd timers are broken
Open, HighPublic

Description

Lots of errors in the logs like:

Apr 27 08:10:01 snapshot1008 systemd[1]: Started Regular jobs to build snapshot of page titles of main namespace.
Apr 27 08:10:01 snapshot1008 python3[1945]: usage: systemd-timer-mail-wrapper [-h] [-T MAIL_TO] [--only-on-error]
Apr 27 08:10:01 snapshot1008 python3[1945]: systemd-timer-mail-wrapper: error: unrecognized arguments: --configfile /etc/dumps/confs/wikidump.conf.dumps:monitor --filenameformat {w}-{d}-all-titles-in-ns-0.gz --outdir /mnt/dump\
sdata/otherdumps/pagetitles/{d} --query 'select page_title from page where page_namespace=0;'
Apr 27 08:10:01 snapshot1008 systemd[1]: pagetitles-ns0.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 27 08:10:01 snapshot1008 systemd[1]: pagetitles-ns0.service: Failed with result 'exit-code'.

Reported by @Protsack.stephan who noticed that the daily page titles were not being produced; last day was Apr 23rd.

Event Timeline

ArielGlenn created this task.

This was caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/679292 which added arg parsing to the systemd timer wrapper script.

This means the wikidata entity dumps and the commons mediainfo dumps did not run this week, cc @hoo and @dcausse for a heads up.

jbond claimed this task.
jbond added a subscriber: jbond.

This has been fixed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/682922. systemd-timer-mail-wrapper has been updated us command uses nagrs=argparse.REMAINDER. this means that the $command paramater needs to come last

Folks cc-ed on this should decide if their jobs ought to start later today in lieu of not running at all, and either do it or poke me to do it, if so.

Note that the only way we found out about these usage errors from the systemd timer wrapper script was that a vigilant user of one of the output datasets happened to notice they weren't being produced. The error messages themselves just siltently went into syslog with no one being the wiser. We might want to think about better reporting that nonetheless doesn't mean piles of cronspam.

After discussion with jbond, reopening for further discussion on better alerting in case of failures.

@fgiunchedi there is a requirement to forward a subset of icinga alerts to a different set of users. either sending to an email address or something fancier like a push notifications.

As a starting point it would be could to forward "Check+systemd+state" alerts relating to snapshot servers to a ops-dumps@wikimedia.org, before i start digging into puppet i thought i would ping you as i think this may be something that's better handled in alertmanager?

@fgiunchedi there is a requirement to forward a subset of icinga alerts to a different set of users. either sending to an email address or something fancier like a push notifications.

As a starting point it would be could to forward "Check+systemd+state" alerts relating to snapshot servers to a ops-dumps@wikimedia.org, before i start digging into puppet i thought i would ping you as i think this may be something that's better handled in alertmanager?

Thanks for reaching out! We can definitely route the icinga alerts that show up in AM (i.e. alertname: Icinga/Check systemd state on snapshot hosts) to email ops-dumps@wikimedia.org, the "authoritative source" for the alert would still be icinga (and puppet, of course) for now, happy to help with the review/setup

What are the next steps on this? Should I be tweaking a manifest someplace?

jbond removed jbond as the assignee of this task.Jun 1 2021, 2:53 PM
jbond added a project: User-jbond.

@fgiunchedi I notice that in some cases phab tasks are autocreated when systemd units fail. Is that true for systemd jobs on snapshot hosts? Could we get tagged on those (Dumps-Generation) or could we get emails from those (ops-dumps@wm.o)?

@fgiunchedi I notice that in some cases phab tasks are autocreated when systemd units fail. Is that true for systemd jobs on snapshot hosts? Could we get tagged on those (Dumps-Generation) or could we get emails from those (ops-dumps@wm.o)?

Yes you can! The easiest would be to add a section to Alertmanager routing for team=core-platform alerts, and decide what to do depending on the alert and/or its severity. A good starting point is this: https://wikitech.wikimedia.org/wiki/Alertmanager#I'm_part_of_a_new_team_that_needs_onboarding_to_Alertmanager,_what_do_I_need_to_do? and please reach out if you run into any snags!