Page MenuHomePhabricator

Restoring the daily traffic anomaly reports
Closed, ResolvedPublic

Description

(Following up on some recent offline conversations with @Slaporte and @elukey :)

  • These reports have been generated and emailed by a daily cronjob running on stat1005 under the user account "jdcc", which formerly belonged to @Jdcc-berkman (the contractor who implemented this project back in 2016).
  • The last email we received dates from November 9, around the time when stat1005 was deactivated and its contents moved to stat1007.
  • According to @elukey it should not be too hard to restore this cronjob on stat1007, perhaps preferably moving it to the folder of an active user, e.g. @Slaporte or myself.
  • Furthermore, @elukey noted that best practice for a regular data job such as this would be to add it to our Puppet repo, which would also have made it easier for Analytics Engineering to take care of it during the stat1005 --> stat1007 migration. @Slaporte / the Legal team supports making the code public (large parts might already be ), just might want to review its public description for communications purposes.

(Note that this task is simply about making these daily internal email reports work again, not about the separate endeavour to create a public version that was discussed in e.g. T183291 .)

Event Timeline

elukey triaged this task as Medium priority.Feb 6 2019, 6:35 AM
elukey added a project: User-Elukey.
elukey added a subscriber: Nuria.

I think that the first step, if this job is important, could be to restore the cronjob under a user's crontab that will actively maintain it. The second step is to discuss how important this job is in the longer term, and if a different approach is worth investigating or not (like having data not sent via email but ending up on a SuperSet dashboard for example, etc..).

@Tbayer IIRC you had the crontab commands that were executed on stat1005, can you post them in here? If you want I can copy jdcc's files to your home directory so we can restart the cron.

I think that the first step, if this job is important, could be to restore the cronjob under a user's crontab that will actively maintain it. The second step is to discuss how important this job is in the longer term, and if a different approach is worth investigating or not (like having data not sent via email but ending up on a SuperSet dashboard for example, etc..).

Agreed, but that second step should probably be a separate ticket. For now we're just focusing on restoring the status quo from before the server migration, while limiting the effort required from everyone involved.

@Tbayer IIRC you had the crontab commands that were executed on stat1005, can you post them in here?

No, I don't have them, and I did not seem to have the necessary user rights to view them when checking on stat1005 earlier. I was assuming that this would be easy for someone with root? (using crontab -u jdcc -l or such)

If you want I can copy jdcc's files to your home directory so we can restart the cron.

It actually seems the cronjob is still running on stat1007 and generating reports every day - the latest can be found at /home/jdcc/project_monitoring/reports/2019-02-06.html and looks fine. Apparently the script just fails to email these out as it used to do on stat1005.

Can we try to find out what broke during the server migration, and fix it? Just a wild guess: Does the OS (Linux version) on stat1007 differ from the one we had on stat1005? (I seem to recall having to replace mailx with heirloom-mailx in some of my own scripts a while ago for such a reason.)

This is jdcc's crontab (I thought it was not an active user anymore, I was wrong):

0 15 * * * USER=jdcc /home/jdcc/anaconda3/bin/python /home/jdcc/project_monitoring/scripts/check_projects.py > /home/jdcc/project_monitoring/reports/`date -Idate`.html 2>> /home/jdcc/errors.log
0 10 * * * USER=jdcc /home/jdcc/anaconda3/bin/python /home/jdcc/article_monitoring/scripts/check_articles.py > /home/jdcc/article_monitoring/reports/`date -Idate`.html 2>> /home/jdcc/errors.log

In both scripts I don't see any mail command used, so maybe it was configured in a different way on stat1005?

I only setup the reports to be dumped to disk. I didn't know they were getting emailed out, so that must have been handled somewhere else.

@Slaporte: if these scripts are important let's please find a person responsible for maintaining them. They are very sparsely documented and as you can see it is not even clear who setup emailing when and how. I would suggest to document them better in wikitech (or github) and name a maintainer for issues like this one going forward.

This is jdcc's crontab (I thought it was not an active user anymore, I was wrong):

Me too (based on T183291) ...

0 15 * * * USER=jdcc /home/jdcc/anaconda3/bin/python /home/jdcc/project_monitoring/scripts/check_projects.py > /home/jdcc/project_monitoring/reports/`date -Idate`.html 2>> /home/jdcc/errors.log
0 10 * * * USER=jdcc /home/jdcc/anaconda3/bin/python /home/jdcc/article_monitoring/scripts/check_articles.py > /home/jdcc/article_monitoring/reports/`date -Idate`.html 2>> /home/jdcc/errors.log

In both scripts I don't see any mail command used, so maybe it was configured in a different way on stat1005?

I just did some further digging and it seems that the part that got broken in the server migration was rather this script: /home/slaporte/send_report.sh.
It still works fine when launched manually (I just ran a copy from my account and received the resulting email report). So we may want to look at slaporte's crontab instead.

So slaporte's crontab on stat1007 is the following:

46 23 * * * sh /home/slaporte/send_report.sh

That is everyday at 23:46. I copied to my home and tested it, seems working fine (namely I receive emails), even when executed as user slaporte. So I am not sure where the problem is..

I did some tests swapping the recipients with me and Francisco, we have both received the email. Not sure if @Slaporte added the cron yesterday or not, but everything should work fine. I just added myself to the recipients so I can see if I get the email, plus I added a "MAILTO" with my address too so in case of errors I'll get notified.

I did some tests swapping the recipients with me and Francisco, we have both received the email. Not sure if @Slaporte added the cron yesterday or not,

Yes, that is what happened - yesterday @Slaporte and I sat down a bit to look into this. He managed to retrieve the instructions he had received from @ZhouZ way back when this was handed over, including the exact crontab data with which we restored the job. @Slaporte was going to post an update here but I think wanted to wait until after the first successful run (which has now occurred), sorry that we ended up duplicating some work on this.

Tbayer claimed this task.

Closing this task now as the first regular daily report has arrived again and looks fine.

As mentioned in the task description, and given the importance of the report as assessed by the Legal team, it would certainly be preferable to productionize it at some point, apart from the other modernizations that have been envisaged. But that should be a different ticket.

Even given that the report was just based on two users' cronjobs though, I would be interested in understanding what caused the disruption here, and how it could have been avoided - especially considering that one of the two jobs continued to run without problems after the migration and the other one failed.

I did some tests swapping the recipients with me and Francisco, we have both received the email. Not sure if @Slaporte added the cron yesterday or not,

Yes, that is what happened - yesterday @Slaporte and I sat down a bit to look into this. He managed to retrieve the instructions he had received from @ZhouZ way back when this was handed over, including the exact crontab data with which we restored the job. @Slaporte was going to post an update here but I think wanted to wait until after the first successful run (which has now occurred), sorry that we ended up duplicating some work on this.

Yes, that's right. Sorry for the late update. I created the crontab yesterday, and it all appears to run fine. Let me know if there is anything I can do to have my crontab moved in future server changes. Thanks for the help, @Tbayer!

No problem! Thanks for clarifying :)

Even given that the report was just based on two users' cronjobs though, I would be interested in understanding what caused the disruption here, and how it could have been avoided - especially considering that one of the two jobs continued to run without problems after the migration and the other one failed.

I think that I know the answer. When we (analytics) migrated users from stat1005 to stat1007 I have sent several emails to alert about the move, especially to users holding an entry in their crontab. I sent an email with subject "Crons running on stat1005 likely to be moved to stat1007" but I didn't verify that all recipients answered to me. I believe that @Slaporte didn't move his crontab to stat1007, but @Jdcc-berkman did. This would explain the mistery :)

No problem! Thanks for clarifying :)

Even given that the report was just based on two users' cronjobs though, I would be interested in understanding what caused the disruption here, and how it could have been avoided - especially considering that one of the two jobs continued to run without problems after the migration and the other one failed.

I think that I know the answer. When we (analytics) migrated users from stat1005 to stat1007 I have sent several emails to alert about the move, especially to users holding an entry in their crontab. I sent an email with subject "Crons running on stat1005 likely to be moved to stat1007" but I didn't verify that all recipients answered to me. I believe that @Slaporte didn't move his crontab to stat1007, but @Jdcc-berkman did. This would explain the mistery :)

I see; that sounds like a reasonable way to notify cronjob owners. IIRC @Slaporte was also reaching out to @Jdcc-berkman back in December about the missing reports, but I guess there were too many moving parts.

After seeing a related ticket i would still highly recommend to puppetize this cron job. That will make sure next time the server is moved there will have to be no reminder emails, no confusion about who moved it and it will just work. Also that would be more compatible with L3.

These reports will be turned off in favor of the new ones running distributted on the cluster, please see: https://gerrit.wikimedia.org/r/#/c/analytics/refinery/source/+/561674/

Once the new reports are active i think this directory can probably be deleted entirely. Traffic team to confirm