Page MenuHomePhabricator

Reportupdater: do not write execution control files in source directories
Closed, ResolvedPublic

Description

Currently, reportupdater writes a set of execution control files to the "query folder" passed as a parameter.
The problem is that folder usually belongs to another repository and it's not a good practice to write files inside code-versioned folders.

Instead, reportupdater should write those files maybe in:

  • a home folder
  • a tmp folder
  • ???

For the record, the mentioned control files are:

  • A .reportupdater.pid file
  • A .reruns directory that contains one file for each call to rerun_reports.py that hasn't been read by update_reports.py yet.

Event Timeline

Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Nuria triaged this task as Medium priority.Apr 5 2018, 4:51 PM
Nuria lowered the priority of this task from Medium to Low.
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.

This looks fun. One quick improvement would be to rewrite the write and delete methods as a context manager, so that Python is able to release the PID lock in most cases, even after an unexpected exception.

I think .reportupdater.history logic was removed in 2851780a7594, I'll update the task description accordingly.

I'd like to understand better how this tool is used, so will read the docs. If it's actually run by cron, and production access is controlled by the Analytics team, then it might make sense to setuid, and store the PID file in /var/run. If people are free to run their own clone of the repo on production stat machines, then we would want per-user lockfiles rather than systemwide, amirite?

Hi @mforns , I am a first time contributor. I was looking at this task. While looking at the code, I saw that writing control files to home directory is pretty doable. So should I go ahead with this approach? Or do you want me to explore any further. Please opine.

mforns lowered the priority of this task from Low to Lowest.Aug 10 2020, 4:27 PM
mforns moved this task from Backlog (Later) to Smart Tools for Better Data on the Analytics board.

Change 623470 had a related patch set uploaded (by Paul Kernfeld; owner: Paul Kernfeld):
[analytics/reportupdater@master] Use pid library to manage Pidfile

https://gerrit.wikimedia.org/r/623470

paulkernfeld subscribed.

I saw this was marked with "good first issue," so I put together a patch for the pidfile portion of this.

@awight the patch does use /var/run and work as a context manager, as you suggested. I didn't put the files in the home directory because that looked like a behavior change from the previous behavior. My logic there was, if end users aren't requesting the ability for multiple OS users to be able to run the same query simultaneously, it would be safer to disallow it.

Nuria subscribed.

given that the missing pid package is now debianized let's deploy this code?

Change 623470 merged by Mforns:
[analytics/reportupdater@master] Use pid library to manage Pidfile

https://gerrit.wikimedia.org/r/623470

@paulkernfeld thanks for your work, woudl you be interested in doing more wikistats work (javascript) or you prefer these type of python projects?

No problem!

I would be happy to take a shot at Python or Wikistats work. If it helps in your planning, my goal is to spend like 3 hours per week on Wikimedia code.