Page MenuHomePhabricator

Make a nag system to email maintainers of tools still running on precise grid hosts
Closed, ResolvedPublic

Description

We need to create a nag system that looks at the precise job runners, makes a list of running processes, maps them to tools, and emails maintainers.

Just looking at the running jobs may not be the most effective way to notice cron jobs, so we may instead want to data mine the EventLogging data (do we still have that?) or add some new instrumentation to jsub to find out who to pester.

Event Timeline

bd808 created this task.Oct 26 2016, 5:17 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 26 2016, 5:17 PM
scfc added a subscriber: scfc.Dec 1 2016, 10:13 PM

The information is available in /var/lib/gridengine/default/common/accounting:

scfc@tools-bastion-03:~$ tail -5 /var/lib/gridengine/default/common/accounting 
task:tools-exec-1413.tools.eqiad.wmflabs:tools.apersonbot:tools.apersonbot:apersonbot-botreq-status:588087:sge:0:1480629612:1480629613:1480629719:0:0:106:6.895261:3.411877:34568.000000:0:0:0:0:47195:0:0:0.000000:40:0:0:0:42569:5580:NONE:defaultdepartment:NONE:1:0:10.307138:0.867204:0.008516:-u tools.apersonbot -q task -l h_vmem=524288k,release=trusty:0.000000:NONE:194379776.000000:0:0
task:tools-exec-1416.tools.eqiad.wmflabs:tools.hazard-bot:tools.hazard-bot:d-rfbotstatus:588147:sge:0:1480629620:1480629621:1480629722:0:0:101:1.871000:5.699957:24120.000000:0:0:0:0:16186:0:0:0.000000:32:0:0:0:51896:1394:NONE:defaultdepartment:NONE:1:0:7.570957:0.583231:0.004743:-u tools.hazard-bot -q task -l h_vmem=524288k,release=trusty:0.000000:NONE:202141696.000000:0:0
task:tools-exec-1408.eqiad.wmflabs:tools.fiwiki-tools:tools.fiwiki-tools:rvv:588228:sge:0:1480629721:1480629722:1480629722:0:0:0:0.052594:0.070787:18636.000000:0:0:0:0:6306:0:0:0.000000:16:0:0:0:99:40:NONE:defaultdepartment:NONE:1:0:0.123381:0.000000:0.000000:-u tools.fiwiki-tools -q task -l h_vmem=786432k,release=trusty:0.000000:NONE:0.000000:0:0
task:tools-exec-1415.tools.eqiad.wmflabs:tools.perfectbot:tools.perfectbot:ListSpeedyDeletions:588225:sge:0:1480629721:1480629722:1480629723:0:0:1:0.355476:0.058414:25120.000000:0:0:0:0:7663:0:0:0.000000:16:0:0:0:109:204:NONE:defaultdepartment:NONE:1:0:0.413890:0.027514:0.001640:-u tools.perfectbot -q task -l h_vmem=524288k,release=trusty:0.000000:NONE:184647680.000000:0:0
task:tools-exec-1405.eqiad.wmflabs:tools.fiwiki-tools:tools.fiwiki-tools:wikidatakuvat:588229:sge:0:1480629721:1480629722:1480629724:0:0:2:0.093074:0.075216:17956.000000:0:0:0:0:6122:0:0:0.000000:24:0:0:0:109:19:NONE:defaultdepartment:NONE:1:0:0.168290:0.022568:0.000129:-u tools.fiwiki-tools -q task -l h_vmem=786432k,release=trusty:0.000000:NONE:372801536.000000:0:0
scfc@tools-bastion-03:~$

The second field seems to be the execution node and the third/forth the tool account (fifth job name). There were 201 Precise jobs in the last 10000 jobs:

scfc@tools-bastion-03:~$ tail -10000 /var/lib/gridengine/default/common/accounting | grep tools-.*-12 | cut -d : -f 4 | sort | uniq -c | sort -n
      1 tools.tsreports
      1 tools.ytcleaner
      2 tools.dexbot
      7 tools.wikitrends
      9 tools.nlwikibots
     10 tools.merlbot2
     12 tools.veblenbot
     13 tools.dplbot
     14 tools.dewikinews-rss
     14 tools.random-featured
     17 tools.toolschecker
     25 tools.vcat
     76 tools.avicbot
scfc@tools-bastion-03:~$

(24107 in the last million. Picking third field gives www-data instead of tools.toolschecker.)

scfc moved this task from Triage to Backlog on the Toolforge board.Dec 4 2016, 8:09 PM
bd808 added a comment.Jan 6 2017, 2:31 AM

This looks like a promising start based on @scfc's research:

tools-bastion-02.tools:~/projects/T149214
bd808$ tail -500000 /data/project/.system/accounting | ./precise_nag.py
2014 tools.avicbot
383 tools.toolschecker
360 tools.dplbot
336 tools.veblenbot
251 tools.random-featured
236 tools.dewikinews-rss
207 tools.nlwikibots
203 tools.wikitrends
57 tools.suggestbot
31 tools.russbot
22 tools.pltools
14 tools.ytcleaner
12 tools.lists
9 tools.wiwosm
7 tools.giftbot
2 tools.yadkard
1 tools.persondata
1 tools.congressedits
1 tools.pb
1 tools.file-reuse
1 tools.drtrigonbot
precise_nag.py
#!/usr/bin/env python3
# Read lines from OGE's accouting file in stdin and look for jobs that
# executed on precise hosts (release=precise in 'category').
#
# Note: the accounting file only records completed jobs, so continuous jobs or
# very long running jobs will not be caught with this method of examination.
#
import collections
import datetime
import operator
import sys

DAYS=7
FIELD_NAMES = [
    'qname', 'hostname', 'group', 'owner', 'job_name', 'job_number', 'account',
    'priority', 'submission_time', 'start_time', 'end_time', 'failed',
    'exit_status', 'ru_wallclock', 'ru_utime', 'ru_stime', 'ru_maxrss',
    'ru_ixrss', 'ru_ismrss', 'ru_idrss', 'ru_isrss', 'ru_minflt', 'ru_majflt',
    'ru_nswap', 'ru_inblock', 'ru_oublock', 'ru_msgsnd', 'ru_msgrcv',
    'ru_nsignals', 'ru_nvcsw', 'ru_nivcsw', 'project', 'department',
    'granted_pe', 'slots', 'task_number', 'cpu', 'mem', 'io', 'category',
    'iow', 'pe_taskid', 'maxvemem', 'arid', 'ar_submission_time',
]

cutoff = (datetime.datetime.now() - datetime.timedelta(days=DAYS)).timestamp()
jobs = collections.defaultdict(int)

for line in sys.stdin:
    parts = line.split(':')
    job = dict(zip(FIELD_NAMES, parts))
    if int(job['end_time']) < cutoff:
        continue
    if 'release=precise' in job['category']:
        jobs[job['owner']] += 1

sorted_jobs = sorted(jobs.items(), key=operator.itemgetter(1), reverse=True)
for owner, count in sorted_jobs:
    print(count, owner)

With some additions this could be used to run a weekly cron to send nag emails. The live grid info could be mixed in too to deal with the continuous/long running jobs that will be missed in the accounting file. I could also output a "wall of shame" report that we could look at to see how things change over time. The job that tools.avicbot is running fires off once ever 5 minutes so the 2014 count pretty much matches up with the number of 5 minute increments in a 7 day period (2016).

I used this and a bit more to write a script that maps users to precise tools, and vice versa - https://phabricator.wikimedia.org/P4805. I've also dumped the json for both way mappings in comments there.

zhuyifei1999 renamed this task from Make a nag system to email maintainers of tools still running on precise gird hosts to Make a nag system to email maintainers of tools still running on precise grid hosts.Jan 30 2017, 11:59 AM

Change 335488 had a related patch set uploaded (by Madhuvishy):
toollabs: Add temp role and cron to send weekly tools precise deprecation reminders

https://gerrit.wikimedia.org/r/335488

Change 335488 merged by Madhuvishy:
toollabs: Add temp role and cron to send weekly tools precise deprecation reminders

https://gerrit.wikimedia.org/r/335488

This is done, emails were sent out day, and will be sent out weekly. The cron is applied via the role::toollabs::precise_reminder on tools-bastion-03 (on horizon), which can be reversed post Precise deprecation.

scfc closed this task as Resolved.Feb 7 2017, 3:42 PM

I've sent a manual email to most of the maintainers of the tools still listed as running, reminding them of the deadline on Tuesday, just in case they somehow missed the notices sent via this task plus the mailing list announcements.

Change 342658 had a related patch set uploaded (by Madhuvishy):
[operations/puppet] tools: Deprecate precise_reminder role and clean up related script

https://gerrit.wikimedia.org/r/342658

Change 342658 abandoned by Madhuvishy:
tools: Deprecate precise_reminder role and clean up related script

Reason:
Looks like this already happened during precise deprecation!

https://gerrit.wikimedia.org/r/342658