Page MenuHomePhabricator

Hashtags tool creating a very large number of concurrent grid jobs
Closed, ResolvedPublic

Description

From https://tools.wmflabs.org/grid-jobs/ at 2018-06-28T00:00:

ToolUnique jobsActive jobsJobs seen
hashtags3536820716

Drilling down shows that there are 99 enHashtagUpdate, 48 esHashtagUpdate, 73 ruHashtagUpdate, etc jobs running. I know that I have seen the concurrent job for this tool spike before, but this particular spike seems more extreme.

Total jobs active on the grid seems to have started climbing sharply around UTC midnight of 2018-06-27. That is probably not all attributable to the hashtags tool, but it seems likely that there is some correlation.

Event Timeline

MariaDB [(none)]> select count(1) as procs, max(time) as longest, state from INFORMATION_SCHEMA.PROCESSLIST where user = 's52467' group by state;
+-------+---------+------------------------------+
| procs | longest | state                        |
+-------+---------+------------------------------+
|     8 |      52 | Copying to tmp table         |
|     2 |      86 | Copying to tmp table on disk |
|     1 |       0 | Filling schema table         |
|   385 |    8400 | Waiting for table level lock |
+-------+---------+------------------------------+
4 rows in set (0.02 sec)

There seems to be some serious lock contention writing to the tool's ToolsDB table.

The contention is around statements that look like:

UPDATE hashtags
SET ht_update_timestamp = '20180627192453'
WHERE ht_text = 'iabot'

There are actually at this instant 220 of these that are all trying to record last seen times for ht_text = 'iabot'. Pretty obviously only the most recent one matters. I wonder if there is something that can be done in the job to de-deplicate these writes and reduce the lock contention?

@mahmoud can you give me some help in figuring out what is going wrong here?

Sorry for the delay in replying, wedding planning has done a number on my open software contribs.

I think the most likely culprit here is that some external slowdown (maybe a big influx of hashtags, or just non-hashtag hardware contention), caused the scheduled jobs to pile up, waiting on mysql to kill deadlocks.

We can alleviate the problem in future by having the jobs early exit if another job is already running. I've done a bit of design there, but implementing our own locks is bound to be a little tricky. For instance, NFS has proven too flaky to rely on for a file-based lock, so the locks themselves will probably live in the DB.

I'd be very curious to see how other tools have solved this issue with scheduled jobs, if you happen to know of any. :)

I'd be very curious to see how other tools have solved this issue with scheduled jobs, if you happen to know of any. :)

The -once flag to jsub is the easiest way to ensure that cron does not start a named job if there is already a copy of the same named job running. See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Grid#Running_a_job_only_once

I agree that NFS file locks are not a reliable solution.

fwiw @bd808 I put the -once in. Not sure if that's enough info to close this ticket, but hopefully we won't see this happen again. Thanks!

bd808 closed this task as Resolved.Jul 30 2018, 3:15 AM
bd808 assigned this task to mahmoud.

Things look good for now. I'll reopen this if I happen to catch things running away again, but hopefully -once will do the trick.