Page MenuHomePhabricator

Job-queue gets multiple identical entries / db load too high
Closed, DeclinedPublic

Description

Author: gunter.schmidt

Description:
When you edit a template, the job-queue gets filled. If you edit the same template multiple times within
a short period of time, the job-queue will have identical entries.

This does not result in an error, but in unnecessary database load.

I propose, the job table should be checked, if there are already entries with the same "job_cmd,
job_namespace and job_title".
Since the job queue is updated after the edit is finished, there should not be a problem with entries
that do exist during check, but are already completed on update of the table.

The benefit would be:

  • less db load
  • faster updates on long queues, because they are considerably shorter

While writing this wikipedia en has a jobqueue of 82,305 entries, so I make it a high priority bug.


Version: 1.9.x
Severity: minor

Details

Reference
bz9096

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 9:38 PM
bzimport set Reference to bz9096.
bzimport added a subscriber: Unknown Object (MLST).

gunter.schmidt wrote:

I am sorry, I just found, that the cleanup-jobs seems to check this. So it will delete all similar jobs with just one step.

Thus the DB-load is not significantly higher than if it would have been not inserted in the first place.

The only drawback is, that you do not see the actual length of the jobqueue. One could make a group statement in
SpecialStatistics.php.

gunter.schmidt wrote:

One would need to add:

$numJobs = $dbr->selectField( 'job', 'COUNT(*)', '', $fname );

+ $numJobsGrouped = $dbr->selectField( 'job', 'count(DISTINCT job_title,job_cmd,job_namespace)', '', $fname);

$wgLang->formatNum( $images ),

+ $wgLang->formatNum( $numJobsGrouped )

in SpecialStatistics.php

and add some text to the Message: Sitestatstext

I am not sure about the database load of count DISTINCT on large Systems, so it might not be a good idea.

Another possible SELECT would be SELECT COUNT(*) AS C FROM job WHERE job_id IN (SELECT (job_id) FROM job GROUP BY
job_cmd,job_namespace, job_title).

robchur wrote:

We could just add a unique index on those three columns and use an INSERT IGNORE
when stuffing rows into the job queue, but I'd like another opinion on whether
or not the duplicates are, in fact, causing load that we need to be worried about.

The duplicates are used because the original checking on add was very expensive
(the inserts must be very fast, while the processing can take as long as it needs).

An INSERT IGNORE might not do too bad, though, dunno.

I didn't use a unique index in the original code because I imagined that at some stage in the future, we may want to add job types that require execution of duplicates. For example, a job type with no attached title, defined entirely by the last few bytes of a large job_params blob, would create duplicates in a (job_cmd,job_namespace,job_title) key. The current method is good enough for now, although I would like to switch to a specialised non-MySQL data structure at some stage.

  • Tim Starling