Page MenuHomePhabricator

Linter jobs are running slowly
Closed, ResolvedPublic

Description

Splitting into a separate task for better tracking.

@ShakespeareFan00 is reporting that linting reports are not updating. They give one example: this report still contains report of an error after page83's transclusion 83 which should have been fixed after this change.

Looking at the jobqueue, I can see a significant change in the recordLintJob rates after friday Jan 14, 02:00 AM (UTC)

Screenshot 2022-01-15 at 20.46.06.png (688×1 px, 311 KB)

The train hit group2 at 00:23, so it's probably related. I suspect the Linter job increase is because of T297443: Add a linter category for inline images with captions, which added a new lint category, so instead of most Linter jobs being fast no-ops they're now writing rows to the database. Over time this will recover by itself, but in the meantime we can add more RecordLintJob runners in changeprop (Monday I guess).

Tagging Platform and serviceops to see if we can get more runners for the Linter job queue temporarily.

Event Timeline

Legoktm triaged this task as High priority.Jan 16 2022, 7:59 AM
Legoktm created this task.

I think the task title is inaccurate.

Specifically:

  • The processing rate remained the same
  • The processing *time* remained the same, even went down a bit
  • The insert rate of linting jobs clearly went through the roof.

So now we have a 9 million items backlog. I'll double the concurrency of this job to try and reduce the strain.

Change 754096 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/deployment-charts@master] cpjobqueue: double the concurrencty for recordlintjob

https://gerrit.wikimedia.org/r/754096

Change 754096 merged by Giuseppe Lavagetto:

[operations/deployment-charts@master] cpjobqueue: double the concurrencty for recordlintjob

https://gerrit.wikimedia.org/r/754096

We're now reducing the number of backlogged items at a rate of 25k/minute. At this pace, the backlog should be back near zero in 6 hours. I think this is a reasonable time for resolution. Leaving the task open so we can come back and assess further.

Some graphs for posterity:

Specifically:

  • The processing rate remained the same

Screenshot 2022-01-16 at 00-40-59 JobQueue - Grafana.png (664×3 px, 220 KB)

  • The processing *time* remained the same, even went down a bit

Screenshot 2022-01-16 at 00-41-36 JobQueue - Grafana.png (1×3 px, 350 KB)

  • The insert rate of linting jobs clearly went through the roof.

Kind of. The peaks are a bit higher but there's also no significant gaps that used to happen which probably gave it room to process any backlog? Off the top of my head I'm not really sure why this would happen, short of editing patterns changing.

Screenshot 2022-01-16 at 00-31-38 JobQueue - Grafana.png (766×1 px, 83 KB)

Change 754110 had a related patch set uploaded (by Arlolra; author: Arlolra):

[mediawiki/services/parsoid@master] Stop emitting \"inline-media-caption\" lints

https://gerrit.wikimedia.org/r/754110

Joe claimed this task.

The backlog has been recovered; tomorrow I'll lower the concurrency for such jobs.

Change 754110 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Stop emitting \"inline-media-caption\" lints

https://gerrit.wikimedia.org/r/754110

The backlog has been recovered; tomorrow I'll lower the concurrency for such jobs.

Given that the lint is being removed, I think we're going to see the same activity in jobs just in the other direction, deleting rows rather than inserting, so it would be nice to keep the extra concurrency it through this week too.

I'd also suggest that the above Parsoid patch be backported to wmf.17, so that way if the train ends up getting rolled back again, even temporarily, we don't have unnecessary churn with wmf.18 deleting lints, wmf.17 re-adding them, and then wmf.18 deleting them again, etc.

The backlog has been recovered; tomorrow I'll lower the concurrency for such jobs.

Given that the lint is being removed, I think we're going to see the same activity in jobs just in the other direction, deleting rows rather than inserting, so it would be nice to keep the extra concurrency it through this week too.

My idea was to take the concurrency a bit higher than it is today in normal conditions,

I'd also suggest that the above Parsoid patch be backported to wmf.17, so that way if the train ends up getting rolled back again, even temporarily, we don't have unnecessary churn with wmf.18 deleting lints, wmf.17 re-adding them, and then wmf.18 deleting them again, etc.

That's for the Release-Engineering-Team and Parsoid to decide. @ssastry @thcipriani what do you think?

As train runners for this week, please be aware of ^

I'd also suggest that the above Parsoid patch be backported to wmf.17, so that way if the train ends up getting rolled back again, even temporarily, we don't have unnecessary churn with wmf.18 deleting lints, wmf.17 re-adding them, and then wmf.18 deleting them again, etc.

That's for the Release-Engineering-Team and Parsoid to decide. @ssastry @thcipriani what do you think?

I think backporting the linter category suppression to wmf.17 is a good idea to avoid the back-and-forth in the event of train rollbacks.

But, practically, instead of trying to backport the Parsoid change to wmf.17 (which is tricky since Parsoid goes through vendor, but in this instance, the change is to a single file maybe we can scap a single file), I think it might be simpler to write a Linter extension patch to drop lint events against this category and then backport that to wmf.17. @Arlolra thoughts?

Change 754558 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/extensions/Linter@master] Drop 'inline-media-caption' lint requests

https://gerrit.wikimedia.org/r/754558

Change 754564 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.15.0-a15

https://gerrit.wikimedia.org/r/754564

Change 754144 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/extensions/Linter@wmf/1.38.0-wmf.17] Drop 'inline-media-caption' lint requests

https://gerrit.wikimedia.org/r/754144

Change 754558 merged by jenkins-bot:

[mediawiki/extensions/Linter@master] Drop 'inline-media-caption' lint requests

https://gerrit.wikimedia.org/r/754558

Change 754564 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.15.0-a15

https://gerrit.wikimedia.org/r/754564

Change 754144 merged by jenkins-bot:

[mediawiki/extensions/Linter@wmf/1.38.0-wmf.17] Drop 'inline-media-caption' lint requests

https://gerrit.wikimedia.org/r/754144

Mentioned in SAL (#wikimedia-operations) [2022-01-18T08:12:50Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.17/extensions/Linter/includes/RecordLintJob.php: Backport: [[gerrit:754144|Drop 'inline-media-caption' lint requests (T297443 T299302)]] (duration: 00m 52s)