Due to an issue in 1.34.0-wmf.10, there were jobs (some important, some less important) that were not being queued for many hours. Given that it is unclear what the real impact of this was, we believe that this deserves a postmorten so to gather as many info as needed, as well as track actionables. For instance we need to add an alert if such a problem comes up again.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Release | • mmodell | T220735 1.34.0-wmf.10 deployment blockers | ||
Resolved | Release | jeena | T220736 1.34.0-wmf.11 deployment blockers | ||
Resolved | • Pchelolo | T226109 Write incident report for jobs not being executed on 1.34.0-wmf.10 |
Event Timeline
Looking at mwlog1001:/srv/mw-log/JobExecutor.log there are no massmessage jobs for .10 only for .8
Change 517876 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735"
Change 517876 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735"
Mentioned in SAL (#wikimedia-operations) [2019-06-19T14:46:36Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: group1 back to .8 T226109
A lot of jobs weren't even being executed/queued. Just looking at today's jobs
$ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n 2 "enotifNotify" 4 "globalUsageCachePurge" 6 "EchoNotificationDeleteJob" 10 "cdnPurge" 12 "categoryMembershipChange" 16 "recentChangesUpdate" 16 "ThumbnailRender" 40 "htmlCacheUpdate" 54 "cirrusSearchLinksUpdatePrioritized" 478 "wikibase-addUsagesForPage" 822 "RecordLintJob" 1613 "LocalGlobalUserPageCacheUpdateJob" 201159 "refreshLinks"
While after the rollback we find:
fgrep commonswiki JobExecutor.log | fgrep wmf.8 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n 2 "AssembleUploadChunks" 2 "cirrusSearchDeletePages" 2 "TTMServerMessageUpdateJob" 4 "refreshLinksPrioritized" 8 "CentralAuthCreateLocalAccountJob" 8 "LoginNotifyChecks" 10 "newUserMessageJob" 12 "cirrusSearchLinksUpdate" 38 "EchoNotificationDeleteJob" 50 "activityUpdateJob" 78 "globalUsageCachePurge" 274 "ThumbnailRender" 322 "cirrusSearchIncomingLinkCount" 590 "enotifNotify" 1132 "categoryMembershipChange" 1190 "cdnPurge" 1257 "recentChangesUpdate" 1371 "refreshLinks" 2352 "htmlCacheUpdate" 2498 "cirrusSearchLinksUpdatePrioritized" 5849 "wikibase-addUsagesForPage" 7396 "RecordLintJob"
So it looks like the jobs were not even being enqueued and are now lost.
TL;DR this means probably badly formatted messages were sent to event-something, and we should have an alert on an elevated rate of errors there.
Change 517885 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735"
Change 517885 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735"
Mentioned in SAL (#wikimedia-operations) [2019-06-19T15:43:37Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: group0 back to .8 T226109
They will be. Based on the last metrics, tasks at UBN were open for 12 days but that doesn't mean they were UBN for the full time. (This figure may also be inflated and for a train blocker is probably much lower)
See https://lists.wikimedia.org/pipermail/wikitech-l/2019-June/092186.html
Asking without adding information doesn't speed it up.
<Edited to clarify per @Aklapper 's comments>
If you are still experiencing issues today, please report them in a new ticket.
Yesterday we rolled back production to Monday's deployment, which is before the issue was introduced. If something is still causing problems today, it is likely a separate issue we do not know about yet.
The issue of this ticket should be resolved soon, but even if it isn't, it shouldn't cause any problems, as it was already disabled.
[offtopic] No, "~12 days" is incorrect and misleading. (Discussing this would be off-topic for this task; feel free to bring it up wherever you found this number.)
That is exactly what this task is about, so there is no need to explicitly ask for investigation as this applies to basically every task. :)
@Pchelolo and I looked through all the logs and graphs and all the patchsets between .8 and .10 for EventBus and MassMessage (in detail) and core (DB- and JQ-related) and found nothing of note that could have caused this. Could it have been just a hiccup for MassMessage?
Nope, unfortunately not. It also broke at least Global Rename..
.10 was live for around 17 hours from memory. See also joe's comment, and all the other types of jobs that were seemingly affected
We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress
This time they started, but stopped in the middle of the process.
Shall I file a new task?
Thanks.
It looks like it might actually be related; for https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/TRXX-TRXX it's done testwiki but then stopped (same for the other one I already fixed)...
I'm guessing it didn't enqueue the job for the next wiki in the list? In that case, doing thwiki and the rest continued fine
reedy@mwlog1001:/srv/mw-log$ grep "チルノ" JobExecutor.log | grep testwiki 2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob"} 2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1311 testwiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob"} 2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob","job_status":true,"job_duration":0.01779317855835} 2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1311 testwiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob","job_status":true,"job_duration":0.032263994216919} 2019-06-21 15:29:22 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalRenameUserJob Global_rename_job force= from=New visitor movepages=1 namespace=0 promotetoglobal= reason=per [[Special:Permalink/19164089|request]] reattach=array(69) renamer=1997kB requestId=XQz3oApAMF0AAHI6IgEAAAAR session={\"headers\":\"array(...)\",\"ip\":\"157.36.249.217\",\"sessionId\":\"\",\"userId\":0} suppressredirects=1 title=Global_rename_job to=チルノ","job_type":"LocalRenameUserJob"} 2019-06-21 15:29:25 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalRenameUserJob Global_rename_job force= from=New visitor movepages=1 namespace=0 promotetoglobal= reason=per [[Special:Permalink/19164089|request]] reattach=array(69) renamer=1997kB requestId=XQz3oApAMF0AAHI6IgEAAAAR session={\"headers\":\"array(...)\",\"ip\":\"157.36.249.217\",\"sessionId\":\"\",\"userId\":0} suppressredirects=1 title=Global_rename_job to=チルノ","job_type":"LocalRenameUserJob","job_status":true,"job_duration":2.9513418674469}
I do not have a solid theory of what's happening yet, but just throwing in some bits and pieces.
Regarding the LocalRenameUserJob for TRXX-TRXX:
kafkacat -b localhost:9092 -p 0 -t "eqiad.mediawiki.job.LocalRenameUserJob" -o beginning | grep TRXX-TRXX | jq .meta
gives us the metadata of the jobs running in production. The 3 events interesting to us there are:
{ "domain": "test.wikipedia.org", "dt": "2019-06-21T16:51:35+00:00", "id": "d42e547d-9444-11e9-a689-1866da993d2e", "request_id": "XQ0KvApAICAAAA9lpPcAAAAS", "schema_uri": "mediawiki/job/3", "topic": "mediawiki.job.LocalRenameUserJob", "uri": "https://placeholder.invalid/wiki/Special:Badtitle" } { "domain": "tr.wikipedia.org", "dt": "2019-06-21T16:51:37+00:00", "id": "d55f72bd-9444-11e9-9691-141877614309", "request_id": "XQ0KvApAICAAAA9lpPcAAAAS", "schema_uri": "mediawiki/job/3", "topic": "mediawiki.job.LocalRenameUserJob", "uri": "https://placeholder.invalid/wiki/Special:Badtitle" } { "domain": "uk.wikipedia.org", "dt": "2019-06-21T21:30:08+00:00", "id": "bde7c2dd-946b-11e9-8649-1866da5fcf1f", "request_id": "99012b73681a3ced8c965182", "schema_uri": "mediawiki/job/3", "topic": "mediawiki.job.LocalRenameUserJob", "uri": "https://placeholder.invalid/wiki/Special:Badtitle" }
Notice that the trwiki job timestamp immediately follows the testwiki timestamp, but the next one, ukwiki has a 5 hour gap. I assume that's when @Reedy has manually continued the execution loop.
So, execution on wmf.10 did cause creating a subsequent job event, but it caused the next execution to break and not create a subsequent event.
Change 518751 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] RunSingleJob: Don't silently fail if page_title is not provided in job.
The above patch to mediawiki-config will fix it. This is another instance of having code in multiple places biting us hard. Given that it affects production, this config change has to be deployed ASAP
Change 518751 merged by jenkins-bot:
[operations/mediawiki-config@master] RunSingleJob: Don't silently fail if page_title is not provided in job.
Mentioned in SAL (#wikimedia-operations) [2019-06-24T16:13:03Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: RunSingleJob: check that only the database param is set and leave the rest to JobExecutor - T226109 (duration: 00m 55s)
Am I alone in feeling like this probably deserves an incident report?
I'd be very happy to be wrong about this, but it seems like there are no mechanisms in place to prevent this kind of error from happening again, or even to detect it if/when it does.
I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages
It was https://wikitech.wikimedia.org/wiki/Incident_documentation/20190417-Jobqueue from April!! Shows you how well I read it!
Reopening this because I think it's important we take some preventative measures here.
I've made a preliminary incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190619-JobQueue
A lot of time has passed, so we lack some details. Also, help on filing followups would be much appreciated.