Page MenuHomePhabricator

Write incident report for jobs not being executed on 1.34.0-wmf.10
Closed, ResolvedPublic

Description

Due to an issue in 1.34.0-wmf.10, there were jobs (some important, some less important) that were not being queued for many hours. Given that it is unclear what the real impact of this was, we believe that this deserves a postmorten so to gather as many info as needed, as well as track actionables. For instance we need to add an alert if such a problem comes up again.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Looking at mwlog1001:/srv/mw-log/JobExecutor.log there are no massmessage jobs for .10 only for .8

Change 517876 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517876

Change 517876 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517876

Reedy renamed this task from MassMessages apparently not being sent on 1.34.0-wmf.10 to Jobs not being executed on 1.34.0-wmf.10.Jun 19 2019, 2:45 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-19T14:46:36Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: group1 back to .8 T226109

Reedy triaged this task as High priority.Jun 19 2019, 2:47 PM
Reedy removed a project: Patch-For-Review.
Krinkle raised the priority of this task from High to Unbreak Now!.Jun 19 2019, 2:47 PM
Krinkle added a subscriber: Krinkle.

Train blockers are UBN.

A lot of jobs weren't even being executed/queued. Just looking at today's jobs

$ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "enotifNotify"
      4 "globalUsageCachePurge"
      6 "EchoNotificationDeleteJob"
     10 "cdnPurge"
     12 "categoryMembershipChange"
     16 "recentChangesUpdate"
     16 "ThumbnailRender"
     40 "htmlCacheUpdate"
     54 "cirrusSearchLinksUpdatePrioritized"
    478 "wikibase-addUsagesForPage"
    822 "RecordLintJob"
   1613 "LocalGlobalUserPageCacheUpdateJob"
 201159 "refreshLinks"

While after the rollback we find:

fgrep commonswiki JobExecutor.log | fgrep wmf.8 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "AssembleUploadChunks"
      2 "cirrusSearchDeletePages"
      2 "TTMServerMessageUpdateJob"
      4 "refreshLinksPrioritized"
      8 "CentralAuthCreateLocalAccountJob"
      8 "LoginNotifyChecks"
     10 "newUserMessageJob"
     12 "cirrusSearchLinksUpdate"
     38 "EchoNotificationDeleteJob"
     50 "activityUpdateJob"
     78 "globalUsageCachePurge"
    274 "ThumbnailRender"
    322 "cirrusSearchIncomingLinkCount"
    590 "enotifNotify"
   1132 "categoryMembershipChange"
   1190 "cdnPurge"
   1257 "recentChangesUpdate"
   1371 "refreshLinks"
   2352 "htmlCacheUpdate"
   2498 "cirrusSearchLinksUpdatePrioritized"
   5849 "wikibase-addUsagesForPage"
   7396 "RecordLintJob"

So it looks like the jobs were not even being enqueued and are now lost.

TL;DR this means probably badly formatted messages were sent to event-something, and we should have an alert on an elevated rate of errors there.

Change 517885 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517885

Change 517885 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517885

Mentioned in SAL (#wikimedia-operations) [2019-06-19T15:43:37Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: group0 back to .8 T226109

@Reedy manually ran the global renames that were never queued properly.

When will the fix be ready?

Sometime after someone works out what's broken

When will the fix be ready?

Sometime after someone works out what's broken

Could someone investigate what's broken?

When will the fix be ready?

Sometime after someone works out what's broken

Could someone investigate what's broken?

They will be. Based on the last metrics, tasks at UBN were open for 12 days but that doesn't mean they were UBN for the full time. (This figure may also be inflated and for a train blocker is probably much lower)

See https://lists.wikimedia.org/pipermail/wikitech-l/2019-June/092186.html

Asking without adding information doesn't speed it up.

<Edited to clarify per @Aklapper 's comments>

This comment was removed by RhinosF1.

When will the fix be ready?

If you are still experiencing issues today, please report them in a new ticket.

Yesterday we rolled back production to Monday's deployment, which is before the issue was introduced. If something is still causing problems today, it is likely a separate issue we do not know about yet.

The issue of this ticket should be resolved soon, but even if it isn't, it shouldn't cause any problems, as it was already disabled.

edit: I didn't see @Krinkle's message, this was replying to @RhinosF1:

Given that this is a train blocker, 12 days isn't acceptable.

They will be. Based on the last metrics it takes ~12 days for a UBN to be fixed.

[offtopic] No, "~12 days" is incorrect and misleading. (Discussing this would be off-topic for this task; feel free to bring it up wherever you found this number.)

Could someone investigate what's broken?

That is exactly what this task is about, so there is no need to explicitly ask for investigation as this applies to basically every task. :)

This comment was removed by Agusbou2015.
This comment was removed by Krinkle.

@Agusbou2015: Please stop posting unhelpful repetitive messages. Thanks.

@Pchelolo and I looked through all the logs and graphs and all the patchsets between .8 and .10 for EventBus and MassMessage (in detail) and core (DB- and JQ-related) and found nothing of note that could have caused this. Could it have been just a hiccup for MassMessage?

@Pchelolo and I looked through all the logs and graphs and all the patchsets between .8 and .10 for EventBus and MassMessage (in detail) and core (DB- and JQ-related) and found nothing of note that could have caused this. Could it have been just a hiccup for MassMessage?

Nope, unfortunately not. It also broke at least Global Rename..

.10 was live for around 17 hours from memory. See also joe's comment, and all the other types of jobs that were seemingly affected

A lot of jobs weren't even being executed/queued. Just looking at today's jobs

$ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "enotifNotify"
      4 "globalUsageCachePurge"
      6 "EchoNotificationDeleteJob"
     10 "cdnPurge"
     12 "categoryMembershipChange"
     16 "recentChangesUpdate"
     16 "ThumbnailRender"
     40 "htmlCacheUpdate"
     54 "cirrusSearchLinksUpdatePrioritized"
    478 "wikibase-addUsagesForPage"
    822 "RecordLintJob"
   1613 "LocalGlobalUserPageCacheUpdateJob"
 201159 "refreshLinks"

While after the rollback we find:

fgrep commonswiki JobExecutor.log | fgrep wmf.8 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "AssembleUploadChunks"
      2 "cirrusSearchDeletePages"
      2 "TTMServerMessageUpdateJob"
      4 "refreshLinksPrioritized"
      8 "CentralAuthCreateLocalAccountJob"
      8 "LoginNotifyChecks"
     10 "newUserMessageJob"
     12 "cirrusSearchLinksUpdate"
     38 "EchoNotificationDeleteJob"
     50 "activityUpdateJob"
     78 "globalUsageCachePurge"
    274 "ThumbnailRender"
    322 "cirrusSearchIncomingLinkCount"
    590 "enotifNotify"
   1132 "categoryMembershipChange"
   1190 "cdnPurge"
   1257 "recentChangesUpdate"
   1371 "refreshLinks"
   2352 "htmlCacheUpdate"
   2498 "cirrusSearchLinksUpdatePrioritized"
   5849 "wikibase-addUsagesForPage"
   7396 "RecordLintJob"

So it looks like the jobs were not even being enqueued and are now lost.

TL;DR this means probably badly formatted messages were sent to event-something, and we should have an alert on an elevated rate of errors there.

We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress

This time they started, but stopped in the middle of the process.

Shall I file a new task?

Thanks.

We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress

This time they started, but stopped in the middle of the process.

Shall I file a new task?

Thanks.

It looks like it might actually be related; for https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/TRXX-TRXX it's done testwiki but then stopped (same for the other one I already fixed)...

I'm guessing it didn't enqueue the job for the next wiki in the list? In that case, doing thwiki and the rest continued fine

reedy@mwlog1001:/srv/mw-log$ grep "チルノ" JobExecutor.log | grep testwiki
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob"} 
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1311 testwiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob"} 
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob","job_status":true,"job_duration":0.01779317855835} 
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1311 testwiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob","job_status":true,"job_duration":0.032263994216919} 
2019-06-21 15:29:22 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalRenameUserJob Global_rename_job force= from=New visitor movepages=1 namespace=0 promotetoglobal= reason=per [[Special:Permalink/19164089|request]] reattach=array(69) renamer=1997kB requestId=XQz3oApAMF0AAHI6IgEAAAAR session={\"headers\":\"array(...)\",\"ip\":\"157.36.249.217\",\"sessionId\":\"\",\"userId\":0} suppressredirects=1 title=Global_rename_job to=チルノ","job_type":"LocalRenameUserJob"} 
2019-06-21 15:29:25 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalRenameUserJob Global_rename_job force= from=New visitor movepages=1 namespace=0 promotetoglobal= reason=per [[Special:Permalink/19164089|request]] reattach=array(69) renamer=1997kB requestId=XQz3oApAMF0AAHI6IgEAAAAR session={\"headers\":\"array(...)\",\"ip\":\"157.36.249.217\",\"sessionId\":\"\",\"userId\":0} suppressredirects=1 title=Global_rename_job to=チルノ","job_type":"LocalRenameUserJob","job_status":true,"job_duration":2.9513418674469}

I do not have a solid theory of what's happening yet, but just throwing in some bits and pieces.

Regarding the LocalRenameUserJob for TRXX-TRXX:

kafkacat -b localhost:9092 -p 0 -t "eqiad.mediawiki.job.LocalRenameUserJob" -o beginning | grep TRXX-TRXX | jq .meta

gives us the metadata of the jobs running in production. The 3 events interesting to us there are:

{
  "domain": "test.wikipedia.org",
  "dt": "2019-06-21T16:51:35+00:00",
  "id": "d42e547d-9444-11e9-a689-1866da993d2e",
  "request_id": "XQ0KvApAICAAAA9lpPcAAAAS",
  "schema_uri": "mediawiki/job/3",
  "topic": "mediawiki.job.LocalRenameUserJob",
  "uri": "https://placeholder.invalid/wiki/Special:Badtitle"
}
{
  "domain": "tr.wikipedia.org",
  "dt": "2019-06-21T16:51:37+00:00",
  "id": "d55f72bd-9444-11e9-9691-141877614309",
  "request_id": "XQ0KvApAICAAAA9lpPcAAAAS",
  "schema_uri": "mediawiki/job/3",
  "topic": "mediawiki.job.LocalRenameUserJob",
  "uri": "https://placeholder.invalid/wiki/Special:Badtitle"
}
{
  "domain": "uk.wikipedia.org",
  "dt": "2019-06-21T21:30:08+00:00",
  "id": "bde7c2dd-946b-11e9-8649-1866da5fcf1f",
  "request_id": "99012b73681a3ced8c965182",
  "schema_uri": "mediawiki/job/3",
  "topic": "mediawiki.job.LocalRenameUserJob",
  "uri": "https://placeholder.invalid/wiki/Special:Badtitle"
}

Notice that the trwiki job timestamp immediately follows the testwiki timestamp, but the next one, ukwiki has a 5 hour gap. I assume that's when @Reedy has manually continued the execution loop.

So, execution on wmf.10 did cause creating a subsequent job event, but it caused the next execution to break and not create a subsequent event.

Change 518751 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] RunSingleJob: Don't silently fail if page_title is not provided in job.

https://gerrit.wikimedia.org/r/518751

The above patch to mediawiki-config will fix it. This is another instance of having code in multiple places biting us hard. Given that it affects production, this config change has to be deployed ASAP

Change 518751 merged by jenkins-bot:
[operations/mediawiki-config@master] RunSingleJob: Don't silently fail if page_title is not provided in job.

https://gerrit.wikimedia.org/r/518751

Mentioned in SAL (#wikimedia-operations) [2019-06-24T16:13:03Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: RunSingleJob: check that only the database param is set and leave the rest to JobExecutor - T226109 (duration: 00m 55s)

Pchelolo claimed this task.

The fix to mw-config has been deployed, we should be good now.

Am I alone in feeling like this probably deserves an incident report?

I'd be very happy to be wrong about this, but it seems like there are no mechanisms in place to prevent this kind of error from happening again, or even to detect it if/when it does.

Am I alone in feeling like this probably deserves an incident report?

I'd be very happy to be wrong about this, but it seems like there are no mechanisms in place to prevent this kind of error from happening again, or even to detect it if/when it does.

<Strike>I think there was one at some point</strike> Wrong month!

I think there was one at some point

Pretty sure there wasn't

I think there was one at some point

Pretty sure there wasn't

I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages

I think there was one at some point

Pretty sure there wasn't

I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages

That might've be the last breakage, but the cause is different

I think there was one at some point

Pretty sure there wasn't

I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages

That might've be the last breakage, but the cause is different

It was https://wikitech.wikimedia.org/wiki/Incident_documentation/20190417-Jobqueue from April!! Shows you how well I read it!

Am I alone in feeling like this probably deserves an incident report?

I'd be very happy to be wrong about this, but it seems like there are no mechanisms in place to prevent this kind of error from happening again, or even to detect it if/when it does.

Reopening this because I think it's important we take some preventative measures here.

CDanis renamed this task from Jobs not being executed on 1.34.0-wmf.10 to Write incident report for jobs not being executed on 1.34.0-wmf.10.Jul 22 2019, 3:46 PM
CDanis lowered the priority of this task from Unbreak Now! to Medium.

I've made a preliminary incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190619-JobQueue

A lot of time has passed, so we lack some details. Also, help on filing followups would be much appreciated.

Report written. Please reopen if it's not sufficient.