Page MenuHomePhabricator

Jobs not being executed on 1.34.0-wmf.10
Closed, ResolvedPublic

Description

I'm experiencing this again :/ Help? It's been several hours. https://meta.wikimedia.org/wiki/Special:Log/massmessage

Event Timeline

Reedy created this task.Wed, Jun 19, 2:38 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptWed, Jun 19, 2:38 PM

Looking at mwlog1001:/srv/mw-log/JobExecutor.log there are no massmessage jobs for .10 only for .8

Change 517876 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517876

Change 517876 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "group1 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517876

Reedy renamed this task from MassMessages apparently not being sent on 1.34.0-wmf.10 to Jobs not being executed on 1.34.0-wmf.10.Wed, Jun 19, 2:45 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-19T14:46:36Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: group1 back to .8 T226109

Krinkle added a subscriber: Pchelolo.
Krinkle added a subscriber: mobrovac.
Reedy triaged this task as High priority.Wed, Jun 19, 2:47 PM
Reedy removed a project: Patch-For-Review.
Krinkle raised the priority of this task from High to Unbreak Now!.Wed, Jun 19, 2:47 PM
Krinkle added a subscriber: Krinkle.

Train blockers are UBN.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptWed, Jun 19, 2:47 PM
CDanis added a subscriber: CDanis.Wed, Jun 19, 2:59 PM
Joe added a subscriber: Joe.Wed, Jun 19, 3:36 PM

A lot of jobs weren't even being executed/queued. Just looking at today's jobs

$ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "enotifNotify"
      4 "globalUsageCachePurge"
      6 "EchoNotificationDeleteJob"
     10 "cdnPurge"
     12 "categoryMembershipChange"
     16 "recentChangesUpdate"
     16 "ThumbnailRender"
     40 "htmlCacheUpdate"
     54 "cirrusSearchLinksUpdatePrioritized"
    478 "wikibase-addUsagesForPage"
    822 "RecordLintJob"
   1613 "LocalGlobalUserPageCacheUpdateJob"
 201159 "refreshLinks"

While after the rollback we find:

fgrep commonswiki JobExecutor.log | fgrep wmf.8 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "AssembleUploadChunks"
      2 "cirrusSearchDeletePages"
      2 "TTMServerMessageUpdateJob"
      4 "refreshLinksPrioritized"
      8 "CentralAuthCreateLocalAccountJob"
      8 "LoginNotifyChecks"
     10 "newUserMessageJob"
     12 "cirrusSearchLinksUpdate"
     38 "EchoNotificationDeleteJob"
     50 "activityUpdateJob"
     78 "globalUsageCachePurge"
    274 "ThumbnailRender"
    322 "cirrusSearchIncomingLinkCount"
    590 "enotifNotify"
   1132 "categoryMembershipChange"
   1190 "cdnPurge"
   1257 "recentChangesUpdate"
   1371 "refreshLinks"
   2352 "htmlCacheUpdate"
   2498 "cirrusSearchLinksUpdatePrioritized"
   5849 "wikibase-addUsagesForPage"
   7396 "RecordLintJob"

So it looks like the jobs were not even being enqueued and are now lost.

TL;DR this means probably badly formatted messages were sent to event-something, and we should have an alert on an elevated rate of errors there.

Change 517885 had a related patch set uploaded (by Reedy; owner: Reedy):
[operations/mediawiki-config@master] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517885

Change 517885 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert "group0 wikis to 1.34.0-wmf.10 refs T220735"

https://gerrit.wikimedia.org/r/517885

Mentioned in SAL (#wikimedia-operations) [2019-06-19T15:43:37Z] <reedy@deploy1001> rebuilt and synchronized wikiversions files: group0 back to .8 T226109

@Reedy manually ran the global renames that were never queued properly.

When will the fix be ready?

Reedy added a comment.Thu, Jun 20, 1:12 PM

When will the fix be ready?

Sometime after someone works out what's broken

When will the fix be ready?

Sometime after someone works out what's broken

Could someone investigate what's broken?

Jpita added a subscriber: Jpita.Thu, Jun 20, 2:20 PM
RhinosF1 added a comment.EditedThu, Jun 20, 4:20 PM

When will the fix be ready?

Sometime after someone works out what's broken

Could someone investigate what's broken?

They will be. Based on the last metrics, tasks at UBN were open for 12 days but that doesn't mean they were UBN for the full time. (This figure may also be inflated and for a train blocker is probably much lower)

See https://lists.wikimedia.org/pipermail/wikitech-l/2019-June/092186.html

Asking without adding information doesn't speed it up.

<Edited to clarify per @Aklapper 's comments>

This comment was removed by RhinosF1.
Krinkle added a comment.EditedThu, Jun 20, 4:22 PM

When will the fix be ready?

If you are still experiencing issues today, please report them in a new ticket.

Yesterday we rolled back production to Monday's deployment, which is before the issue was introduced. If something is still causing problems today, it is likely a separate issue we do not know about yet.

The issue of this ticket should be resolved soon, but even if it isn't, it shouldn't cause any problems, as it was already disabled.

mmodell added a subscriber: mmodell.EditedThu, Jun 20, 4:25 PM

edit: I didn't see @Krinkle's message, this was replying to @RhinosF1:

Given that this is a train blocker, 12 days isn't acceptable.

They will be. Based on the last metrics it takes ~12 days for a UBN to be fixed.

[offtopic] No, "~12 days" is incorrect and misleading. (Discussing this would be off-topic for this task; feel free to bring it up wherever you found this number.)

Could someone investigate what's broken?

That is exactly what this task is about, so there is no need to explicitly ask for investigation as this applies to basically every task. :)

This comment was removed by Agusbou2015.
This comment was removed by Krinkle.

@Agusbou2015: Please stop posting unhelpful repetitive messages. Thanks.

@Pchelolo and I looked through all the logs and graphs and all the patchsets between .8 and .10 for EventBus and MassMessage (in detail) and core (DB- and JQ-related) and found nothing of note that could have caused this. Could it have been just a hiccup for MassMessage?

Reedy added a comment.EditedThu, Jun 20, 10:50 PM

@Pchelolo and I looked through all the logs and graphs and all the patchsets between .8 and .10 for EventBus and MassMessage (in detail) and core (DB- and JQ-related) and found nothing of note that could have caused this. Could it have been just a hiccup for MassMessage?

Nope, unfortunately not. It also broke at least Global Rename..

.10 was live for around 17 hours from memory. See also joe's comment, and all the other types of jobs that were seemingly affected

A lot of jobs weren't even being executed/queued. Just looking at today's jobs

$ fgrep commonswiki JobExecutor.log | fgrep wmf.10 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "enotifNotify"
      4 "globalUsageCachePurge"
      6 "EchoNotificationDeleteJob"
     10 "cdnPurge"
     12 "categoryMembershipChange"
     16 "recentChangesUpdate"
     16 "ThumbnailRender"
     40 "htmlCacheUpdate"
     54 "cirrusSearchLinksUpdatePrioritized"
    478 "wikibase-addUsagesForPage"
    822 "RecordLintJob"
   1613 "LocalGlobalUserPageCacheUpdateJob"
 201159 "refreshLinks"

While after the rollback we find:

fgrep commonswiki JobExecutor.log | fgrep wmf.8 | perl -ne 'print "$1\n" if /(\{.*)$/' | jq .job_type | sort | uniq -c | sort -n
      2 "AssembleUploadChunks"
      2 "cirrusSearchDeletePages"
      2 "TTMServerMessageUpdateJob"
      4 "refreshLinksPrioritized"
      8 "CentralAuthCreateLocalAccountJob"
      8 "LoginNotifyChecks"
     10 "newUserMessageJob"
     12 "cirrusSearchLinksUpdate"
     38 "EchoNotificationDeleteJob"
     50 "activityUpdateJob"
     78 "globalUsageCachePurge"
    274 "ThumbnailRender"
    322 "cirrusSearchIncomingLinkCount"
    590 "enotifNotify"
   1132 "categoryMembershipChange"
   1190 "cdnPurge"
   1257 "recentChangesUpdate"
   1371 "refreshLinks"
   2352 "htmlCacheUpdate"
   2498 "cirrusSearchLinksUpdatePrioritized"
   5849 "wikibase-addUsagesForPage"
   7396 "RecordLintJob"

So it looks like the jobs were not even being enqueued and are now lost.
TL;DR this means probably badly formatted messages were sent to event-something, and we should have an alert on an elevated rate of errors there.

We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress

This time they started, but stopped in the middle of the process.

Shall I file a new task?

Thanks.

Reedy added a comment.Fri, Jun 21, 9:28 PM

We're experiencing again global renames becoming stuck: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress
This time they started, but stopped in the middle of the process.
Shall I file a new task?
Thanks.

It looks like it might actually be related; for https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/TRXX-TRXX it's done testwiki but then stopped (same for the other one I already fixed)...

I'm guessing it didn't enqueue the job for the next wiki in the list? In that case, doing thwiki and the rest continued fine

Reedy added a comment.Fri, Jun 21, 9:29 PM
reedy@mwlog1001:/srv/mw-log$ grep "チルノ" JobExecutor.log | grep testwiki
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob"} 
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1311 testwiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob"} 
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwikidatawiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob","job_status":true,"job_duration":0.01779317855835} 
2019-06-21 15:29:16 [XQz3oApAMF0AAHI6IgEAAAAR] mw1311 testwiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalGlobalUserPageCacheUpdateJob User:チルノ namespace=2 requestId=XQz3oApAMF0AAHI6IgEAAAAR title=チルノ touch= username=チルノ","job_type":"LocalGlobalUserPageCacheUpdateJob","job_status":true,"job_duration":0.032263994216919} 
2019-06-21 15:29:22 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwiki 1.34.0-wmf.10 JobExecutor DEBUG: Beginning job execution {"job":"LocalRenameUserJob Global_rename_job force= from=New visitor movepages=1 namespace=0 promotetoglobal= reason=per [[Special:Permalink/19164089|request]] reattach=array(69) renamer=1997kB requestId=XQz3oApAMF0AAHI6IgEAAAAR session={\"headers\":\"array(...)\",\"ip\":\"157.36.249.217\",\"sessionId\":\"\",\"userId\":0} suppressredirects=1 title=Global_rename_job to=チルノ","job_type":"LocalRenameUserJob"} 
2019-06-21 15:29:25 [XQz3oApAMF0AAHI6IgEAAAAR] mw1336 testwiki 1.34.0-wmf.10 JobExecutor INFO: Finished job execution {"job":"LocalRenameUserJob Global_rename_job force= from=New visitor movepages=1 namespace=0 promotetoglobal= reason=per [[Special:Permalink/19164089|request]] reattach=array(69) renamer=1997kB requestId=XQz3oApAMF0AAHI6IgEAAAAR session={\"headers\":\"array(...)\",\"ip\":\"157.36.249.217\",\"sessionId\":\"\",\"userId\":0} suppressredirects=1 title=Global_rename_job to=チルノ","job_type":"LocalRenameUserJob","job_status":true,"job_duration":2.9513418674469}

I do not have a solid theory of what's happening yet, but just throwing in some bits and pieces.

Regarding the LocalRenameUserJob for TRXX-TRXX:

kafkacat -b localhost:9092 -p 0 -t "eqiad.mediawiki.job.LocalRenameUserJob" -o beginning | grep TRXX-TRXX | jq .meta

gives us the metadata of the jobs running in production. The 3 events interesting to us there are:

{
  "domain": "test.wikipedia.org",
  "dt": "2019-06-21T16:51:35+00:00",
  "id": "d42e547d-9444-11e9-a689-1866da993d2e",
  "request_id": "XQ0KvApAICAAAA9lpPcAAAAS",
  "schema_uri": "mediawiki/job/3",
  "topic": "mediawiki.job.LocalRenameUserJob",
  "uri": "https://placeholder.invalid/wiki/Special:Badtitle"
}
{
  "domain": "tr.wikipedia.org",
  "dt": "2019-06-21T16:51:37+00:00",
  "id": "d55f72bd-9444-11e9-9691-141877614309",
  "request_id": "XQ0KvApAICAAAA9lpPcAAAAS",
  "schema_uri": "mediawiki/job/3",
  "topic": "mediawiki.job.LocalRenameUserJob",
  "uri": "https://placeholder.invalid/wiki/Special:Badtitle"
}
{
  "domain": "uk.wikipedia.org",
  "dt": "2019-06-21T21:30:08+00:00",
  "id": "bde7c2dd-946b-11e9-8649-1866da5fcf1f",
  "request_id": "99012b73681a3ced8c965182",
  "schema_uri": "mediawiki/job/3",
  "topic": "mediawiki.job.LocalRenameUserJob",
  "uri": "https://placeholder.invalid/wiki/Special:Badtitle"
}

Notice that the trwiki job timestamp immediately follows the testwiki timestamp, but the next one, ukwiki has a 5 hour gap. I assume that's when @Reedy has manually continued the execution loop.

So, execution on wmf.10 did cause creating a subsequent job event, but it caused the next execution to break and not create a subsequent event.

Change 518751 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/mediawiki-config@master] RunSingleJob: Don't silently fail if page_title is not provided in job.

https://gerrit.wikimedia.org/r/518751

The above patch to mediawiki-config will fix it. This is another instance of having code in multiple places biting us hard. Given that it affects production, this config change has to be deployed ASAP

Change 518751 merged by jenkins-bot:
[operations/mediawiki-config@master] RunSingleJob: Don't silently fail if page_title is not provided in job.

https://gerrit.wikimedia.org/r/518751

Mentioned in SAL (#wikimedia-operations) [2019-06-24T16:13:03Z] <mobrovac@deploy1001> Synchronized rpc/RunSingleJob.php: RunSingleJob: check that only the database param is set and leave the rest to JobExecutor - T226109 (duration: 00m 55s)

Pchelolo closed this task as Resolved.Mon, Jun 24, 4:16 PM
Pchelolo claimed this task.

The fix to mw-config has been deployed, we should be good now.

Restricted Application added a project: Analytics. · View Herald TranscriptMon, Jun 24, 4:32 PM

Am I alone in feeling like this probably deserves an incident report?

I'd be very happy to be wrong about this, but it seems like there are no mechanisms in place to prevent this kind of error from happening again, or even to detect it if/when it does.

RhinosF1 added a comment.EditedMon, Jun 24, 7:19 PM

Am I alone in feeling like this probably deserves an incident report?
I'd be very happy to be wrong about this, but it seems like there are no mechanisms in place to prevent this kind of error from happening again, or even to detect it if/when it does.

<Strike>I think there was one at some point</strike> Wrong month!

Reedy added a comment.Mon, Jun 24, 7:25 PM

I think there was one at some point

Pretty sure there wasn't

I think there was one at some point

Pretty sure there wasn't

I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages

Reedy added a comment.Mon, Jun 24, 7:31 PM

I think there was one at some point

Pretty sure there wasn't

I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages

That might've be the last breakage, but the cause is different

RhinosF1 added a comment.EditedMon, Jun 24, 7:32 PM

I think there was one at some point

Pretty sure there wasn't

I don't know what Ive read but I'm certain I've read a incident report and I'm sure it referenced @Elitre and mass messages

That might've be the last breakage, but the cause is different

It was https://wikitech.wikimedia.org/wiki/Incident_documentation/20190417-Jobqueue from April!! Shows you how well I read it!