Maniphest T194376

GlobalRename stuck again at Beta
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	MarcoAurelio
	May 10 2018, 10:25 AM

Description

16:58, 9 May 2018 Jianhui67 (talk | contribs) globally renamed Hosiryuhosi to Rxy (Requested)

Rename stuck "queued" at all wikis at https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress/Rxy

Event Timeline

MarcoAurelio created this task.May 10 2018, 10:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2018, 10:25 AM

cpjobqueue: KafkaConsumer is not connected

I see lots of those at Logstash:

cpjobqueue: KafkaConsumer is not connected
    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/error.js:260:10)
    at /srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29

if the jobqueue is not working, then it's logical globalrename isn't working.

Lots == 4,302,490 in the last 24 hours.

The deployment-kafka-jumbo{1,2} machines are failing on puppet as well:

deployment-kafka-jumbo-1

The last Puppet run was at Wed May  9 16:40:22 UTC 2018 (1083 minutes ago).
maurelio@deployment-kafka-jumbo-1:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kafka::mirror::source_cluster_name in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kafka/mirror.pp:48:33 on node deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

deployment-kafka-jumbo-2

The last Puppet run was at Wed May  9 16:37:39 UTC 2018 (1088 minutes ago).
maurelio@deployment-kafka-jumbo-2:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kafka::mirror::source_cluster_name in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kafka/mirror.pp:48:33 on node deployment-kafka-jumbo-2.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

@Ottomata Hi. I see at https://github.com/wikimedia/puppet/commits/production that you did some commits yesterday with 'kafka' as title. May any of those be the reason? Thanks :)

In T194376#4196737, @MarcoAurelio wrote:
cpjobqueue: KafkaConsumer is not connected

I see lots of those at Logstash:
cpjobqueue: KafkaConsumer is not connected
    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/error.js:260:10)
    at /srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29
if the jobqueue is not working, then it's logical globalrename isn't working.

It got stuck indeed. A restart fixed it.

Mentioned in SAL (#wikimedia-releng) [2018-05-10T13:19:30Z] <Hauskatze> maurelio@deployment-tin:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=deploymentwiki --logwiki=deploymentwiki 'Hosiryuhosi' 'Rxy' | T194376

I've fixed the global rename stuck after the jobqueue restart, and while it finished, none of the accounts ended attached so I'll have to run an attachment script I think. Notwithstanding the error reported cpjobqueue: KafkaConsumer is not connected continues to flood Logstash, so as we discussed on IRC, let's see if @Ottomata can help here and stop kafka from misbehaving (+ the puppet errors on the servers kafka-jumbo machines above). Thanks.

Ah, the jumbo nodes failing is because I changed a prod puppet class, but did not uninclude it from deployment-prep horizon. Should be unrelated Will fix.

The commits yesterday were about upgrading Kafka from 0.9.0.1 to 1.1.0 in production. The deployment-prep upgrade was done last week.

Will look though, we might need Petr's help if this is about job queue clients.

• Vvjjkkii renamed this task from GlobalRename stuck again at Beta to z7caaaaaaa.Jul 1 2018, 1:10 AM

• Vvjjkkii triaged this task as High priority.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii removed subscribers: MarcoAurelio, Aklapper.

• mobrovac renamed this task from z7caaaaaaa to GlobalRename stuck again at Beta.Jul 1 2018, 10:35 AM

• mobrovac raised the priority of this task from High to Needs Triage.

• mobrovac edited projects, added MediaWiki-Core-JobQueue, Event-Platform; removed TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

• mobrovac updated the task description. (Show Details)

• mobrovac added a subscriber: MarcoAurelio.

Restricted Application added a project: Analytics. · View Herald TranscriptJul 1 2018, 10:35 AM

• fdans moved this task from Incoming to Radar on the Analytics board.Jul 2 2018, 4:17 PM

I believe that's not an issue any more?

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

GlobalRename stuck again at BetaClosed, ResolvedPublicActions

Description

Event Timeline

GlobalRename stuck again at Beta
Closed, ResolvedPublic
Actions