Page MenuHomePhabricator

GlobalRename stuck again at Beta
Closed, ResolvedPublic

Description

16:58, 9 May 2018 Jianhui67 (talk | contribs) globally renamed Hosiryuhosi to Rxy (Requested)

Rename stuck "queued" at all wikis at https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress/Rxy

Event Timeline

cpjobqueue: KafkaConsumer is not connected

I see lots of those at Logstash:

cpjobqueue: KafkaConsumer is not connected
    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/error.js:260:10)
    at /srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29

if the jobqueue is not working, then it's logical globalrename isn't working.

Lots == 4,302,490 in the last 24 hours.

The deployment-kafka-jumbo{1,2} machines are failing on puppet as well:

deployment-kafka-jumbo-1
The last Puppet run was at Wed May  9 16:40:22 UTC 2018 (1083 minutes ago).
maurelio@deployment-kafka-jumbo-1:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kafka::mirror::source_cluster_name in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kafka/mirror.pp:48:33 on node deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
deployment-kafka-jumbo-2
The last Puppet run was at Wed May  9 16:37:39 UTC 2018 (1088 minutes ago).
maurelio@deployment-kafka-jumbo-2:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kafka::mirror::source_cluster_name in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kafka/mirror.pp:48:33 on node deployment-kafka-jumbo-2.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

@Ottomata Hi. I see at https://github.com/wikimedia/puppet/commits/production that you did some commits yesterday with 'kafka' as title. May any of those be the reason? Thanks :)

cpjobqueue: KafkaConsumer is not connected

I see lots of those at Logstash:

cpjobqueue: KafkaConsumer is not connected
    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/error.js:260:10)
    at /srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29

if the jobqueue is not working, then it's logical globalrename isn't working.

It got stuck indeed. A restart fixed it.

Mentioned in SAL (#wikimedia-releng) [2018-05-10T13:19:30Z] <Hauskatze> maurelio@deployment-tin:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=deploymentwiki --logwiki=deploymentwiki 'Hosiryuhosi' 'Rxy' | T194376

I've fixed the global rename stuck after the jobqueue restart, and while it finished, none of the accounts ended attached so I'll have to run an attachment script I think. Notwithstanding the error reported cpjobqueue: KafkaConsumer is not connected continues to flood Logstash, so as we discussed on IRC, let's see if @Ottomata can help here and stop kafka from misbehaving (+ the puppet errors on the servers kafka-jumbo machines above). Thanks.

Ah, the jumbo nodes failing is because I changed a prod puppet class, but did not uninclude it from deployment-prep horizon. Should be unrelated Will fix.

The commits yesterday were about upgrading Kafka from 0.9.0.1 to 1.1.0 in production. The deployment-prep upgrade was done last week.

Will look though, we might need Petr's help if this is about job queue clients.

mobrovac renamed this task from z7caaaaaaa to GlobalRename stuck again at Beta.Jul 1 2018, 10:35 AM
mobrovac raised the priority of this task from High to Needs Triage.
mobrovac updated the task description. (Show Details)
mobrovac added a subscriber: MarcoAurelio.
Pchelolo claimed this task.
Pchelolo subscribed.

I believe that's not an issue any more?