Page MenuHomePhabricator

GlobalRename stuck again at Beta
Closed, ResolvedPublic

Description

16:58, 9 May 2018 Jianhui67 (talk | contribs) globally renamed Hosiryuhosi to Rxy (Requested)

Rename stuck "queued" at all wikis at https://deployment.wikimedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress/Rxy

Event Timeline

cpjobqueue: KafkaConsumer is not connected

I see lots of those at Logstash:

cpjobqueue: KafkaConsumer is not connected
    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/error.js:260:10)
    at /srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29

if the jobqueue is not working, then it's logical globalrename isn't working.

Lots == 4,302,490 in the last 24 hours.

The deployment-kafka-jumbo{1,2} machines are failing on puppet as well:

deployment-kafka-jumbo-1
The last Puppet run was at Wed May  9 16:40:22 UTC 2018 (1083 minutes ago).
maurelio@deployment-kafka-jumbo-1:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kafka::mirror::source_cluster_name in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kafka/mirror.pp:48:33 on node deployment-kafka-jumbo-1.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
deployment-kafka-jumbo-2
The last Puppet run was at Wed May  9 16:37:39 UTC 2018 (1088 minutes ago).
maurelio@deployment-kafka-jumbo-2:~$ sudo puppet agent -tv
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kafka::mirror::source_cluster_name in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kafka/mirror.pp:48:33 on node deployment-kafka-jumbo-2.deployment-prep.eqiad.wmflabs
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run

@Ottomata Hi. I see at https://github.com/wikimedia/puppet/commits/production that you did some commits yesterday with 'kafka' as title. May any of those be the reason? Thanks :)

cpjobqueue: KafkaConsumer is not connected

I see lots of those at Logstash:

cpjobqueue: KafkaConsumer is not connected
    at Function.createLibrdkafkaError [as create] (/srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/error.js:260:10)
    at /srv/deployment/cpjobqueue/deploy-cache/revs/5c1dcb96e0539f63ec033a845d2150283c211493/node_modules/node-rdkafka/lib/kafka-consumer.js:442:29

if the jobqueue is not working, then it's logical globalrename isn't working.

It got stuck indeed. A restart fixed it.

Mentioned in SAL (#wikimedia-releng) [2018-05-10T13:19:30Z] <Hauskatze> maurelio@deployment-tin:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=deploymentwiki --logwiki=deploymentwiki 'Hosiryuhosi' 'Rxy' | T194376

I've fixed the global rename stuck after the jobqueue restart, and while it finished, none of the accounts ended attached so I'll have to run an attachment script I think. Notwithstanding the error reported cpjobqueue: KafkaConsumer is not connected continues to flood Logstash, so as we discussed on IRC, let's see if @Ottomata can help here and stop kafka from misbehaving (+ the puppet errors on the servers kafka-jumbo machines above). Thanks.

Ah, the jumbo nodes failing is because I changed a prod puppet class, but did not uninclude it from deployment-prep horizon. Should be unrelated Will fix.

The commits yesterday were about upgrading Kafka from 0.9.0.1 to 1.1.0 in production. The deployment-prep upgrade was done last week.

Will look though, we might need Petr's help if this is about job queue clients.

Pchelolo claimed this task.
Pchelolo added a subscriber: Pchelolo.

I believe that's not an issue any more?