Located in Germany: CET (UTC+1)/CEST (UTC+2)
GPG: 84E7 0489 0F69 0544 0A59 86D8 C485 27CF 7D40 8CDE
Per the answer on the discourse discussion, see https://secure.phabricator.com/T10448#186240 for why upstream probably won't move $whatever (here: task creation) into it's own notification setting, and https://secure.phabricator.com/T13069 for the preferred approach. I agree with their interpretation that this is just part of a more general problem and it seems the modular solution proposed would be better suited to solve that general problem, allowing very fine-grained notification control using the new system of mail stamps.
The downside of that this is not a trivial thing to fix but might take a while until we see the current system being replaced by the new one.
Indeed, the jobqueue on beta is still broken, although it's unrelated to the logspam.
Thanks to Antoine, I now know that jobs are queued in kafka. Fiddling a bit on deployment-kafka04 gives me the impression that all the jobs made it there and are sitting happily in the queue. Somehow they don't get consumed. By the description of the profile::cpjobqueue I think deployment-cpjobqueue consumes the messages from kafka and triggers the jobs on deployment-jobrunner03. There's some errors on deployment-cpjobqueue in logstash, I might take a look on those tomorrow - or rather, later today in my TZ.
I didn't run that very command above. Actually I've been doing this for deploymentwiki first - Special:GlobalRenameProgress then said deploymentwiki was "Done" and everything else was "Queued". Then I did the same on enwiki, and it still only moved enwiki to "Done" and left everything else at "Queued". That was the point where I run it for the rest of the wikis in the for-loop.
Alright. Please remember to !log such things - if there's no log entry I'll assume puppet breakage is unintended and try to fix it, where it could just be ignored for a few days.
Per aarons comment, just logspam. Seems the actual problem for renames was nutcracker, which I fixed in T192473#4141169. Both are now unstuck.
Resolved per the comments above. The errors is just logspam, that's tracked in the parent task.
Also note that most of the new php7 syntax won't be available until we stop supporting hhvm. (T176370)
nutcracker now starts on deployment-jobrunner03, see my comment on T178457. The error in the task description still happens though.
Saw the same on deployment-jobrunner03 today (fun note: found this task because google listed it when I searched what 'run, rabbit run' refers to) and had to create the directory manually there to get nutcracker to start
According to the manual, specifying nothing is equivalent to specifying localhost, and both use the socket file rather than the network:
A Unix socket file is used if you do not specify a host name or if you specify the special host name localhost.
I agree with the previous comments. Horizons prefix functionality seems to cover about everything this was expected provide. Doing this would neither extinguish the fact that people have to remember setting hiera values for the cloud services, nor would it help to reduce the mess that cloud service hiera currently is. I'm thus boldly declining this.
I poked WMDE about what I found this morning on IRC:
Since T188288: fr.wikipedia.beta.wmflabs.org uses an invalid security certificate frwiki already is part of our unified letsencrypt cert. JFTR to whoever will tackle this task.
wontfix until T135427: Beta puppetmaster cherry-pick process. We did, do and will just ignore that check, or drop it if it's too annoying,
Do you still have trouble logging into these instances or can this task be closed?
Yeah, we could do that, by setting up an instance (or just applying the relevant class to deployment-tin). But then we probably want to make the scap-scripts (those run every ten minutes) not trigger the usual "scap started" and "scap ended"/"scap aborted"/... messages. There's a full scap on beta every 10 minutes - we don't want an extra message in -releng every five minutes. If we can convince scap to not log stuff for the auto update, no objections.
Well the deployment-mediawiki-07 backend was the cause of 503s today. I changed the appserver backend in hiera to deployment-mediawiki05
I'm sorry, does this mean this is now "live"? So there will be a full recount on April 15th?
Yes, it'll start to run at 05:39 UTC on 1st and 15th of each month from now on. So the first run will be this Sunday.
If that first run fails for whatever reason, the cron will be reverted until problems can be fixed. However, if that would happen, you'd hear about it on this task.
All of beta is currently down.
Indeed. We just checked the last log entry that worked for him and saw that the review action already behaved this way before and the autopatrol action was gone as expected. The API seems fine, the tool needs to be updated.
Alright, ignore me then. :-)
The only concern about doing it directly on the master could be the fact that this table is being updated almost every second.
The puppet errors are fixed and fr.wikipedia.beta.wmflabs.org seems to now be part of the unified cert. Thus I'm closing this as resolved.
That's why I didn't mark the task resolved. But I guess we now have an idea of where this comes from and what to report to upstream: "Please move task creation notifications out of 'Other Notification'". Unfortunately I don't seem to be able to create a task on secure.phabricator.com any more.
That instance no longer exists. I think this is resolved.
According to openstack browser, both instances are gone. Is this resolved?
Just for the record, this table has less than 5 rows in all the wikis.
Puppet run on this instance succeeds. Can this be closed now?
Puppet is fine on both hosts now. so this seems resolved. Thanks @Ottomata!
deployment-fluorine no longer exists and deployment-fluorine02 doesn't have puppet broken. Seems resolved to me.
I've not received a mail for the creation of T191701, found this task and made a few tests in a local phabricator instance:
Fixed by changing hiera, should recover with the next puppet run.
Puppet broken on all the appservers, jobrunners and deployment servers in beta.