I do lots of things. See https://www.mediawiki.org/wiki/User:Majavah.
Fediverse: https://mastodon.technology/@taavi
I do lots of things. See https://www.mediawiki.org/wiki/User:Majavah.
Fediverse: https://mastodon.technology/@taavi
This new instance is failing to run Puppet:
taavi@deployment-termbox-ssr:~$ sudo run-puppet-agent Warning: Unable to fetch my node definition, but the agent run will continue: Warning: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] Info: Retrieving pluginfacts Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] Error: /File[/var/lib/puppet/facts.d]: Could not evaluate: Could not retrieve file metadata for puppet:///pluginfacts: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] Info: Retrieving plugin Error: /File[/var/lib/puppet/lib]: Failed to generate additional resources using 'eval_generate': SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] Error: /File[/var/lib/puppet/lib]: Could not evaluate: Could not retrieve file metadata for puppet:///plugins: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] Error: Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs] Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run Error: Could not send report: SSL_connect returned=1 errno=0 state=error: certificate verify failed (self signed certificate in certificate chain): [self signed certificate in certificate chain for /CN=Puppet CA: deployment-puppetmaster03.deployment-prep.eqiad.wmflabs]
Please fix. All deployment-prep instances must be fully configured via Puppet and not by hand / separate Ansible cookbooks.
If the worker is using a custom k8s deployment, consider configuring liveliness/readiness probes to make kubernetes restart the container when it gets stuck.
Fixed so maintain-kubeusers won't generate any new broken configs.
This looks very similar to T304180 and T291129.
2022-05-23 13:13:42,815 - irc3.wikibugs - DEBUG - Register plugin 'irc3.plugins.ctcp.CTCP' 2022-05-23 13:13:42,826 - irc3.wikibugs - DEBUG - Register plugin 'irc3.plugins.autojoins.AutoJoins' 2022-05-23 13:13:42,859 - irc3.wikibugs - DEBUG - Register plugin 'irc3.plugins.sasl.Sasl' 2022-05-23 13:13:42,924 - irc3.wikibugs - DEBUG - Starting wikibugs... 2022-05-23 13:13:43,207 - irc3.wikibugs - DEBUG - Connected 2022-05-23 13:13:43,208 - irc3.wikibugs - DEBUG - CONNECT ping-pong () 2022-05-23 13:14:24,806 - irc3.wikibugs - CRITICAL - connection lost (139787222388544): None 2022-05-23 13:14:24,809 - irc3.wikibugs - CRITICAL - closing old transport (139787222388544) 2022-05-23 13:14:26,812 - irc3.wikibugs - DEBUG - Starting wikibugs... 2022-05-23 13:14:27,360 - irc3.wikibugs - DEBUG - Connected 2022-05-23 13:14:27,361 - irc3.wikibugs - DEBUG - CONNECT ping-pong () 2022-05-23 13:15:10,604 - irc3.wikibugs - CRITICAL - connection lost (139787222129536): None 2022-05-23 13:15:10,606 - irc3.wikibugs - CRITICAL - closing old transport (139787222129536) 2022-05-23 13:15:12,608 - irc3.wikibugs - DEBUG - Starting wikibugs... 2022-05-23 13:15:12,884 - irc3.wikibugs - DEBUG - Connected 2022-05-23 13:15:12,885 - irc3.wikibugs - DEBUG - CONNECT ping-pong () 2022-05-23 13:15:56,808 - irc3.wikibugs - CRITICAL - connection lost (139787222129824): None 2022-05-23 13:15:56,810 - irc3.wikibugs - CRITICAL - closing old transport (139787222129824) 2022-05-23 13:15:58,813 - irc3.wikibugs - DEBUG - Starting wikibugs... 2022-05-23 13:15:59,034 - irc3.wikibugs - DEBUG - Connected 2022-05-23 13:15:59,035 - irc3.wikibugs - DEBUG - CONNECT ping-pong () 2022-05-23 13:16:40,593 - irc3.wikibugs - CRITICAL - connection lost (139787222129920): None 2022-05-23 13:16:40,594 - irc3.wikibugs - CRITICAL - closing old transport (139787222129920) 2022-05-23 13:16:42,597 - irc3.wikibugs - DEBUG - Starting wikibugs... 2022-05-23 13:16:42,869 - irc3.wikibugs - DEBUG - Connected 2022-05-23 13:16:42,870 - irc3.wikibugs - DEBUG - CONNECT ping-pong ()
Suprisingly this doesn't seem to have caused any unattached local accounts, since the renameuser_status protections in LocalRenameJob prevented the actual renames from happening on the wiki where the rename was requested from.
@Zabe I wonder if this can be caused by https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/774972? That could explain the Username@somewiki suffix that we're seeing in the rename logs?
Useful reading:
In T308381#7939420, @bd808 wrote:Poking around to learn how this is handled in the production k8s clusters might be helpful? There are some teaser docs at https://wikitech.wikimedia.org/wiki/Kubernetes/Metrics. Those docs also point to https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
I'm happy to fix it myself if it helps, but thought it might be best to simply create a ticket and tag it with LDAP and SRE to begin with.
Those are not invalid values, those are just people whose usernames contain non-ASCII characters. Our existing stack fully supports them, and I'd argue that any software that does not like non-ascii values in usernames is bugged and should be fixed.
How was the above list generated? It's missing WMCS paging alerts, AIUI everything sent to the wmcs-team contact group is paging even if it has page => false set.
Anyhow our work in T308601: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins should fix this issue entirely so probably not worth it to figure out what caused this
Was this on a cloud vps vm or a production host?
Toolforge or Cloud VPS doesn't currently offer a managed Prometheus-style monitoring or alerting service. There have been talks of building one for a while, but so far other projects have consumed our very limited engineering resources.
Looks good:
MariaDB [(none)]> show slave status\G [...] Using_Gtid: No Gtid_IO_Pos: [...]
The fix will be included in this week's train.
This is clouddb1002 (secondary) after setting gtid_domain_id on both servers:
MariaDB [(none)]> SELECT @@GLOBAL.gtid_slave_pos; +------------------------------------------------------+ | @@GLOBAL.gtid_slave_pos | +------------------------------------------------------+ | 0-2886731673-33519859088,2886731673-2886731673-18688 | +------------------------------------------------------+ 1 row in set (0.00 sec)
Is it intentional that there are two entries and the first one starts with 0-?
In T308388#7929115, @TheresNoTime wrote:In T308388#7929110, @Majavah wrote:I don't think this needs to block the account renaming, manual debugging on production confirms that CentralAuthUser::listAttached() includes those accounts so they shouldn't cause issues during the rename. (This is also why using Special:MergeAccount did nothing.)
@Majavah to confirm, I am okay to attempt to process this rename?
This seems to be a display bug with Special:CentralAuth: we have some 'corrupted' localuser rows for this specific account that look otherwise good but have lu_attached_method set as NULL. SpecialCentralAuth uses the centralauth-admin-unattached message as a placeholder for missing data.
A simple automation for step 3 would be to simply to replace the 3 control nodes with new ones as the cluster join process creates the manifest files from what's stored in the config maps.
Looks like the Kubernetes cronjob scheduler may be getting overloaded at midnight given how many tools are running jobs at that point. I've increased the scheduling tolerance of your jobs to make Kubernetes start your jobs even if that means it'll be a little off the scheduled date. However, a better solution would be to, if possible, run hourly/daily jobs at a random time (say, 18:37 daily) instead of exactly midnight or top of the hour.
In T308283#7928181, @thcipriani wrote:You are reading that right.
Am I reading this correctly that the proposed Beta replacement would be directly using and updating production data and also share the production bottlenecks (such as databases)?
Done.
Looks like this was caused by an unrelated Puppet change. Fixed by updating the hiera data in Horizon.
Everything seems to be working again properly. I've filed some actionables and marked those as subtasks, so closing this task.
Seems to have worked fine. Tentatively closing.
Hey! I still see a VM in the ores project, is it ok to delete that too?
taavi@cloudcontrol1004 ~ $ os server list --project ores +--------------------------------------+-------------+--------+----------------------------------------+--------------------------------------------+-----------------------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------+--------+----------------------------------------+--------------------------------------------+-----------------------+ | c0252f8f-6953-4d08-9ffc-57f8dcf7ba18 | calbon-test | ACTIVE | lan-flat-cloudinstances2b=172.16.0.200 | debian-10.0-buster (deprecated 2020-10-16) | g2.cores1.ram2.disk20 | +--------------------------------------+-------------+--------+----------------------------------------+--------------------------------------------+-----------------------+
In T290494#7917947, @Majavah wrote:Oh, the kernel pinnings don't work at all. This is an older host that does not use the cloud image
Oh, the kernel pinnings don't work at all. This is an older host that does not
taavi@tools-k8s-worker-42:~ $ uname -a Linux tools-k8s-worker-42 4.19.0-14-amd64 #1 SMP Debian 4.19.171-2 (2021-01-30) x86_64 GNU/Linux taavi@tools-k8s-control-2:~ $ kubectl sudo get node tools-k8s-worker-42 -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME tools-k8s-worker-42 Ready,SchedulingDisabled <none> 2y72d v1.20.11 172.16.1.74 <none> Debian GNU/Linux 10 (buster) 4.19.0-14-amd64 docker://20.10.8
Just enabled unattended-upgrades. That still leaves the apt pinnings. Also note that the kernel pinning does not work for new hosts (all bullseye hosts and some buster ones) use the cloud kernel variants, which our pinnings don't seem to apply to.
taavi@mwmaint1002 ~ $ mwscript namespaceDupes.php --wiki rowiki --fix 0 pages to fix, 0 were resolvable.
Done. Sorry for the delay!
In T307648#7909052, @Ladsgroup wrote:
- Its schema is not optimal, the block reason and actor can be normalized.
- You could normalize the actor name and comment to the actor id in metawiki but that would couple this database to metawiki's database.
- You could probably instead normalize to the global user id of the actor in central auth and make comment a set of pre-defined values (1 = 'Open proxy', or something like that)
Looks like I managed to sneak in a bug in the last webservice update which broke starting all Java web services. Fixed now, sorry about that!