$ sudo /usr/local/sbin/maintain-replica-indexes --database shywiktionary --debug $ sudo /usr/local/sbin/maintain-views --databases shywiktionary --debug $ sudo /usr/local/sbin/maintain-meta_p --databases shywiktionary
$ sudo /usr/local/sbin/maintain-replica-indexes --database gcrwiki --debug $ sudo /usr/local/sbin/maintain-views --databases gcrwiki --debug $ sudo /usr/local/sbin/maintain-meta_p --databases gcrwiki
$ sudo /usr/local/sbin/maintain-replica-indexes --database szywiki --debug $ sudo /usr/local/sbin/maintain-views --databases szywiki --debug $ sudo /usr/local/sbin/maintain-meta_p --databases szywiki
$ sudo /usr/local/sbin/maintain-replica-indexes --database minwiktionary --debug $ sudo /usr/local/sbin/maintain-views --databases minwiktionary --debug $ sudo /usr/local/sbin/maintain-meta_p --databases minwiktionary
$ sudo /usr/local/sbin/maintain-replica-indexes --database gewikimedia --debug $ sudo /usr/local/sbin/maintain-views --databases gewikimedia --debug $ sudo /usr/local/sbin/maintain-meta_p --databases gewikimedia
Tue, Nov 19
@Jclark-ctr replaced the drive in slot 4 with a new 1.9TB drive today. I've confirmed that the system and RAID set are both healthy.
@Jclark-ctr replaced this today with a new 1.9TB drive. No host errors were seen and the megaraid card looks clean.
Mon, Nov 18
Fri, Nov 15
I've update the task details with the current status. Should I leave the netbox status as staged or set it to active? These systems will be testing non-production workloads for the near future, and active seems to imply production status.
Looks like it's an issue with the virtual disk not getting assigned /dev/sda. Checking to see if I can work around this with our installation process and partman, but I may need to switch the controller to HBA mode and use software RAID for the operating system.
I'm having an issue on cloudcephosd1002 and 1003.
Thu, Nov 14
Wed, Nov 13
Tue, Nov 12
Fri, Nov 8
OS Packages for Debian GNU/Linux 10 (buster) pulled from http://apt.wikimedia.org/wikimedia stretch-wikimedia thirdparty/kubeadm-k8s
containerd.io cri-tools docker-ce docker-ce-cli ipset kubeadm kubectl kubernetes-cni
@Jclark-ctr Could you help me with the cloudcephmon1002 and cloudcephmon1003 servers? I'm unable to power them on through iDRAC SSH, IPMI, or the web interface. I see the event logged in the lifecycle log but they never power on.
Tue, Nov 5
@Jclark-ctr thanks! confirmed that it's working from my end now.
Mon, Nov 4
The management interface for cloudcephosd1001.mgmt is currently unavailable, could we get someone take a look at it please?
Fri, Nov 1
Confirmed that this warning can safely be ignored.
Tue, Oct 29
Mon, Oct 28
This was originally noticed while working on T235627
Oct 23 2019
Grafana dashboard for HAproxy metrics: https://grafana.wikimedia.org/d/tanisM2Zz/wmcs-openstack-eqiad1-api-stats
Record deletes are working as expected now, likely resolved from the OpenStack upgrades and service improvements.
This is no longer the case, most likely resolved from the OpenStack upgrades. The form to associate a floating IP address correctly shows IP addresses now.
Oct 22 2019
Patch has been merged and openstack zone list --all-projects is now working as expected.
Oct 21 2019
Oct 18 2019
@dom_walden I verified that your account is now working again, please let us know if you're still unable to connect to the replica databases.
root@cloudcontrol1004:~# openstack zone list --all-projects Unexpected exception for http://openstack.eqiad1.wikimediacloud.org:9001/v2/zones?: Header value True must be of type str or bytes, not <type 'bool'>
Oct 17 2019
All HA changes are now implemented in both regions codfw1dev and eqiad1.
Updating the OpenStack endpoints in eqiad1 to the new domain.
Oct 16 2019
Changes pushed. Full list of accounts that were out of sync and created at: https://phabricator.wikimedia.org/P9366
maintain-dbusers --account-type user harvest-replicas
$ mysql -h labsdb1011.eqiad.wmnet -u labsdbadmin -p -e 'SELECT COUNT(User) from mysql.user where User = "u21436"\G' *************************** 1. row *************************** COUNT(User): 1
and I'm able to connect from toolforge now :)
mysql --defaults-file=$HOME/replica.my.cnf -h enwiki.analytics.db.svc.eqiad.wmflabs enwiki_p ...
I tested the above patch with my user account and verified it's working as expected.
Harvest has two functions:
- harvest_cnf_files finds tools and users replica.my.cnf file and insert/update m5's labsdbaccounts.account table
- harvest_replica_accts Determine the state of users defined in m5's labsdbaccounts.account table on each replica, then update the users state to either absent|present in labsdbaccounts.account_host table for each replica.
Related task T235382
I noticed there's a harvest function in maintain-dbusers https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/files/wmcs/nfs/maintain-dbusers.py#19
Oct 15 2019
$ sudo /usr/local/sbin/maintain-replica-indexes --database banwiki --debug $ sudo /usr/local/sbin/maintain-views --databases banwiki --debug $ sudo /usr/local/sbin/maintain-meta_p --databases banwiki
Oct 11 2019
The toolforge.org domain has been setup in designate
Assigned roles for the tools-dns-manager user in eqiad1
Oct 10 2019
After the Newton upgrade the Stretch hosts areall running the distro version: systemd-sysv 232-25+deb9u11
Oct 9 2019
This was a false alarm... The domain_id column was removed in a later revision: https://github.com/openstack/keystone/blob/master/keystone/common/sql/migrate_repo/versions/091_migrate_data_to_local_user_and_password_tables.py#L82
Oct 3 2019
Puppet is not managing the permissions on the secondary controllers image directory.
cloudcontrol2003-dev:~# ls -lad /srv/glance/images drwxr-xr-x 2 glance glance 4096 Oct 1 21:29 /srv/glance/images