Just to clarify, we have lowered the priority because the slaves are no longer lagging.
A few minutes ago the master went back to normal INSERT values - normal meaning before the upgrade.
Resolving this - thanks @mmodell!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jul 5 2019
In T227251#5306948, @mmodell wrote:Now the graphs look better. Unfortunately, puppet will set the config back to 10 taskmasters unless we make a commit to rOPUP Wikimedia Puppet
In T226952#5295368, @Marostegui wrote:Note: db2044 needs upgrading
This host is ready for DC-Ops to decommission
Jul 4 2019
From what I can see now, the UPDATEs have stopped, but the INSERT rate is still at the same level on the master: https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104&from=now-24h&to=now
I have restored the defaults after db2065 caught up
In T227251#5306333, @Stashbot wrote:Mentioned in SAL (#wikimedia-operations) [2019-07-04T10:47:29Z] <marostegui> Ease replication consistency option on db2065 to allow it to catch a bit - T227251
@mmodell there has not been any significant change to the amount of INSERTs the master is getting
https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1072&var-port=9104&from=1562198027198&to=1562237067984
The etherpad is ready with the procedure and ready for a review.
The patch is also ready for review: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/519975/
Thank you!
labsdb1011 is fully done:
root@labsdb1011:~# df -hT /srv Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/tank-data xfs 12T 6.2T 5.5T 53% /srv
All good now - thanks!
root@db2049:~# hpssacli controller all show config
Jul 3 2019
All done
centralauth progress
@RobH @Papaul I have merged: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520379/
The only changes pending from your side to be able to install these hosts once they arrive would be:
All codfw is now running 10.1.39 (which is the version the new master will run) - will keep upgrading eqiad now.
@Ladsgroup do you want to run the script?
This was done.
Read only start: 06:00:36 UTC
Read only stop: 06:01:56 UTC
Total read only time: 01:20 min
Let's replace the disk please!
Thanks
Jul 2 2019
Adding Analytics as they are interested in knowing when this wiki finally gets created so they can sqoop data from it T227030: hi.wikisource added to labs replicas?
As @Reedy points out, hi.wiksource isn't created yet, not even its database T219374: Prepare and check storage layer for hi.wikisource.
As the wiki is marked as a public wiki, the process is as follows:
In T226050#5298926, @Ahecht wrote:How long do you expect labsdb1011 to be depooled for. Is this going to be a regular thing?
Jul 1 2019
Cool, I will give it some more 24h - so far nothing on logstash for db1094
Should we define new HW specs for these hosts?
Currently es1015 specs:
128GB RAM
12x1.819 TB SATA HDD
In T169440#5295834, @MarcoAurelio wrote:Well, that's the question. Do we need to keep asking on Phabricator before renaming users with big edit counts locally and/or globally? I don't think any of us wants to break the sites doing a heavy rename (DBA tag or not).
In T169440#5295698, @MarcoAurelio wrote:In T169440#5295547, @Marostegui wrote:Renames probably don't need DBA monitoring anymore - since we replaced our old hardware we haven't seen any replication delay showing up.
Thanks @Marostegui - Could you please discuss this with DBA/Ops/etc. and get back to us with a definitive answer on this issue? (that is: if renames for global accounts with more than X edits locally or globally needs Phabricator ticket). Thank you!
Renames probably don't need DBA monitoring anymore - since we replaced our old hardware we haven't seen any replication delay showing up.
This doesn't really need a DBA there is no lag replication lag showing up since we replaced all the old hardware
This doesn't really need a DBA there is no lag replication lag showing up since we replaced all the old hardware
This doesn't really need a DBA there is no lag replication lag showing up since we replaced all the old hardware
Note: db2044 needs upgrading
Thanks, I am going to stall this until then.
In T226952#5295258, @Volans wrote:For debmonitor it connects to m2-master.eqiad.wmnet and I'm not sure if Django's connection pooling would be smart enough to reconnect given that the old one will still work, just RO. It might need a:
sudo cumin 'A:debmonitor' 'systemctl restart uwsgi-debmonitor.service'just after the switch.
Let's leave it aside for now :-)
In T226952#5295201, @jcrespo wrote:Because of the TTL mention, are you planning a failover of proxy at the same time?
Also changed on an eqiad host (so we can check if there is something reading from those):
I have altered this table on db2054 on centralauth and will leave the columns renamed for a few days to make sure nothing uses it.
root@db2054.codfw.wmnet[centralauth]> alter table oathauth_users change secret TO_DROP_secret varbinary(255) DEFAULT NULL, change scratch_tokens TO_DROP_scratch_tokens varbinary(511) DEFAULT NULL; Query OK, 0 rows affected (0.09 sec) Records: 0 Duplicates: 0 Warnings: 0
Not sure what's the actionable here for the DBAs.
This looks like another case of the optimizer not doing what expected. I have tested the original two queries on the new 10.3 and they also filesort. (and considering this is a 5 years old ticket and nothing has changed from 5.5 to 10.3.... I guess we cannot have much hopes on MariaDB's optimizer doing the right thing).
We can always report it as a bug, but I don't think it will make much difference.
Thoughts?
@Daimona as per https://tools.wmflabs.org/versions/ we are on .11 everywhere, this is safe to proceed then?
@Anomie might be able to give us more information of the timeline for the deployment.
Yes, the host that was out for maintenance, labsdb1011 was repooled. However, we still need to continue with the maintenance for T222978: Compress and defragment tables on labsdb hosts so I am going to depool labsdb1011 again. I know this is unfortunate, but there is nothing else we can do to reduce disk space, and we have to do that no matter what, or else, the replicas will get fully filled.