The alter tables to convert Aria tables to InnoDB are still in progress :)
Wed, Jan 16
Today I have made all the partitions on the old hosts (wiping the old content) so we are now ready to bootstrap the cluster merging https://gerrit.wikimedia.org/r/484374
@Marostegui I have now everything in my home gzipped, I'll move it to stat1007 and HDFS asap:
Ok for me for the analytics nodes, but I'd need a bit of heads up to properly stop them if possible :)
@Marostegui I don't see any more "easy" way to ping people about those tables, do you think that this clean up was ok (and the task can be closed) or that we'd need something more?
Leaving also the following aside for a moment:
Seems happening for .ores_classification tables:
Skipping this table for the moment:
Thanks all for your clean up! So remaining tables to drop:
Tue, Jan 15
Took a mysqldump of the staging database and moved it in two places:
- on dbstore1002's /srv/elukey_backup
- on stat1007's /home/elukey home dir (chmod root:root, 700)
We decided to try a simple systemd timer for the moment, that will alarm if report updater will run and return a non zero code. This is currently tracked in T172532
As FYI I can see the following for Dec 17th in my inbox for analytics-alerts@:
Another question - should we back up the staging database just in case something goes wrong?
I have created /home/elukey/aria_tables_alter.sql on dbstore1002, if you can review them quickly as sanity check it would be great. Then I'd just execute mysql --skip-ssl < /home/elukey/aria_tables_alter.sql from a root tmux and monitor the status during the next hours.
So as far as I can understand I'd need to grab the list of tables and produce a list of:
So the staging db is kinda problematic to clean up, since it is difficult to figure out owners and reach out to people. I have already started to ask to people to review/drop the old tables, but as precautionary step I'd migrate everything to one of the new dbstore hosts and then figure out later on if something else can be dropped.
Mon, Jan 14
Update for all the users of dbstore1002:
@Nikerabbit looking forward to see it deployed, thanks to both you and Aaron for this work.
Update for all the users of dbstore1002:
racadm getsel for today (remember that one disk was already failed, we have a task about it):
Executed bmc-device --debug --cold-reset in localhost since the mgmt interface was not available ("No more sessions available").
Fri, Jan 11
Thu, Jan 10
The issue seems a one time only, and the new alarms have been very reliable over the past months. Closing this task since there seems to be no action left, will re-open if necessary.
It is fine in here Daniel, thanks! In theory kafka1012->23 should be decommissioned when Event Gate (part of Modern Event Platform) will be up and running, since Mediawiki Avro Monolog will be migrated to it and at that point nothing will be pushing data to the old Kafka Analytics cluster anymore. So I wouldn't spend much time energy on this if possible, worst case scenario we can shrink down the cluster to 5/4 hosts and decom the ones (like this) that are not healthy anymore.
Wed, Jan 9
I'd prefer 1.5.6 too, we'd be really close to upstream (atm 1.5.12) and getting help from them would surely be easier if needed. I'd also love to be able to provide feedback to the memcached project about scalability and/or bottlenecks of running a recent version at scale (for example, LRU special use case, etc..).
As mentioned before these nodes will become a new testing cluster, more info in T212256
Nodes completely removed:
- removed from the network topology and restarted namenodes
- assigned role::spare:system and removed the datanode/nodemanager packages from the hosts to prevent the chance that a node comes back alive.
- cleaned up the hosts.exclude list
- is done :)
Tue, Jan 8
We bumped the Xmx/Xms settings of the HDFS namenode to 12G (was 8G) for unrelated changes and I haven't seen any more pauses since then. The increased zookeeper timeout has also helped. For the moment I don't see any more action items, so I think we are good to close.
So I tried it today and I have a couple of notes:
Mon, Jan 7
When we discussed this use case I was not aware (shame! shame!) about using SSH -L in the following way:
Enclosure Device ID: 32 Slot Number: 1 Drive's position: DiskGroup: 2, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 1 WWN: 500003964b700233 Sequence Number: 3 Media Error Count: 0 Other Error Count: 3 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA
@Cmjohnson so I got a different than usual output from:
Nevermind then, I can easily use only analytics1028->41, we are good to decom. Thanks!
Going to close this task to open another one that tracks the upgrade to buster or stretch, this one is full of information and I wouldn't like to overload it (more than now).