Page MenuHomePhabricator

Loss of access to parsing-qa-01.eqiad.wmflabs
Closed, ResolvedPublic

Description

TLDR of discussion from Cloud-Services

I lost access to parsing-qa-01.eqiad.wmflabs and @bd808 says it is because of expired certs and since it was original Debian stretch, I believe it is hard to get it fixed up because of its split identity between Stretch and Buster distros.

This phab task is for us to work through spinning up a new equivalent instance and the work needed for it to happen. We can edit description and create todo items here or create subtasks as appropriate.

Proposed fix:

  • T292265 bump quota for wikitextexp to allow for another very large instance and 350GiB of cinder disk
  • create a new parsing-qa-02 instance in the Cloud VPS project that will take over for parsing-qa-01 using the g3.cores16.ram36.disk20 flavor
  • create a 350GiB cinder volume
  • attach the cinder volume to parsing-qa-01
  • fill cinder volume with data that needs to be preserved
  • reattach the cinder volume to parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud
  • shutdown parsing-qa-01
  • hand the new instance over to @ssastry to finish setup
  • delete parsing-qa-01 and reduce quota of project

Event Timeline

The problem with ssh into the instance is an NSS failure. Specifically this host is using nslcd to talk to the LDAP directory and that communication is failing with a TLS error that is almost certainly caused by the LE root signing cert expiration on 2021-09-30 (T283164: Let's Encrypt issuance chains update).

Will the right puppet classes, etc. be configured via horizon? Or can I do that after I get the instance from you all?

Mentioned in SAL (#wikimedia-cloud) [2021-09-30T22:45:59Z] <bd808> Attaching volume "parsing" to parsing-qa-01 (T292264)

Will the right puppet classes, etc. be configured via horizon? Or can I do that after I get the instance from you all?

As far as I can tell the current parsing-qa-01 instance is basically un-puppetized. The only puppet config I can find for it is the class that was used to provision and format the extended disk.

Alright then. :) I'll worry about puppetization later.

Data copied from local disk on parsing-qa-01 instance to "parsing" cinder volume (mounted at /srv-cinder) with:

$ mysqldump --all-databases > /srv-cinder/parsing-qa-01-all-databases-20210930.sql
$ cd /srv
$ tar --exclude='./lost+found' -cf - .|(cd /srv-cinder; tar xvf -)

The tar is still running, but I will check back on it later this evening.

bd808 changed the task status from Open to In Progress.Sep 30 2021, 11:34 PM
bd808 claimed this task.
bd808 triaged this task as Medium priority.
bd808 moved this task from Inbox to Clinic Duty on the cloud-services-team (Kanban) board.

@ssastry Your new instance is parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud (note the new style domain). The data from the mysql database and the /srv mount on parsing-qa-01 is on parsing-qa-02 in a cinder volume mounted at /srv.

I have shutdown, but not deleted, the parsing-qa-01 instance. We can spin it up to fetch more files if needed.

It possible, config files in /etc/nginx /etc/testreduce and /etc/visualdiff as well as /lib/systemd/system/(parsoid*|mw*|diff*) (or all files if that is simpler and I can easily sift through what I want) would save me a few hours setting them up (since I hadn't puppetized them). You can put them in my home directory.

It possible, config files in /etc/nginx /etc/testreduce and /etc/visualdiff as well as /lib/systemd/system/(parsoid*|mw*|diff*) (or all files if that is simpler and I can easily sift through what I want) would save me a few hours setting them up (since I hadn't puppetized them). You can put them in my home directory.

I tarred up all those directories and put them in /home/ssastry/parsing-qa-01.config.tbz on parsing-qa-02.

Awesome ... almost there .. https://parsoid-vs-core.wmflabs.org/ is now operational and the services run.

I feel silly not thinking upfront about all the places where the bodies were buried ... but, I also need my home directory transferred. I had a stats script lying around there I used once a week.

I should probably use this as an excuse to puppetize some of the steps.

Looks like we got a performance upgrade as a result - a number of tests that were failing to complete because of time / memory limits are now no longer crashing (probably about 0.2 - 0.3% of tests). And, maybe the whole test run will finish a bit faster and we can expand the test set to include pages from other wikis. So, not a bad outcome in the end. :)

I feel silly not thinking upfront about all the places where the bodies were buried ... but, I also need my home directory transferred. I had a stats script lying around there I used once a week.

Happens to all of us, so no reason to feel silly. Maybe a good motivation to build a more repeatable instance though. :)

I tarred up most of /home from parsing-qa-01 and placed the archive in parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud:~ssastry/parsing-qa-01-home.tbz for you.

Thanks! Yes, I should puppetize or something. Anyway, all set now - that script in my home directory was the last missing piece.

Thanks! Yes, I should puppetize or something. Anyway, all set now - that script in my home directory was the last missing piece.

Just to double check, you are now ready for me to delete the old parsing-qa-01 instance and cleanup the project quotas?

Mentioned in SAL (#wikimedia-cloud) [2021-10-05T00:36:10Z] <bd808> Deleted instance parsing-qa-01 (T292264)

Mentioned in SAL (#wikimedia-cloud) [2021-10-05T00:48:28Z] <bd808> Remove self (BryanDavis) from project after helping resolve T292264