Sat, Dec 3
Loading with mydumper 0.10 into 10.6.10 fails almost immediately:
$ myloader --version ~> myloader 0.10.0, built against MySQL 10.5.8 $ date ~> Sat 03 Dec 2022 08:03:39 AM UTC $ /usr/bin/myloader --directory /srv/backups/dumps/latest/dump.s5.2022-11-29--00-00-02 --threads 16 --host db1133.eqiad.wmnet <credential options> date
s5 (a best case scenario, our smallest wiki section and a balanced number of tables too 10h30) with the oneliners:
root@dbprov1003:~$ ./mini_loader.sh /srv/backups/dumps/latest/dump.s5.2022-11-29--00-00-02/ Starting recovery at 2022-12-02 12:57:56+00:00 Creating database amiwiki... Database amiwiki created successfully ... Table altwiki.slot_roles imported successfully Finishing recovery at 2022-12-02 23:27:59+00:00
Fri, Dec 2
I will now do a load comparison between both recovery methods on a more realistic scenario with an s* section with MariaDB 10.6, at the same time that I check definitively that the issue is either on myloader side or on mariadb server side.
The current script is able to reload a dump of the Phabricator databases (434GB) in 12 hours ( https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1133&var-port=9104&from=1669880505155&to=1669929579334) although all tables except a very tall one were imported in the first 1h15min, probably comparable in time to myloader due to limited table concurrency and that it had unaltered configuration and was still replicating s1:
root@dbprov1003:~$ ./mini_loader.sh /srv/backups/dumps/latest/dump.m3.2022-11-29--03-58-42/ Starting recovery at 2022-12-01 08:33:17+00:00 Creating database phabricator_almanac... ... Importing data to table phabricator_repository.repository_commit_fngrams.00016... Table phabricator_repository.repository_commit_fngrams.00016 imported successfully Finishing recovery at 2022-12-01 20:09:32+00:00
This should do: T313582
Wed, Nov 30
Summary so far- I think this would be a good thing to have when flexibility would be needed and debugging. It could also solve some of the issue we have with lack of TLS support and other security concerns. However, making a good error handling won't be easy (it was not easy to do it for myloaded either- that is why metadata tracking and production recovery testing was needed). We'll see tomorrow about performance. Specially same-table loading.
I am running now a remote import test of a full s2 backup from a screen on dbprov1003 with 16 parallel threads into db1133:
It didn't took long to create a prototype to load in the mydumper format:Parallelization is an xargs away.
Hopefully this is the right place to ask. I am not an English native speaker, so while I can sometimes get technical terms, I don't necessarily get the nuances and context, and history of the words. Hopefully you can give me some advice:
Tue, Nov 29
This is the backup on eqiad, there is another one on codfw too:
10G is also not absolutely required at the moment. I personally would like to eventually have all dbs in a 10G for a fast backup recovery- and that is why we buy 10G cards, but there is no formal plan for it yet (I believe it will require the network upgrade for that).
Mon, Nov 28
@hashar @Vgutierrez Please review my summary of the incident at: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues . I left some things as guesses, as I am unsure of what the best actionables are for ATS/Phabricator, but please you are invited to edit and file any followup tickets if necessary.
As soon as I finish the wikitech description I intend to resolve it.
Fri, Nov 25
All issues fixed in release 0.1.5. This is no longer a blocker.
Last minor issue- there was outdated indexes on the codfw database, leading to full scans on deletion.
Issues solved now, with a last blocker: metadata is correctly updated on the files table (file is set as "hard-deleted") but the row on the backups table is not deleted, so this is detected as a "metadata deletion failure". Investigating.
I know what the issue is- this automation was made initially just to list and restore files, never to modify them (deletion was a later addition). For this reason, these scripts use the read-only mediabackups account, which fails at performing the actual deletion.
The fix worked, deletion was sent this time, but we have an issue with permissions:
This is something that bit other people, so maybe it was working before-even if deprecated, but the library upgrade made it fail:
This is blocking deletion of files from backups.
Wed, Nov 23
We realized it was due to the cpu flags sha_ni:
According to openssl speed SHA-256 isn't slower than MD5
Thank you, I am going to productionize this as is after I validated it had no regressions (and there was no warning anymore).
Tue, Nov 22
I am filling in: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues (Still WIP)
Thanks, that is all I needed to understand the context! I will create a draft doc on Wikitech and link it here for review.
Ignore my previous comments, I was comparing with the wrong transfer. The speed times were:
The new encryption parameters seems to make the transfer 20% slower. I will have more details after a deeper analysis and with more testing, to make sure it is not caused by external factors.
Mon, Nov 21
@Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to check the power redundancy is working as it is expected. (maybe you already did that).
First timeout matches that log:
Service Unknown[2022-11-21 15:11:00] SERVICE ALERT: db2174;Check for large files in client bucket;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Thanks Ottomata, please use the template with the checklist I linked to you; otherwise I think there is not enough visibility and clarity to follow the process as documented and not forgetting any step.
I can confirm the warning was showing up on verbose mode:
100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /tmp/...eqiad.wmnet_4401'. 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. ----- OUTPUT of '/bin/bash -c "/u...qiad.wmnet 4401"' ----- *** WARNING : deprecated key derivation used. Using -iter or -pbkdf2 would be better.
To clarify- there is no blocker from SRE team ops to proceed with this, we are eager and waiting for the template to be added on this ticket to formally kickstart the process.
Fri, Nov 18
Thu, Nov 17
According to Namely, Will and Guillome should approve for each + either Otto or Olja from your side (let me know if that is up to date).
Do we need any additional approval from elsewhere in SRE or can we just go ahead and make the change
If an incident is planned to be written, let me add the corresponding tag for tracking purposes only (I'm on clinic duty this week).
Thank you for bearing with me! Old account is now disabled on Phabricator and everywhere else. Further requests should go much smoother! Sorry for the complications. Have a nice day and enjoy your new privileges! :-)
Could I please request you to disable the other Phab account (Username: AbhasT)?
Wed, Nov 16
For the record, the UID/CN on LDAP associated with the corporate LDAP/email is: Abhas, I updated it on the request.
Tempted to mark this as a duplicate of T255124
What was the specific change that was deployed. What was the specific change change that caused the issue?
I contacted Abhas in private, proving the request was legitimate. Thank you and apologies for any problem caused!
Hello, @Atripathi. Privileged acces to LDAP is provided to people according to certain rules and needs. I hope this doesn't sound disrespectful, but I am not sure who is the requester (this is an account created today with little to no information or verification) and who you are requesting the access for (@Atripathi?)- another account created today with no context.
Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done? Then please keep the good work :-P
I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye upgrade (in particular, of a ganeti upgrade), as I don't see them in the not-yet-upgraded hosts, but start exactly on the same they were upgraded, FYI:
hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet
I am marking this as an incident, as lists were down for around 2.5h. Although it could also be considered an Sustainability (Incident Followup)
I think we know already about it is that the server has 1 power supply on the left and the other one on the right
My suggestion is purely practical- to avoid having every week SREs asking you (because processing access requests rotates every 7 days)- removing the SRE and SRE-Access-Requests tags, as we have no actionables yet. Probably add a personal tag or that of the sponsor team. When ready, it can always be retagged with no issue. This is mostly to avoid the stress of access requests, which are treated with maximum urgency on our side.
Tue, Nov 15
@Papaul You probably are not asking me, but the work on the server is scheduled for next quarter, so feel free to do more tests/work with this server. We actually chose to order the new mode for this expansion because it was ok for the setup to take it longer (and it was just 1 per dc), but in the future other servers like this will be ordered, if that helps the answer.
I agree but I think the second should be communicated more widely. I think many people would prefer to have a task closed than open forever and never worked, but requires a common understanding of that (e.g. declining now doesn't mean it is a bad I idea, and that it could be reopened later on/pushing for it in the future).
Patch that should help: https://gerrit.wikimedia.org/r/856927 (asuming HDs RAID is sda)
Would it be possible to introduce a revision that fixes the missing reference? If you document the process I can fix future cases (or at least avoid errors).
@Papaul Yes, The RAID with the HDs should contain the OS, using the same custom recipe as db hosts (ideally that is sda, first hw raid virtual disk). The ssds are used for fast scratching, I can set the partition of those after installation, but ideally you could setup the raid 0 for the ssds for me before or after installation.
Mon, Nov 14
Priority-wise not giving a judgment call, but reflecting this has been untouched since 2019 and looks like a (nice) feature request, not a bug.
I answer myself with a potential solution: I am thinking to try to hide SRE tasks for triage in the "(Acknowledged)" columns, but I wonder if that is just delaying the inevitable- how to handle tasks that are correct but no one is going to work on them? (specially when several teams have different policies- some close them, some remove their tag, some just leave them untouched or on a separate column).