Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (395 w, 3 h)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Today

jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

Do you remember, more or less, how long it used to take with myloader?

Mon, Dec 5, 10:12 AM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups

Sat, Dec 3

jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

Loading with mydumper 0.10 into 10.6.10 fails almost immediately:

$ myloader --version ~> myloader 0.10.0, built against MySQL 10.5.8
$ date ~> Sat 03 Dec 2022 08:03:39 AM UTC
$ /usr/bin/myloader --directory /srv/backups/dumps/latest/dump.s5.2022-11-29--00-00-02 --threads 16 --host db1133.eqiad.wmnet <credential options>
date
Sat, Dec 3, 8:18 AM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

s5 (a best case scenario, our smallest wiki section and a balanced number of tables too 10h30) with the oneliners:

root@dbprov1003:~$ ./mini_loader.sh /srv/backups/dumps/latest/dump.s5.2022-11-29--00-00-02/
Starting recovery at 2022-12-02 12:57:56+00:00
Creating database amiwiki...
Database amiwiki created successfully
...
Table altwiki.slot_roles imported successfully
Finishing recovery at 2022-12-02 23:27:59+00:00
Sat, Dec 3, 7:41 AM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups

Fri, Dec 2

jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

I will now do a load comparison between both recovery methods on a more realistic scenario with an s* section with MariaDB 10.6, at the same time that I check definitively that the issue is either on myloader side or on mariadb server side.

Fri, Dec 2, 12:09 PM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

The current script is able to reload a dump of the Phabricator databases (434GB) in 12 hours ( https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1133&var-port=9104&from=1669880505155&to=1669929579334) although all tables except a very tall one were imported in the first 1h15min, probably comparable in time to myloader due to limited table concurrency and that it had unaltered configuration and was still replicating s1:

root@dbprov1003:~$ ./mini_loader.sh /srv/backups/dumps/latest/dump.m3.2022-11-29--03-58-42/
Starting recovery at 2022-12-01 08:33:17+00:00
Creating database phabricator_almanac...
...
Importing data to table phabricator_repository.repository_commit_fngrams.00016...
Table phabricator_repository.repository_commit_fngrams.00016 imported successfully
Finishing recovery at 2022-12-01 20:09:32+00:00
Fri, Dec 2, 11:58 AM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo added a comment to T313978: Q1:rack/setup/install db1204, db1205.

This should do: T313582

Fri, Dec 2, 8:39 AM · SRE, Data-Persistence-Backup, ops-eqiad, DC-Ops

Wed, Nov 30

jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

Summary so far- I think this would be a good thing to have when flexibility would be needed and debugging. It could also solve some of the issue we have with lack of TLS support and other security concerns. However, making a good error handling won't be easy (it was not easy to do it for myloaded either- that is why metadata tracking and production recovery testing was needed). We'll see tomorrow about performance. Specially same-table loading.

Wed, Nov 30, 5:05 PM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

I am running now a remote import test of a full s2 backup from a screen on dbprov1003 with 16 parallel threads into db1133:

Wed, Nov 30, 4:37 PM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

It would be interesting to see the difference in length, in a sX section between that script and myloader. Just to have some rough idea...

Wed, Nov 30, 3:14 PM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo added a comment to T319383: Mydumper incompatibility with MariaDB 10.6 (was: Logical recoveries (myloader) to db2098:s7 are failing with "Lock wait timeout exceeded; try restarting transaction").

It didn't took long to create a prototype to load in the mydumper format:

1./mini_loader.sh /srv/backups/dumps/latest/dump.m2-scholarships.2022-11-29--10-13-38
2Creating database scholarships...
3ERROR 1007 (HY000) at line 1: Can't create database 'scholarships'; database exists
4Database scholarships failed to be created
5Creating table scholarships.iso_countries...
6Table scholarships.iso_countries created successfully
7Creating table scholarships.language_communities...
8Table scholarships.language_communities created successfully
9Creating table scholarships.rankings...
10Table scholarships.rankings created successfully
11Creating table scholarships.scholarships...
12Table scholarships.scholarships created successfully
13Creating table scholarships.settings...
14Table scholarships.settings created successfully
15Creating table scholarships.users...
16Table scholarships.users created successfully
17Importing data to table scholarships.iso_countries...
18Table scholarships.iso_countries imported successfully
19Importing data to table scholarships.language_communities...
20Table scholarships.language_communities imported successfully
21Importing data to table scholarships.settings...
22Table scholarships.settings imported successfully
23Importing data to table scholarships.users...
24Table scholarships.users imported successfully
25✔️
Parallelization is an xargs away.

Wed, Nov 30, 3:06 PM · Patch-For-Review, Data-Persistence-Backup, DBA, database-backups
jcrespo created P41887 mini_loader.sh test.
Wed, Nov 30, 3:03 PM
jcrespo added a comment to T254646: Reconsidering how we name things.

Hopefully this is the right place to ask. I am not an English native speaker, so while I can sometimes get technical terms, I don't necessarily get the nuances and context, and history of the words. Hopefully you can give me some advice:

Wed, Nov 30, 10:24 AM · MW-1.39-notes (1.39.0-wmf.21; 2022-07-18), MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), User-brennen, MediaWiki-extensions-General, WMF-General-or-Unknown, MediaWiki-General, Patch-For-Review, Voice & Tone

Tue, Nov 29

jcrespo added a comment to T243037: Shutdown scholarships.wikimedia.org and archive project.

This is the backup on eqiad, there is another one on codfw too:

Screenshot_20221129_123514.png (1×2 px, 164 KB)

Tue, Nov 29, 11:36 AM · Patch-For-Review, Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup, Wikimedia-Wikimania-Scholarships
jcrespo added a comment to T243037: Shutdown scholarships.wikimedia.org and archive project.

@Marostegui We have an open checkbox "Delete scholarships database on m2-master.eqiad.wmnet" here but I am not sure if that is actionable by DBA or if it is whether it should be assigned as a separate ticket.

We can do it yeah. I would like to ping @jcrespo first to make sure we have one last backup, before dropping it.
But once it is done, we can totally go ahead and drop it

Tue, Nov 29, 11:34 AM · Patch-For-Review, Wikimedia-GitHub, Diffusion-Repository-Administrators, Projects-Cleanup, Wikimedia-Wikimania-Scholarships
jcrespo awarded T323397: PHP Warning: Config page does not exist: title=[title], query= [Called from JsonConfig\JCUtils::warn in /srv/mediawiki/php-1.40.0-wmf.10/extensions/JsonConfig/includes/JCUtils.php at line 47] a Like token.
Tue, Nov 29, 10:54 AM · MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), Product Infrastructure Roadmap, JsonConfig, Wikimedia-production-error
jcrespo added a comment to T313978: Q1:rack/setup/install db1204, db1205.

10G is also not absolutely required at the moment. I personally would like to eventually have all dbs in a 10G for a fast backup recovery- and that is why we buy 10G cards, but there is no formal plan for it yet (I believe it will require the network upgrade for that).

Tue, Nov 29, 9:46 AM · SRE, Data-Persistence-Backup, ops-eqiad, DC-Ops

Mon, Nov 28

jcrespo updated the task description for T323903: Transfer.py: Deprecate md5 hashing in favour of the more secure sha256 for checksum.
Mon, Nov 28, 10:49 AM · database-backups, Data-Persistence-Backup
jcrespo updated the task description for T323903: Transfer.py: Deprecate md5 hashing in favour of the more secure sha256 for checksum.
Mon, Nov 28, 10:47 AM · database-backups, Data-Persistence-Backup
jcrespo triaged T323903: Transfer.py: Deprecate md5 hashing in favour of the more secure sha256 for checksum as Medium priority.
Mon, Nov 28, 10:45 AM · database-backups, Data-Persistence-Backup
jcrespo created T323903: Transfer.py: Deprecate md5 hashing in favour of the more secure sha256 for checksum.
Mon, Nov 28, 10:45 AM · database-backups, Data-Persistence-Backup
jcrespo closed T316337: Phabricator was logging out users repeatedly (2022-08-26) as Resolved.

@hashar @Vgutierrez Please review my summary of the incident at: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues . I left some things as guesses, as I am unsure of what the best actionables are for ATS/Phabricator, but please you are invited to edit and file any followup tickets if necessary.

Mon, Nov 28, 9:40 AM · Wikimedia-Incident, SRE, Phabricator, Traffic
jcrespo closed T316337: Phabricator was logging out users repeatedly (2022-08-26), a subtask of T316338: strip non session cookies before cache lookup in ATS, as Resolved.
Mon, Nov 28, 9:40 AM · SRE, Traffic
jcrespo updated the task description for T316337: Phabricator was logging out users repeatedly (2022-08-26).
Mon, Nov 28, 9:40 AM · Wikimedia-Incident, SRE, Phabricator, Traffic
jcrespo added a comment to T316337: Phabricator was logging out users repeatedly (2022-08-26).

As soon as I finish the wikitech description I intend to resolve it.

Mon, Nov 28, 9:18 AM · Wikimedia-Incident, SRE, Phabricator, Traffic

Fri, Nov 25

jcrespo committed rOSMBefdf42d38ed1: deletion: Fix bug in query for metadata deletion (authored by jcrespo).
deletion: Fix bug in query for metadata deletion
Fri, Nov 25, 2:12 PM
jcrespo committed rOSMBbc65264667a2: Prepare for release 0.1.4 (authored by jcrespo).
Prepare for release 0.1.4
Fri, Nov 25, 2:12 PM
jcrespo committed rOSMB9c9145c844bb: Fix parameter naming on deletion of an S3 object (authored by jcrespo).
Fix parameter naming on deletion of an S3 object
Fri, Nov 25, 2:11 PM
jcrespo committed rOSMB7b720dd8cb89: Fix minor syntax issue while rising an exception (authored by jcrespo).
Fix minor syntax issue while rising an exception
Fri, Nov 25, 2:11 PM
jcrespo closed T323796: Exception TypeError when trying to delete a file from media backups as Resolved.

All issues fixed in release 0.1.5. This is no longer a blocker.

Fri, Nov 25, 11:37 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T323796: Exception TypeError when trying to delete a file from media backups.

Last minor issue- there was outdated indexes on the codfw database, leading to full scans on deletion.

Fri, Nov 25, 11:16 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T323796: Exception TypeError when trying to delete a file from media backups.

Issues solved now, with a last blocker: metadata is correctly updated on the files table (file is set as "hard-deleted") but the row on the backups table is not deleted, so this is detected as a "metadata deletion failure". Investigating.

Fri, Nov 25, 10:10 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T323796: Exception TypeError when trying to delete a file from media backups.

I know what the issue is- this automation was made initially just to list and restore files, never to modify them (deletion was a later addition). For this reason, these scripts use the read-only mediabackups account, which fails at performing the actual deletion.

Fri, Nov 25, 9:22 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T323796: Exception TypeError when trying to delete a file from media backups.

The fix worked, deletion was sent this time, but we have an issue with permissions:

Fri, Nov 25, 9:11 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T323796: Exception TypeError when trying to delete a file from media backups.

This is something that bit other people, so maybe it was working before-even if deprecated, but the library upgrade made it fail:

Fri, Nov 25, 8:37 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo claimed T323796: Exception TypeError when trying to delete a file from media backups.

This is blocking deletion of files from backups.

Fri, Nov 25, 8:33 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo created T323796: Exception TypeError when trying to delete a file from media backups.
Fri, Nov 25, 8:33 AM · Data-Persistence, Data-Persistence-Backup, media-backups

Wed, Nov 23

jcrespo added a comment to T323485: Transferpy: Enable PBKDF2 usage.

We realized it was due to the cpu flags sha_ni:

Wed, Nov 23, 11:25 AM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo added a comment to T323485: Transferpy: Enable PBKDF2 usage.

Thanks, I will give it a look and test it on the right context (piping content on a single thread) and explore implementing it for the next release.

Consider testing openssl dgst besides coreutils' /usr/bin/sha256sum

Wed, Nov 23, 10:50 AM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo closed T323485: Transferpy: Enable PBKDF2 usage as Resolved.

According to openssl speed SHA-256 isn't slower than MD5

Wed, Nov 23, 10:26 AM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo committed rOSTPaac9253dcca1: Use the shlex.quote method to escape hosts and paths (authored by jcrespo).
Use the shlex.quote method to escape hosts and paths
Wed, Nov 23, 10:24 AM
jcrespo committed rOSTP82fd3f0efe27: Add man page for transfer.py executable (authored by jcrespo).
Add man page for transfer.py executable
Wed, Nov 23, 10:24 AM
jcrespo committed rOSTPb656e727a258: Update changelog for release 1.1 (authored by jcrespo).
Update changelog for release 1.1
Wed, Nov 23, 10:24 AM
jcrespo committed rOSTPd9a48b6061f7: Transferer: Enable PBKDF2 usage with 310000 iterations (authored by jcrespo).
Transferer: Enable PBKDF2 usage with 310000 iterations
Wed, Nov 23, 10:24 AM
jcrespo added a comment to T323485: Transferpy: Enable PBKDF2 usage.

Thank you, I am going to productionize this as is after I validated it had no regressions (and there was no warning anymore).

Wed, Nov 23, 9:28 AM · Data-Persistence, Data-Persistence-Backup, database-backups

Tue, Nov 22

jcrespo added a comment to T316337: Phabricator was logging out users repeatedly (2022-08-26).

I am filling in: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues (Still WIP)

Tue, Nov 22, 2:34 PM · Wikimedia-Incident, SRE, Phabricator, Traffic
jcrespo added a comment to T316337: Phabricator was logging out users repeatedly (2022-08-26).

Thanks, that is all I needed to understand the context! I will create a draft doc on Wikitech and link it here for review.

Tue, Nov 22, 2:13 PM · Wikimedia-Incident, SRE, Phabricator, Traffic
jcrespo added a comment to T323485: Transferpy: Enable PBKDF2 usage.

Ignore my previous comments, I was comparing with the wrong transfer. The speed times were:

Tue, Nov 22, 12:58 PM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo added a comment to T323485: Transferpy: Enable PBKDF2 usage.

The new encryption parameters seems to make the transfer 20% slower. I will have more details after a deeper analysis and with more testing, to make sure it is not caused by external factors.

Tue, Nov 22, 12:23 PM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo assigned T323512: db2174 lost power to Papaul.
Tue, Nov 22, 9:13 AM · SRE, DBA, ops-codfw

Mon, Nov 21

jcrespo triaged T323512: db2174 lost power as High priority.
Mon, Nov 21, 6:00 PM · SRE, DBA, ops-codfw
jcrespo added a comment to T323512: db2174 lost power.

@Papaul, I wonder if we could do a "simple" test of checking the power supply redundancy by "pulling the plug" (literally or just pushing the on/off button) to check the power redundancy is working as it is expected. (maybe you already did that).

Mon, Nov 21, 5:59 PM · SRE, DBA, ops-codfw
jcrespo added a comment to T323280: Grant ssh access to analytics-admins to dcausse and gmodena.

Done, I removed irrelevant parts, if that is okay.

Mon, Nov 21, 5:53 PM · SRE, SRE-Access-Requests, Data-Engineering
jcrespo added a comment to T323512: db2174 lost power.

First timeout matches that log:

Service Unknown[2022-11-21 15:11:00] SERVICE ALERT: db2174;Check for large files in client bucket;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Mon, Nov 21, 4:57 PM · SRE, DBA, ops-codfw
jcrespo added a comment to T323280: Grant ssh access to analytics-admins to dcausse and gmodena.

Thanks Ottomata, please use the template with the checklist I linked to you; otherwise I think there is not enough visibility and clarity to follow the process as documented and not forgetting any step.

Mon, Nov 21, 2:41 PM · SRE, SRE-Access-Requests, Data-Engineering
jcrespo changed the status of T323485: Transferpy: Enable PBKDF2 usage from Open to In Progress.
Mon, Nov 21, 1:00 PM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo added a comment to T323485: Transferpy: Enable PBKDF2 usage.

I can confirm the warning was showing up on verbose mode:

100.0% (1/1) success ratio (>= 100.0% threshold) for command: '/bin/mkdir /tmp/...eqiad.wmnet_4401'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
----- OUTPUT of '/bin/bash -c "/u...qiad.wmnet 4401"' -----                                                                   
*** WARNING : deprecated key derivation used.                                                                                 
Using -iter or -pbkdf2 would be better.
Mon, Nov 21, 1:00 PM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo claimed T323485: Transferpy: Enable PBKDF2 usage.
Mon, Nov 21, 12:12 PM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo edited projects for T323485: Transferpy: Enable PBKDF2 usage, added: Data-Persistence-Backup, Data-Persistence; removed SRE.
Mon, Nov 21, 12:11 PM · Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo moved T323280: Grant ssh access to analytics-admins to dcausse and gmodena from Untriaged to Awaiting User Input on the SRE-Access-Requests board.

To clarify- there is no blocker from SRE team ops to proceed with this, we are eager and waiting for the template to be added on this ticket to formally kickstart the process.

Mon, Nov 21, 10:58 AM · SRE, SRE-Access-Requests, Data-Engineering
jcrespo moved T323280: Grant ssh access to analytics-admins to dcausse and gmodena from Backlog to Acknowledged on the SRE board.
Mon, Nov 21, 10:57 AM · SRE, SRE-Access-Requests, Data-Engineering

Fri, Nov 18

jcrespo moved T263220: Limit concurrency of DPL queries from Backlog to Acknowledged on the SRE board.
Fri, Nov 18, 11:45 AM · serviceops-radar, Wikimedia-Slow-DB-Query, SecTeam-Processed, Security, Vuln-DoS, Sustainability (Incident Followup), SRE, Platform Team Workboards (Clinic Duty Team), MW-1.36-notes (1.36.0-wmf.18; 2020-11-17), Performance Issue, Patch-For-Review, DynamicPageList (Wikimedia)

Thu, Nov 17

jcrespo added a comment to T323280: Grant ssh access to analytics-admins to dcausse and gmodena.

According to Namely, Will and Guillome should approve for each + either Otto or Olja from your side (let me know if that is up to date).

Thu, Nov 17, 7:38 PM · SRE, SRE-Access-Requests, Data-Engineering
jcrespo added a comment to T323280: Grant ssh access to analytics-admins to dcausse and gmodena.

Do we need any additional approval from elsewhere in SRE or can we just go ahead and make the change

Thu, Nov 17, 7:34 PM · SRE, SRE-Access-Requests, Data-Engineering
jcrespo moved T323262: gerrit1001 running out of space on / from Active investigation to Awaiting report on the Wikimedia-Incident board.
Thu, Nov 17, 12:17 PM · Release-Engineering-Team (GitLab III: GitLab in LA 🪃), Gerrit, SRE-OnFire, Sustainability (Incident Followup), Wikimedia-Incident, serviceops-collab
jcrespo added a project to T323262: gerrit1001 running out of space on /: Wikimedia-Incident.

If an incident is planned to be written, let me add the corresponding tag for tracking purposes only (I'm on clinic duty this week).

Thu, Nov 17, 12:17 PM · Release-Engineering-Team (GitLab III: GitLab in LA 🪃), Gerrit, SRE-OnFire, Sustainability (Incident Followup), Wikimedia-Incident, serviceops-collab
jcrespo added a comment to T323207: Grant Access to wmf for Atripathi.

Thank you for bearing with me! Old account is now disabled on Phabricator and everywhere else. Further requests should go much smoother! Sorry for the complications. Have a nice day and enjoy your new privileges! :-)

Thu, Nov 17, 9:49 AM · SRE, LDAP-Access-Requests
jcrespo added a comment to T323207: Grant Access to wmf for Atripathi.

Could I please request you to disable the other Phab account (Username: AbhasT)?

Thu, Nov 17, 9:21 AM · SRE, LDAP-Access-Requests

Wed, Nov 16

jcrespo closed T323207: Grant Access to wmf for Atripathi as Resolved.

@Abhas: you have been added to the WMF ldap group- which should provide you access to superset. Please check access is working for you. You were also aded to the NDA group here on phabricator.

Wed, Nov 16, 5:34 PM · SRE, LDAP-Access-Requests
jcrespo added a member for WMF-NDA: Abhas.
Wed, Nov 16, 5:19 PM
jcrespo changed the status of T323207: Grant Access to wmf for Atripathi from Open to In Progress.
Wed, Nov 16, 4:37 PM · SRE, LDAP-Access-Requests
jcrespo added a comment to T323207: Grant Access to wmf for Atripathi.

For the record, the UID/CN on LDAP associated with the corporate LDAP/email is: Abhas, I updated it on the request.

Wed, Nov 16, 4:16 PM · SRE, LDAP-Access-Requests
jcrespo updated the task description for T323207: Grant Access to wmf for Atripathi.
Wed, Nov 16, 4:15 PM · SRE, LDAP-Access-Requests
jcrespo added a comment to T323208: lists apache config change should trigger an apache reload.

Tempted to mark this as a duplicate of T255124

Wed, Nov 16, 2:27 PM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists
jcrespo added a comment to T323208: lists apache config change should trigger an apache reload.

What was the specific change that was deployed. What was the specific change change that caused the issue?

Wed, Nov 16, 2:25 PM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists
jcrespo updated the task description for T323207: Grant Access to wmf for Atripathi.
Wed, Nov 16, 2:09 PM · SRE, LDAP-Access-Requests
jcrespo claimed T323207: Grant Access to wmf for Atripathi.
Wed, Nov 16, 2:07 PM · SRE, LDAP-Access-Requests
jcrespo moved T323207: Grant Access to wmf for Atripathi from Awaiting User Input to Backlog on the LDAP-Access-Requests board.
Wed, Nov 16, 2:07 PM · SRE, LDAP-Access-Requests
jcrespo added a comment to T323207: Grant Access to wmf for Atripathi.

I contacted Abhas in private, proving the request was legitimate. Thank you and apologies for any problem caused!

Wed, Nov 16, 2:07 PM · SRE, LDAP-Access-Requests
jcrespo triaged T323207: Grant Access to wmf for Atripathi as High priority.
Wed, Nov 16, 1:28 PM · SRE, LDAP-Access-Requests
jcrespo moved T323207: Grant Access to wmf for Atripathi from Backlog to Awaiting User Input on the LDAP-Access-Requests board.
Wed, Nov 16, 1:24 PM · SRE, LDAP-Access-Requests
jcrespo updated subscribers of T323207: Grant Access to wmf for Atripathi.

Hello, @Atripathi. Privileged acces to LDAP is provided to people according to certain rules and needs. I hope this doesn't sound disrespectful, but I am not sure who is the requester (this is an account created today with little to no information or verification) and who you are requesting the access for (@Atripathi?)- another account created today with no context.

Wed, Nov 16, 1:12 PM · SRE, LDAP-Access-Requests
jcrespo added a comment to T311687: Upgrade ganeti/eqiad to Bullseye.

Ah, so you mean they are temporary during the maintenance, and won't happen once all migrations are done? Then please keep the good work :-P

Wed, Nov 16, 12:58 PM · Ganeti, Infrastructure-Foundations, SRE
jcrespo added a comment to T311687: Upgrade ganeti/eqiad to Bullseye.

I am seeing a couple of non-fatal errors on ganeti. I wonder if they could be artifacts of the bullseye upgrade (in particular, of a ganeti upgrade), as I don't see them in the not-yet-upgraded hosts, but start exactly on the same they were upgraded, FYI:

Wed, Nov 16, 12:52 PM · Ganeti, Infrastructure-Foundations, SRE
jcrespo added a comment to T323208: lists apache config change should trigger an apache reload.

hmmm that would trigger a few seconds of downtime every time that Apache is restarted automatically by puppet

Wed, Nov 16, 11:14 AM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists
jcrespo renamed T323208: lists apache config change should trigger an apache reload from lists apache config change should trigger an apache restart to lists apache config change should trigger an apache reload.
Wed, Nov 16, 11:14 AM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists
jcrespo added a comment to T323208: lists apache config change should trigger an apache reload.

I am marking this as an incident, as lists were down for around 2.5h. Although it could also be considered an Sustainability (Incident Followup)

Wed, Nov 16, 11:06 AM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists
jcrespo added a project to T323208: lists apache config change should trigger an apache reload: Wikimedia-Incident.
Wed, Nov 16, 11:03 AM · Wikimedia-Incident, SRE, Wikimedia-Mailing-lists
jcrespo updated subscribers of T321128: Q1:rack/setup/install dbprov2004.

I think we know already about it is that the server has 1 power supply on the left and the other one on the right

Wed, Nov 16, 9:47 AM · Data-Persistence-Backup, SRE, ops-codfw, DC-Ops
jcrespo added a comment to T318058: Requesting access to Eventlogs, Stats for Simulo-wikitech.

My suggestion is purely practical- to avoid having every week SREs asking you (because processing access requests rotates every 7 days)- removing the SRE and SRE-Access-Requests tags, as we have no actionables yet. Probably add a personal tag or that of the sponsor team. When ready, it can always be retagged with no issue. This is mostly to avoid the stress of access requests, which are treated with maximum urgency on our side.

Wed, Nov 16, 9:36 AM · User-awight

Tue, Nov 15

jcrespo added a project to T323094: asw1-eqsin: VC mastership change: Wikimedia-Incident.
Tue, Nov 15, 4:41 PM · Wikimedia-Incident, SRE, netops, ops-eqsin, Infrastructure-Foundations
jcrespo added a comment to T321128: Q1:rack/setup/install dbprov2004.

@Papaul You probably are not asking me, but the work on the server is scheduled for next quarter, so feel free to do more tests/work with this server. We actually chose to order the new mode for this expansion because it was ok for the setup to take it longer (and it was just 1 per dc), but in the future other servers like this will be ordered, if that helps the answer.

Tue, Nov 15, 4:33 PM · Data-Persistence-Backup, SRE, ops-codfw, DC-Ops
jcrespo added a comment to T320959: SRE Clinic duty - triage query review.

I agree but I think the second should be communicated more widely. I think many people would prefer to have a task closed than open forever and never worked, but requires a common understanding of that (e.g. declining now doesn't mean it is a bad I idea, and that it could be reopened later on/pushing for it in the future).

Tue, Nov 15, 12:38 PM · SRE Program Management, PM
jcrespo added a comment to T321128: Q1:rack/setup/install dbprov2004.

Patch that should help: https://gerrit.wikimedia.org/r/856927 (asuming HDs RAID is sda)

Tue, Nov 15, 11:36 AM · Data-Persistence-Backup, SRE, ops-codfw, DC-Ops
jcrespo added a comment to T299387: Database corruption due to compressOld array plus bug, April 2006.

Would it be possible to introduce a revision that fixes the missing reference? If you document the process I can fix future cases (or at least avoid errors).

Tue, Nov 15, 10:26 AM · Patch-For-Review, Wikimedia-database-issue (Bad data), Platform Engineering, MediaWiki-Core-Revision-backend, Wikimedia-production-error
jcrespo added a comment to T321128: Q1:rack/setup/install dbprov2004.

@Papaul Yes, The RAID with the HDs should contain the OS, using the same custom recipe as db hosts (ideally that is sda, first hw raid virtual disk). The ssds are used for fast scratching, I can set the partition of those after installation, but ideally you could setup the raid 0 for the ssds for me before or after installation.

Tue, Nov 15, 10:08 AM · Data-Persistence-Backup, SRE, ops-codfw, DC-Ops

Mon, Nov 14

jcrespo updated the task description for T322591: Requesting access to analytics-privatedata-users for Dasm.
Mon, Nov 14, 4:19 PM · SRE, SRE-Access-Requests
jcrespo created T323040: db2166 crashed several times.
Mon, Nov 14, 4:17 PM · Wikimedia-production-error, DBA
jcrespo moved T223319: URL shortener subdomains for useful Wikimedia infrastructure from Backlog to Acknowledged on the SRE board.
Mon, Nov 14, 1:15 PM · SRE
jcrespo lowered the priority of T223319: URL shortener subdomains for useful Wikimedia infrastructure from Medium to Low.

Priority-wise not giving a judgment call, but reflecting this has been untouched since 2019 and looks like a (nice) feature request, not a bug.

Mon, Nov 14, 1:15 PM · SRE
jcrespo added a comment to T320959: SRE Clinic duty - triage query review.

I answer myself with a potential solution: I am thinking to try to hide SRE tasks for triage in the "(Acknowledged)" columns, but I wonder if that is just delaying the inevitable- how to handle tasks that are correct but no one is going to work on them? (specially when several teams have different policies- some close them, some remove their tag, some just leave them untouched or on a separate column).

Mon, Nov 14, 1:11 PM · SRE Program Management, PM