Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (232 w, 2 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Yesterday

jcrespo added a comment to T223151: Review special replica partitioning of certain tables by `xx_user`.

Addendum, there is also logging of long running queries killed:

Wed, Oct 23, 1:57 PM · mariadb-optimizer-bug, Core Platform Team, MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), Performance Issue, DBA
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

So normally the fix for the above would be trivial, but the design decisions of making sql class a singleton are in my opinion not worthy fixing, because it would force to either a deeper refactoring or a global scope hack. I would prefer to spend more time to refactor tendril into not using the Google API: T96499

Wed, Oct 23, 10:46 AM · Operations
jcrespo merged T139765: dbtree.wikimedia.org: Replace Google Charts usage with something we can host into T96499: dbtree loads third party resources (from jquery.com and google.com).
Wed, Oct 23, 10:45 AM · Privacy, Traffic, HTTPS, Operations, Patch-For-Review, DBA, WMF-Legal
jcrespo merged task T139765: dbtree.wikimedia.org: Replace Google Charts usage with something we can host into T96499: dbtree loads third party resources (from jquery.com and google.com).
Wed, Oct 23, 10:45 AM · Privacy, Wikimedia-General-or-Unknown
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.
[Wed Oct 23 10:17:48.055752 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33
[Wed Oct 23 10:17:48.055976 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.056122 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_query() expects parameter 1 to be mysqli, string given in /srv/dbtree/inc/sanity.php on line 783
[Wed Oct 23 10:17:48.056213 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_errno() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 788
[Wed Oct 23 10:17:48.056297 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_error() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 789
[Wed Oct 23 10:17:48.056377 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_fetch_array() expects parameter 1 to be mysqli_result, null given in /srv/dbtree/inc/sanity.php on line 852
[Wed Oct 23 10:17:48.060413 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33
[Wed Oct 23 10:17:48.060750 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.060941 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_query() expects parameter 1 to be mysqli, string given in /srv/dbtree/inc/sanity.php on line 783
[Wed Oct 23 10:17:48.061083 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_errno() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 788
[Wed Oct 23 10:17:48.061236 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_error() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 789
[Wed Oct 23 10:17:48.061352 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_fetch_array() expects parameter 1 to be mysqli_result, null given in /srv/dbtree/inc/sanity.php on line 852
[Wed Oct 23 10:17:48.062785 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33
[Wed Oct 23 10:17:48.062926 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.063107 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_query() expects parameter 1 to be mysqli, string given in /srv/dbtree/inc/sanity.php on line 783
[Wed Oct 23 10:17:48.063231 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_errno() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 788
[Wed Oct 23 10:17:48.063341 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_error() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 789
[Wed Oct 23 10:17:48.063453 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_fetch_array() expects parameter 1 to be mysqli_result, null given in /srv/dbtree/inc/sanity.php on line 852
[Wed Oct 23 10:17:48.064791 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33
[Wed Oct 23 10:17:48.064942 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.065104 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.065229 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.065380 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_query() expects parameter 1 to be mysqli, string given in /srv/dbtree/inc/sanity.php on line 783
[Wed Oct 23 10:17:48.065489 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_errno() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 788
[Wed Oct 23 10:17:48.065596 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_error() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 789
[Wed Oct 23 10:17:48.065707 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_fetch_array() expects parameter 1 to be mysqli_result, null given in /srv/dbtree/inc/sanity.php on line 852
[Wed Oct 23 10:17:48.067187 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33
[Wed Oct 23 10:17:48.067320 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_real_escape_string() expects exactly 2 parameters, 1 given in /srv/dbtree/inc/sanity.php on line 286
[Wed Oct 23 10:17:48.067507 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_query() expects parameter 1 to be mysqli, string given in /srv/dbtree/inc/sanity.php on line 783
[Wed Oct 23 10:17:48.067633 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_errno() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 788
[Wed Oct 23 10:17:48.067751 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_error() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 789
[Wed Oct 23 10:17:48.067896 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_fetch_array() expects parameter 1 to be mysqli_result, null given in /srv/dbtree/inc/sanity.php on line 852
[Wed Oct 23 10:17:48.070537 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_select_db() expects exactly 2 parameters, 1 given in /srv/dbtree/index.php on line 33
[Wed Oct 23 10:17:48.070809 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_query() expects parameter 1 to be mysqli, string given in /srv/dbtree/inc/sanity.php on line 783
[Wed Oct 23 10:17:48.070914 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_errno() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 788
[Wed Oct 23 10:17:48.071025 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_error() expects exactly 1 parameter, 0 given in /srv/dbtree/inc/sanity.php on line 789
[Wed Oct 23 10:17:48.071139 2019] [:error] [pid 10017] [client 10.64.32.67:64543] PHP Warning:  mysqli_fetch_array() expects parameter 1 to be mysqli_result, null given in /srv/dbtree/inc/sanity.php on line 852
Wed, Oct 23, 10:19 AM · Operations
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

Now we "only" need to fix the php, with I would prefer not to, not because it would be difficult, but because it would be a waste of time, and I would prefer to create a simple flash + d3 microsite, specially for dbtree:

Wed, Oct 23, 9:24 AM · Operations
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

I ran manually a2dismod mpm_event and now it worked. I will check if this happens again on a clean install of dbmonitor1001 and add code to handle it.

Wed, Oct 23, 9:04 AM · Operations
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

That was more or less what I tried before, but it installs event version rather than prefork. Just to be sure, I tried your exact purges again, and I got the same error:

Wed, Oct 23, 9:01 AM · Operations
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

🤔

Notice: /Stage[main]/Httpd/Httpd::Conf[defaults]/File[/etc/apache2/conf-enabled/00-defaults.conf]/ensure: created
Info: /Stage[main]/Httpd/Httpd::Conf[defaults]/File[/etc/apache2/conf-enabled/00-defaults.conf]: Scheduling refresh of Service[apache2]                                                                                                         
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[rewrite]/Exec[ensure_present_mod_rewrite]/returns: executed successfully
Info: /Stage[main]/Httpd/Httpd::Mod_conf[rewrite]/Exec[ensure_present_mod_rewrite]: Scheduling refresh of Service[apache2]                                                                                                                      
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[headers]/Exec[ensure_present_mod_headers]/returns: executed successfully
Info: /Stage[main]/Httpd/Httpd::Mod_conf[headers]/Exec[ensure_present_mod_headers]: Scheduling refresh of Service[apache2]                                                                                                                      
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[ssl]/Exec[ensure_present_mod_ssl]/returns: executed successfully
Info: /Stage[main]/Httpd/Httpd::Mod_conf[ssl]/Exec[ensure_present_mod_ssl]: Scheduling refresh of Service[apache2]
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[php7.3]/Exec[ensure_present_mod_php7.3]/returns: ERROR: Module mpm_event is enabled - cannot proceed due to conflicts. It needs to be disabled first!
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[php7.3]/Exec[ensure_present_mod_php7.3]/returns: ERROR: Could not enable dependency mpm_prefork for php7.3, aborting
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[php7.3]/Exec[ensure_present_mod_php7.3]/returns: Considering dependency mpm_prefork for php7.3:
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[php7.3]/Exec[ensure_present_mod_php7.3]/returns: Considering conflict mpm_event for mpm_prefork:
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[php7.3]/Exec[ensure_present_mod_php7.3]/returns: Considering conflict mpm_worker for mpm_prefork:
Error: '/usr/sbin/a2enmod php7.3' returned 1 instead of one of [0]
Error: /Stage[main]/Httpd/Httpd::Mod_conf[php7.3]/Exec[ensure_present_mod_php7.3]/returns: change from 'notrun' to ['0'] failed: '/usr/sbin/a2enmod php7.3' returned 1 instead of one of [0]
Notice: /Stage[main]/Httpd/Httpd::Mod_conf[authnz_ldap]/Exec[ensure_present_mod_authnz_ldap]/returns: executed successfully
Wed, Oct 23, 8:31 AM · Operations
jcrespo added a comment to T236152: wmf-auto-reimage, decommission & Server_lifecycle documentation for virtual machines reimage confusing.

Here are my diffs:

Wed, Oct 23, 8:08 AM · SRE-tools, Documentation

Tue, Oct 22

jcrespo added a comment to T236152: wmf-auto-reimage, decommission & Server_lifecycle documentation for virtual machines reimage confusing.

I will ask @RobH and @akosiaris if I can mess with those pages, a 1 line addition with a warning would probably suffice, but I didn't understood the state, thanks for clarifications. I wanted to stress that I didn't need the functionality, just a clarification of what was the current status.

Tue, Oct 22, 4:08 PM · SRE-tools, Documentation
jcrespo added a comment to T236166: scap sync failed, database error: RevisionStore::fetchRevisionRowFromConds Error: 1146 Table 'labtestwiki.revision' doesn't exist.

context: https://phabricator.wikimedia.org/T233236#5585066

Tue, Oct 22, 2:42 PM · wikitech.wikimedia.org, cloud-services-team, Wikimedia-production-error
jcrespo edited projects for T236166: scap sync failed, database error: RevisionStore::fetchRevisionRowFromConds Error: 1146 Table 'labtestwiki.revision' doesn't exist, added: cloud-services-team, wikitech.wikimedia.org; removed DBA.
Tue, Oct 22, 2:41 PM · wikitech.wikimedia.org, cloud-services-team, Wikimedia-production-error
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

Thanks, joe, I didn't see your comment so it tool me more time than I thought to find it. The above 2 patches should fix it?

Tue, Oct 22, 2:04 PM · Operations
jcrespo added a comment to T224589: Migrate dbmonitor hosts to Stretch/Buster.

2 blockers:

Tue, Oct 22, 1:35 PM · Operations
jcrespo updated subscribers of T236152: wmf-auto-reimage, decommission & Server_lifecycle documentation for virtual machines reimage confusing.

Sorry, wrong person.

Tue, Oct 22, 12:43 PM · SRE-tools, Documentation
jcrespo created T236152: wmf-auto-reimage, decommission & Server_lifecycle documentation for virtual machines reimage confusing.
Tue, Oct 22, 12:43 PM · SRE-tools, Documentation
jcrespo claimed T224589: Migrate dbmonitor hosts to Stretch/Buster.
Tue, Oct 22, 11:48 AM · Operations
jcrespo added a comment to T227142: a6-eqiad pdu refresh (Tuesday 10/22 @11am UTC).

db1115 is now down, I took the opportunity to upgrade all its system packages, but didn't touch mariadb.

Tue, Oct 22, 10:53 AM · DC-Ops, Operations, ops-eqiad
jcrespo added a comment to T235356: Fatal from ApiGraph->getGraphSpec: "Call to a member function getExtensionData() on boolean".

Large spike just happened minutes ago, enough to notify high exception rate on -operations: https://logstash.wikimedia.org/goto/a82a49a001c2f38904cbfbc9fb390292

Tue, Oct 22, 9:36 AM · Editing-team, MediaWiki-extensions-Graph, Wikimedia-production-error

Mon, Oct 21

jcrespo placed T232446: Compress new Wikibase tables up for grabs.
Mon, Oct 21, 10:16 AM · DBA
jcrespo added a comment to T234900: Setup bacula backup monitoring.

In nagios format:

Mon, Oct 21, 8:50 AM · Patch-For-Review, Availability, observability, Goal, Operations

Fri, Oct 18

jcrespo added a comment to T234900: Setup bacula backup monitoring.

With this first version I get:

Fri, Oct 18, 4:22 PM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

This may be interesting for our physical migration, on a worse case scenario:

Fri, Oct 18, 10:53 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo added a comment to T235838: Backups on buster hosts fail to run.

There is mix experiences on compatibility of clients and storage daemons between 5.X and higher: https://serverfault.com/questions/837241/bacula-versions-compatibility

Fri, Oct 18, 10:30 AM · Patch-For-Review, DBA, serviceops, Operations
jcrespo added a comment to T235838: Backups on buster hosts fail to run.

It is not the second case:

Fri, Oct 18, 10:17 AM · Patch-For-Review, DBA, serviceops, Operations
jcrespo added a comment to T235838: Backups on buster hosts fail to run.

Bacula advice on the issue:

Fri, Oct 18, 10:14 AM · Patch-For-Review, DBA, serviceops, Operations
jcrespo removed a project from T235838: Backups on buster hosts fail to run: Goal.
Fri, Oct 18, 10:11 AM · Patch-For-Review, DBA, serviceops, Operations
jcrespo updated the task description for T235838: Backups on buster hosts fail to run.
Fri, Oct 18, 10:09 AM · Patch-For-Review, DBA, serviceops, Operations
jcrespo triaged T235838: Backups on buster hosts fail to run as High priority.
Fri, Oct 18, 10:08 AM · Patch-For-Review, DBA, serviceops, Operations
jcrespo created T235838: Backups on buster hosts fail to run.
Fri, Oct 18, 10:08 AM · Patch-For-Review, DBA, serviceops, Operations

Thu, Oct 17

jcrespo added a comment to T234900: Setup bacula backup monitoring.

I've created a quick and dirty script that extracts backup job status without needing to query the database. For example, in 6 lines it can get the backup host whose last execution was not successful:

Thu, Oct 17, 6:11 PM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

The copy finished correctly and actually found a bug on transfer.py:

Thu, Oct 17, 1:18 PM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo updated subscribers of T227355: DBA review for the MachineVision extension.

Sadly I am not in charge of databases anymore, @Marostegui will have to do the work.

Thu, Oct 17, 7:48 AM · DBA, Product-Infrastructure-Team-Backlog, Machine vision

Wed, Oct 16

jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

I have discussed with alex a plan, there is a preliminary, but timid suggestion of steps on the design (more like diary) document.

Wed, Oct 16, 4:44 PM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

Reminder:

# TODO The IPv6 IP should be converted into a DNS AAAA resolve once we
# enabled the DNS record on the director
Wed, Oct 16, 10:51 AM · Patch-For-Review, Goal, DBA, serviceops, Operations

Tue, Oct 15

jcrespo added a comment to T215183: Redundant bootloaders for software RAID.

Jaime, if you have a spare host in this hardware configuration, I could try reimaging and gathering logs.

@CDanis I was able to install it in the end, it was a conflict with other drive (hw RAID, in addition to the sw one) what caused issues, not the recipe. Sorry for the misreporting.

Tue, Oct 15, 2:41 PM · Operations
jcrespo added a comment to T234900: Setup bacula backup monitoring.

Just to be clear, the above was not a concrete proposal, more like a brainstorming of everything I could think from the top of my mind. There is probably more things :-D
I thank your input, and I will certainly search your opinion before implementing this (which, as you suggested, will probably come as a prioritized queue), from you and from other stakeholders, as I previously announced.

Tue, Oct 15, 12:20 PM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo updated subscribers of T234900: Setup bacula backup monitoring.

It also helps having a look at the global status- for example, bugzilla and rt, being in read only mode, don't make sense having monthly backups, but a proper long term one to the archive fileset unless that was decided for a reason CC @Dzahn

Tue, Oct 15, 10:41 AM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo added a comment to T234900: Setup bacula backup monitoring.

I will look at them, this was mostly an excuse to get familiar with the current status. Independently of the prepackaged ones, we should come up with a list of things to monitor that we thing could fail or have failed in the past:

Tue, Oct 15, 10:33 AM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo added a comment to T234900: Setup bacula backup monitoring.

Global retention queries:

Tue, Oct 15, 10:24 AM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo added a comment to T234900: Setup bacula backup monitoring.

Everything with a correct full backup in the last month, ordered by size:

Tue, Oct 15, 10:13 AM · Patch-For-Review, Availability, observability, Goal, Operations

Mon, Oct 14

jcrespo added a comment to T222472: Investigate gerrit session expiration.

This happens to me, sometimes multiple times a day.

Mon, Oct 14, 11:50 AM · Patch-For-Review, Release-Engineering-Team-TODO (201907), Release-Engineering-Team (Development services), Gerrit

Fri, Oct 11

jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

@akosiaris We have reached an impass. We should:

Fri, Oct 11, 9:25 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo added a comment to T234900: Setup bacula backup monitoring.

@fgiunchedi I will either start with such brainstorming or maybe some the technical, foundation layers first (script for checking automation), please make sure to feel free to unsubscribe if there is many of the boring bits spamming you here, and will be sure to add/talk to you for the more relevant bits to you later or though other channels. Not wanting you to unsubscribe, just warning you there could be some spam as part of this ticket, and I don't want to bother you with those.

Fri, Oct 11, 9:22 AM · Patch-For-Review, Availability, observability, Goal, Operations

Thu, Oct 10

jcrespo added a comment to T224422: Implement logic to filter bogus GTIDs.

So this is the latest news:

Thu, Oct 10, 1:58 PM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Multi-DC (TEC1)), Performance-Team, Services (watching), Wikimedia-Rdbms
jcrespo added a comment to T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction.

We found a possible cause even if partial: maintenance queries for Special page update like SpecialMostLinked::reallyDoQuery causing lag and general slowdown on other servers, probably you can check what those do and if they break or make the wikidata migration or make it more difficult. For now manuel killed the ongoing queries.

Thu, Oct 10, 1:49 PM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-production-error, Wikimedia-database-error, Wikidata
jcrespo added a comment to T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction.

Now there seems to be also many duplicate insert errors.

Thu, Oct 10, 1:33 PM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-production-error, Wikimedia-database-error, Wikidata
jcrespo added a comment to T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction.

It came back https://logstash.wikimedia.org/goto/de0f6611dde37e811edcba4f530131c0

Thu, Oct 10, 9:48 AM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-production-error, Wikimedia-database-error, Wikidata
jcrespo added a comment to T224422: Implement logic to filter bogus GTIDs.

I think I need more metrics (e.g. percentage of executions of chronology protector vs. successful executions) because either this didn't work, its measurements are not accurate for what we really want to measure, or it worked and displayed worse lag issues than we thought.

Thu, Oct 10, 9:39 AM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Multi-DC (TEC1)), Performance-Team, Services (watching), Wikimedia-Rdbms
jcrespo added a comment to T224422: Implement logic to filter bogus GTIDs.

I checked db1084 execution and local execution seem to work as intended:

Thu, Oct 10, 9:34 AM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Multi-DC (TEC1)), Performance-Team, Services (watching), Wikimedia-Rdbms
jcrespo added a comment to T224422: Implement logic to filter bogus GTIDs.

So this is my proposal- GTID get "infected" from the master, but they were able to be at least cleared. I want to depool a host where this happens a lot, db1121, remove its gtids, reconnect it to master and see if this continues happening (I was able to reproduce it on db1121 with the current gtid pos). The rationale is that for gtid positions not waited, it waits on some existing only on a replica but not on the new master. Thoughts?

Thu, Oct 10, 8:02 AM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Multi-DC (TEC1)), Performance-Team, Services (watching), Wikimedia-Rdbms

Wed, Oct 9

jcrespo renamed T133523: [RFC] improve parsercache replication, sharding and HA from [RFC] improve parsercache replication and sharding handling to [RFC] improve parsercache replication, sharding and HA.
Wed, Oct 9, 11:13 AM · Patch-For-Review, Operations, codfw-rollout, DBA
jcrespo updated the task description for T133523: [RFC] improve parsercache replication, sharding and HA.
Wed, Oct 9, 11:10 AM · Patch-For-Review, Operations, codfw-rollout, DBA
jcrespo added a comment to T229062: Look into a simple way to have global keys with db-replicated.

Rewording Manuel words in actionables:

Wed, Oct 9, 11:08 AM · Patch-For-Review, Performance-Team (Radar), MediaWiki-Cache
jcrespo added a comment to T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction.

@jcrespo what rate should we be concerned about? (to plan/monitor ahead)

Wed, Oct 9, 6:56 AM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-production-error, Wikimedia-database-error, Wikidata
jcrespo added a subtask for T30599: Deadlock tracking bug (tracking): T214035: DBError "Error: 1213 Deadlock found when trying to get lock" on WikiPage::doUpdateRestrictions.
Wed, Oct 9, 6:45 AM · MediaWiki-General, Tracking-Neverending
jcrespo added a parent task for T214035: DBError "Error: 1213 Deadlock found when trying to get lock" on WikiPage::doUpdateRestrictions: T30599: Deadlock tracking bug (tracking).
Wed, Oct 9, 6:45 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-Revision-backend, Wikimedia-production-error

Tue, Oct 8

jcrespo added a parent task for T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction: T30599: Deadlock tracking bug (tracking).
Tue, Oct 8, 4:01 PM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-production-error, Wikimedia-database-error, Wikidata
jcrespo added a subtask for T30599: Deadlock tracking bug (tracking): T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction.
Tue, Oct 8, 4:01 PM · MediaWiki-General, Tracking-Neverending
jcrespo created T234948: New Wikibase deadlocks on Wikidata wiki since 2019-10-08T00:00:02: Wikibase\Lib\Store\Sql\Terms\{closure} Deadlock found when trying to get lock; try restarting transaction.
Tue, Oct 8, 3:59 PM · User-Ladsgroup, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-production-error, Wikimedia-database-error, Wikidata
jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

I got finally the director running, but sadly it won't start with no devices or clients provisioned, so I created a duplicate of the ones puppet may create:

Tue, Oct 8, 11:12 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo triaged T234900: Setup bacula backup monitoring as High priority.
Tue, Oct 8, 10:47 AM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo created T234900: Setup bacula backup monitoring.
Tue, Oct 8, 10:47 AM · Patch-For-Review, Availability, observability, Goal, Operations

Mon, Oct 7

jcrespo added a comment to T224422: Implement logic to filter bogus GTIDs.

See my comment on patch, this may not have worked?

Mon, Oct 7, 7:04 PM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Multi-DC (TEC1)), Performance-Team, Services (watching), Wikimedia-Rdbms
jcrespo added a comment to T229686: #dbctl: manage 'externalLoads' data.

es1: currently read only clusters, all <24
es2 es3: currently rw clusters containing cluster24 and 25 respectively (but distribution may change in the future), they will become read only after the following are setup:
es4 es5: new rw clusters, previsibly cluster 26 and cluster 27 respectively, yet to be purchased and setup

Mon, Oct 7, 6:47 AM · Performance-Team, DBA, conftool

Fri, Oct 4

jcrespo added a comment to T233589: Create: mariadb-optimizer-bug tag.

@Marostegui If it takes too much time, and the general one has been created you can use that + Upstream meanwhile.

Fri, Oct 4, 7:22 AM · Project-Admins

Mon, Sep 30

jcrespo closed T234152: snapshot for s6/s7 at eqiad taken more than 4 days ago as Resolved.
root@db1115.eqiad.wmnet[zarcillo]> SELECT * FROM backups WHERE section in ('s6', 's7') ORDER BY id desc LIMIT 5;
+------+----------------------------------+----------+-------------------------+------------------------+----------+---------+---------------------+---------------------+--------------+
| id   | name                             | status   | source                  | host                   | type     | section | start_date          | end_date            | total_size   |
+------+----------------------------------+----------+-------------------------+------------------------+----------+---------+---------------------+---------------------+--------------+
| 2980 | snapshot.s7.2019-09-30--11-40-55 | finished | db1116.eqiad.wmnet:3317 | dbprov1002.eqiad.wmnet | snapshot | s7      | 2019-09-30 13:05:21 | 2019-09-30 14:15:43 | 927424383014 |
| 2977 | snapshot.s6.2019-09-30--03-30-02 | finished | db1139.eqiad.wmnet:3316 | dbprov1001.eqiad.wmnet | snapshot | s6      | 2019-09-30 04:17:00 | 2019-09-30 05:04:01 | 539671316158 |
| 2976 | snapshot.s7.2019-09-30--02-53-40 | finished | db1116.eqiad.wmnet:3317 | dbprov1002.eqiad.wmnet | snapshot | s7      | 2019-09-30 04:15:08 | 2019-09-30 06:37:21 | 926946625574 |
| 2973 | snapshot.s7.2019-09-30--00-24-45 | finished | db2100.codfw.wmnet:3317 | dbprov2002.codfw.wmnet | snapshot | s7      | 2019-09-30 01:45:05 | 2019-09-30 04:40:41 | 939256940768 |
| 2971 | snapshot.s6.2019-09-30--00-48-20 | finished | db2097.codfw.wmnet:3316 | dbprov2001.codfw.wmnet | snapshot | s6      | 2019-09-30 01:34:45 | 2019-09-30 02:24:45 | 543299025716 |
+------+----------------------------------+----------+-------------------------+------------------------+----------+---------+---------------------+---------------------+--------------+
5 rows in set (0.00 sec)
Mon, Sep 30, 8:00 PM · DBA, Operations
jcrespo added a comment to T234152: snapshot for s6/s7 at eqiad taken more than 4 days ago.

There is some amount of "self healing", but I will rerun manually some backups so not to miss the window. Thanks @jijiki for the report. These alerts are like RAID alerts- they are actionable and very worrying if on for a long time, but if they happen to go off during the weekend they can wait until the week for them to be corrected.

Mon, Sep 30, 6:25 AM · DBA, Operations
jcrespo claimed T234152: snapshot for s6/s7 at eqiad taken more than 4 days ago.
Mon, Sep 30, 5:58 AM · DBA, Operations

Thu, Sep 26

jcrespo added a comment to T89707: Encourage users to publish old inactive unpublished translations.

Will this table be public? Can it be replicated without any restrictions to our wiki replicas? (https://wikitech.wikimedia.org/wiki/Portal:Data_Services#Wiki_Replicas)
If this table is expected or is supposed to be accessed by our Wiki Replicas users, a view will be needed and for that a ticket to cloud-services-team will be required.

Thu, Sep 26, 7:59 AM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Language-Team (Language-2019-October-December), Schema-change, CX-boost, Growth-Team, WorkType-NewFunctionality, Collaboration-Team-Triage, Notifications
jcrespo awarded T230784: Switchover s4 (commonswiki) primary database master db1081 -> db1138 - 26th Sept @05:00 UTC a Like token.
Thu, Sep 26, 6:30 AM · DBA, Operations
jcrespo added a comment to T89707: Encourage users to publish old inactive unpublished translations.

For what I see in the patch, the implementation is at the moment planned as an extra table. If that is right, and such a small table is the only sql change, we just need a heads up before production deployment for it to be added to the list of private tables on the operations/puppet repository (but it not technically considered a schema change). After that, we encourage developers to add the table themselves so they are not blocked on us, as it is not a dangerous process.

Thu, Sep 26, 4:45 AM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), Language-Team (Language-2019-October-December), Schema-change, CX-boost, Growth-Team, WorkType-NewFunctionality, Collaboration-Team-Triage, Notifications

Wed, Sep 25

jcrespo added a comment to T231638: db1074 crashed: Broken BBU.

Reminder to move sanitarium (T231638#5453802) back here (or somewhere else on eqiad) before closing this ticket.

Wed, Sep 25, 5:45 PM · ops-eqiad, Operations, DBA
jcrespo added a comment to T233766: labsdb1011 mariadb crashed.

Out of curiosity, did you run the checks against their master, between replicas or something else?

Wed, Sep 25, 1:24 PM · Data-Services, cloud-services-team (Kanban)
jcrespo awarded T233534: db1075 (s3 master) crashed - BBU failure a Like token.
Wed, Sep 25, 1:17 PM · Wikimedia-Incident, ops-eqiad, Operations, DBA
jcrespo created P9172 cumin?.
Wed, Sep 25, 8:00 AM

Tue, Sep 24

jcrespo created P9165 analytics backups.
Tue, Sep 24, 2:17 PM
jcrespo closed T233701: No grafana dashboard with working disk writes and reads in bytes as Resolved.

Main blocker fixed, I am resolving this, but please note my suggestions I added at the end. Thank you!

Tue, Sep 24, 2:12 PM · observability
jcrespo added a comment to T233701: No grafana dashboard with working disk writes and reads in bytes.

I just found one working, at the end of https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1114&var-port=9104 I will use that for now.

Tue, Sep 24, 9:21 AM · observability
jcrespo created T233701: No grafana dashboard with working disk writes and reads in bytes.
Tue, Sep 24, 9:16 AM · observability
jcrespo created T233698: es1019 IPMI and its management interface are unresponsive (again2).
Tue, Sep 24, 8:52 AM · ops-eqiad, Operations, DBA
jcrespo added a comment to T233534: db1075 (s3 master) crashed - BBU failure.

whether that needs changing on the desired thresholds is a different discussion.

Tue, Sep 24, 4:53 AM · Wikimedia-Incident, ops-eqiad, Operations, DBA

Sep 23 2019

jcrespo added a comment to T231858: Archive data on eventlogging MySQL to analytics replica before decomisioning .

Regarding the machines, I would suggest to either decom both or keep both. Keeping a service with redundancy of 0 is undesirable, even if with low or no usage. There is space on backups for offline long term archiving of important data, although not for large datasets.

Sep 23 2019, 11:11 AM · Analytics-Kanban, Analytics, Analytics-EventLogging
jcrespo created P9149 DB restore avoids overwriting existing server.
Sep 23 2019, 9:50 AM
jcrespo added a comment to T233589: Create: mariadb-optimizer-bug tag.

I thought also in the past about a query-optimization tracking project, which could be used in addition or instead (e.g., as a column- it would be broader than this), to group all WMF query performance issues in production.

Sep 23 2019, 9:17 AM · Project-Admins

Sep 20 2019

jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

Almost there:

Sep 20 2019, 10:46 AM · Patch-For-Review, Goal, DBA, serviceops, Operations

Sep 19 2019

jcrespo awarded T202367: Productionize dbproxy101[2-7].eqiad.wmnet and dbproxy200[1-4] a Like token.
Sep 19 2019, 2:03 PM · DBA
jcrespo added a comment to T223151: Review special replica partitioning of certain tables by `xx_user`.

In order to give more context, if partitioning was still needed, query timeouts/long running happen almost immediately for queries like recentchanges/user contributions. I believe specially for enwiki, commonswiki and wikidata. Refactoring, index hints, mariadb version and schema changes may have made it obsolete (that would be great news, 14 special servers less to maintain) ?

Sep 19 2019, 1:53 PM · mariadb-optimizer-bug, Core Platform Team, MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), Performance Issue, DBA
jcrespo added a comment to T233281: Check/remove unused databases following labpuppetmaster deprecation.

Reminder: Let's check grants too.

Sep 19 2019, 9:19 AM · DBA, Operations
jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

We may need some firmware updates, but hw is ready to go as soon as background raid initialization finishes on array2 of backup1001. Hosts installed with buster.

Sep 19 2019, 8:36 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo updated the task description for T229209: Strengthen backup infrastructure and support.
Sep 19 2019, 8:35 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo awarded T232882: backup1001 failed disk (degraded RAID) a Love token.
Sep 19 2019, 7:41 AM · ops-eqiad, Operations
jcrespo closed T232882: backup1001 failed disk (degraded RAID), a subtask of T229209: Strengthen backup infrastructure and support, as Resolved.
Sep 19 2019, 7:41 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo closed T232882: backup1001 failed disk (degraded RAID) as Resolved.

I can see now 24, thanks!

Sep 19 2019, 7:41 AM · ops-eqiad, Operations

Sep 18 2019

jcrespo added a comment to T223151: Review special replica partitioning of certain tables by `xx_user`.

Two options- the partitioning is no longer needed due to the new schema (preferred) or b) the partitioning is still needed, partition by actor id. Check on a large enwiki host with low weight.

Sep 18 2019, 3:20 PM · mariadb-optimizer-bug, Core Platform Team, MW-1.34-notes (1.34.0-wmf.24; 2019-09-24), Performance Issue, DBA
jcrespo added a comment to T229209: Strengthen backup infrastructure and support.

backup1001 was also setup, however there is still a missing disk: T232882#5502241. Separating enclosures into different logical drives is going to pay off earlier than anticipated, as it may require rebuiding the virtual disk.

Sep 18 2019, 9:20 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo reopened T232882: backup1001 failed disk (degraded RAID), a subtask of T229209: Strengthen backup infrastructure and support, as Open.
Sep 18 2019, 8:21 AM · Patch-For-Review, Goal, DBA, serviceops, Operations
jcrespo reopened T232882: backup1001 failed disk (degraded RAID) as "Open".

Now instead of a failed disk, I can only see 23/24 disks, one disk of the second enclosure is gone. See:

Sep 18 2019, 8:21 AM · ops-eqiad, Operations

Sep 17 2019

jcrespo awarded T232882: backup1001 failed disk (degraded RAID) a Love token.
Sep 17 2019, 8:35 PM · ops-eqiad, Operations