Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (218 w, 1 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Today

jcrespo awarded T169440: Pending global renames in need of sysadmin supervision (tracking) a Party Time token.
Tue, Jul 16, 11:25 AM · MediaWiki-extensions-CentralAuth, GlobalRename, Tracking-Neverending
jcrespo added a comment to T198939: Decommission servermon.

Also the passwords have to be removed from the private repo (and possibly from labs/private).

Tue, Jul 16, 8:40 AM · Patch-For-Review, Operations
jcrespo added a comment to T198939: Decommission servermon.

What about the puppet database on m1?

Tue, Jul 16, 8:37 AM · Patch-For-Review, Operations
jcrespo added a comment to T225713: CPU scaling governor audit.

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

Tue, Jul 16, 7:02 AM · User-fgiunchedi, Operations

Yesterday

jcrespo added a comment to T228089: Logstash down for MediaWiki.

Once the backlog is processed, https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=8&fullscreen&orgId=1 This can be lowered to high, but something should be put in place to prevent another logs outage, even if it is a rough way, such as an alert to identify it and a runbook to drop a source of logs like above.

Mon, Jul 15, 8:12 PM · Wikimedia-Incident, observability, Operations, Wikimedia-Logstash
jcrespo added a comment to T228089: Logstash down for MediaWiki.

I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.

Mon, Jul 15, 6:48 PM · Wikimedia-Incident, observability, Operations, Wikimedia-Logstash
jcrespo updated the task description for T225713: CPU scaling governor audit.
Mon, Jul 15, 4:44 PM · User-fgiunchedi, Operations
jcrespo added a comment to T196055: Remove table `math` from the database.

So I have left a dump of the math table on public wikis at: https://people.wikimedia.org/~jynus/math.tar There is an example of the format of the tar at: https://people.wikimedia.org/~jynus/aawiki-math.sql I have also backed up the private and closed ones for short term. What would it be the next steps, comparing the hashes with existing images?

Mon, Jul 15, 4:03 PM · DBA, Math
jcrespo removed a project from T227838: Obsessive serverIsReadOnly() checking in MySQL: DBA.

I will remove DBA but remain subscribed, as all points this should be a decision of mediawiki optimization for 3rd party usage, not WMF.

Mon, Jul 15, 7:57 AM · Patch-For-Review, Performance-Team, Performance, Core Platform Team, MediaWiki-Database
jcrespo added a comment to T194125: [RFC] Future of charset and collation for mediawiki on mysql .

Because T135969 has spilled here, I can see the annoyance with someone using a "bad" encoding, I will try to be a bit conciliatory, however, I don't see any way to solve that, other than documenting (maybe it was done already):

Mon, Jul 15, 7:49 AM · Core Platform Team (Security, stability, performance and scalability (TEC1)), MediaWiki-Database

Fri, Jul 12

jcrespo moved T196055: Remove table `math` from the database from Blocked external/Not db team to In progress on the DBA board.
Fri, Jul 12, 2:15 PM · DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

Thanks, that is great info, I will come back with a list of images and/or a dump at the beginning of next week.

Fri, Jul 12, 2:15 PM · DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

See, we had a bit of a communication loss here, but we may be able to move forward :-D Each person only has a limited amount of information. I will give a look and do some tests and (if you can) I may ask additional questions. Requesting for help is always ok, and we are here to help. :-) Thanks for yours!

Fri, Jul 12, 2:11 PM · DBA, Math
jcrespo updated the task description for T196055: Remove table `math` from the database.
Fri, Jul 12, 2:09 PM · DBA, Math
jcrespo reopened T196055: Remove table `math` from the database, a subtask of T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking), as Open.
Fri, Jul 12, 2:07 PM · Epic, DBA, Tracking-Neverending
jcrespo reopened T196055: Remove table `math` from the database, a subtask of T195847: Clean up artifacts from LaTeX based math rendering, as Open.
Fri, Jul 12, 2:07 PM · Operations, Math
jcrespo reopened T196055: Remove table `math` from the database as "Open".

Thanks, if you allow me- I will copy that and reopen so this is still tracked. And maybe I can check or delete on a low traffic db and see what are the results. If you have time, you can help us with your knowledge and my access to figure it out :-)

Fri, Jul 12, 2:07 PM · DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

Sorry, but I may not be understanding your comments. Is the extension on WMF ready to have them deleted, if yes, please say so ("please delete ASAP") on the summary. Deleting things carefuly should take less than a week. If you don't have time to do the preparation needed/there is a blocker, I am ok with closing it, although I would mention it also on the summary what is the pending work because may be someone else can do it in the future. :-D

Fri, Jul 12, 1:58 PM · DBA, Math
jcrespo added a comment to T226704: Setup es4 and es5 replica sets for new read-write external store service.

Thanks, that was probably it. Thanks for clarifying it!

Fri, Jul 12, 1:54 PM · Epic, DBA
jcrespo added a comment to T196055: Remove table `math` from the database.

I don't know what I could do to make the task more ready. I will just close the task. Maybe it's best to keep the table forever.

Fri, Jul 12, 1:53 PM · DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

Why declining it? We are waiting for the blocking work to own this, but nobody seems to be progressing on it.

Fri, Jul 12, 1:49 PM · DBA, Math
jcrespo added a comment to T227862: (OoW) db2045 failed battery.
root@db1115.eqiad.wmnet[zarcillo]> update masters set instance='db2069' where section='x1' and dc='codfw';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
Fri, Jul 12, 11:16 AM · ops-codfw, Operations, DBA
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

Everything went well except:

Updating tendril...
[WARNING] Old master not found on tendril server list
Updating zarcillo...
[WARNING] Old master not found on zarcillo master list

And here is probably the issue:

Fri, Jul 12, 11:13 AM · ops-codfw, Operations, DBA
jcrespo moved T227862: (OoW) db2045 failed battery from Triage to In progress on the DBA board.
Fri, Jul 12, 11:11 AM · ops-codfw, Operations, DBA
jcrespo added a comment to T227838: Obsessive serverIsReadOnly() checking in MySQL.

Based on your feedback, my guess is that because you are reading from the "master" (there is no other host, really), this effect happens. Probably this doesn't happen on WMF production where reads are from replicas and they are already read only, and lag is (I believe) cached. I am guessing the lack of caching infrastructure + single master topology is causing this. That doesn't mean it is not an issue, there is probably a way to optimize this, but I will let others comment if/how, as I am more familiar with WMF use case than mediawiki in general.

Fri, Jul 12, 11:02 AM · Patch-For-Review, Performance-Team, Performance, Core Platform Team, MediaWiki-Database
jcrespo added a comment to T152080: Frequent duplicate key errors by page assessments.

I saw a few occurrences (5) of this on trwiki: https://logstash.wikimedia.org/goto/d9c3d20188039cd82d1d9367270f842c

Fri, Jul 12, 10:53 AM · Community-Tech, MW-1.29-release (WMF-deploy-2017-01-03_(1.29.0-wmf.7)), Patch-For-Review, Wikimedia-production-error, MediaWiki-extensions-PageAssessments
jcrespo triaged T227862: (OoW) db2045 failed battery as Normal priority.
Fri, Jul 12, 10:27 AM · ops-codfw, Operations, DBA
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

Everything went well except:

Updating tendril...
[WARNING] Old master not found on tendril server list
Updating zarcillo...
[WARNING] Old master not found on zarcillo master list
Fri, Jul 12, 10:27 AM · ops-codfw, Operations, DBA
jcrespo committed rOSMDa2c2eaec25fe: switchover.py: Fix small formatting bug when printing ROW format (authored by jcrespo).
switchover.py: Fix small formatting bug when printing ROW format
Fri, Jul 12, 10:26 AM
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

Based on https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php&1 and T184888 I will switchover codfw master to db2069.

Fri, Jul 12, 10:14 AM · ops-codfw, Operations, DBA
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

I will try to force a relearn process/reboot, in case that works.

Fri, Jul 12, 10:06 AM · ops-codfw, Operations, DBA
jcrespo created T227862: (OoW) db2045 failed battery.
Fri, Jul 12, 10:06 AM · ops-codfw, Operations, DBA
jcrespo added a comment to T227717: Drop DB tables for now-deleted zerowiki from production.

I'd say it's safe to store a backup

Fri, Jul 12, 7:21 AM · Release-Engineering-Team-TODO, DBA, Reading-Infrastructure-Team-Backlog
jcrespo added a subtask for T227829: Degraded RAID on db2044: Unknown Object (Task).
Fri, Jul 12, 7:11 AM · Operations, ops-codfw
jcrespo added a comment to T227829: Degraded RAID on db2044.

@Papaul We will ask you to replace a disk here from T226406, when they arrive.

Fri, Jul 12, 7:11 AM · Operations, ops-codfw
jcrespo added a comment to T149077: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki.

@Anomie I support strongly to close old tasks like this that can no longer be reproduced, too much in the code and infrastructure changes to be relevant later in an open status. I belive both mw improvements, mysql status e.g. (ANALYIZE + reboot, SSDs) improvements and mariadb code improvement may have left many of these obsolete.

Fri, Jul 12, 7:05 AM · Core Platform Team Workboards (Clinic Duty Team), Core Platform Team (Security, stability, performance and scalability (TEC1)), Wikimedia-production-error, MediaWiki-API, DBA
jcrespo added a comment to T222050: db1107 (eventlogging db master) possibly memory issues.

Chris, you will need to coordinate with @elukey principally, as he is the person in touch directly with users affected to agree on a date.

Fri, Jul 12, 7:00 AM · Analytics, Operations, ops-eqiad, Analytics-EventLogging, DBA
jcrespo added a comment to T226704: Setup es4 and es5 replica sets for new read-write external store service.

If I recall correctly, the maintenance was along the lines of running MediaWiki's recompression scripts and other similar scripts, which would first require making sure the scripts still work right.

Fri, Jul 12, 6:58 AM · Epic, DBA
jcrespo added a comment to T71222: list=logevents slow for users with last log action long time ago.

@Anomie it is probably the old decision of Mediawiki vs Wikimedia. I don't have visibility of what is the impact outside of WMF, but I would suggest to set up a lower priority, generic task to review optimizer hints and document them or remove the unnecessary ones, starting by this one.

Fri, Jul 12, 6:54 AM · User-Marostegui, MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), DBA, Performance, MediaWiki-API
jcrespo added a comment to T227739: Contention on User::getActorId ?.

It is ok to close it if it is a duplicate or you think is unlikely to happen again or is a very rare occurence. I just report when I see something out of the ordinary on the logs FYI, but lack the knowledge of a deep analysis.

Fri, Jul 12, 6:49 AM · Wikimedia-production-error, MediaWiki-User-management, Core Platform Team, MediaWiki-Database
jcrespo added a comment to T227838: Obsessive serverIsReadOnly() checking in MySQL.

May I ask where you tested this, and if it was on your own installation, more data about it (version, topology, configuration, etc.?), and in any case, how you did profile the queries executed (just setup debug for all queries?)? Also please point us to the code entry for that function (is it using a master or a replica to perform the reads?).

Fri, Jul 12, 6:46 AM · Patch-For-Review, Performance-Team, Performance, Core Platform Team, MediaWiki-Database
jcrespo updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Fri, Jul 12, 6:37 AM · Operations, DBA
jcrespo added a comment to T227829: Degraded RAID on db2044.

See also recent T217755

Fri, Jul 12, 6:36 AM · Operations, ops-codfw

Thu, Jul 11

jcrespo edited projects for T88084: Using both rvuser and rvcontinue with prop=revisions causes database error on pages with a lot of revisions, added: MediaWiki-Database; removed Performance.

I think your classification was already right, I was proposing to add on top a new yellow one, as normally work on those require a combination of performance, DBAs and either core or other team on the product side, depending on the code module. The reasoning for that is that it would help avoid duplicate reports, in the same spirit as Wikimedia-production-error. It was just a suggestion, I understand if you consider it may not be useful.

Thu, Jul 11, 4:27 PM · MediaWiki-Database, MediaWiki-API
jcrespo added a comment to T88084: Using both rvuser and rvcontinue with prop=revisions causes database error on pages with a lot of revisions.

@Krinkle It took me some time to understand your comment and classification. Wouldn't be nice to have a specific tag for #query-performance or #wikimedia-query-performance or #slow-database-queries for discovery reasons (e.g. finding duplicates and previous examples)? What do you think?

Thu, Jul 11, 3:38 PM · MediaWiki-Database, MediaWiki-API
jcrespo triaged T227717: Drop DB tables for now-deleted zerowiki from production as Low priority.

Needs some research to see if it is safe. Low priority because it shouldn't block any other task.

Thu, Jul 11, 3:34 PM · Release-Engineering-Team-TODO, DBA, Reading-Infrastructure-Team-Backlog
jcrespo added a comment to T222224: Normalizing *links tables.

Manuel is on vacations ATM, I am glad to answer any questions, although DBAs need more concrete questions (e.g. we can answer how much space and iops would be saved for a particular wiki or table) as costs such as development time would be better calculated by the people involved on the MediaWiki-Database code bits.

Thu, Jul 11, 2:17 PM · Patch-For-Review, MediaWiki-Database, TechCom-RFC
jcrespo updated the task description for T143896: MySQL metrics monitoring.
Thu, Jul 11, 11:33 AM · observability, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
jcrespo added a comment to T143896: MySQL metrics monitoring.
root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-*
-r--r--r-- 1 root       root  2592 Jul 11 11:27 mysql-core_codfw.yaml
-r--r--r-- 1 root       root   612 Jul 11 11:27 mysql-dbstore_codfw.yaml
-r--r--r-- 1 root       root   544 Jul 10 10:57 mysql-labs_codfw.yaml
-rw-r--r-- 1 root       root   544 Jul 10 10:48 mysql-labsdb_codfw.yaml
-r--r--r-- 1 root       root   621 Jul 11 11:27 mysql-misc_codfw.yaml
-r--r--r-- 1 root       root   275 Jul 11 11:27 mysql-parsercache_codfw.yaml
root@prometheus2003:/srv/prometheus/ops/targets$ date
Thu Jul 11 11:29:19 UTC 2019
root@prometheus2003:/srv/prometheus/ops/targets$ run-puppet-agent 
Warning: Downgrading to PSON for future requests
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for prometheus2003.codfw.wmnet
Info: Applying configuration version '1562844569'
Notice: /Stage[main]/Profile::Prometheus::Ops_mysql/Exec[generate-mysqld-exporter-config]/returns: executed successfully
Notice: Applied catalog in 18.99 seconds
root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-*
-r--r--r-- 1 root root 2592 Jul 11 11:27 mysql-core_codfw.yaml
-r--r--r-- 1 root root  612 Jul 11 11:27 mysql-dbstore_codfw.yaml
-r--r--r-- 1 root root  544 Jul 10 10:57 mysql-labs_codfw.yaml
-rw-r--r-- 1 root root  544 Jul 10 10:48 mysql-labsdb_codfw.yaml
-r--r--r-- 1 root root  621 Jul 11 11:27 mysql-misc_codfw.yaml
-r--r--r-- 1 root root  275 Jul 11 11:27 mysql-parsercache_codfw.yaml
Thu, Jul 11, 11:32 AM · observability, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
jcrespo added a reverting change for rLPRIc73947e0daaf: Revert "prometheus: move prometheus secrets back to the original role": rLPRI224100e43026: Revert "Revert "prometheus: move prometheus secrets back to the original role"".
Thu, Jul 11, 11:17 AM
jcrespo committed rLPRI224100e43026: Revert "Revert "prometheus: move prometheus secrets back to the original role"" (authored by jcrespo).
Revert "Revert "prometheus: move prometheus secrets back to the original role""
Thu, Jul 11, 11:17 AM
jcrespo added a reverting change for rLPRI0cc83bae3ad3: prometheus: move prometheus secrets back to the original role: rLPRIc73947e0daaf: Revert "prometheus: move prometheus secrets back to the original role".
Thu, Jul 11, 11:16 AM
jcrespo committed rLPRIc73947e0daaf: Revert "prometheus: move prometheus secrets back to the original role" (authored by jcrespo).
Revert "prometheus: move prometheus secrets back to the original role"
Thu, Jul 11, 11:16 AM
jcrespo committed rLPRI0cc83bae3ad3: prometheus: move prometheus secrets back to the original role (authored by jcrespo).
prometheus: move prometheus secrets back to the original role
Thu, Jul 11, 9:25 AM
jcrespo created T227739: Contention on User::getActorId ?.
Thu, Jul 11, 7:50 AM · Wikimedia-production-error, MediaWiki-User-management, Core Platform Team, MediaWiki-Database
jcrespo committed rLPRI6aa78168423c: prometheus-mysqld-exporter: move variable to profile (authored by jcrespo).
prometheus-mysqld-exporter: move variable to profile
Thu, Jul 11, 7:35 AM

Wed, Jul 10

jcrespo committed rLPRI3e95207086fa: prometheus: Add fake prometheus labs password (authored by jcrespo).
prometheus: Add fake prometheus labs password
Wed, Jul 10, 9:44 AM
jcrespo added a comment to T68025: [Story] Monitor size of some Wikidata database tables.

We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard. The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis. Please talk to security before doing it. Please talk to us DBAs befere reimplementing an existing feature.

Wed, Jul 10, 8:55 AM · WMDE-Analytics-Engineering, DBA, Story, Wikidata, Wikidata.org

Tue, Jul 9

jcrespo added a comment to T226952: Failover m2 master db1065 to db1132.
$ ./replication_tree.py db1065
db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991
+ db1117:3322, version: 10.1.39, up: 32d, RO: ON, binlog: MIXED, lag: 0, processes: 15, latency: 0.0423
+ db1132, version: 10.1.39, up: 14h, RO: ON, binlog: MIXED, lag: 0, processes: 16, latency: 0.0416
+ db2044, version: 10.1.39, up: 4d, RO: ON, binlog: MIXED, lag: 0, processes: None, latency: 0.0046
  + db2078:3322, version: 10.1.39, up: 47d, RO: ON, binlog: MIXED, lag: 0, processes: 14, latency: 0.0056
Tue, Jul 9, 5:21 AM · Operations-Software-Development, OTRS, Recommendation-API, Operations, DBA

Mon, Jul 8

jcrespo removed a project from T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08: ops-codfw.
Mon, Jul 8, 3:40 PM · Operations, DBA
jcrespo reopened T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08, a subtask of T206203: Implement database binary backups into the production infrastructure, as Open.
Mon, Jul 8, 3:39 PM · Goal, DBA
jcrespo reopened T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 as "Open".
Mem:         515690

HW seems to be fixed, owning for the followup (software) steps.

Mon, Jul 8, 3:39 PM · Operations, DBA
jcrespo created P8723 switchover process.
Mon, Jul 8, 3:09 PM

Thu, Jul 4

jcrespo created P8709 (An Untitled Masterwork).
Thu, Jul 4, 1:58 PM
jcrespo committed rOSMD23173d6419e6: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Thu, Jul 4, 12:14 PM

Wed, Jul 3

Restricted Application added a project to T227197: [BUG] Reading Lists Not Syncing — 11 lists synced: Reading-Infrastructure-Team-Backlog.

Just to discard any kind of database anomaly, I ran:

Wed, Jul 3, 3:36 PM · Reading-Infrastructure-Team-Backlog (Kanban), Reading List Service, iOS-app-Bugs, Wikipedia-iOS-App-Backlog
jcrespo committed rOSMD9df778c4ca37: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 3:16 PM
jcrespo committed rOSMD0e577bd04f3a: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 3:14 PM
jcrespo committed rOSMD8bf12d499727: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 2:10 PM
jcrespo committed rOSMDdd810002a640: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 2:01 PM
jcrespo committed rOSMD1657b539100d: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 2:01 PM
jcrespo reassigned T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 from jcrespo to Papaul.

a memory stick of db2097 is literally broken:

Wed, Jul 3, 1:44 PM · Operations, DBA
jcrespo added a comment to T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08.
462 - Uncorrectable Memory Error Threshold Exceeded (Processor 1, DIMM 3).  The
DIMM is mapped out and is currently not available.
Action: Take corrective action for the failing DIMM. Re-map all DIMMs back into
the memory map in RBSU. If the issue persists, contact support.
Wed, Jul 3, 1:33 PM · Operations, DBA
jcrespo added a comment to T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08.
[6498437.928368] mce: [Hardware Error]: Machine check events logged
[6498437.928393] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[6498437.928395] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd004780001000c1
[6498437.928396] EDAC skx MC1: TSC 5342fa81aa1c04 
[6498437.928396] EDAC skx MC1: ADDR d8229cf80 
[6498437.928399] EDAC skx MC1: MISC 908400100000086 
[6498437.928402] EDAC skx MC1: PROCESSOR 0:50654 TIME 1561517951 SOCKET 0 APIC 4
[6498437.928408] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xd8229c offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:6432 col:138)
[6510762.575543] Process accounting resumed
[6596958.282298] Process accounting resumed
[6683155.141534] Process accounting resumed
[6769351.182001] Process accounting resumed
[6855547.716695] Process accounting resumed
[6941743.671751] Process accounting resumed
[7022624.571104] mce: [Hardware Error]: Machine check events logged
[7022624.571161] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[7022624.571163] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd004940001000c1
[7022624.571164] EDAC skx MC1: TSC 59f742acdaddae 
[7022624.571165] EDAC skx MC1: ADDR dfa29cf80 
[7022624.571165] EDAC skx MC1: MISC 908400200000086 
[7022624.571167] EDAC skx MC1: PROCESSOR 0:50654 TIME 1562043375 SOCKET 0 APIC 4
[7022624.571175] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xdfa29c offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:6bb2 col:138)
[7027940.094961] Process accounting resumed
[7107442.018032] mce: [Hardware Error]: Machine check events logged
[7107442.018086] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[7107442.018087] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd000c40001000c1
[7107442.018088] EDAC skx MC1: TSC 5b0cf7d688dc44 
[7107442.018089] EDAC skx MC1: ADDR daa29cf80 
[7107442.018090] EDAC skx MC1: MISC 908400100000086 
[7107442.018090] EDAC skx MC1: PROCESSOR 0:50654 TIME 1562128393 SOCKET 0 APIC 4
[7107442.018097] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xdaa29c offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:66b2 col:138)
[7107691.803948] mce: [Hardware Error]: Machine check events logged
[7107691.804005] EDAC skx MC1: HANDLING MCE MEMORY ERROR
[7107691.804007] EDAC skx MC1: CPU 1: Machine Check Exception: 5 Bank 16: fd000500001000c1
[7107691.804008] EDAC skx MC1: TSC 5b0dc934c6be34 
[7107691.804009] EDAC skx MC1: ADDR dda2dcf80 
[7107691.804009] EDAC skx MC1: MISC 908400800000086 
[7107691.804010] EDAC skx MC1: PROCESSOR 0:50654 TIME 1562128643 SOCKET 0 APIC 4
[7107691.804019] EDAC MC1: 1 UE memory scrubbing error on CPU_SrcID#0_MC#1_Chan#1_DIMM#1 (channel:1 slot:1 page:0xdda2dc offset:0xf80 grain:32 -  OVERFLOW recoverable err_code:0010:00c1 socket:0 imc:1 rank:1 bg:2 ba:0 row:69b3 col:138)
[7114136.812186] Process accounting resumed
Wed, Jul 3, 1:13 PM · Operations, DBA
jcrespo committed rOSMD879f3da97fbf: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 11:41 AM
jcrespo committed rOSMDa0e8fe7f687b: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 10:29 AM
jcrespo committed rOSMD9c3af935efc0: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 10:26 AM
jcrespo committed rOSMD7a66f857d48d: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 10:14 AM
jcrespo committed rOSMD420b83f8e5d7: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 10:11 AM
jcrespo committed rOSMD338e8800d132: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Wed, Jul 3, 10:10 AM
jcrespo committed rOSMD91165fc5393c: switchover.py: Add new options --replicating-master & --read-only-master (authored by jcrespo).
switchover.py: Add new options --replicating-master & --read-only-master
Wed, Jul 3, 9:33 AM
jcrespo committed rOSMDad4ef6f13c91: switchover.py: Add new options --replicating-master & --read-only-master (authored by jcrespo).
switchover.py: Add new options --replicating-master & --read-only-master
Wed, Jul 3, 9:27 AM
jcrespo committed rOSMDa2e287420312: switchover.py: Add new options --replicating-master & --read-only-master (authored by jcrespo).
switchover.py: Add new options --replicating-master & --read-only-master
Wed, Jul 3, 9:24 AM
jcrespo committed rOSMD8e3e86194f2d: switchover.py: Add new options --replicating-master & --read-only-master (authored by jcrespo).
switchover.py: Add new options --replicating-master & --read-only-master
Wed, Jul 3, 9:20 AM
jcrespo committed rOSMD1b76c55dc196: switchover.py: Add new options --replicating-master & --read-only-master (authored by jcrespo).
switchover.py: Add new options --replicating-master & --read-only-master
Wed, Jul 3, 9:06 AM
jcrespo committed rOSMD6a8f87e11ae8: switchover.py: Enable new option --replicating-master (authored by jcrespo).
switchover.py: Enable new option --replicating-master
Wed, Jul 3, 8:33 AM
jcrespo committed rOSMD742760112439: switchover.py: Add some extra automations to the script (authored by jcrespo).
switchover.py: Add some extra automations to the script
Wed, Jul 3, 7:04 AM
jcrespo added a comment to T226787: My sandbox has no save button.

So "publish..." is the button you are looking for- being on your sandbox it will be just a draft, not a main article namespace, but it will save it for later time.

Wed, Jul 3, 5:43 AM

Tue, Jul 2

jcrespo committed rOSMD11a3aa20dd74: switchover.py: Add some extra automations to the script (authored by jcrespo).
switchover.py: Add some extra automations to the script
Tue, Jul 2, 5:24 PM
jcrespo committed rOSMD537f5f7a440e: switchover.py: Add some extra automations to the script (authored by jcrespo).
switchover.py: Add some extra automations to the script
Tue, Jul 2, 5:20 PM
jcrespo updated the task description for T120085: Serve Main Page of WMF wikis from a consistent URL.
Tue, Jul 2, 5:07 PM · Patch-For-Review, Core Platform Team Backlog (Watching / External), Performance-Team, Operations, Traffic, TechCom-RFC, SEO, Wikimedia-Site-requests
jcrespo committed rOSMD8661c293b75a: WMFReplication: Make move work for a limited number of cases (authored by jcrespo).
WMFReplication: Make move work for a limited number of cases
Tue, Jul 2, 2:14 PM
jcrespo committed rOSMD13a8fb4c5c87: WMFReplication: Make move work for a limited number of cases (authored by jcrespo).
WMFReplication: Make move work for a limited number of cases
Tue, Jul 2, 2:12 PM
jcrespo added a comment to P8698 topology changes.
>>> import WMFMariaDB
>>> import WMFReplication
>>> db1 = WMFMariaDB.WMFMariaDB(host='127.0.0.1')
>>> db2 = WMFMariaDB.WMFMariaDB(host='127.0.0.1', port=3307)
>>> db3 = WMFMariaDB.WMFMariaDB(host='127.0.0.1', port=3308)
>>> db1r = WMFReplication.WMFReplication(db1)
>>> db2r = WMFReplication.WMFReplication(db2)
>>> db3r = WMFReplication.WMFReplication(db3)
>>> db1.name()
'127.0.0.1:3306/(none)'
>>> db2.name()
'127.0.0.1:3307/(none)'
>>> db3.name()
'127.0.0.1:3308/(none)'
>>> db1r.debug()
127.0.0.1:3306/(none)> Not configured as a slave
>>> db2r.debug()
127.0.0.1:3307/(none)> master: 127.0.0.1:3306, io: Yes, sql: Yes, lag: 0, pos: sangai-bin.000003/1683948
>>> db3r.debug()
127.0.0.1:3308/(none)> master: 127.0.0.1:3306, io: Yes, sql: Yes, lag: 0, pos: sangai-bin.000003/1683948
>>> db3r.move(db2)
/usr/lib/python3/dist-packages/pymysql/cursors.py:170: Warning: (1278, "It is recommended to use --skip-slave-start when doing step-by-step replication with START SLAVE UNTIL; otherwise, you will get problems if you get an unexpected slave's mysqld restart")
  result = self._query(query)
{'success': True}
>>> db1r.debug()
127.0.0.1:3306/(none)> Not configured as a slave
>>> db2r.debug()
127.0.0.1:3307/(none)> master: 127.0.0.1:3306, io: Yes, sql: Yes, lag: 0, pos: sangai-bin.000003/1696195
>>> db3r.debug()
127.0.0.1:3308/(none)> master: 127.0.0.1:3307, io: Yes, sql: Yes, lag: 0, pos: sangai-bin.000003/1640009
>>> db2r.move(db3)
{'success': False, 'errno': -1, 'errmsg': 'The topology change cannot be done at the moment- check its relationship, replication status or replication lag'}
>>> db1r.move(db3)
{'success': False, 'errno': -1, 'errmsg': 'The host is not configured as a replica'}
>>> db3r.move(db1)
{'success': True}
>>> db1r.debug()
127.0.0.1:3306/(none)> Not configured as a slave
>>> db2r.debug()
127.0.0.1:3307/(none)> master: 127.0.0.1:3306, io: Yes, sql: Yes, lag: 0, pos: sangai-bin.000003/1721682
>>> db3r.debug()
127.0.0.1:3308/(none)> master: 127.0.0.1:3306, io: Yes, sql: Yes, lag: 0, pos: sangai-bin.000003/1723337
Tue, Jul 2, 1:48 PM
jcrespo committed rOSMDdb574972a4c2: WMFReplication: Make move work for a limited number of cases (authored by jcrespo).
WMFReplication: Make move work for a limited number of cases
Tue, Jul 2, 1:06 PM
jcrespo created P8698 topology changes.
Tue, Jul 2, 12:46 PM
jcrespo added a comment to T224916: Global rename of B dash → A1Cafel: supervision needed.

@revi This is not yet announced, but I have not forgotten about it, I just want to right person to do it (who implemented it) to take credit for the improvement.

Tue, Jul 2, 12:44 PM · Wikimedia-Site-requests
jcrespo added a comment to T225370: Global rename of Waldir → Waldyrious: supervision needed.

I am not a developer, but to me T225370#5298483 would seem like an intended thing. You may want to document that on the rename user documentation.

Tue, Jul 2, 8:13 AM · Operations, Wikimedia-Site-requests
jcrespo added a comment to T224916: Global rename of B dash → A1Cafel: supervision needed.

@mys_721tx, there should be no reason to block this anymore, as far as I been told and I can see, renames should be (almost) instant and not longer be an error-prone action. Please confirm when performing this rename.

Tue, Jul 2, 8:05 AM · Wikimedia-Site-requests
jcrespo changed the status of T224348: Global rename of Fiona B. → Fiona*: supervision needed from Stalled to Open.

@Itti, there should be no reason to block this anymore, as far as I been told and I can see, renames should be (almost) instant and not longer be an error-prone action. Please confirm when performing this rename.

Tue, Jul 2, 8:04 AM · Operations, Wikimedia-Site-requests