Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (249 w, 1 h)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Thu, Feb 13

jcrespo added a comment to T245114: Migrate Cumin hosts to Buster.

Cumin should for the most part be able to be upgraded to 10.4 with ease as it only holds the client, not the server, and that is why easier to upgrade (and it should be transparent).

Thu, Feb 13, 11:51 AM · Operations
jcrespo added a comment to T244958: db1095 backup source crashed: broken BBU.

Thank, Jclark-ctl. No need to wait for us in this particular case, as it is as important that the service was immediately moved elsewhere and the data considered irrecoverable (but service has to return at it). As I said:

Thu, Feb 13, 10:15 AM · ops-eqiad, Operations, DBA
jcrespo renamed T244884: Implement logic to be able to perform full and incremental backups of ES hosts from Implement logic to be able to perform partial/incremental backups of ES hosts to Implement logic to be able to perform full and incremental backups of ES hosts.
Thu, Feb 13, 10:11 AM · Patch-For-Review, Goal, Operations, DBA
jcrespo added a comment to T244884: Implement logic to be able to perform full and incremental backups of ES hosts.

We can also take into consideration reading the binlogs from a given file/position using the coordinates we store on the logical dump, which might be faster.

Thu, Feb 13, 10:10 AM · Patch-For-Review, Goal, Operations, DBA

Wed, Feb 12

jcrespo added a comment to T234900: Setup bacula backup monitoring.

I am wondering if 0 byte incremental backups are purged as an optimization, or we had another error that caused those to disapper. If that were to be true, we may have to change our strategy assuming existence of recent incrementals.

Wed, Feb 12, 6:08 PM · Patch-For-Review, Availability, observability, Goal, Operations
jcrespo added a comment to T238048: Followup to backup1001 bacula switchover (misc pending tasks).

For some reason, the pool was updated, but not every volumne. I run update pool from resource, and then "all volumnes from pool", and it got applied.

Wed, Feb 12, 6:04 PM · Goal, Operations
jcrespo added a comment to T238048: Followup to backup1001 bacula switchover (misc pending tasks).

Apparently, databases pool got enlarged, but production one is still on 1 month to purge. Needs checking to increase it too to 3 months, there is space available for that.

Wed, Feb 12, 5:57 PM · Goal, Operations
jcrespo added a comment to T240772: Prepare and check storage layer for ngwikimedia.

I checked users table, looking good, but will let manuel close this.

Wed, Feb 12, 5:43 PM · Data-Services, cloud-services-team (Kanban), DBA
jcrespo reassigned T244958: db1095 backup source crashed: broken BBU from jcrespo to wiki_willy.

Battery of db1095, our of warranty, is toasted. It would be nice not throw away the whole server for just the RAID battery. Could we order one?

Wed, Feb 12, 4:52 PM · ops-eqiad, Operations, DBA
jcrespo added a comment to T244958: db1095 backup source crashed: broken BBU.

eqiad backup service has been restored on a different host, now to handle hw issues.

Wed, Feb 12, 2:17 PM · ops-eqiad, Operations, DBA
jcrespo added a comment to T244238: Upgrade and restart m1 master (db1135).

Let's aim for Thursday 20th at 09:00AM UTC?

Wed, Feb 12, 1:46 PM · Wikimedia-Etherpad, DBA, Operations
jcrespo added a comment to T244238: Upgrade and restart m1 master (db1135).

Sorry, I thought I had answered, but I apparently I did not hit submit.

Wed, Feb 12, 1:42 PM · Wikimedia-Etherpad, DBA, Operations
jcrespo added a comment to T244958: db1095 backup source crashed: broken BBU.

I predict s3 will take more time due to filesystem object overhead.

Wed, Feb 12, 1:03 PM · ops-eqiad, Operations, DBA
jcrespo added a comment to T244958: db1095 backup source crashed: broken BBU.

Now running:

transfer.py --type=decompress --no-encrypt --no-checksum dbprov1002.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s3.2020-02-12--05-46-09.tar.gz db1140.eqiad.wmnet:/srv/sqldata.s3
Wed, Feb 12, 11:14 AM · ops-eqiad, Operations, DBA
jcrespo added a comment to T244958: db1095 backup source crashed: broken BBU.

created /srv/sqldata.s2 on db1140 and ran:

Wed, Feb 12, 10:22 AM · ops-eqiad, Operations, DBA

Tue, Feb 11

jcrespo added a project to T79922: Set up backup strategy for es clusters: Goal.
Tue, Feb 11, 4:21 PM · Goal, Operations, DBA
jcrespo added a comment to T79922: Set up backup strategy for es clusters.

While there is full logical dumps of es hosts, those are not integrated with the general metadata and misc backups. This task, once T244884 is done, will integrate those under the same process, although with a slightly different logic.

Tue, Feb 11, 4:06 PM · Goal, Operations, DBA
jcrespo added a project to T244884: Implement logic to be able to perform full and incremental backups of ES hosts: Goal.
Tue, Feb 11, 4:05 PM · Patch-For-Review, Goal, Operations, DBA
jcrespo triaged T244884: Implement logic to be able to perform full and incremental backups of ES hosts as High priority.
Tue, Feb 11, 4:05 PM · Patch-For-Review, Goal, Operations, DBA
jcrespo created T244884: Implement logic to be able to perform full and incremental backups of ES hosts.
Tue, Feb 11, 4:04 PM · Patch-For-Review, Goal, Operations, DBA

Mon, Feb 10

jcrespo added a comment to T195578: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers).

Just to be clear, not precisely a #1 priority thing, we didn't make much noise about this for a reason... but this got unearthed lately after a couple of performance-related requests from mw maintainers, as it could help with as a self-service model rather than pinging DBAs everytime. :-D

Mon, Feb 10, 5:17 PM · Security Related, SecTeam Discussion, Security-Team, Security, Performance Issue, DBA
jcrespo added a comment to T195578: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers).

We don't need Security-Team feedback in terms of usefulness, that part of the ticket was for general deployers to comment if they found it helpful as to enable it on all hosts.

Mon, Feb 10, 5:11 PM · Security Related, SecTeam Discussion, Security-Team, Security, Performance Issue, DBA
jcrespo added a comment to T195578: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers).

For this task, given its age, is it still relevant?

Mon, Feb 10, 4:16 PM · Security Related, SecTeam Discussion, Security-Team, Security, Performance Issue, DBA

Sat, Feb 8

jcrespo added a comment to T244058: Wiki diffs take over 15s to load.

What about tuning HTTP frontend caching? Last revision is very dynamic, but a hardcoded diff maybe could return stale results more often, as I can only see it changing on deletion/page move?

Sat, Feb 8, 10:34 AM · Core Platform Team Workboards (Clinic Duty Team), Performance-Team (Radar), serviceops, Operations, Wikimedia-production-error

Wed, Feb 5

jcrespo added a comment to T243963: es1019: reseat IPMI.

yes, this seems to be an issue

I hope you understood this was a rant directed towards the machine/vendor
only and for background info. I don't think we will get rid of it until we
renew the hw, but ofc no problem on trying an upgrade.

Wed, Feb 5, 1:40 AM · DC-Ops, Operations, ops-eqiad, DBA

Tue, Feb 4

Ladsgroup awarded T195578: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers) a Love token.
Tue, Feb 4, 6:24 PM · Security Related, SecTeam Discussion, Security-Team, Security, Performance Issue, DBA
jcrespo awarded T244232: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails a Evil Spooky Haunted Tree token.
Tue, Feb 4, 3:31 PM · Operations, Traffic
jcrespo renamed T244058: Wiki diffs take over 15s to load from Page takes over 15s to load: https://en.wikipedia.org/w/index.php?title=European_Union&type=revision&diff=938561921&oldid=938557616 to Wiki diffs take over 15s to load.
Tue, Feb 4, 12:12 PM · Core Platform Team Workboards (Clinic Duty Team), Performance-Team (Radar), serviceops, Operations, Wikimedia-production-error
jcrespo added a comment to T243602: templatelinks table on Commons SQL database is not updating properly.

I might be able to tell you who to contact: [disclaimer- please note DBA take care only of server maintenance, things like "are databases up?" "Do al nodes have the same data on them?" "Do developers have enough servers?", but cannot help with debugging mediawiki-related bugs (e.g. we don't handle Wikimedia-database-error tickets)].

Tue, Feb 4, 11:09 AM · Contributors-Team, Commons, Wikimedia-database-error

Mon, Feb 3

jcrespo added a comment to T239344: Mobileapps flapping since 2019-11-26 0:00 UTC.

We have been seeing instances of this issue on codfw (soft mobileapp endpoint timeout alerts) specifically in the last several weeks. I can file a new task if this is not the right place to followup on this.

Mon, Feb 3, 12:34 PM · Product-Infrastructure-Team-Backlog, serviceops, Page Content Service
jcrespo updated the task description for T244127: cp3057 crash (was: network down).
Mon, Feb 3, 12:19 PM · ops-esams, Traffic, Operations
jcrespo renamed T244127: cp3057 crash (was: network down) from cp3057 network down to cp3057 crash (was: network down).
Mon, Feb 3, 12:17 PM · ops-esams, Traffic, Operations
jcrespo added a comment to T244127: cp3057 crash (was: network down).

+1, there where icinga errors as early as 11:15:

[2020-02-03 11:15:57] SERVICE ALERT: cp3057;Webrequests Varnishkafka log producer;UNKNOWN;SOFT;1;CHECK_NRPE STATE UNKNOWN: Socket timeout after 10 seconds.
Mon, Feb 3, 12:00 PM · ops-esams, Traffic, Operations
jcrespo added a project to T244127: cp3057 crash (was: network down): netops.
Mon, Feb 3, 11:32 AM · ops-esams, Traffic, Operations
jcrespo created T244127: cp3057 crash (was: network down).
Mon, Feb 3, 11:31 AM · ops-esams, Traffic, Operations
jcrespo added a comment to T224422: Implement logic to filter bogus GTIDs.

I apologize for assuming gtid changes were to blame- in my defense logging was confusing to me. On the bad news side, I believe there is still some underlying issue showing "fake" lag/creating extra waits, due to some kind of contention on the mysql or application layer. Right now this is most visible on the job queue, so not sure timeout is the right way to go, it could be just some concurrency configuration to "smooth" mysql queries/connections, or maybe some migration or other ongoing issue (e.g. could be a fallout of the application server latency issue we have observed in the past).

Mon, Feb 3, 7:38 AM · MW-1.35-notes (1.35.0-wmf.2; 2019-10-15), MW-1.34-notes (1.34.0-wmf.25; 2019-10-01), Core Platform Team Workboards (Clinic Duty Team), CPT Initiatives (Multi-DC (TEC1)), Performance-Team, Services (watching), Wikimedia-Rdbms

Fri, Jan 31

jcrespo added a comment to T243830: Increased errors in GET https://ar.wikipedia.org/api/rest_v1/page/random/summary since 21 Jan 9:00 UTC.

👍

Fri, Jan 31, 6:35 AM · Mobile-Content-Service, Product-Infrastructure-Team-Backlog, Android-app-Bugs, iOS-app-Bugs, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, Wikimedia-production-error, serviceops

Thu, Jan 30

jcrespo added a comment to T243963: es1019: reseat IPMI.

Strange, not as if it has happened before! T120689 T155691 T187530 T201132 T213422 T233698

Thu, Jan 30, 5:37 PM · DC-Ops, Operations, ops-eqiad, DBA
jcrespo added a comment to T243948: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left.

If I can add a 4th and 5th, with lower priority, and feel free to disagree- "Ensure acme-chief-backend is running only in the active node" check should not use the -a parameter, but match the 1st or 2 first arguments only. Also maybe some extra monitoring related to repeating failures (?) (I believe someone saw errors on execution). Minor things, comparing to avoiding an outage here, which is the top priority for now.

Thu, Jan 30, 3:22 PM · Operations, Traffic
jcrespo created P10292 race condition on "PROBLEM - Ensure acme-chief-backend is running only in the active node on acmechief1001 is CRITICAL: PROCS CRITICAL: 2 processes with args acme-chief-backend".
Thu, Jan 30, 11:16 AM
jcrespo added a comment to T243884: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ....

Thanks @Jdlrobson I hoped that either Search or Readers could know a source, but I it wasn't clear to me on filing.

Thu, Jan 30, 9:30 AM · Readers-Web-Backlog (Tracking), Discovery-Search, Wikimedia-production-error

Wed, Jan 29

jcrespo created T243884: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ....
Wed, Jan 29, 10:25 AM · Readers-Web-Backlog (Tracking), Discovery-Search, Wikimedia-production-error

Tue, Jan 28

jcrespo reopened T243821: mr1-eqiad.oob IPv6 is down as "Open".
[19:41] <icinga-wm> PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
Tue, Jan 28, 7:44 PM · Operations
jcrespo added a comment to T243762: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null.

Thanks, let us know how we can help.

Tue, Jan 28, 5:17 PM · Wikimedia-production-error, Arc-Lamp, Performance-Team, Operations
jcrespo added a comment to T243762: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null.

Assuming no impact on actual functionality, this could have lower priority, not have the Wikimedia-production-error and be lower priority, just be kept (and eventually be solved or workarounded) to try to reduce root cron spam.

Tue, Jan 28, 4:50 PM · Wikimedia-production-error, Arc-Lamp, Performance-Team, Operations
jcrespo added projects to T243830: Increased errors in GET https://ar.wikipedia.org/api/rest_v1/page/random/summary since 21 Jan 9:00 UTC: Mobile, iOS-app-Bugs, Android-app-Bugs.

On clients, randomizer fails quite frequently

Tue, Jan 28, 12:32 PM · Mobile-Content-Service, Product-Infrastructure-Team-Backlog, Android-app-Bugs, iOS-app-Bugs, Wikipedia-iOS-App-Backlog, Wikipedia-Android-App-Backlog, Wikimedia-production-error, serviceops
jcrespo added a comment to T208425: [EPIC] Kill the wb_terms table.

Thanks a lot for the work on this- may I suggest a step before the next step (after rebuilding) of "checking all data, old and new, is consistent". This is a lot of data, and even on well thought processes missing rows were discovered after I requested a comparison on other well-though migration, which happened due to mistakes/existing inconsistencies/aborts. May I request such a step, which could be as fast as a simple join query (<5m to run) between old an new to check no rows are missing or extra, and have equivalent data?

Tue, Jan 28, 12:10 PM · User-Addshore, wikidata-tech-focus, Wikidata-Ugly-Cat-Trailblaze (wb_terms trail blazing), Wikidata
jcrespo created P10283 ar mobile apps latency.
Tue, Jan 28, 11:47 AM

Mon, Jan 27

jcrespo added a comment to T243701: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service).

API maxlag has been repeatedly declared to always be <5s by several people in the past

Mon, Jan 27, 4:43 PM · Wikidata-Campsite, Traffic, Performance Issue, Operations, Discovery, Wikidata-Query-Service, Wikidata
jcrespo added a comment to T243701: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service).

I am not part of the Wikidata QS team, so I don't have answers, just questions :-D Only chiming in because my team was been tagged on this ticket- please understand we (SREs) are not in direct charge of this service and that someone else should answer with first hand knowledge.

Mon, Jan 27, 4:19 PM · Wikidata-Campsite, Traffic, Performance Issue, Operations, Discovery, Wikidata-Query-Service, Wikidata
jcrespo created T243762: Daily errors on webperf1002 & webperf2002 /usr/local/bin/arclamp-generate-svgs > /dev/null.
Mon, Jan 27, 11:59 AM · Wikimedia-production-error, Arc-Lamp, Performance-Team, Operations
jcrespo lowered the priority of T243713: Time-out error; Babel/WikibaseRepo being somehow uncached, overloading the API, and causing general outage from Unbreak Now! to High.
Mon, Jan 27, 11:24 AM · User-Addshore, MW-1.35-notes (1.35.0-wmf.18; 2020-02-04), Wikimedia-Incident, Traffic, Operations, Performance Issue

Thu, Jan 23

jcrespo added a comment to T243519: il_from not always the same size.

Please note that int(8) unsigned and int(10) unsigned are all the exact same actual data type (4 byte integer, from 0 to 4294967295). https://dev.mysql.com/doc/refman/8.0/en/numeric-type-attributes.html

Thu, Jan 23, 3:24 PM · DBA
jcrespo added a comment to T243483: Unknown search function "XXX". Supported functions are: all, title, body, core, comment..

Related to T243479 ?

Thu, Jan 23, 10:04 AM · Phabricator (Upstream), Upstream

Tue, Jan 21

jcrespo claimed T79922: Set up backup strategy for es clusters.
Tue, Jan 21, 4:09 PM · Goal, Operations, DBA

Jan 9 2020

jcrespo added a comment to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English.

Sadly, applying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/563241 to current 1.35 gives me:

Jan 9 2020, 6:49 PM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a comment to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English.

Edit: What is a good 1.35 version to verify the issue is not there? HEAD?

Yes, head of master. If it's not a problem in master, we don't need to do anything.

Jan 9 2020, 6:42 PM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a comment to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English.

Not the original reporter, but after applying the patch I can confirm the turkish installation went through correctly (twice):

Jan 9 2020, 4:32 PM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a comment to T235481: Backups for arclamp application data.

helium is no longer the backup server, you could go to https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-job=webperf1002.eqiad.wmnet-Monthly-1st-Wed-production-arclamp-application-data&from=1577127671546&to=1578586171164 to check backup behavior. Icinga will also alert if backups are not fresh and correct beyond the configured period. I can see full backups for webperf taking 215GB. As part of upcoming goal work, we will require soon assistance from service owners to automate and test the recovery of backups.

Jan 9 2020, 4:10 PM · serviceops, Performance-Team, Arc-Lamp
jcrespo added a comment to T240177: backup2001 crashed 2019-12-08.

backup2001 is now down and ready to be done maintenance (no need to ask again). @Papaul please, when done, just boot it back up and ping here. Thanks.

Jan 9 2020, 12:28 PM · Operations, DBA
jcrespo added a comment to T238296: job queue insert rate metrics gone from Grafana.

As an addendum, could something be improved related to T238296#5662905 ? It seems there are at least 3 dashboards related to the job queue health, and while https://grafana.wikimedia.org/d/000000107/job-queue-health is clearly marked as deprecated, I didn't know about T238296#5662894. I would suggest to clarify current status on wikitech or grafana itself.

Jan 9 2020, 9:54 AM · Core Platform Team Workboards (Clinic Duty Team), serviceops, WMF-JobQueue, MediaWiki-JobQueue, observability
jcrespo awarded T238296: job queue insert rate metrics gone from Grafana a Love token.
Jan 9 2020, 9:51 AM · Core Platform Team Workboards (Clinic Duty Team), serviceops, WMF-JobQueue, MediaWiki-JobQueue, observability

Jan 6 2020

jcrespo added a comment to T240929: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman.

I have forwarded the link to @herron

Jan 6 2020, 4:55 PM · Operations, Wikimedia-Mailing-lists

Jan 3 2020

jcrespo reassigned T241103: Upgrade BIOS and firmware on db2084 from jcrespo to Marostegui.

I have started mysql instances back again, and replication, as on codfw there is low load.

Jan 3 2020, 5:54 PM · ops-codfw, Operations, DBA
jcrespo reassigned T240177: backup2001 crashed 2019-12-08 from jcrespo to Papaul.

@Papaul could you proceed with T240177#5727654 as this is the 3rd crash, and the second since firmware upgrade.

Jan 3 2020, 11:39 AM · Operations, DBA
jcrespo updated the task description for T241421: Sustained periods (2-4h) of bad latency on production-search eqiad.
Jan 3 2020, 10:10 AM · Discovery-Search (Current work), Patch-For-Review, Operations, Traffic, Performance Issue, Elasticsearch

Jan 2 2020

jcrespo added a comment to T241309: Add more detailed instructions to the "sec-advice" page.

Feel free to edit the body with a complete list of changes, but I beleive a single task would be enough to track all improvements requested.

Jan 2 2020, 7:59 PM · Traffic, Operations
jcrespo merged task T241656: sec-warning page uses the term "Wikipedia" incorrectly into T241309: Add more detailed instructions to the "sec-advice" page.
Jan 2 2020, 7:58 PM · Voice & Tone, Operations, Traffic, HTTPS
jcrespo merged T241656: sec-warning page uses the term "Wikipedia" incorrectly into T241309: Add more detailed instructions to the "sec-advice" page.
Jan 2 2020, 7:58 PM · Traffic, Operations
jcrespo added a comment to T240929: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman.

I still got no answer yet :-/.

Jan 2 2020, 4:10 PM · Operations, Wikimedia-Mailing-lists
jcrespo added a comment to T160985: Create an easy to deploy kill switch for every self-contained mediawiki functionality.

I believe this should be renamed to "Approve a policy to not allow new features being deployed to production that don't have a kill switch".

Jan 2 2020, 4:09 PM · MediaWiki-Configuration, MediaWiki-Special-pages, MediaWiki-API

Dec 31 2019

jcrespo added a project to T241648: Special:BrokenRedirects shows displays an incorrect state on srwiki: Serbian-Sites.
Dec 31 2019, 2:09 PM · Serbian-Sites, Wikimedia-Site-requests, Wikimedia-maintenance-script-run
jcrespo added a project to T241648: Special:BrokenRedirects shows displays an incorrect state on srwiki: Wikimedia-Site-requests.
Dec 31 2019, 2:08 PM · Serbian-Sites, Wikimedia-Site-requests, Wikimedia-maintenance-script-run
jcrespo added a comment to T238464: PLURAL magicword not parsed during installation: "Password must be at least {{PLURAL:10|1 character|10 characters}}".

I was able to reproduce this on 1.34. I wonder if the root cause is similar to T241638, translation service not being yet available during the installer.

Dec 31 2019, 1:16 PM · Core Platform Team, Patch-For-Review, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a comment to T228613: Re-build db2097 s1 and s6.

I agree we should keep the backup sources either the same version as the master or as >50% of the replicas (aka "upgrade it at the same time as the master"). However, if we decide to change a whole section (e.g. s1 and s6 on codfw only) to a higher version, we could also upgrade the backup source of that dc to test the backup workflow on the new version.

Dec 31 2019, 1:11 PM · DBA
jcrespo added a comment to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English.

I was able to reproduce it quite consistently with the given steps- it happens on final setup, on creation of the sysop account, see screenshot:

Dec 31 2019, 1:04 PM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a comment to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English.

My guess is there may be a dependency cycle in which the translation for the installer requires a service or string that is not yet installed. I believe similar errors happened in the past with non-standard installations, but we need the mw internationalization/dependency experts to confirm.

Dec 31 2019, 12:43 PM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a project to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English: MediaWiki-ServiceContainer.
Dec 31 2019, 12:39 PM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo awarded T241644: Too Many Requests on loading complex dynamic maps a Like token.
Dec 31 2019, 12:29 PM · Maps (Kartographer)
jcrespo assigned T241644: Too Many Requests on loading complex dynamic maps to MSantos.

Adding maintainers as per https://www.mediawiki.org/wiki/Developers/Maintainers#MediaWiki_extensions_deployed_on_the_Wikimedia_Cluster (please correct tags AND wiki if outdated)

Dec 31 2019, 12:28 PM · Maps (Kartographer)
jcrespo added a comment to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English.

I am no mediawiki developer, but normally one would ask for the Mediawiki version, what steps you were trying to accomplish (you seem to be running the installer?) and a copy of your configuration file.

Dec 31 2019, 9:19 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a project to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English: MediaWiki-Internationalization.
Dec 31 2019, 9:15 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a project to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English: MediaWiki-Installer.
Dec 31 2019, 9:12 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer
jcrespo added a project to T241638: ContainerDisabledException in ServiceContainer due to failed installation with different language then English: MediaWiki-General.
Dec 31 2019, 9:10 AM · Patch-For-Review, Core Platform Team Workboards (Clinic Duty Team), MediaWiki-ServiceContainer, MediaWiki-Internationalization, MediaWiki-Installer

Dec 30 2019

jcrespo added a comment to T241585: Facing an issue when loading a Tensorflow model.

any suggestions for troubleshooting

Dec 30 2019, 12:01 PM · Toolforge
jcrespo assigned T241535: Degraded RAID on ms-be2035 to fgiunchedi.

See also T241534, I am not sure exactly what was the problem detected.

Dec 30 2019, 11:47 AM · SRE-swift-storage, Operations, ops-codfw
jcrespo assigned T241534: Degraded RAID on ms-be2035 to Papaul.

This being a software raid please coordinate with @fgiunchedi .

Dec 30 2019, 11:45 AM · SRE-swift-storage, Operations, ops-codfw
jcrespo edited projects for T178445: flapping monitoring for recommendation_api on scb, added: Core Platform Team; removed Core Platform Team Legacy (Watching / External).

This is flapping very frequently, but with a 500, not a 429 (scb1002 only, for example, twice per hour). Should I close this and open a new one, or can this be handled here?

Dec 30 2019, 11:32 AM · Core Platform Team Workboards (Clinic Duty Team), Discovery, Recommendation-API, Wikidata, Services (watching), Operations, observability

Dec 27 2019

jcrespo added a comment to T241421: Sustained periods (2-4h) of bad latency on production-search eqiad.

This has continued in the last 7 days: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&from=1576888676529&to=1577452275258&var-cirrus_group=eqiad&var-cluster=eqiad&var-exported_cluster=production-search&var-smoothing=1

Dec 27 2019, 1:12 PM · Discovery-Search (Current work), Patch-For-Review, Operations, Traffic, Performance Issue, Elasticsearch

Dec 26 2019

jcrespo added a comment to T237033: On beta, scap can't clear opcache on some mw servers.

@hashar based on Dzhan's comment, is that something your team could handle, sending a puppet patch for the missing hiera keys there (and I can help reviewing it and deploying it)? Let me know.

Dec 26 2019, 11:20 AM · Release-Engineering-Team (Deployment services), serviceops, Operations, Beta-Cluster-Infrastructure, Scap
jcrespo changed the status of T240341: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org from Open to Stalled.

Stalled based on comments, waiting for T202684#5735025 response.

Dec 26 2019, 11:16 AM · Traffic, Operations, DNS
jcrespo added a parent task for T202684: Import the old per-year Wikimania wikis into the new Wikimania wiki, each under their namespace: T240341: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org.
Dec 26 2019, 11:16 AM · Wikimedia-Site-requests
jcrespo added a subtask for T240341: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org: T202684: Import the old per-year Wikimania wikis into the new Wikimania wiki, each under their namespace.
Dec 26 2019, 11:16 AM · Traffic, Operations, DNS
jcrespo added a comment to T240658: fastnetmon spamming /var/log on netflow hosts leading to disk saturation.

@ayounsi how prioritary would you say this ticket is worth? Spam is annoying but shouldn't have high- however the disk saturation could be dangerous (I don't have all context to be able to decide).

Dec 26 2019, 11:14 AM · Operations, netops
jcrespo triaged T240667: Ingestion errors for production logs on ELK7 as High priority.

This seems high importance, feel free to tune down if necessary.

Dec 26 2019, 11:11 AM · User-fgiunchedi, Operations, Wikimedia-Logstash
jcrespo added a comment to T240843: Track services without a native systemd unit.

How high priority would you say this has, to remove it from triage inbox?

Dec 26 2019, 11:10 AM · Operations
jcrespo assigned T240929: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman to herron.

I am personally not familiar with mailman format. Maybe @herron, our mail expert, knows how to proceed with this? Even if he cannot do it, if he knows it is a trivial procedure "copy the file to a path", I could do it myself.

Dec 26 2019, 11:08 AM · Operations, Wikimedia-Mailing-lists
jcrespo added a comment to T241309: Add more detailed instructions to the "sec-advice" page.

See also T240794, which if agreed could be done at the same time.

Dec 26 2019, 11:00 AM · Operations, Traffic
jcrespo triaged T241374: fastnetmon misreports attack type and protocol as Medium priority.
Dec 26 2019, 10:59 AM · Patch-For-Review, Operations, netops
jcrespo added a comment to T241096: Requesting access to analytics-privatedata-users and researchers for Aroraakhil.

@leila researchers typically have time-limited MOUs, is this true in this case? If so, could you share the period of time, so I can add an expiry date for the account?

Dec 26 2019, 10:21 AM · Operations, SRE-Access-Requests, Research