jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (192 w, 3 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Today

jcrespo added a comment to T188327: Deploy refactored actor storage.

^So I don't need this, but people really need to think of a way to push fresh configurations to mw tasks- or establish some policy of long-running scripts to avoid https://logstash.wikimedia.org/goto/523d9a64fb0821e25c2e84ca93502c1d

Thu, Jan 17, 12:22 PM · MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.12; 2019-01-08), Patch-For-Review, Core Platform Team Kanban (Blocked Externally), Core Platform Team ( Code Health (TEC13)), Epic
jcrespo added a comment to T213973: Remove frimpressions db from prod mysql.

awight - do you need its contents? Maybe it was archived in the past, we would have to do some research about that.

Thu, Jan 17, 6:53 AM · DBA, Fundraising-Backlog
jcrespo added a comment to T213973: Remove frimpressions db from prod mysql.

our backup systems has this bug where empty dbs are not recovered, maybe it is an empty and was deleted by accident (with no data loss?).

Thu, Jan 17, 6:51 AM · DBA, Fundraising-Backlog
jcrespo added a comment to T213973: Remove frimpressions db from prod mysql.

It should be on m2: https://wikitech.wikimedia.org/wiki/MariaDB/misc#Current_schemas_2

Thu, Jan 17, 6:45 AM · DBA, Fundraising-Backlog
jcrespo added a project to T213655: Lost file Juan_Guaidó.jpg: MediaWiki-General-or-Unknown.

Thanks, yes, as Filippo said avobe, it has been deleted (and it is available) on swift, but not on metadata. We can do 2 things- reupload it, or perform a deletion with SQL and recover it in the normal way. I will need help from a mw developer for the second option.

Thu, Jan 17, 6:43 AM · MediaWiki-General-or-Unknown, Operations, media-storage

Yesterday

jcrespo added a comment to T213858: s3 master emergency failover (db1075).

switchover script works as expected (tested on db1111/db1112):

Wed, Jan 16, 5:29 PM · Patch-For-Review, DBA, Operations
jcrespo added a comment to T206504: Create a new endpoint which returns articles in need of a description.

I wonder why redis- I understand the need for caching, but recently x1 section was expanded to accommodate reading lists needs, and 10 GB is small compared to the reading lists and cx-translation (in-progress translation) needs, which is kind of the same amount of data. Not against using other technology- but this looks very similar to the above mentioned features, or the pre-cached Special:* list pages ? Redis has issues with cross-dc replication, and it is slowly being removed (jobqueue was, sessions next).

Wed, Jan 16, 5:08 PM · Growth-Team, MediaWiki-extensions-GettingStarted, Wikipedia-Android-App-Backlog, Reading-Infrastructure-Team-Backlog (Kanban), Mobile-Content-Service
jcrespo closed T213422: es1019 IPMI and its management interface are unresponsive (again) as Resolved.

es1019 is back into service.

Wed, Jan 16, 3:11 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo closed T213422: es1019 IPMI and its management interface are unresponsive (again), a subtask of T167121: Several hosts return "internal IPMI error" in the check_ipmi_temp check, as Resolved.
Wed, Jan 16, 3:11 PM · Patch-For-Review, monitoring, Operations
jcrespo changed the status of T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues from Open to Stalled.

stalling, no errors so far, but I doubt this is the last time we hear abut this. Backups are on dbstore1001 just in case.

Wed, Jan 16, 3:10 PM · Operations, ops-eqiad, DBA
jcrespo added a comment to T213865: Failover dbproxy1003 to dbproxy1008.

So this is solved?

Wed, Jan 16, 12:01 PM · Patch-For-Review, DBA, Operations
jcrespo claimed T213422: es1019 IPMI and its management interface are unresponsive (again).

Taking care of it.

Wed, Jan 16, 11:57 AM · Patch-For-Review, Operations, ops-eqiad
jcrespo added a comment to T213864: Emergency database primary master failover on s3 primary master.

Some wikipedias will be affected, if you want to shorter list of wikis that WON'T be affected it is at https://noc.wikimedia.org/db.php :

Wed, Jan 16, 9:28 AM · User-Johan, CommRel-Specialists-Support (Jan-Mar-2019), User-notice

Tue, Jan 15

jcrespo added a comment to T213422: es1019 IPMI and its management interface are unresponsive (again).

es1019 is back up and mgmt is working. Not starting mysql though, until chris confirms everthing is done and ok (and he is understandingly busy right now).

Tue, Jan 15, 7:46 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo added a comment to T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues.

@Cmjohnson The most likely scenario is that we move the dimm and we keep detecting 96GB of ram, and then we will ask you to ask for a replacement. Otherwise we will reboot it and keep observing.

Tue, Jan 15, 4:43 PM · Operations, ops-eqiad, DBA
jcrespo added a comment to T213422: es1019 IPMI and its management interface are unresponsive (again).

Waiting for Chris to be available to fully shutdown it (as otherwise I wouldn't be able to put it back up).

Tue, Jan 15, 4:35 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo added a comment to T213655: Lost file Juan_Guaidó.jpg.

Could you try to restore it @Platonides using the wiki admin tools before trying some SQL?

Tue, Jan 15, 12:04 PM · MediaWiki-General-or-Unknown, Operations, media-storage
jcrespo created P7989 (An Untitled Masterwork).
Tue, Jan 15, 11:24 AM
jcrespo added a comment to T213674: Possible first paint regression on mobile.
2019:01:57
[10:08] <jynus> something happened yesterday at 22:40 that made things 0.5 seconds slower
[10:08] <jynus> on mobile
Tue, Jan 15, 10:59 AM · Performance-Team
jcrespo updated subscribers of T213796: Global rename of SuperVirtual → Dennis Radaelli: supervision needed.

Let's add @Anomie here once so he can verify this didn't affect ongoing actor migration as per https://wikitech.wikimedia.org/wiki/Deployments#Week_of_January_14th

Tue, Jan 15, 8:59 AM · DBA, Wikimedia-Site-requests

Mon, Jan 14

jcrespo renamed T213406: Purchase and setup remaining hosts for database backups from Purchase remaining hosts for database backups to Purchase and setup remaining hosts for database backups.
Mon, Jan 14, 2:10 PM · DBA
jcrespo added a comment to T178690: Better organization for SRE grafana dashboards.

Jaime, going to have to guess here; are you referring to "Prometheus machine stats" (marked for deletion) vs "Host overview"?

Mon, Jan 14, 1:10 PM · User-CDanis, Patch-For-Review, User-fgiunchedi, monitoring, Operations
jcrespo added a comment to T178690: Better organization for SRE grafana dashboards.

I've just seen a dashboard I use is scheduled for deletion. I don't see the replacement as particularly better and lacking. Could you have a look at how other people are doing those such as https://pmmdemo.percona.com/graph/d/qyzrQGHmk/system-overview They can be downloaded at https://github.com/percona/grafana-dashboards

Mon, Jan 14, 12:45 PM · User-CDanis, Patch-For-Review, User-fgiunchedi, monitoring, Operations
jcrespo added a comment to T197616: Create a production test wiki in group0 to parallel Wikimedia Commons.

First of all, I am only commenting because I have more information, but access handling is owned by the cloud team.

Mon, Jan 14, 10:47 AM · DBA, Release-Engineering-Team (Watching / External), SDC Engineering, Wiki-Setup (Create), Wikidata, SDC General
jcrespo added a comment to T213664: correctable memory errors db1068 (commons primary master database).

I created to track it, it has gone up to 21 since yesterday. We have to consider the possibility of it crashing due to uncorrectable errors and be prepared for a failover.

Mon, Jan 14, 10:18 AM · Patch-For-Review, DBA, Operations

Sun, Jan 13

jcrespo renamed T213664: correctable memory errors db1068 (commons primary master database) from correctable memory errors db1068 (commons primary master database to correctable memory errors db1068 (commons primary master database).
Sun, Jan 13, 7:30 PM · Patch-For-Review, DBA, Operations
jcrespo created T213664: correctable memory errors db1068 (commons primary master database).
Sun, Jan 13, 7:27 PM · Patch-For-Review, DBA, Operations

Fri, Jan 11

jcrespo added a comment to T212346: Mass bigdeletion scheduled for sr.wikinews.

@Zoranzoki21 and @dungodung, as well as other subscribers- Phabricator is not the place for this kind of discussion- that should go to wiki. As @MarcoAurelio said, no deletion will happen until that happens, and even if some deletion happened already, it can be recovered. Please solve disputes on wiki, and only return to this ticket when a consensus is reached with a decision we can apply.

Fri, Jan 11, 5:28 PM · Serbian-Sites, DBA
jcrespo added a comment to T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues.

More logs, confirming the module is probably dead:

Fri, Jan 11, 3:57 PM · Operations, ops-eqiad, DBA
jcrespo renamed T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues from db1115 (tendril DB) had OOM for some processes to db1115 (tendril DB) had OOM for some processes and some hw (memory) issues.
Fri, Jan 11, 3:54 PM · Operations, ops-eqiad, DBA
jcrespo assigned T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues to Cmjohnson.

Asking @Cmjohnson to move around this memory module- either it got disconnected or broken completely- FYI 128 GB of memory should be detected, but only 96 are (aiming for maintenance for Tuesday):

Fri, Jan 11, 3:50 PM · Operations, ops-eqiad, DBA
jcrespo reopened T196726: db1115 (tendril DB) had OOM for some processes and some hw (memory) issues as "Open".

Something else happened on the 17th:

Fri, Jan 11, 3:38 PM · Operations, ops-eqiad, DBA
jcrespo added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

Can we discuss about how to implement these?

Fri, Jan 11, 3:24 PM · Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

@jcrespo it seems we should be able to deploy (out of the box with a new config) the tool existing in prod to the new and upcoming analytics replicas right? Am I missing something why this would not be possible?

Fri, Jan 11, 2:07 PM · Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

Even us roots have one for mysql administration:

root@cumin1001:~$ mysql.py --version
/usr/local/sbin/mysql.py  Ver 15.1 Distrib 10.1.36-MariaDB, for Linux (x86_64) using readline 5.2
Fri, Jan 11, 2:01 PM · Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

@Nuria There is apparently 2 tools (or the same, reused), one on production, too:

Fri, Jan 11, 1:58 PM · Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo removed a subtask for T213527: Prepare our base system layer for Debian buster: T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases.
Fri, Jan 11, 9:44 AM · Patch-For-Review, Operations
jcrespo removed a parent task for T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases: T213527: Prepare our base system layer for Debian buster.
Fri, Jan 11, 9:44 AM · Patch-For-Review, DBA
jcrespo added a parent task for T193224: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished: T213527: Prepare our base system layer for Debian buster.
Fri, Jan 11, 9:44 AM · Patch-For-Review, MediaWiki-Database, Operations, DBA
jcrespo added subtasks for T213527: Prepare our base system layer for Debian buster: T193224: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished, T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases.
Fri, Jan 11, 9:44 AM · Patch-For-Review, Operations
jcrespo added a parent task for T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases: T213527: Prepare our base system layer for Debian buster.
Fri, Jan 11, 9:44 AM · Patch-For-Review, DBA

Thu, Jan 10

jcrespo added a comment to T207258: rack/setup/install pc1007-pc1010.

@Cmjohnson you are the best, the worse Dell is, the more superb you are to cover for their mess. How many beers do I own you already? XD Thanks again.

Thu, Jan 10, 5:15 PM · Patch-For-Review, Operations, ops-eqiad, DBA
jcrespo added a comment to T213422: es1019 IPMI and its management interface are unresponsive (again).

@Cmjohnson Sorry, cannot today for both organizational reasons (@ at meeting today) and technical ones (cannot depool today due to traffic without being too disruptive). Let's try Tuesday if you are ok with that?

Thu, Jan 10, 4:18 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo updated the task description for T172410: Replace the current multisource analytics-store setup.
Thu, Jan 10, 1:56 PM · Product-Analytics, Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo reassigned T213422: es1019 IPMI and its management interface are unresponsive (again) from jcrespo to Cmjohnson.

@Volans I have no ssh, https or ipmi access, so there is nothing I can do about it. This needs a power drain.

Thu, Jan 10, 1:36 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo added a comment to T213422: es1019 IPMI and its management interface are unresponsive (again).

That was the plan :-)

Thu, Jan 10, 1:25 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo added a comment to T202051: db2042 (m3) master RAID battery failed.

Sorry, I searched but I didn't find the other one, as on your above comment you probably meant that but linked to itself by mistake. I am ok with any method, as long as there is at least one task open.

Thu, Jan 10, 12:54 PM · User-Banyek, Operations, ops-codfw, DBA
jcrespo added parent tasks for T213422: es1019 IPMI and its management interface are unresponsive (again): T167121: Several hosts return "internal IPMI error" in the check_ipmi_temp check, T193155: IPMI Audit 2018-04.
Thu, Jan 10, 12:53 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo added a subtask for T167121: Several hosts return "internal IPMI error" in the check_ipmi_temp check: T213422: es1019 IPMI and its management interface are unresponsive (again).
Thu, Jan 10, 12:53 PM · Patch-For-Review, monitoring, Operations
jcrespo added a subtask for T193155: IPMI Audit 2018-04: T213422: es1019 IPMI and its management interface are unresponsive (again).
Thu, Jan 10, 12:53 PM · Operations
jcrespo claimed T213422: es1019 IPMI and its management interface are unresponsive (again).

I will first try remote debugging techniques myself.

Thu, Jan 10, 12:52 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo created T213422: es1019 IPMI and its management interface are unresponsive (again).
Thu, Jan 10, 12:51 PM · Patch-For-Review, Operations, ops-eqiad
jcrespo reopened T205257: BBU problems dbstore2002 as "Open".

Failing again, acking on icinga, reopening to not forget about it.

Thu, Jan 10, 11:56 AM · User-Banyek, DBA
jcrespo added a comment to T202051: db2042 (m3) master RAID battery failed.

Leaving it open and acking it on icinga so we don't forget about it.

Thu, Jan 10, 11:54 AM · User-Banyek, Operations, ops-codfw, DBA
jcrespo reopened T202051: db2042 (m3) master RAID battery failed as "Open".
Thu, Jan 10, 11:53 AM · User-Banyek, Operations, ops-codfw, DBA
jcrespo added a subtask for T213406: Purchase and setup remaining hosts for database backups: T213404: Design the final architecture for the database binary backups.
Thu, Jan 10, 10:40 AM · DBA
jcrespo added a parent task for T213404: Design the final architecture for the database binary backups: T213406: Purchase and setup remaining hosts for database backups.
Thu, Jan 10, 10:40 AM · DBA
jcrespo triaged T213406: Purchase and setup remaining hosts for database backups as High priority.
Thu, Jan 10, 10:39 AM · DBA
jcrespo triaged T213404: Design the final architecture for the database binary backups as High priority.
Thu, Jan 10, 10:25 AM · DBA
jcrespo raised the priority of T206203: Implement database binary backups into the production infrastructure from Normal to High.
Thu, Jan 10, 10:20 AM · Goal, DBA
jcrespo added a comment to T206203: Implement database binary backups into the production infrastructure.

I have modified the wording to reuse the meta task for the new goal, which has already solved the decision part, but still needs some design for the architecture, purchases and final implementation.

Thu, Jan 10, 10:15 AM · Goal, DBA
jcrespo renamed T206203: Implement database binary backups into the production infrastructure from Design and prepare infrastructure for database binary backups to Implement database binary backups into the production infrastructure.
Thu, Jan 10, 10:06 AM · Goal, DBA

Wed, Jan 9

jcrespo added a comment to T212861: Rack A2's hosts alarm for PSU broken.

I rebuilt db1082- we are no blocker for any maintenance on those servers, but we would prefer to stop mysql if there is a chance for the server to lose power, while it does not cause any user-visible outage, as it is very time consuming for us to recover a pooled server, and takes very little time to depool it and stop it.

Wed, Jan 9, 4:26 PM · Analytics, ops-eqiad, Operations
jcrespo closed T213108: db1082 power loss resulted on mysql crash as Resolved.

db1082 is fully repooled, it and db1124 had gtid reeenabled.

Wed, Jan 9, 4:24 PM · Patch-For-Review, Data-Services, DBA, Operations
jcrespo closed T213108: db1082 power loss resulted on mysql crash, a subtask of T212861: Rack A2's hosts alarm for PSU broken, as Resolved.
Wed, Jan 9, 4:24 PM · Analytics, ops-eqiad, Operations
jcrespo added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

Looking at the logs, the issue (lock wait timeout) I see now is with User::loadFromDatabase (SELECT user_id,user_name,user_real_name,user_email,user_touched,user_token,user_email_authenticated,user_email_token,user_email_token_expires,user_registration,user_editcount,user_actor.actor_id FROM user LEFT JOIN actor user_actor ON ((user_actor.actor_user = user_id)) WHERE user_id = '1983946' LIMIT 1 FOR UPDATE) so maybe this can be closed as resolved?

Wed, Jan 9, 1:43 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
jcrespo added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

I assume the 6 errors correlate to the 6 edits in the same minute given the update happens post-send (it is already biased towards later), As such, it would seem that 6 out of 6 timed out.

Wed, Jan 9, 1:38 PM · Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database

Tue, Jan 8

jcrespo added a comment to T213108: db1082 power loss resulted on mysql crash.

This is mostly fixed, except gtid must be enabled on 82 and 1124, plus 82 must be repooled.

Tue, Jan 8, 7:01 PM · Patch-For-Review, Data-Services, DBA, Operations
jcrespo added a comment to T197616: Create a production test wiki in group0 to parallel Wikimedia Commons.

By the way, are people aware that a shard called "test-s4" has 2 dedicated large hosts and ready to be used for production? I think it was used by Anomie and DanielK to test MCR, could it be shared for whatever testcommonswiki is being used?

Tue, Jan 8, 3:50 PM · DBA, Release-Engineering-Team (Watching / External), SDC Engineering, Wiki-Setup (Create), Wikidata, SDC General
jcrespo added a comment to T213154: /api/rest_v1/page/pdf/* service unstable.

If this is a known, ongoing, in-process-of-being decommission issue, you can close this ticket, no reason to keep it open. But I would suggest sending an email to ops@ linking to the above comment and saying so (I didn't know this, and probably more people didn't either, but it sends alerts to icinga).

Tue, Jan 8, 12:47 PM · Core Platform Team Backlog (Attic), Services (attic), Electron-PDFs, Readers-Web-Backlog, RESTBase-API
jcrespo added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

the best way to accomplish this would probably be a library

Tue, Jan 8, 12:34 PM · Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo added a comment to T212386: Provide tools for querying MediaWiki replica databases without having to specify the shard.

There is already an 'sql' tool that developers that query production use without having to know the underlying mediawiki topology (100 servers)- probably could be adapted for analytics hosts?

Tue, Jan 8, 12:27 PM · Analytics, WMDE-Analytics-Engineering, User-Addshore, User-Elukey, Research
jcrespo created T213154: /api/rest_v1/page/pdf/* service unstable.
Tue, Jan 8, 10:13 AM · Core Platform Team Backlog (Attic), Services (attic), Electron-PDFs, Readers-Web-Backlog, RESTBase-API
jcrespo added a comment to T213108: db1082 power loss resulted on mysql crash.

db1124:s5 stopped at db1082-bin.002490:667685191

Tue, Jan 8, 9:25 AM · Patch-For-Review, Data-Services, DBA, Operations

Mon, Jan 7

jcrespo moved T213108: db1082 power loss resulted on mysql crash from Triage to Next on the DBA board.
Mon, Jan 7, 6:37 PM · Patch-For-Review, Data-Services, DBA, Operations
jcrespo updated subscribers of T213108: db1082 power loss resulted on mysql crash.
Mon, Jan 7, 6:37 PM · Patch-For-Review, Data-Services, DBA, Operations
jcrespo claimed T213108: db1082 power loss resulted on mysql crash.

I plan to take care of this tomorrow morning.

Mon, Jan 7, 6:37 PM · Patch-For-Review, Data-Services, DBA, Operations
jcrespo triaged T213108: db1082 power loss resulted on mysql crash as High priority.
Mon, Jan 7, 6:36 PM · Patch-For-Review, Data-Services, DBA, Operations
jcrespo added a comment to T212861: Rack A2's hosts alarm for PSU broken.

^CC @Marostegui so you know why db1082 + db1124 + labsdb replication (s5) are broken or stopped

Mon, Jan 7, 6:23 PM · Analytics, ops-eqiad, Operations
jcrespo added a comment to T212861: Rack A2's hosts alarm for PSU broken.

I am creating a subtask to fix db1082, which may have to be reimaged because the power loss.

Mon, Jan 7, 6:22 PM · Analytics, ops-eqiad, Operations
jcrespo added a comment to T197616: Create a production test wiki in group0 to parallel Wikimedia Commons.

I've been told there was some breakage based on assuming s4 ==> commons, or commons ==> s4. I am not too worried, as I said, about a temporary project, but the assumption of that on code or configuration is worrying, as it would not be unthinkable we move commonswiki in the future to separate group.

Mon, Jan 7, 4:56 PM · DBA, Release-Engineering-Team (Watching / External), SDC Engineering, Wiki-Setup (Create), Wikidata, SDC General
jcrespo added a comment to T212493: Clean up staging db.

May a suggest a different route? Let's migrate the mediawiki replicated tables first- then migrate the staging ones on a per-case bases. After all, it makes no sense to copy them to, which of the new servers? Once we check the replication works as intended, we can ask what to keep and what to remover. In some cases, users may prefer to regenerate them from "fresh data"? Just an idea.

Mon, Jan 7, 4:53 PM · Analytics-Kanban, Analytics
jcrespo added a comment to T197616: Create a production test wiki in group0 to parallel Wikimedia Commons.

I agree with Manuel T197616, I would have preferred creating it on s3 for isolation reasons- enwiki, commons and wikidata require more resources than the typical high-throughput project and they were on purpose set on dedicated hardware. I understand that you want a setup as similar as possible as the actual commonswiki, but from our point of view, s0 deployments are the ones more likely to create outages, and the above wikis, plus metawiki and centralauth are on purpose separate from group0 ones to minimize impact. Also the above 3 wikis have a large amount of hardware behind them, which makes testcommonswiki overprovisioned in some aspects.

Mon, Jan 7, 3:46 PM · DBA, Release-Engineering-Team (Watching / External), SDC Engineering, Wiki-Setup (Create), Wikidata, SDC General
jcrespo added a comment to T207258: rack/setup/install pc1007-pc1010.

I would like to insist on this issue now that the holiday is over- while the service (parsercache) is not at the time affected, we are in a no-hw redundancy mode on eqiad, and after all it was the vendor that sent faulty hardware in the first place. Please escalate to us or a manager if you need help "fighting". Happy 2019 and thanks!

Mon, Jan 7, 11:00 AM · Patch-For-Review, Operations, ops-eqiad, DBA
jcrespo added a parent task for T134252: Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear: T30499: 1205: Lock wait timeout exceeded; try restarting transaction (tracking).
Mon, Jan 7, 10:44 AM · MW-1.33-notes (1.33.0-wmf.6; 2018-11-27), Language-Team (Language-2018-October-December), Wikimedia-production-error, MediaWiki-extensions-Translate
jcrespo added a subtask for T30499: 1205: Lock wait timeout exceeded; try restarting transaction (tracking): T134252: Avoid "Lock wait timeout exceeded" from MessageGroupStats::clear.
Mon, Jan 7, 10:44 AM · Technical-Debt, Tracking, MediaWiki-Database
jcrespo awarded T208909: [Bug] Update old nonuniformly distributed page_random values a Yellow Medal token.
Mon, Jan 7, 10:41 AM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
jcrespo awarded T208750: Requesting access to graphite hosts for addshore a Like token.
Mon, Jan 7, 10:35 AM · Patch-For-Review, User-Addshore, Operations, WMDE-Analytics-Engineering, Graphite, SRE-Access-Requests
jcrespo added a comment to T196153: Quarry cannot save queries with emojies.

Regarding the second error- binary strings are not text, so they must be converted to python strings explicitly after driver execution.

Mon, Jan 7, 10:35 AM · I18n, Quarry

Nov 30 2018

jcrespo added a comment to T210701: ORES 500s since 2018-11-29 6:25.

Thank you, then I guess this can be closed as resolved, or I will let you handle it as you prefer.

Nov 30 2018, 3:51 PM · User-Ladsgroup, Scoring-platform-team (Current), Operations, ORES
jcrespo updated subscribers of T210701: ORES 500s since 2018-11-29 6:25.

Related to T210610 or T210575, or nothing to do? CC @Ladsgroup

Nov 30 2018, 3:48 PM · User-Ladsgroup, Scoring-platform-team (Current), Operations, ORES
jcrespo moved T210223: Post hold because of "invalid headers" in wikimediacz-l from Backlog to Radar on the Operations board.
Nov 30 2018, 3:43 PM · User-Urbanecm, Operations, Wikimedia-Mailing-lists
jcrespo assigned T210223: Post hold because of "invalid headers" in wikimediacz-l to Urbanecm.

Assigning it to you- unclaim it if it doesn't work and need more help, or close it at a later time if the fix works.

Nov 30 2018, 3:42 PM · User-Urbanecm, Operations, Wikimedia-Mailing-lists
jcrespo assigned T210846: Ship Grafana server logs to ELK to herron.

You seem to be working on this, do you mind if I assign it to you (you can unclaim it if you want, later), that way it is clear someone is actively working on it, for organization purposes?

Nov 30 2018, 3:40 PM · Patch-For-Review, Wikimedia-Logstash, Operations
jcrespo changed the status of T143896: MySQL metrics monitoring from Open to Stalled.
Nov 30 2018, 3:38 PM · monitoring, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
jcrespo added a comment to T209773: RFC: Proposal to add wl_addedtimestamp attribute to the watchlist table.

To further clarify on the Blocked external/Not db team by Manuel, this seems a fairly simple and strightforward [famous last words], and not worrying storage-wise, you don't need any previous discussion with us to work on it- if agreed. We would like, however, to review potential new queries on implementation, to make sure indexing is used appropriately (it most likely will need a new index to filter on it, and that may not be that simple except on trivial usages- e.g. T209773#4783873 will need thorough review to support large watchlists). We are obviously always open for questions (ping us)- but we are not leading this work.

Nov 30 2018, 3:34 PM · DBA, TechCom-RFC, MediaWiki-Database
jcrespo added a comment to T176370: Migrate to PHP 7 in WMF production.

Sorry, I found this very interesting case, but I don't know a better ticket to to report it. I saw on mediawiki errors logs that every minute, a request to [[User:Acer/Simple1]] on enwiki was done (oldid 844560394, in case it is edited or deleted). It returned 500 errors every time- the page is very large, contains 50000 wikilinks (I am guessing because it took a long time to render).

Nov 30 2018, 3:21 PM · Core Platform Team Kanban (Doing), Core Platform Team (PHP7 (TEC4)), Patch-For-Review, TechCom-RFC (TechCom-Approved), User-ArielGlenn, HHVM, Operations
jcrespo awarded T176370: Migrate to PHP 7 in WMF production a Love token.
Nov 30 2018, 3:21 PM · Core Platform Team Kanban (Doing), Core Platform Team (PHP7 (TEC4)), Patch-For-Review, TechCom-RFC (TechCom-Approved), User-ArielGlenn, HHVM, Operations
jcrespo added a comment to T210824: Barack Obama and other pages significant performance drop.

Thank you, I will ack the alerts on icinga- that was the main trigger of this.

Nov 30 2018, 2:55 PM · Regression, Performance-Team
jcrespo added a comment to T210478: Migrate dbstore1002 to a multi instance setup on dbstore100[3-5].

We can use the new ones

Nov 30 2018, 1:46 PM · Patch-For-Review, User-Banyek, Analytics-Kanban, DBA, Analytics