Page MenuHomePhabricator

Marostegui (Manuel Aróstegui)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (137 w, 4 d)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Today

Marostegui added a comment to T221463: questions about standalone wmf-mariadb103.

We will at some point go for 10.3 but I don't think we are ready for it. It hasn't been properly tested yet, if you want to be an early adopter within WMF that's fine by me, otherwise I would suggest to go for 10.1 and migrate once we are a bit more sure about 10.3 :-)

Mon, Apr 22, 3:49 PM · Patch-For-Review, DBA
Marostegui added a comment to T221159: FY18/19 TEC1.6 Q4: Improve or replace GTID + pt-heartbeat logic for cross-DC.

For those not familiar with the chronology's protector https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage:

MediaWiki's "chronology protector" ensures that replication lag never causes a user to see a page that claims an action they've just performed hasn't happened yet. This is done by storing the master's position in the user's session if a request they made resulted in a write query. The next time the user makes a read request, the load balancer read this position from the session, and tries to select a slave that has caught up to that replication position to serve the request. If none is available, it will wait until one is. It may appear to other users as though the action hasn't happened yet, but the chronology remains consistent for each user.
Mon, Apr 22, 2:29 PM · User-mobrovac, Services (watching), Goal, Core Platform Team Backlog (Watching / External), MediaWiki-Database, Core Platform Team (Multi-DC (TEC1)), Performance-Team, DBA
Marostegui edited projects for T221541: Adding tags hstore GIN indexes to the OSM database on osmdb.eqiad.wmnet for performance, added: PostgreSQL; removed DBA.

OSM is PostreSQL which is not maintained by the DBA team.

Mon, Apr 22, 1:00 PM · PostgreSQL, Cloud-Services, Maps (Maps-data)
Marostegui added a comment to T221449: Redesign querycache* tables.

The first two tables querycache and querycachetwo do not even have an UNIQUE that we could convert to PK, so adding the PK there with the existing data would be indeed very tricky, not to mention the fact that we might even have rows being exactly the same.

Mon, Apr 22, 7:48 AM · MediaWiki-Database
Marostegui added a parent task for T219493: Decommission 2 codfw x1 hosts db2033 and db2034: T221533: Decommission old coredb machines (<=db2042).
Mon, Apr 22, 6:53 AM · Patch-For-Review, DBA
Marostegui added a subtask for T221533: Decommission old coredb machines (<=db2042): T219493: Decommission 2 codfw x1 hosts db2033 and db2034.
Mon, Apr 22, 6:53 AM · DBA
Marostegui added a parent task for T220070: Decommission db2033: T38220: The stashimageinfo module shouldn't be a prop querymodule (it doesn't accept titles and doesn't work with generators).
Mon, Apr 22, 6:53 AM · decommission, ops-codfw, Operations
Marostegui added a subtask for T38220: The stashimageinfo module shouldn't be a prop querymodule (it doesn't accept titles and doesn't work with generators): T220070: Decommission db2033.
Mon, Apr 22, 6:53 AM · MediaWiki-API
Marostegui changed the status of T221533: Decommission old coredb machines (<=db2042) from Open to Stalled.

Stalling as this cannot really proceed until the new hosts are installed (T221532)

Mon, Apr 22, 6:52 AM · DBA
Marostegui created T221533: Decommission old coredb machines (<=db2042).
Mon, Apr 22, 6:51 AM · DBA
Marostegui added a comment to T221532: rack/setup/install db2[103-120].codfw.wmnet (18 hosts).

Keep in mind that db2033 can be decommissioned (it is on C6) T220070

Mon, Apr 22, 6:39 AM · ops-codfw, Goal, Operations, DBA
Marostegui triaged T221532: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) as Normal priority.
Mon, Apr 22, 5:53 AM · ops-codfw, Goal, Operations, DBA
Marostegui created T221532: rack/setup/install db2[103-120].codfw.wmnet (18 hosts).
Mon, Apr 22, 5:53 AM · ops-codfw, Goal, Operations, DBA
Marostegui added a comment to T221502: db1099 memory issues.

I rebooted the host to see if the memory errors would clear up, but it didn't happen, so I guess we have to either contact Dell or move the DIMM to a different slot and wait to see if it happens again on a different location
@Cmjohnson please advise

Mon, Apr 22, 5:33 AM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T220940: Abstracts dumps for Commons running very slowly.

@Marostegui sorry to ping you again but we'd like your expertise: I can use this workaround of using he query as is and throwing away the revisions we don't want (LIMIT 50000 always) or we can change it to join on page_namespace right in the query (but the LIMIT will still be 50k). The upside of the second is that much less data will be sent but the downside is that it will take a lot longer to hit that LIMIT, what do you think about this tradeoff? Which is harder on the db servers and is the difference appreciable either way?

Mon, Apr 22, 5:16 AM · Patch-For-Review, Dumps-Generation
Marostegui triaged T221463: questions about standalone wmf-mariadb103 as Normal priority.
Mon, Apr 22, 5:13 AM · Patch-For-Review, DBA
Marostegui updated subscribers of T221463: questions about standalone wmf-mariadb103.

Is there any specific reason to use 10.3 instead of 10.1?
@jcrespo is our package master, so he can probably provide more info about it. However, I believe 10.3 isn't fully ready yet from a package point of view, configuration+puppet as well.

Mon, Apr 22, 5:12 AM · Patch-For-Review, DBA
Marostegui added a comment to T220246: Session storage service Cassandra schema.

I was also wondering if @Marostegui and @jcrespo could share insights from how we manage schema changes and configuration changes on mysql.

What's the best way to keep track of configuration changes that you apply at runtime?

Mon, Apr 22, 4:58 AM · Core Platform Team (Session Management Service (CDP2)), User-Clarakosi, Core Platform Team Backlog (Next), User-Eevans

Yesterday

Marostegui added a comment to T200297: Introduce a new namespace for collaborative judgements about wiki entities.

As far as I remember (it has been a while) all the stuff that was sent for me to review was reviewed and I believe it was even merged.

Sun, Apr 21, 4:19 PM · MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Scoring-platform-team (Current), DBA, Operations, Jade, TechCom-RFC
Marostegui added a comment to T221357: Read timeout reached while viewing AbuseLog.

There is no difference, I don't think it is an optimizer bug, it is an issue with CAST.
I personally don't have much experience with casting, but from the mysql manual:

If you convert an indexed column using BINARY, CAST(), or CONVERT(), MySQL may not be able to use the index efficiently.
Sun, Apr 21, 3:59 PM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), DBA, AbuseFilter, Wikimedia-production-error
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Sun, Apr 21, 7:02 AM · Operations, DBA
Marostegui moved T221512: Degraded RAID on db2037 from Triage to In progress on the DBA board.
Sun, Apr 21, 7:01 AM · DBA, Operations, ops-codfw
Marostegui assigned T221512: Degraded RAID on db2037 to Papaul.

Let's get it replaced
Thanks!

Sun, Apr 21, 7:00 AM · DBA, Operations, ops-codfw
Marostegui created T221511: Possible full scan query ApiQueryUserContribs::execute for revision_actor_temp table on commonswiki.
Sun, Apr 21, 6:51 AM · Performance, Core Platform Team, MediaWiki-Database
Marostegui edited projects for T217481: Slow queries on abuse_filter_log using afl_action or afl_actions, added: User-Marostegui; removed DBA.

Untagging us as there is no actionable here for us, I will remain subscribed in case you've got questions or further requests, happy to help!

Sun, Apr 21, 6:35 AM · User-Marostegui, User-Daimona, AbuseFilter
Marostegui moved T221357: Read timeout reached while viewing AbuseLog from Triage to In progress on the DBA board.

It is not really making any difference unfortunately, the optimizer still thinks it is better to do a full scan (I guess the CAST is the culprit here):

SELECT  * FROM `abuse_filter_log` FORCE INDEX(afl_timestamp) LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 51
Sun, Apr 21, 6:31 AM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), DBA, AbuseFilter, Wikimedia-production-error
Marostegui moved T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out from Triage to In progress on the DBA board.
Sun, Apr 21, 6:09 AM · Performance, Core Platform Team, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
Marostegui moved T221502: db1099 memory issues from Triage to In progress on the DBA board.
Sun, Apr 21, 6:09 AM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a comment to T221508: webperf2001 is running out of disk space.

This will get full in a matter of minutes again:

root@webperf2001:/var/log# ls -lh messages user.log
-rw-r----- 1 root adm 1.3G Apr 21 05:27 messages
-rw-r----- 1 root adm 989M Apr 21 05:27 user.log
Sun, Apr 21, 5:29 AM · Operations, Performance-Team
Marostegui added a comment to T221508: webperf2001 is running out of disk space.

The host was fully full:

root@webperf2001:/var/log# df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs  3.9G     0  3.9G   0% /dev
tmpfs          tmpfs     799M   81M  719M  11% /run
/dev/vda1      ext4       49G   49G     0 100% /
tmpfs          tmpfs     4.0G   12K  4.0G   1% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     4.0G     0  4.0G   0% /sys/fs/cgroup
tmpfs          tmpfs     799M     0  799M   0% /run/user/15343
Sun, Apr 21, 5:22 AM · Operations, Performance-Team

Sat, Apr 20

Marostegui triaged T221502: db1099 memory issues as Normal priority.
Sat, Apr 20, 5:55 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui created T221502: db1099 memory issues.
Sat, Apr 20, 5:55 PM · Patch-For-Review, ops-eqiad, DBA, Operations
Marostegui added a project to T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out: Core Platform Team.
Sat, Apr 20, 4:50 PM · Performance, Core Platform Team, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
Marostegui updated subscribers of T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out.

This is the query I believe - looks like the optimizer is being dumb again:

SELECT /* IndexPager::buildQueryInfo (LogPager)  */  log_id,log_type,log_action,log_timestamp,log_namespace,log_title,log_params,log_deleted,user_id,user_name,user_editcount,comment_log_comment.comment_text AS `log_comment_text`,comment_log_comment.comment_data AS `log_comment_data`,comment_log_comment.comment_id AS `log_comment_cid`,actor_log_user.actor_user AS `log_user`,actor_log_user.actor_name AS `log_user_text`,log_actor,(SELECT  GROUP_CONCAT(ctd_name SEPARATOR ',')  FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id))   WHERE ct_log_id=log_id  ) AS `ts_tags`  FROM `logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id)) JOIN `actor` `actor_log_user` ON ((actor_log_user.actor_id = log_actor)) LEFT JOIN `user` ON ((user_id=actor_log_user.actor_user))   WHERE (log_type NOT IN ('spamblacklist','titleblacklist','abusefilterprivatedetails','oath','suppress')) AND (log_type != 'thanks') AND (log_type != 'patrol') AND (log_type != 'tag')  ORDER BY log_timestamp DESC LIMIT 51
Sat, Apr 20, 3:49 PM · Performance, Core Platform Team, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
Marostegui added a parent task for T221424: decommission db2014,db2020, db2021, db2022, db2024, db2031: T176243: Decommission database hosts <= db2031 (tracking).
Sat, Apr 20, 6:32 AM · Operations, ops-codfw, DC-Ops, decommission
Marostegui added a subtask for T176243: Decommission database hosts <= db2031 (tracking): T221424: decommission db2014,db2020, db2021, db2022, db2024, db2031.
Sat, Apr 20, 6:32 AM · Patch-For-Review, Goal, DBA
Marostegui removed a project from T221424: decommission db2014,db2020, db2021, db2022, db2024, db2031: DBA.

All those hosts were decommissioned as part of T176243, so probably a leftover from that.
Removing our tag as there is nothing for us to do. I will keep subscribed to this task in case our help is needed to clarify something.
Thanks!

Sat, Apr 20, 6:29 AM · Operations, ops-codfw, DC-Ops, decommission
Marostegui moved T221481: Degraded RAID on db2047 from Triage to In progress on the DBA board.
Sat, Apr 20, 6:25 AM · DBA, Operations, ops-codfw
Marostegui assigned T221481: Degraded RAID on db2047 to Papaul.

Let's replace the failed disk first only, disk #12

Sat, Apr 20, 6:24 AM · DBA, Operations, ops-codfw

Fri, Apr 19

Marostegui triaged T221201: Prepare and check storage layer for initiativeswiki as Normal priority.
Fri, Apr 19, 3:55 PM · Cloud-Services, DBA
Marostegui added a comment to T208323: Predictive failures on disk S.M.A.R.T. status.

db2047 has another disk failed:

logicaldrive 1 (3.3 TB, RAID 1+0, OK)
Fri, Apr 19, 8:07 AM · Operations, DBA
Marostegui added a comment to T213664: correctable memory errors db1068 (commons primary master database).

Thanks for letting us know!
This master will be replaced once the hosts at T211613: rack/setup/install db11[26-38].eqiad.wmnet are racked and installed.

Fri, Apr 19, 8:05 AM · Patch-For-Review, DBA, Operations
Marostegui closed T149670: Predictive disk failure on db2047 as Resolved.

Thanks! We are tracking those at T208323 and as we have many - we are waiting for them to fully fail before replacing (as sometimes it takes months) so closing this again as it is on the other task and an automatic task will be created once the disk is fully gailed.
Thanks for letting us know though, much appreciated!

Fri, Apr 19, 6:03 AM · ops-codfw, Operations
Marostegui added a comment to T211613: rack/setup/install db11[26-38].eqiad.wmnet.

I have increased the priority cause s4 master is having memory errors again and needs to be replaced as soon as we can

Fri, Apr 19, 5:15 AM · Goal, Patch-For-Review, DBA, ops-eqiad, User-Marostegui, Operations

Thu, Apr 18

Marostegui moved T221159: FY18/19 TEC1.6 Q4: Improve or replace GTID + pt-heartbeat logic for cross-DC from Triage to Meta/Epic on the DBA board.
Thu, Apr 18, 6:32 PM · User-mobrovac, Services (watching), Goal, Core Platform Team Backlog (Watching / External), MediaWiki-Database, Core Platform Team (Multi-DC (TEC1)), Performance-Team, DBA
Marostegui moved T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout from Triage to In progress on the DBA board.
Thu, Apr 18, 6:32 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui triaged T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout as Normal priority.
Thu, Apr 18, 6:31 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui raised the priority of T211613: rack/setup/install db11[26-38].eqiad.wmnet from Normal to High.
Thu, Apr 18, 6:28 PM · Goal, Patch-For-Review, DBA, ops-eqiad, User-Marostegui, Operations
Marostegui added a comment to T221357: Read timeout reached while viewing AbuseLog.

@Daimona if you let me know the query you'd like to try with the FORCE I can try to run it on a core host for you.
Also let me know which wiki would be good for that test

Thu, Apr 18, 6:00 PM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), DBA, AbuseFilter, Wikimedia-production-error

Tue, Apr 16

Marostegui updated subscribers of T218985: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves).

So from DC Ops side only missing the production DNS entries?
Thanks Chris!

Tue, Apr 16, 8:19 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T219399: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts.

Yaaay!

Tue, Apr 16, 6:56 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T220480: Migration Plan 3.

Sounds good to me!

Tue, Apr 16, 2:19 PM · User-Ladsgroup, Wikidata wb_terms Trailblazing
Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.
pc[2007-2010].codfw.wmnet,pc[1007-1010].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) pc2010.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.3T  2.2T  51% /srv
===== NODE GROUP =====
(1) pc2009.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  55% /srv
===== NODE GROUP =====
(1) pc2007.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.1T  2.3T  49% /srv
===== NODE GROUP =====
(1) pc2008.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  54% /srv
===== NODE GROUP =====
(1) pc1007.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.3T  2.1T  52% /srv
===== NODE GROUP =====
(2) pc[1008-1009].eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  2.0T  56% /srv
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  1.9T  57% /srv
================
Tue, Apr 16, 6:27 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

I guess we need to FORCE index? :(

Tue, Apr 16, 5:33 AM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

pc1007 is now back to 56% after the optimization

Tue, Apr 16, 5:00 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui

Mon, Apr 15

Marostegui added projects to T221035: scap no longer !log'ging to server admin log: Scap, Release-Engineering-Team (Backlog).
Mon, Apr 15, 7:56 PM · Release-Engineering-Team (Watching / External), Stashbot, Scap, Operations
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

A count on commonswiki is: 8335746
On enwiki it is way larger: 82373996

Mon, Apr 15, 3:43 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

There is also the fact that the logging table uses a different index.
On commons it chooses log_actor_type_time whereas on enwiki it uses times (might or might not be relevant). Both hosts I have tested on are on 10.1.37.

Mon, Apr 15, 3:32 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

So the analyze table for actor table didn't change the pattern:

root@db2073.codfw.wmnet[(none)]> show explain for 6952971;
+------+--------------------+---------------------+--------+----------------------------------------------------------+-----------------------+---------+-----------------------------------------------------------------+---------+---------------------------------+
| id   | select_type        | table               | type   | possible_keys                                            | key                   | key_len | ref                                                             | rows    | Extra                           |
+------+--------------------+---------------------+--------+----------------------------------------------------------+-----------------------+---------+-----------------------------------------------------------------+---------+---------------------------------+
|    1 | PRIMARY            | actor_log_user      | ALL    | PRIMARY                                                  | NULL                  | NULL    | NULL                                                            | 8335724 | Using temporary; Using filesort |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                                                  | PRIMARY               | 4       | commonswiki.actor_log_user.actor_user                           |       1 | Using where                     |
|    1 | PRIMARY            | logging             | ref    | type_time,actor_time,log_actor_type_time,log_type_action | log_actor_type_time   | 8       | commonswiki.actor_log_user.actor_id                             |       3 | Using index condition           |
|    1 | PRIMARY            | page                | eq_ref | name_title                                               | name_title            | 261     | commonswiki.logging.log_namespace,commonswiki.logging.log_title |       1 | Using index                     |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                                                  | PRIMARY               | 8       | commonswiki.logging.log_comment_id                              |       1 |                                 |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | change_tag_log_tag_id,change_tag_tag_id_id               | change_tag_log_tag_id | 5       | commonswiki.logging.log_id                                      |       1 | Using index                     |
|    2 | DEPENDENT SUBQUERY | change_tag_def      | eq_ref | PRIMARY                                                  | PRIMARY               | 4       | commonswiki.change_tag.ct_tag_id                                |       1 |                                 |
+------+--------------------+---------------------+--------+----------------------------------------------------------+-----------------------+---------+-----------------------------------------------------------------+---------+---------------------------------+
7 rows in set, 1 warning (0.04 sec)
Mon, Apr 15, 3:27 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

A quick check on Tendril for slow queries for ApiQueryLogEvents::execute only gets reports for commonswiki.
I can try to run an analyze table on codfw for the actor table and we can check if that changes the query plan.

Mon, Apr 15, 3:20 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):.

@fsero maybe comment at T207702: contint1001:/var/lib/docker growth?

Mon, Apr 15, 2:57 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure, Operations
Marostegui added a comment to T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):.

Note that this task is about /srv and the current issue is with /

Mon, Apr 15, 2:56 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure, Operations
Marostegui added a comment to T220940: Abstracts dumps for Commons running very slowly.

The plan doesn't look different from those explains, however, we might want to actually check the real plan the optimizer is running, as we have seen before that the explain might differ from the actual plan it really runs.
You can try to run the query and in a different shell identify the process with show processlist and then run a show explain for ID where the ID is the one gotten for that specific query on show processlist

Mon, Apr 15, 2:00 PM · Patch-For-Review, Dumps-Generation
Marostegui closed T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap as Resolved.

After merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504005/ I have done a test deploy and it now looks good, deploys are back to be around 50 seconds :)

Mon, Apr 15, 1:36 PM · Patch-For-Review, Scap, Operations, cloud-services-team
Marostegui updated the task description for T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.
Mon, Apr 15, 7:16 AM · DBA
Marostegui created T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap.
Mon, Apr 15, 5:36 AM · Patch-For-Review, Scap, Operations, cloud-services-team
Marostegui added a comment to T218006: mw1280 crashed.

This server crashed again:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/13/2019 12:33:55
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      4
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      5
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Mon, Apr 15, 5:07 AM · ops-eqiad, Operations, serviceops

Sun, Apr 14

Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

pc1010 (which replicates from pc1007) finished its old rows deletion (rows that were not purged) + optimization and has 300G extra than pc1007.
I will pool pc1010 instead of pc1007 and optimize pc1007 so it can get some extra space so we can forget about it for the upcoming Easter days.

(1) pc1007.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.7T  1.7T  62% /srv
===== NODE GROUP =====
(2) pc[1008-1009].eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  2.0T  56% /srv
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  56% /srv
Sun, Apr 14, 6:26 PM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui

Fri, Apr 12

Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

parsercache hit ratio values are back to normal values after the 1st key change past 9th. So it took around 3 days to be fully back (https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1554620012644&to=1555063147728).
I am going to wait until 9th of May to see how the disk space trends go before going for the second key change.

Fri, Apr 12, 10:02 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui
Marostegui added a parent task for T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool: T216175: HP Gen9 onboard controller review.
Fri, Apr 12, 8:04 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added a subtask for T216175: HP Gen9 onboard controller review: T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 8:04 AM · Operations
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

@RobH @faidon Re: T219461#5103942 I wonder if we should document this stop as one to do for these models.

Or rather buy these without an SD card reader if that's a viable option?

Fri, Apr 12, 7:59 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui triaged T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host as Normal priority.
Fri, Apr 12, 7:27 AM · DBA
Marostegui added a comment to P8392 (An Untitled Masterwork).
root@db2102:~# ssacli version
Fri, Apr 12, 7:21 AM
Marostegui created P8392 (An Untitled Masterwork).
Fri, Apr 12, 7:18 AM
Marostegui renamed T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool from Fix RAID handler alert to work with Gen10 hosts and ssacli tool to Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:57 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

Great catch @MoritzMuehlenhoff thanks!
I have created T220787 to follow up our tools and monitoring needed changes to adapt to the new Gen10

Fri, Apr 12, 6:43 AM · DBA
Marostegui added a project to T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool: Operations.
Fri, Apr 12, 6:38 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui renamed T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool from Fix RAID handler alert to work with Gen10 hosts to Fix RAID handler alert to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:37 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added subtasks for T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool: T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host, T216043: Sort out which RAID packages are still needed.
Fri, Apr 12, 6:37 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added a parent task for T216043: Sort out which RAID packages are still needed: T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:37 AM · Operations
Marostegui added a parent task for T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host: T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:37 AM · DBA
Marostegui created T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:36 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui updated subscribers of T216175: HP Gen9 onboard controller review.

Heh...HP decided to rename the tool and on the Gen10, @MoritzMuehlenhoff found it (T220572#5106204):

HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine.
Fri, Apr 12, 6:24 AM · Operations
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

The RAID controller shows up in early device detection by the kernel:

[    4.385654] smartpqi 0000:5c:00.0: added scsi 0:1:0:0: Direct-Access     HPE      LOGICAL VOLUME   RAID-1(ADM)  SSDSmartPathCap+ En+ Exp+ qd=192
[    4.400970] scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10 1.98 PQ: 0 ANSI: 5
[    4.437509] smartpqi 0000:5c:00.0: added scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10              SSDSmartPathCap- En- Exp+ qd=0

But I don't see an RAID controller in "lspci -v" on neither db2097 not db2102. This controller should be supported by the smartpqi driver, but even if the current smartpqi driver in Linux 4.9 would not support our model is should still be listed in lspci.

Fri, Apr 12, 5:52 AM · DBA
Marostegui closed T219463: rack/setup/install (5) codfw dedicated dump slaves as Resolved.

All these host are now ready to be productionized at T220572.
There is a problem with the controller exposure to the OS which is being discussed at that same task (T220572#5104585)
Thanks @Papaul for being so fast with these and for helping out investigating the controller issues on that other task!

Fri, Apr 12, 5:03 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui updated the task description for T219463: rack/setup/install (5) codfw dedicated dump slaves.
Fri, Apr 12, 5:02 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Fri, Apr 12, 4:58 AM · Operations, DBA
Marostegui added a comment to T208323: Predictive failures on disk S.M.A.R.T. status.

db2044 again:

root@db2044:~# hpssacli controller all show config
Fri, Apr 12, 4:57 AM · Operations, DBA
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Fri, Apr 12, 4:57 AM · Operations, DBA

Thu, Apr 11

Marostegui added a comment to T216175: HP Gen9 onboard controller review.

P408i and Gen 10 might be bitting us: T220572#5104134 T220572#5104345 T220572#5104585

Thu, Apr 11, 5:17 PM · Operations
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

@Papaul have you double checked that the RAID controller is not set up to work as HBA mode?

Thu, Apr 11, 5:02 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

So to sum up.
We can use the storage:

root@db2102:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   3.5T  3.6G  3.5T   1% /srv
root@db2102:~# touch /srv/test
root@db2102:~# rm /srv/test
root@db2102:~#
root@db2102:~# fdisk -l
Disk /dev/sda: 3.5 TiB, 3840699359232 bytes, 7501365936 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 bytes
Disklabel type: gpt
Disk identifier: 308170B4-330A-43CA-850C-7B6F344BA9DC
Thu, Apr 11, 4:52 PM · DBA
Marostegui moved T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host from Next to In progress on the DBA board.
Thu, Apr 11, 4:07 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

Thanks @Papaul!
Feel free to take either db2097 or db2102 down anytime you want to check them. They have no data

Thu, Apr 11, 4:06 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

@MoritzMuehlenhoff db2097 is online now and it is one of the new ones, same batch as db2102. You can also check there.
Keep in mind that even if the controller doesn't appear to be there, the storage on /srv looks good on both db2102 and db2097.

Thu, Apr 11, 3:21 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

We will have them today or tomorrow as Papaul is installing them right now.

Thu, Apr 11, 3:15 PM · DBA
Marostegui updated subscribers of T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

These new hosts have a HP408i controller and I have noticed this:

1root@db2102:~# hpssacli controller all show config
2
3Error: No controllers detected. Possible causes:
4 - The driver for the installed controller(s) is not loaded.
5 - On LINUX, the scsi_generic (sg) driver module is not loaded.
6 See the README file for more details
7
8root@db2102:~# lsmod | grep sg
9sg 32768 0
10ipmi_msghandler 49152 2 ipmi_devintf,ipmi_si
11scsi_mod 225280 5 smartpqi,sd_mod,ses,scsi_transport_sas,sg

@MoritzMuehlenhoff is kindly taking a look :)

Thu, Apr 11, 2:53 PM · DBA
Marostegui edited P8388 (An Untitled Masterwork).
Thu, Apr 11, 2:53 PM
Marostegui updated the task description for T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.
Thu, Apr 11, 2:46 PM · Patch-For-Review, ops-codfw, Operations, DBA