Page MenuHomePhabricator

Marostegui (Manuel Aróstegui)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (137 w, 3 d)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Yesterday

Marostegui added a comment to T200297: Introduce a new namespace for collaborative judgements about wiki entities.

As far as I remember (it has been a while) all the stuff that was sent for me to review was reviewed and I believe it was even merged.

Sun, Apr 21, 4:19 PM · MW-1.33-notes (1.33.0-wmf.14; 2019-01-22), Patch-For-Review, Scoring-platform-team (Current), DBA, Operations, Jade, TechCom-RFC
Marostegui added a comment to T221357: Read timeout reached while viewing AbuseLog.

There is no difference, I don't think it is an optimizer bug, it is an issue with CAST.
I personally don't have much experience with casting, but from the mysql manual:

If you convert an indexed column using BINARY, CAST(), or CONVERT(), MySQL may not be able to use the index efficiently.
Sun, Apr 21, 3:59 PM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), DBA, AbuseFilter, Wikimedia-production-error
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Sun, Apr 21, 7:02 AM · Operations, DBA
Marostegui moved T221512: Degraded RAID on db2037 from Triage to In progress on the DBA board.
Sun, Apr 21, 7:01 AM · DBA, Operations, ops-codfw
Marostegui assigned T221512: Degraded RAID on db2037 to Papaul.

Let's get it replaced
Thanks!

Sun, Apr 21, 7:00 AM · DBA, Operations, ops-codfw
Marostegui created T221511: Possible full scan query ApiQueryUserContribs::execute for revision_actor_temp table on commonswiki.
Sun, Apr 21, 6:51 AM · Core Platform Team, MediaWiki-Database
Marostegui edited projects for T217481: Slow queries on abuse_filter_log using afl_action or afl_actions, added: User-Marostegui; removed DBA.

Untagging us as there is no actionable here for us, I will remain subscribed in case you've got questions or further requests, happy to help!

Sun, Apr 21, 6:35 AM · User-Marostegui, User-Daimona, AbuseFilter
Marostegui moved T221357: Read timeout reached while viewing AbuseLog from Triage to In progress on the DBA board.

It is not really making any difference unfortunately, the optimizer still thinks it is better to do a full scan (I guess the CAST is the culprit here):

SELECT  * FROM `abuse_filter_log` FORCE INDEX(afl_timestamp) LEFT JOIN `abuse_filter` ON ((CAST( af_id AS BINARY )=afl_filter)) WHERE afl_deleted = '0' ORDER BY afl_timestamp DESC LIMIT 51
Sun, Apr 21, 6:31 AM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), DBA, AbuseFilter, Wikimedia-production-error
Marostegui moved T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out from Triage to In progress on the DBA board.
Sun, Apr 21, 6:09 AM · Performance, Core Platform Team, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
Marostegui moved T221502: db1099 memory issues from Triage to In progress on the DBA board.
Sun, Apr 21, 6:09 AM · ops-eqiad, Operations, DBA
Marostegui added a comment to T221508: webperf2001 is running out of disk space.

This will get full in a matter of minutes again:

root@webperf2001:/var/log# ls -lh messages user.log
-rw-r----- 1 root adm 1.3G Apr 21 05:27 messages
-rw-r----- 1 root adm 989M Apr 21 05:27 user.log
Sun, Apr 21, 5:29 AM · Operations, Performance-Team
Marostegui added a comment to T221508: webperf2001 is running out of disk space.

The host was fully full:

root@webperf2001:/var/log# df -hT
Filesystem     Type      Size  Used Avail Use% Mounted on
udev           devtmpfs  3.9G     0  3.9G   0% /dev
tmpfs          tmpfs     799M   81M  719M  11% /run
/dev/vda1      ext4       49G   49G     0 100% /
tmpfs          tmpfs     4.0G   12K  4.0G   1% /dev/shm
tmpfs          tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs          tmpfs     4.0G     0  4.0G   0% /sys/fs/cgroup
tmpfs          tmpfs     799M     0  799M   0% /run/user/15343
Sun, Apr 21, 5:22 AM · Operations, Performance-Team

Sat, Apr 20

Marostegui triaged T221502: db1099 memory issues as Normal priority.
Sat, Apr 20, 5:55 PM · ops-eqiad, Operations, DBA
Marostegui created T221502: db1099 memory issues.
Sat, Apr 20, 5:55 PM · ops-eqiad, Operations, DBA
Marostegui added a project to T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out: Core Platform Team.
Sat, Apr 20, 4:50 PM · Performance, Core Platform Team, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
Marostegui updated subscribers of T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out.

This is the query I believe - looks like the optimizer is being dumb again:

SELECT /* IndexPager::buildQueryInfo (LogPager)  */  log_id,log_type,log_action,log_timestamp,log_namespace,log_title,log_params,log_deleted,user_id,user_name,user_editcount,comment_log_comment.comment_text AS `log_comment_text`,comment_log_comment.comment_data AS `log_comment_data`,comment_log_comment.comment_id AS `log_comment_cid`,actor_log_user.actor_user AS `log_user`,actor_log_user.actor_name AS `log_user_text`,log_actor,(SELECT  GROUP_CONCAT(ctd_name SEPARATOR ',')  FROM `change_tag` JOIN `change_tag_def` ON ((ct_tag_id=ctd_id))   WHERE ct_log_id=log_id  ) AS `ts_tags`  FROM `logging` JOIN `comment` `comment_log_comment` ON ((comment_log_comment.comment_id = log_comment_id)) JOIN `actor` `actor_log_user` ON ((actor_log_user.actor_id = log_actor)) LEFT JOIN `user` ON ((user_id=actor_log_user.actor_user))   WHERE (log_type NOT IN ('spamblacklist','titleblacklist','abusefilterprivatedetails','oath','suppress')) AND (log_type != 'thanks') AND (log_type != 'patrol') AND (log_type != 'tag')  ORDER BY log_timestamp DESC LIMIT 51
Sat, Apr 20, 3:49 PM · Performance, Core Platform Team, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
Marostegui added a parent task for T221424: decommission db2014,db2020, db2021, db2022, db2024, db2031: T176243: Decommission database hosts <= db2031 (tracking).
Sat, Apr 20, 6:32 AM · Operations, ops-codfw, DC-Ops, decommission
Marostegui added a subtask for T176243: Decommission database hosts <= db2031 (tracking): T221424: decommission db2014,db2020, db2021, db2022, db2024, db2031.
Sat, Apr 20, 6:32 AM · Patch-For-Review, Goal, DBA
Marostegui removed a project from T221424: decommission db2014,db2020, db2021, db2022, db2024, db2031: DBA.

All those hosts were decommissioned as part of T176243, so probably a leftover from that.
Removing our tag as there is nothing for us to do. I will keep subscribed to this task in case our help is needed to clarify something.
Thanks!

Sat, Apr 20, 6:29 AM · Operations, ops-codfw, DC-Ops, decommission
Marostegui moved T221481: Degraded RAID on db2047 from Triage to In progress on the DBA board.
Sat, Apr 20, 6:25 AM · DBA, Operations, ops-codfw
Marostegui assigned T221481: Degraded RAID on db2047 to Papaul.

Let's replace the failed disk first only, disk #12

Sat, Apr 20, 6:24 AM · DBA, Operations, ops-codfw

Fri, Apr 19

Marostegui triaged T221201: Prepare and check storage layer for initiativeswiki as Normal priority.
Fri, Apr 19, 3:55 PM · Cloud-Services, DBA
Marostegui added a comment to T208323: Predictive failures on disk S.M.A.R.T. status.

db2047 has another disk failed:

logicaldrive 1 (3.3 TB, RAID 1+0, OK)
Fri, Apr 19, 8:07 AM · Operations, DBA
Marostegui added a comment to T213664: correctable memory errors db1068 (commons primary master database).

Thanks for letting us know!
This master will be replaced once the hosts at T211613: rack/setup/install db11[26-38].eqiad.wmnet are racked and installed.

Fri, Apr 19, 8:05 AM · Patch-For-Review, DBA, Operations
Marostegui closed T149670: Predictive disk failure on db2047 as Resolved.

Thanks! We are tracking those at T208323 and as we have many - we are waiting for them to fully fail before replacing (as sometimes it takes months) so closing this again as it is on the other task and an automatic task will be created once the disk is fully gailed.
Thanks for letting us know though, much appreciated!

Fri, Apr 19, 6:03 AM · ops-codfw, Operations
Marostegui added a comment to T211613: rack/setup/install db11[26-38].eqiad.wmnet.

I have increased the priority cause s4 master is having memory errors again and needs to be replaced as soon as we can

Fri, Apr 19, 5:15 AM · Goal, Patch-For-Review, DBA, ops-eqiad, User-Marostegui, Operations

Thu, Apr 18

Marostegui moved T221159: FY18/19 TEC1.6 Q4: Improve or replace GTID + pt-heartbeat logic for cross-DC from Triage to Meta/Epic on the DBA board.
Thu, Apr 18, 6:32 PM · User-mobrovac, Services (watching), Goal, Core Platform Team Backlog (Watching / External), MediaWiki-Database, Core Platform Team (Multi-DC (TEC1)), Performance-Team, DBA
Marostegui moved T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout from Triage to In progress on the DBA board.
Thu, Apr 18, 6:32 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui triaged T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout as Normal priority.
Thu, Apr 18, 6:31 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui raised the priority of T211613: rack/setup/install db11[26-38].eqiad.wmnet from Normal to High.
Thu, Apr 18, 6:28 PM · Goal, Patch-For-Review, DBA, ops-eqiad, User-Marostegui, Operations
Marostegui added a comment to T221357: Read timeout reached while viewing AbuseLog.

@Daimona if you let me know the query you'd like to try with the FORCE I can try to run it on a core host for you.
Also let me know which wiki would be good for that test

Thu, Apr 18, 6:00 PM · MW-1.34-notes (1.34.0-wmf.1; 2019-04-16), DBA, AbuseFilter, Wikimedia-production-error

Tue, Apr 16

Marostegui updated subscribers of T218985: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves).

So from DC Ops side only missing the production DNS entries?
Thanks Chris!

Tue, Apr 16, 8:19 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T219399: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts.

Yaaay!

Tue, Apr 16, 6:56 PM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T220480: Migration Plan 3.

Sounds good to me!

Tue, Apr 16, 2:19 PM · User-Ladsgroup, Wikidata wb_terms Trailblazing
Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.
pc[2007-2010].codfw.wmnet,pc[1007-1010].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) pc2010.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.3T  2.2T  51% /srv
===== NODE GROUP =====
(1) pc2009.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  55% /srv
===== NODE GROUP =====
(1) pc2007.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.1T  2.3T  49% /srv
===== NODE GROUP =====
(1) pc2008.codfw.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  54% /srv
===== NODE GROUP =====
(1) pc1007.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.3T  2.1T  52% /srv
===== NODE GROUP =====
(2) pc[1008-1009].eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  2.0T  56% /srv
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  1.9T  57% /srv
================
Tue, Apr 16, 6:27 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

I guess we need to FORCE index? :(

Tue, Apr 16, 5:33 AM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

pc1007 is now back to 56% after the optimization

Tue, Apr 16, 5:00 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui

Mon, Apr 15

Marostegui added projects to T221035: scap no longer !log'ging to server admin log: Scap, Release-Engineering-Team (Backlog).
Mon, Apr 15, 7:56 PM · Release-Engineering-Team (Watching / External), Stashbot, Scap, Operations
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

A count on commonswiki is: 8335746
On enwiki it is way larger: 82373996

Mon, Apr 15, 3:43 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

There is also the fact that the logging table uses a different index.
On commons it chooses log_actor_type_time whereas on enwiki it uses times (might or might not be relevant). Both hosts I have tested on are on 10.1.37.

Mon, Apr 15, 3:32 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

So the analyze table for actor table didn't change the pattern:

root@db2073.codfw.wmnet[(none)]> show explain for 6952971;
+------+--------------------+---------------------+--------+----------------------------------------------------------+-----------------------+---------+-----------------------------------------------------------------+---------+---------------------------------+
| id   | select_type        | table               | type   | possible_keys                                            | key                   | key_len | ref                                                             | rows    | Extra                           |
+------+--------------------+---------------------+--------+----------------------------------------------------------+-----------------------+---------+-----------------------------------------------------------------+---------+---------------------------------+
|    1 | PRIMARY            | actor_log_user      | ALL    | PRIMARY                                                  | NULL                  | NULL    | NULL                                                            | 8335724 | Using temporary; Using filesort |
|    1 | PRIMARY            | user                | eq_ref | PRIMARY                                                  | PRIMARY               | 4       | commonswiki.actor_log_user.actor_user                           |       1 | Using where                     |
|    1 | PRIMARY            | logging             | ref    | type_time,actor_time,log_actor_type_time,log_type_action | log_actor_type_time   | 8       | commonswiki.actor_log_user.actor_id                             |       3 | Using index condition           |
|    1 | PRIMARY            | page                | eq_ref | name_title                                               | name_title            | 261     | commonswiki.logging.log_namespace,commonswiki.logging.log_title |       1 | Using index                     |
|    1 | PRIMARY            | comment_log_comment | eq_ref | PRIMARY                                                  | PRIMARY               | 8       | commonswiki.logging.log_comment_id                              |       1 |                                 |
|    2 | DEPENDENT SUBQUERY | change_tag          | ref    | change_tag_log_tag_id,change_tag_tag_id_id               | change_tag_log_tag_id | 5       | commonswiki.logging.log_id                                      |       1 | Using index                     |
|    2 | DEPENDENT SUBQUERY | change_tag_def      | eq_ref | PRIMARY                                                  | PRIMARY               | 4       | commonswiki.change_tag.ct_tag_id                                |       1 |                                 |
+------+--------------------+---------------------+--------+----------------------------------------------------------+-----------------------+---------+-----------------------------------------------------------------+---------+---------------------------------+
7 rows in set, 1 warning (0.04 sec)
Mon, Apr 15, 3:27 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T220999: Slow query "ApiQueryLogEvents::execute" after actor rollout.

A quick check on Tendril for slow queries for ApiQueryLogEvents::execute only gets reports for commonswiki.
I can try to run an analyze table on codfw for the actor table and we can check if that changes the query plan.

Mon, Apr 15, 3:20 PM · Patch-For-Review, MediaWiki-API, DBA
Marostegui added a comment to T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):.

@fsero maybe comment at T207702: contint1001:/var/lib/docker growth?

Mon, Apr 15, 2:57 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure, Operations
Marostegui added a comment to T219850: contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%):.

Note that this task is about /srv and the current issue is with /

Mon, Apr 15, 2:56 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Infrastructure, Operations
Marostegui added a comment to T220940: Abstracts dumps for Commons running very slowly.

The plan doesn't look different from those explains, however, we might want to actually check the real plan the optimizer is running, as we have seen before that the explain might differ from the actual plan it really runs.
You can try to run the query and in a different shell identify the process with show processlist and then run a show explain for ID where the ID is the one gotten for that specific query on show processlist

Mon, Apr 15, 2:00 PM · Patch-For-Review, Dumps-Generation
Marostegui closed T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap as Resolved.

After merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504005/ I have done a test deploy and it now looks good, deploys are back to be around 50 seconds :)

Mon, Apr 15, 1:36 PM · Patch-For-Review, Scap, Operations, cloud-services-team
Marostegui updated the task description for T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.
Mon, Apr 15, 7:16 AM · DBA
Marostegui created T220931: deploy1001 cannot reach cloudweb2001-dev.wikimedia.org when running scap.
Mon, Apr 15, 5:36 AM · Patch-For-Review, Scap, Operations, cloud-services-team
Marostegui added a comment to T218006: mw1280 crashed.

This server crashed again:

-------------------------------------------------------------------------------
Record:      2
Date/Time:   04/13/2019 12:33:55
Source:      system
Severity:    Critical
Description: CPU 2 has an internal error (IERR).
-------------------------------------------------------------------------------
Record:      3
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Ok
Description: A problem was detected related to the previous server boot.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      4
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_A1.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      5
Date/Time:   04/13/2019 12:36:59
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Record:      27
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B1.
-------------------------------------------------------------------------------
Record:      28
Date/Time:   04/13/2019 12:37:00
Source:      system
Severity:    Ok
Description: An OEM diagnostic event occurred.
-------------------------------------------------------------------------------
Mon, Apr 15, 5:07 AM · ops-eqiad, Operations, serviceops

Sun, Apr 14

Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

pc1010 (which replicates from pc1007) finished its old rows deletion (rows that were not purged) + optimization and has 300G extra than pc1007.
I will pool pc1010 instead of pc1007 and optimize pc1007 so it can get some extra space so we can forget about it for the upcoming Easter days.

(1) pc1007.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.7T  1.7T  62% /srv
===== NODE GROUP =====
(2) pc[1008-1009].eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.5T  2.0T  56% /srv
===== NODE GROUP =====
(1) pc1010.eqiad.wmnet
----- OUTPUT of 'df -hT /srv' -----
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   4.4T  2.4T  2.0T  56% /srv
Sun, Apr 14, 6:26 PM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui

Fri, Apr 12

Marostegui added a comment to T210725: Replace parsercache keys to something more meaningful on db-XXXX.php.

parsercache hit ratio values are back to normal values after the 1st key change past 9th. So it took around 3 days to be fully back (https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1554620012644&to=1555063147728).
I am going to wait until 9th of May to see how the disk space trends go before going for the second key change.

Fri, Apr 12, 10:02 AM · Patch-For-Review, MediaWiki-Cache, Performance-Team (Radar), DBA, User-Marostegui
Marostegui added a parent task for T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool: T216175: HP Gen9 onboard controller review.
Fri, Apr 12, 8:04 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added a subtask for T216175: HP Gen9 onboard controller review: T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 8:04 AM · Operations
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

@RobH @faidon Re: T219461#5103942 I wonder if we should document this stop as one to do for these models.

Or rather buy these without an SD card reader if that's a viable option?

Fri, Apr 12, 7:59 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui triaged T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host as Normal priority.
Fri, Apr 12, 7:27 AM · DBA
Marostegui added a comment to P8392 (An Untitled Masterwork).
root@db2102:~# ssacli version
Fri, Apr 12, 7:21 AM
Marostegui created P8392 (An Untitled Masterwork).
Fri, Apr 12, 7:18 AM
Marostegui renamed T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool from Fix RAID handler alert to work with Gen10 hosts and ssacli tool to Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:57 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

Great catch @MoritzMuehlenhoff thanks!
I have created T220787 to follow up our tools and monitoring needed changes to adapt to the new Gen10

Fri, Apr 12, 6:43 AM · DBA
Marostegui added a project to T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool: Operations.
Fri, Apr 12, 6:38 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui renamed T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool from Fix RAID handler alert to work with Gen10 hosts to Fix RAID handler alert to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:37 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added subtasks for T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool: T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host, T216043: Sort out which RAID packages are still needed.
Fri, Apr 12, 6:37 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui added a parent task for T216043: Sort out which RAID packages are still needed: T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:37 AM · Operations
Marostegui added a parent task for T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host: T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:37 AM · DBA
Marostegui created T220787: Fix RAID handler alert and puppet facter to work with Gen10 hosts and ssacli tool.
Fri, Apr 12, 6:36 AM · Patch-For-Review, Operations, Icinga, monitoring
Marostegui updated subscribers of T216175: HP Gen9 onboard controller review.

Heh...HP decided to rename the tool and on the Gen10, @MoritzMuehlenhoff found it (T220572#5106204):

HPE renamed the tool, I installed "ssacli" and now "ssacli controller all show config" works fine.
Fri, Apr 12, 6:24 AM · Operations
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

The RAID controller shows up in early device detection by the kernel:

[    4.385654] smartpqi 0000:5c:00.0: added scsi 0:1:0:0: Direct-Access     HPE      LOGICAL VOLUME   RAID-1(ADM)  SSDSmartPathCap+ En+ Exp+ qd=192
[    4.400970] scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10 1.98 PQ: 0 ANSI: 5
[    4.437509] smartpqi 0000:5c:00.0: added scsi 0:2:0:0: RAID              HPE      P408i-a SR Gen10              SSDSmartPathCap- En- Exp+ qd=0

But I don't see an RAID controller in "lspci -v" on neither db2097 not db2102. This controller should be supported by the smartpqi driver, but even if the current smartpqi driver in Linux 4.9 would not support our model is should still be listed in lspci.

Fri, Apr 12, 5:52 AM · DBA
Marostegui closed T219463: rack/setup/install (5) codfw dedicated dump slaves as Resolved.

All these host are now ready to be productionized at T220572.
There is a problem with the controller exposure to the OS which is being discussed at that same task (T220572#5104585)
Thanks @Papaul for being so fast with these and for helping out investigating the controller issues on that other task!

Fri, Apr 12, 5:03 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui updated the task description for T219463: rack/setup/install (5) codfw dedicated dump slaves.
Fri, Apr 12, 5:02 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Fri, Apr 12, 4:58 AM · Operations, DBA
Marostegui added a comment to T208323: Predictive failures on disk S.M.A.R.T. status.

db2044 again:

root@db2044:~# hpssacli controller all show config
Fri, Apr 12, 4:57 AM · Operations, DBA
Marostegui updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Fri, Apr 12, 4:57 AM · Operations, DBA

Thu, Apr 11

Marostegui added a comment to T216175: HP Gen9 onboard controller review.

P408i and Gen 10 might be bitting us: T220572#5104134 T220572#5104345 T220572#5104585

Thu, Apr 11, 5:17 PM · Operations
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

@Papaul have you double checked that the RAID controller is not set up to work as HBA mode?

Thu, Apr 11, 5:02 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

So to sum up.
We can use the storage:

root@db2102:~# df -hT /srv
Filesystem            Type  Size  Used Avail Use% Mounted on
/dev/mapper/tank-data xfs   3.5T  3.6G  3.5T   1% /srv
root@db2102:~# touch /srv/test
root@db2102:~# rm /srv/test
root@db2102:~#
root@db2102:~# fdisk -l
Disk /dev/sda: 3.5 TiB, 3840699359232 bytes, 7501365936 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 bytes
Disklabel type: gpt
Disk identifier: 308170B4-330A-43CA-850C-7B6F344BA9DC
Thu, Apr 11, 4:52 PM · DBA
Marostegui moved T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host from Next to In progress on the DBA board.
Thu, Apr 11, 4:07 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

Thanks @Papaul!
Feel free to take either db2097 or db2102 down anytime you want to check them. They have no data

Thu, Apr 11, 4:06 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

@MoritzMuehlenhoff db2097 is online now and it is one of the new ones, same batch as db2102. You can also check there.
Keep in mind that even if the controller doesn't appear to be there, the storage on /srv looks good on both db2102 and db2097.

Thu, Apr 11, 3:21 PM · DBA
Marostegui added a comment to T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

We will have them today or tomorrow as Papaul is installing them right now.

Thu, Apr 11, 3:15 PM · DBA
Marostegui updated subscribers of T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host.

These new hosts have a HP408i controller and I have noticed this:

1root@db2102:~# hpssacli controller all show config
2
3Error: No controllers detected. Possible causes:
4 - The driver for the installed controller(s) is not loaded.
5 - On LINUX, the scsi_generic (sg) driver module is not loaded.
6 See the README file for more details
7
8root@db2102:~# lsmod | grep sg
9sg 32768 0
10ipmi_msghandler 49152 2 ipmi_devintf,ipmi_si
11scsi_mod 225280 5 smartpqi,sd_mod,ses,scsi_transport_sas,sg

@MoritzMuehlenhoff is kindly taking a look :)

Thu, Apr 11, 2:53 PM · DBA
Marostegui edited P8388 (An Untitled Masterwork).
Thu, Apr 11, 2:53 PM
Marostegui updated the task description for T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.
Thu, Apr 11, 2:46 PM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui added a comment to P8388 (An Untitled Masterwork).
root@db2102:~# lsmod | grep sg
sg                     32768  0
ipmi_msghandler        49152  2 ipmi_devintf,ipmi_si
scsi_mod              225280  5 smartpqi,sd_mod,ses,scsi_transport_sas,sg
Thu, Apr 11, 2:43 PM
Marostegui created P8388 (An Untitled Masterwork).
Thu, Apr 11, 2:43 PM
Marostegui closed T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups as Resolved.

Thanks @Papaul!
This server is ready to be productionized at: T220572: Productionize eqiad and codfw source backup hosts & codfw backup test host

Thu, Apr 11, 2:41 PM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui updated the task description for T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.
Thu, Apr 11, 2:40 PM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui updated the task description for T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.
Thu, Apr 11, 2:31 PM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

So the SD disablement did the trick! :-)
The server looks good now:

root@db2102:~# df -hT
Filesystem            Type      Size  Used Avail Use% Mounted on
udev                  devtmpfs  252G     0  252G   0% /dev
tmpfs                 tmpfs      51G  9.6M   51G   1% /run
/dev/sda1             ext4       37G  899M   34G   3% /
tmpfs                 tmpfs     252G     0  252G   0% /dev/shm
tmpfs                 tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                 tmpfs     252G     0  252G   0% /sys/fs/cgroup
/dev/mapper/tank-data xfs       3.5T  3.6G  3.5T   1% /srv
root@db2102:~# free -g
              total        used        free      shared  buff/cache   available
Mem:            503           0         502           0           0         500
Swap:             7           0           7
Thu, Apr 11, 2:26 PM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui removed a project from T138562: Improve regular production database backups handling: Wikimedia-Incident.
Thu, Apr 11, 1:23 PM · Epic, DBA
Marostegui removed a project from T172492: Improve database alerting (tracking): Wikimedia-Incident.
Thu, Apr 11, 1:23 PM · Epic, monitoring, DBA
Elitre awarded T220080: Emergency database primary master failover on s3 a Love token.
Thu, Apr 11, 7:52 AM · User-Johan, CommRel-Specialists-Support (Apr-Jun-2019), User-notice
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

I have been trying to check if there is something else defined on a storage level but it is impossible to see anything with vsp :(

Thu, Apr 11, 7:19 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

So, I have been checking out the RAID menu on the controller, but unfortunately over vsp it doesn't show most of the options.
I can see there is a RAID created and the disks but that is pretty much all I can see, not even the RAID size or anything related to its options (level, strip size, etc)
Sometimes, if there is an SD card on the server, that takes or is assumed to be the sda (although I cannot see it on fdisk) and the logical drive is created as sdb. @Papaul can you review if there is another storage device there?
Example of what I "see":

Thu, Apr 11, 5:41 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

The raid is sdb and we need it to be sda for db.cfg to work:

Disk /dev/sdb: 3.5 TiB, 3840699359232 bytes, 7501365936 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 262144 bytes / 524288 bytes
Thu, Apr 11, 5:28 AM · Patch-For-Review, ops-codfw, Operations, DBA
Marostegui added a comment to T219115: db1078 s3 primary DB master BBU pre-failure.

@Cmjohnson can we schedule the BBU replacement for Monday 15th? db1078 is no longer a master.

Thu, Apr 11, 5:21 AM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui closed T220080: Emergency database primary master failover on s3 as Resolved.

This was done successfully.

Thu, Apr 11, 5:14 AM · User-Johan, CommRel-Specialists-Support (Apr-Jun-2019), User-notice
Marostegui closed T220080: Emergency database primary master failover on s3, a subtask of T219115: db1078 s3 primary DB master BBU pre-failure, as Resolved.
Thu, Apr 11, 5:14 AM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a comment to T219461: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups.

I will check if the raid is on sda, because the host is correctly set to be allowed to be re-imaged:

db1114|db112[6-9]|db113[0-9]|db1140|dbprov200[1-2]|db209[7-9]|db210[0-2]) echo partman/db.cfg ;; \
Thu, Apr 11, 4:41 AM · Patch-For-Review, ops-codfw, Operations, DBA

Wed, Apr 10

Marostegui reassigned T220607: Degraded RAID on db2054 from Marostegui to Papaul.
Wed, Apr 10, 5:09 PM · DBA, Operations, ops-codfw
Marostegui closed T220607: Degraded RAID on db2054 as Resolved.
Wed, Apr 10, 5:08 PM · DBA, Operations, ops-codfw