Page MenuHomePhabricator
Feed Advanced Search

Jul 5 2018

mark added a comment to T184715: pybal's "can-depool" logic only takes downServers into account.

Nice work! :)

Jul 5 2018, 4:40 PM · Traffic, PyBal
jcrespo awarded T136403: Move cp3030+ from OE14 to OE13 in racktables a Like token.
Jul 5 2018, 8:36 AM · ops-esams, SRE

Jul 4 2018

mark added a comment to T184063: Remove all decommissioned hardware.

All to-be-removed servers have had their cabling removed, except in rack OE12.

Jul 4 2018, 5:16 PM · SRE, Epic, ops-esams
mark updated the task description for T184063: Remove all decommissioned hardware.
Jul 4 2018, 5:16 PM · SRE, Epic, ops-esams
mark updated the task description for T184063: Remove all decommissioned hardware.
Jul 4 2018, 3:12 PM · SRE, Epic, ops-esams
mark updated the task description for T184063: Remove all decommissioned hardware.
Jul 4 2018, 3:07 PM · SRE, Epic, ops-esams
mark moved T198784: Degraded RAID on cp3048 from Backlog to Hardware Failure / Repair on the ops-esams board.
Jul 4 2018, 2:40 PM · Traffic, ops-esams, SRE
mark moved T190607: cp3048 hardware issues from Procurement to Hardware Failure / Repair on the ops-esams board.
Jul 4 2018, 2:40 PM · Traffic, SRE, ops-esams
mark closed T183814: Degraded RAID on bast3002 as Resolved.

The last RAID array (md2) is now resyncing.

Jul 4 2018, 2:33 PM · ops-esams, SRE
mark closed T94819: Audit racktables as Resolved.
Jul 4 2018, 2:32 PM · DC-Ops, SRE, ops-esams
mark added a comment to T94819: Audit racktables.

All esams racks have been audited with their Racktables counterparts, and object location and unusable items are now correct.

Jul 4 2018, 2:31 PM · DC-Ops, SRE, ops-esams
mark closed T169035: bast3002 sdb broken as Resolved.
Jul 4 2018, 1:30 PM · SRE, ops-esams
mark closed T136403: Move cp3030+ from OE14 to OE13 in racktables as Resolved.

This has now been corrected and verified on-site.

Jul 4 2018, 1:28 PM · ops-esams, SRE
mark closed T148422: cp3009: memory scrubbing error as Declined.

After consultation with Ema and considering how long this server has been broken, is 1 out of 4 misc varnish servers and the misc cluster is being folded into text anyway, we decided it's not worth repairing this server.

Jul 4 2018, 12:53 PM · Patch-For-Review, Traffic, ops-esams, SRE
mark moved T184936: install/designate other machine as esams bastion from Procurement to Backlog on the ops-esams board.
Jul 4 2018, 12:51 PM · SRE, ops-esams
mark lowered the priority of T184936: install/designate other machine as esams bastion from High to Low.

ms-be3003 is still connected, and eth0 and eth1 are connected to ports 4 and 5 of csw2-oe11-esams respectively. (csw2-esams).

Jul 4 2018, 12:51 PM · SRE, ops-esams
mark added a comment to T169035: bast3002 sdb broken.

bast3002 (aka hooft) sdb has been swapped with amslvs3's sdb (both LFF, non-hotswap). The server has booted back up, and sdb is being repartitioned and added to the RAID arrays.

Jul 4 2018, 12:45 PM · SRE, ops-esams
mark created T198790: Relabel hooft to bast3002.
Jul 4 2018, 12:19 PM · SRE, ops-esams
mark added a comment to T190607: cp3048 hardware issues.

Because cp3048 is out of warranty and it's unlikely we can get it fixed, I've used it as a parts donor for other broken systems:

Jul 4 2018, 12:01 PM · Traffic, SRE, ops-esams
mark closed T189305: cp3034: Uncorrectable Memory Error as Resolved.

Swapped DIMM B3 with DIMM B3 from cp3048 (parts donor). Server booted up just fine afterwards.

Jul 4 2018, 11:52 AM · SRE, Traffic, ops-esams
mark closed T179953: cp3043 disk failure as Resolved.

cp3043 drive 2 (sdb) has been swapped for cp3048's drive 2 (sdb), as cp3048 is out of warranty and unfixable anyway. RAID1 md0 has been restored, and the server is back up and running.

Jul 4 2018, 11:27 AM · Traffic, SRE, ops-esams

Jul 3 2018

mark moved T136403: Move cp3030+ from OE14 to OE13 in racktables from Backlog to Procurement on the ops-esams board.
Jul 3 2018, 1:05 PM · ops-esams, SRE
mark moved T94819: Audit racktables from Backlog to Procurement on the ops-esams board.
Jul 3 2018, 1:05 PM · DC-Ops, SRE, ops-esams
mark moved T169035: bast3002 sdb broken from Hardware Failure / Repair to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · SRE, ops-esams
mark moved T189305: cp3034: Uncorrectable Memory Error from Hardware Failure / Repair to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · SRE, Traffic, ops-esams
mark moved T190607: cp3048 hardware issues from Hardware Failure / Repair to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · Traffic, SRE, ops-esams
mark moved T148422: cp3009: memory scrubbing error from Hardware Failure / Repair to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · Patch-For-Review, Traffic, ops-esams, SRE
mark moved T183814: Degraded RAID on bast3002 from Hardware Failure / Repair to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · ops-esams, SRE
mark moved T184936: install/designate other machine as esams bastion from Backlog to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · SRE, ops-esams
mark moved T179953: cp3043 disk failure from Hardware Failure / Repair to Procurement on the ops-esams board.
Jul 3 2018, 1:04 PM · Traffic, SRE, ops-esams
mark added a comment to T190607: cp3048 hardware issues.

More recently:

Jul 3 2018, 1:02 PM · Traffic, SRE, ops-esams
mark added a comment to T190607: cp3048 hardware issues.

As Chris confirmed, this is either due to CPU or memory.

Jul 3 2018, 12:55 PM · Traffic, SRE, ops-esams
mark raised the priority of T189305: cp3034: Uncorrectable Memory Error from Medium to High.
Jul 3 2018, 12:46 PM · SRE, Traffic, ops-esams
mark raised the priority of T148422: cp3009: memory scrubbing error from Medium to High.
Jul 3 2018, 12:18 PM · Patch-For-Review, Traffic, ops-esams, SRE
mark closed T184522: To purchase for next esams visit as Resolved.

These were purchased by Rob and given to me in January.

Jul 3 2018, 11:59 AM · ops-esams, SRE
mark moved T189305: cp3034: Uncorrectable Memory Error from Backlog to Hardware Failure / Repair on the ops-esams board.
Jul 3 2018, 11:59 AM · SRE, Traffic, ops-esams
mark moved T190607: cp3048 hardware issues from Backlog to Hardware Failure / Repair on the ops-esams board.
Jul 3 2018, 11:59 AM · Traffic, SRE, ops-esams

Jun 25 2018

mark renamed T104459: Detect object, schema and data drifts between mediawiki HEAD, production masters and replicas from Automatize the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves to Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves.
Jun 25 2018, 11:56 AM · User-Ladsgroup, Sustainability (Incident Followup), Datasets-General-or-Unknown, OKR-Work, DBA

Jun 14 2018

mark claimed T136403: Move cp3030+ from OE14 to OE13 in racktables.

It looks like cp3030-cp3039 are in OE13 (despite what Racktables says), and cp3040+ are in OE40. So this is swapped from reality in Racktables. This is confirmed with LLDP info from the switch.

Jun 14 2018, 11:43 AM · ops-esams, SRE
mark raised the priority of T136403: Move cp3030+ from OE14 to OE13 in racktables from Lowest to High.
Jun 14 2018, 11:40 AM · ops-esams, SRE

May 30 2018

mark added a comment to T195623: request to assign wmf6937 (mw1298, former imagescaler) (now: wmf4727) as phab1002.

@mark: Is this something you would want to approve? If it was a permanent allocation, I know it would be. Since it is a temp allocation, I'm not certain.

Please advise.

May 30 2018, 6:05 PM · Patch-For-Review, SRE, hardware-requests

May 25 2018

mark added a project to T85414: wikibase: synchronize schema on production with what is created on install: DBA.
May 25 2018, 9:09 AM · User-Addshore, [DEPRECATED] wdwb-tech, DBA, MediaWiki-extensions-WikibaseRepository, Wikidata
mark reopened T85414: wikibase: synchronize schema on production with what is created on install as "Open".
May 25 2018, 9:08 AM · User-Addshore, [DEPRECATED] wdwb-tech, DBA, MediaWiki-extensions-WikibaseRepository, Wikidata

May 1 2018

mark added a project to T193496: Allocate public v4 IPs for Neutron setup in eqiad: netops.
May 1 2018, 3:03 PM · Patch-For-Review, cloud-services-team (Kanban), Cloud-Services, netops, SRE

Apr 30 2018

mark updated subscribers of T193408: SPF record for canonical domains.
Apr 30 2018, 4:49 PM · Patch-For-Review, Mail, SRE

Mar 21 2018

mark closed T180069: Pybal should be able to advertise to multiple routers as Resolved.

This is now in the latest PyBal releases, so resolving this ticket.

Mar 21 2018, 1:00 PM · PyBal, Traffic, SRE

Mar 20 2018

mark added a comment to T188453: Google Search Console access for Search Platform team.

Unfortunately we've had an issue with managing access to the admin account for the Google search console for a long time. Currently the only option is sharing the single password to this account with multiple people, which besides the obvious security problems this entails, also causes Google to detect abuse all the time.

Mar 20 2018, 11:26 AM · Search-Console-access-request, Discovery-Search, SRE

Mar 13 2018

mark added a comment to T184666: DBA review for GlobalPreferences schema.

@jcrespo: One thing that hasn't been mentioned is that GlobalPreferences will almost certainly reduce the growth of the user_properties tables in the long run. Instead of users having to duplicate all of their preference selections across all the wikis they use, they will only have to set them once on their home wiki and declare them to be global. So the net effect on db storage should actually be positive in the long run.

Mar 13 2018, 12:37 PM · Community-Tech, MediaWiki-extensions-GlobalPreferences, Schema-change, DBA

Mar 9 2018

mark moved T178151: Add UDP monitor for pybal from Backlog to In Progress on the PyBal board.
Mar 9 2018, 9:59 AM · Patch-For-Review, SRE, Traffic, PyBal

Feb 21 2018

mark added a comment to T187910: Define a special range in constants.pp for the LVS hosts.

You are setting up a publicly accessible web service, right? So you should probably open up port 80 (and/or 443) to the entire world, not just LVS servers.

Feb 21 2018, 5:42 PM · SRE

Feb 6 2018

mark added a comment to T172459: eqiad row D switch upgrade.

@BBlack The thing is, we physically could do this in 2 weeks- if we put it on our top priority and do nothing else- I don't know how urgent is this- if it is long tail maintenance that can wait, or things are literally breaking apart. A manager would know were to put it on our pile and how to prioritize with more context, that is why I mentioned, so we can provide you with a more accurate timing.

In our case, literally as I write things, mysql is breaking apart... wait for the ticket.

you just need to find an answer to the question from your end

If it was entirerly on a reasonable expectation, I would wait until a failover, in 4+ months, as it save us time- but I do not think it is entirely up to us, we can discuss.

Feb 6 2018, 1:10 PM · Infrastructure-Foundations, Patch-For-Review, SRE, netops, Traffic

Jan 16 2018

mark closed T178325: Operations 2017-18 Q2 Program 6 umbrella task as Resolved.
Jan 16 2018, 1:19 PM · SRE, Kubernetes

Jan 10 2018

fgiunchedi awarded T169518: Decommission esams ms-fe / ms-be a Like token.
Jan 10 2018, 1:34 PM · Patch-For-Review, decommission-hardware, ops-esams, SRE
fgiunchedi awarded T169518: Decommission esams ms-fe / ms-be a Like token.
Jan 10 2018, 9:01 AM · Patch-For-Review, decommission-hardware, ops-esams, SRE

Jan 9 2018

mark added a comment to T184065: Setup new access switches.

asw-oe14-esams, asw-oe15-esams and asw-oe16-esams have all been mounted in their respective racks, all at position 24 (so midway).

Jan 9 2018, 4:08 PM · SRE, ops-esams
mark added a project to T184522: To purchase for next esams visit: ops-esams.
Jan 9 2018, 2:58 PM · ops-esams, SRE
mark created T184522: To purchase for next esams visit.
Jan 9 2018, 1:50 PM · ops-esams, SRE
mark added a comment to T176816: cr2-esams temperature warning.

cr2-esams is mounted facing the hot row (like all network equipment) so this makes sense.

Jan 9 2018, 1:39 PM · DC-Ops, ops-esams, netops, SRE
mark updated the task description for T184063: Remove all decommissioned hardware.
Jan 9 2018, 1:38 PM · SRE, Epic, ops-esams
mark closed T177228: Multiple systems in esams OE10 showing PSU failures as Resolved.

cp3048 had one PSU loosely connected, fixed. All other systems in the rack have redundant power atm.

Jan 9 2018, 1:20 PM · Traffic, ops-esams, DC-Ops, SRE

Jan 4 2018

mark updated the task description for T184063: Remove all decommissioned hardware.
Jan 4 2018, 2:58 PM · SRE, Epic, ops-esams
mark raised the priority of T184179: Missing references to s8 on maintenance and cloud scripts (and potentially others) from Medium to High.
Jan 4 2018, 1:48 PM · cloud-services-team (Kanban), Data-Services, Wikidata, MediaWiki-Maintenance-system, SRE

Jan 3 2018

mark added a comment to T174637: Setup esams atlas anchor.

Have we acquired a new image for AS14907 yet?

Jan 3 2018, 1:41 PM · SRE, netops, ops-esams
mark raised the priority of T176816: cr2-esams temperature warning from Medium to High.
Jan 3 2018, 1:28 PM · DC-Ops, ops-esams, SRE, netops
mark moved T98984: Check power supply balance settings on cp3030+ from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · DC-Ops, SRE, ops-esams
mark moved T148422: cp3009: memory scrubbing error from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · Patch-For-Review, Traffic, ops-esams, SRE
mark moved T166965: Degraded RAID on lvs3001 from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · Traffic, ops-esams, SRE
mark moved T168619: Degraded RAID on lvs3001 from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · ops-esams, SRE
mark moved T169035: bast3002 sdb broken from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · SRE, ops-esams
mark moved T176816: cr2-esams temperature warning from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · DC-Ops, ops-esams, SRE, netops
mark moved T179953: cp3043 disk failure from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · Traffic, SRE, ops-esams
mark moved T183814: Degraded RAID on bast3002 from Backlog to Hardware Failure / Repair on the ops-esams board.
Jan 3 2018, 1:24 PM · ops-esams, SRE
mark triaged T184068: Procure and install LVS and miscellaneous servers as Medium priority.
Jan 3 2018, 1:22 PM · Traffic, hardware-requests, SRE, ops-esams
mark added a parent task for T174637: Setup esams atlas anchor: T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking).
Jan 3 2018, 1:18 PM · SRE, netops, ops-esams
mark added a subtask for T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking): T174637: Setup esams atlas anchor.
Jan 3 2018, 1:18 PM · SRE, Epic, ops-esams
mark removed a parent task for T174616: set up cr3-esams: T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking).
Jan 3 2018, 1:17 PM · ops-esams, SRE, netops
mark removed a subtask for T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking): T174616: set up cr3-esams.
Jan 3 2018, 1:17 PM · SRE, Epic, ops-esams
mark added a subtask for T184067: Complete router migration from cr1-esams to cr3-esams: T174616: set up cr3-esams.
Jan 3 2018, 1:17 PM · netops, SRE, ops-esams
mark added a parent task for T174616: set up cr3-esams: T184067: Complete router migration from cr1-esams to cr3-esams.
Jan 3 2018, 1:17 PM · ops-esams, SRE, netops
mark triaged T184067: Complete router migration from cr1-esams to cr3-esams as Medium priority.
Jan 3 2018, 1:17 PM · netops, SRE, ops-esams
mark added a parent task for T174616: set up cr3-esams: T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking).
Jan 3 2018, 1:15 PM · ops-esams, SRE, netops
mark added a subtask for T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking): T174616: set up cr3-esams.
Jan 3 2018, 1:15 PM · SRE, Epic, ops-esams
mark added a project to T174616: set up cr3-esams: ops-esams.
Jan 3 2018, 1:15 PM · ops-esams, SRE, netops
mark triaged T184066: rack/setup/install ps[12]-oe1[456]-esams as Medium priority.
Jan 3 2018, 1:13 PM · SRE, ops-esams
mark triaged T184065: Setup new access switches as Medium priority.
Jan 3 2018, 1:11 PM · SRE, ops-esams
mark triaged T184064: Prepare racks OE14, OE15 and OE16 with new infrastructure as Medium priority.
Jan 3 2018, 1:09 PM · SRE, ops-esams
mark moved T184063: Remove all decommissioned hardware from Backlog to Decommission on the ops-esams board.
Jan 3 2018, 1:07 PM · SRE, Epic, ops-esams
mark added a subtask for T184063: Remove all decommissioned hardware: Unknown Object (Task).
Jan 3 2018, 1:06 PM · SRE, Epic, ops-esams
mark added a parent task for T167376: Decommission cp300[3456]: T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:06 PM · decommission-hardware, SRE, ops-esams
mark added a subtask for T184063: Remove all decommissioned hardware: T167376: Decommission cp300[3456].
Jan 3 2018, 1:06 PM · SRE, Epic, ops-esams
mark added a parent task for T94215: decommission cp3001 & cp3002: T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:04 PM · DC-Ops, SRE, Patch-For-Review, ops-esams
mark added a parent task for T87790: decom amslvs1-4 (dc work): T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:04 PM · Patch-For-Review, DC-Ops, SRE, ops-esams
mark added subtasks for T184063: Remove all decommissioned hardware: T169518: Decommission esams ms-fe / ms-be , T95742: Decomission amssq31-62 (32 hosts), T87790: decom amslvs1-4 (dc work), T94215: decommission cp3001 & cp3002, T130883: decom cp3011-22 (12 machines), T159480: Decommission bast3001.
Jan 3 2018, 1:04 PM · SRE, Epic, ops-esams
mark added a parent task for T95742: Decomission amssq31-62 (32 hosts): T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:04 PM · decommission-hardware, DC-Ops, SRE, ops-esams
mark added a parent task for T130883: decom cp3011-22 (12 machines): T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:04 PM · Patch-For-Review, decommission-hardware, ops-esams, SRE
mark added a parent task for T169518: Decommission esams ms-fe / ms-be : T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:04 PM · Patch-For-Review, decommission-hardware, ops-esams, SRE
mark added a parent task for T159480: Decommission bast3001: T184063: Remove all decommissioned hardware.
Jan 3 2018, 1:04 PM · Patch-For-Review, decommission-hardware, ops-esams, SRE
mark triaged T184063: Remove all decommissioned hardware as Medium priority.
Jan 3 2018, 1:02 PM · SRE, Epic, ops-esams
mark triaged T184061: SRE 2017-18 Q3 goal Cleanup esams and refresh servers and infrastructure (tracking) as Medium priority.
Jan 3 2018, 12:59 PM · SRE, Epic, ops-esams

Dec 20 2017

mark triaged T183341: New item fails (Special and WEF tool) as High priority.
Dec 20 2017, 1:35 PM · User-Daniel, MediaWiki-libs-Rdbms, Wikidata