Page MenuHomePhabricator

Replace remaining Samsung SSDs
Closed, ResolvedPublic

Description

With the recent refresh of restbase200{1,2,3,4,5,6}, there are 11 hosts remaining that are populated with the problematic Samsung SSDs. Replacing the SSDs themselves is of course an option, but since all of the hosts are either past their supported until date, or are within 6 months of it, I first propose replacing them entirely.

Standard host configuration

  • PowerEdge R440
  • 2 @ Intel Xeon Silver 4114 2.2G, 10C/20T, 9.6GT/s , 14M Cache, Turbo, HT (85W) DDR4-2400
  • 4 @ 32GB RDIMM 2666MT/s Dual Rank (128G)
  • 3 @ 1.92TB SSD SATA Read Intensive 6Gpbs 512n 2.5in Hot-plug Drive, Hawk-M4R,1 DWPD,3504 TBW
See also: {T210884}

Hosts w/ Samsung SSDs

Codfw

HostSupported until
restbase20072019-04-22
restbase20082019-04-22

Eqiad

HostSupported until
restbase10072018-05-28
restbase10082018-05-28
restbase10092018-05-28
restbase10102019-02-23
restbase10112019-02-23
restbase10122019-02-23
restbase10132019-02-23
restbase10142019-02-23
restbase10152019-02-23

Event Timeline

Eevans triaged this task as Medium priority.Oct 26 2018, 6:51 PM
Eevans created this task.

@RobH Can we get information on the hosts listed in the description, vis-a-vis whether their leased or purchased, and any time remaining before a scheduled refresh?

@RobH Can we get information on the hosts listed in the description, vis-a-vis whether their leased or purchased, and any time remaining before a scheduled refresh?

Ok, most of that date info for each host can be viewed on this google sheet: https://docs.google.com/spreadsheets/d/1kxDpjqBKVWixAOyS-YmBwG3bNtB480J45nU34MTjub4/edit?usp=sharing

As for replacement date, that depends on a few factors, but if its a lease its 3 years, and if it is owned it is up to 5 years. The owned hosts can vary though, depending in the needs of the service in question, budget, and planning. So I would recommend going 3 years for lease, and 5 years for own for now.

RobH mentioned this in Unknown Object (Task).Nov 30 2018, 8:22 PM
Eevans renamed this task from Procure remaining hardware for RESTBase cluster capacity upgrade to Replace remaining Samsung SSDs.Nov 30 2018, 8:40 PM
Eevans updated the task description. (Show Details)
mobrovac added subtasks: Unknown Object (Task), Unknown Object (Task).Jan 18 2019, 2:08 AM
mobrovac mentioned this in Unknown Object (Task).Jan 22 2019, 6:47 PM
mobrovac mentioned this in Unknown Object (Task).
RobH closed subtask Unknown Object (Task) as Resolved.Feb 25 2019, 4:16 PM
Papaul closed subtask Unknown Object (Task) as Resolved.Mar 7 2019, 4:14 PM
Cmjohnson closed subtask Unknown Object (Task) as Resolved.Apr 8 2019, 3:43 PM

Mentioned in SAL (#wikimedia-operations) [2019-04-08T16:31:12Z] <urandom> bootstrapping cassandra-a, restbase2019 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-08T20:24:50Z] <urandom> bootstrapping cassandra-b, restbase2019 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-09T00:45:52Z] <urandom> bootstrapping cassandra-c, restbase2019 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-09T10:09:21Z] <godog> bootstrapping cassandra-a, restbase2020 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-09T14:09:35Z] <godog> bootstrapping cassandra-b, restbase2020 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-09T18:07:13Z] <urandom> bootstrapping cassandra-c, restbase2020 -- T208087

Change 502631 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Add restbase20(19|20) to the list of targets

https://gerrit.wikimedia.org/r/502631

Change 502631 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Add restbase20(19|20) to the list of targets

https://gerrit.wikimedia.org/r/502631

Mentioned in SAL (#wikimedia-operations) [2019-04-09T22:11:20Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@c0a2977]: Bring RB on restbase20(19|20) up to date - T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-09T22:13:52Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@c0a2977]: Bring RB on restbase20(19|20) up to date - T208087 (duration: 02m 32s)

Mentioned in SAL (#wikimedia-operations) [2019-04-10T12:41:46Z] <urandom> decommissioning cassandra-b, restbase2007 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-10T20:43:41Z] <urandom> decommissioning cassandra-c, restbase2007 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-11T23:57:28Z] <urandom> decommissioning cassandra-b, restbase2008 -- T208087

Mentioned in SAL (#wikimedia-operations) [2019-04-12T12:53:29Z] <urandom> decommissioning cassandra-c, restbase2008 -- T208087

Change 503452 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Set restbase200[78] as spares and remove them from conftool

https://gerrit.wikimedia.org/r/503452

Change 503460 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Remove restbase200[78] from the list of targets

https://gerrit.wikimedia.org/r/503460

Change 503452 merged by Dzahn:
[operations/puppet@production] Set restbase200[78] as spares and remove them from conftool

https://gerrit.wikimedia.org/r/503452

Mentioned in SAL (#wikimedia-operations) [2019-04-16T18:07:56Z] <mutante> restbase2007, restbase2008 - re-enabled puppet which was disabled with reason 'decom'ed' but actually needed to run to decom after they had moved to role::spare::system (T208087)

Change 503460 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Remove restbase200[78] from the list of targets

https://gerrit.wikimedia.org/r/503460

Mentioned in SAL (#wikimedia-operations) [2019-05-21T15:56:08Z] <urandom> decommissioning restbase1007-a -- T208087