Page MenuHomePhabricator

restbase2009 down
Closed, ResolvedPublic

Description

The host restbase2009.codfw.wmnet has now been down for more than 24 hours. I have tried to restart it but it is still not coming up. Nothing in console.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

show system1/log1 etc has 2 telling entries

</>hpiLO-> show system1/log1/record19

status=0
status_tag=COMMAND COMPLETED
Wed Jul  1 13:24:25 2020



/system1/log1/record19
  Targets
  Properties
    number=19
    severity=Critical
    date=06/30/2020
    time=00:11
    description=Server Critical Fault (Service Information: Runtime Fault, System Board,  AUX/Main EFUSE Regulator 1 (10h))
  Verbs
    cd version exit show

and

status=0
status_tag=COMMAND COMPLETED
Wed Jul  1 13:24:16 2020



/system1/log1/record20
  Targets
  Properties
    number=20
    severity=Critical
    date=06/30/2020
    time=00:11
    description=System Power Supply: General Failure (Power Supply Unknown)
  Verbs
    cd version exit show

ouch.

The server is dead. It has a main board problem.

Change 610072 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/puppet@production] Remove restbase2009 from RESTBase cassandra seeds

https://gerrit.wikimedia.org/r/610072

Change 610072 merged by Giuseppe Lavagetto:
[operations/puppet@production] Remove restbase2009 from RESTBase cassandra seeds

https://gerrit.wikimedia.org/r/610072

Change 610078 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[mediawiki/services/restbase/deploy@master] Remove restbase2009 from canaries

https://gerrit.wikimedia.org/r/610078

Change 610078 merged by Hnowlan:
[mediawiki/services/restbase/deploy@master] Remove restbase2009 from canaries

https://gerrit.wikimedia.org/r/610078

Change 610080 had a related patch set uploaded (by Hnowlan; owner: Hnowlan):
[operations/puppet@production] restbase: fix typo

https://gerrit.wikimedia.org/r/610080

Change 610080 merged by Hnowlan:
[operations/puppet@production] restbase: fix typo

https://gerrit.wikimedia.org/r/610080

Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to push out the refresh until FY21-22. Can this be decommissioned, since we're towards the end of the 5yr server life cycle? Thanks, Willy

Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to push out the refresh until FY21-22. Can this be decommissioned, since we're towards the end of the 5yr server life cycle? Thanks, Willy

Do you mean... indefinitely?

Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to push out the refresh until FY21-22. Can this be decommissioned, since we're towards the end of the 5yr server life cycle? Thanks, Willy

Do you mean... indefinitely?

Yeah. Unfortunately, this server is about 5yrs old now, and out of warranty. @Eevans - will you be able to get by without having this system in rotation, until it's time to refresh it? On line 51 for this year's CapEx budget sheet below, I see there's a comment there to postpone the refresh until FY21-22 along with the rest of the batch.

https://docs.google.com/spreadsheets/d/1YQ1_FwwIEdWRjkdY2GA9euuMt9s_3CNW1exDf3UbKK0/edit?ts=5e4534cc#gid=1630616671

but then someone decided to push out the refresh until FY21-22.

That was my recommendation by the way. The reasoning behind it was that restbase201{0,1,2} were bought barely six months after restbase2009 (albeit in the next fiscal year) and bundling them all together made more sense from a budget and cost perspective (easier and cheaper to by 4 instead of 3+1).

Of course I did not anticipate that the box would be rendered inoperable 20 days in the new fiscal year. If we can do without the capacity this machine provides us in codfw (now that parsoid has been moved to mediawiki and then parsercache, perhaps we need less capacity space wise?) we should stick to that plan, as the eventual fate of restbase might be different in a year from now given all the planning that goes toward it. If we can't, then we need to replace the main board I guess (or use some spare server in it's place?)

Hi @Eevans - it looks like this was originally scheduled to be refreshed this fiscal year during the annual CapEx planning, but then someone decided to push out the refresh until FY21-22. Can this be decommissioned, since we're towards the end of the 5yr server life cycle? Thanks, Willy

Do you mean... indefinitely?

Yeah. Unfortunately, this server is about 5yrs old now, and out of warranty. @Eevans - will you be able to get by without having this system in rotation, until it's time to refresh it? On line 51 for this year's CapEx budget sheet below, I see there's a comment there to postpone the refresh until FY21-22 along with the rest of the batch.

/cc: @WDoranWMF

Here is what that means:

We have 5 machines in each of 3 rows per data-center (replica factor of 6). So losing restbase2009 means that we lose 20% of the storage capacity of row D in codfw. Since we have row/replica parity, this means losing 20% of the capacity of codfw, and since codfw is a mirror of eqiad, it means we're losing 20% of the total capacity of the cluster. IOW, I know this reads like 1 machine in 30, but in fact means reducing overall cluster storage capacity by 20%.

We might be able to accept that sort of reduction in capacity (more investigation is needed[*]), but should do so knowing that's what this means. Doing so almost certainly rules out adding any additional use-cases between now and the next expansion (any excess capacity is maintained to accommodate such scenarios).

Finally, if we do. We should consider decommissioning another 5 (one each in the other rows). We won't be able to fully utilize them, and it sounds like they could be used elsewhere.

[*]: Our Grafana dashboards are at odds with actual disk utilization, and we should figure that out.

Edit: T258414: Cassandra Grafana dashboards seem to disagree with actual utilization created to investigate actual utilization.

@Eevans @akosiaris we have 2 spare on site that we can use to replace this server
wmf6413 and wmf6414 in netbox both servers are:
HP ProLiant DL360 Gen9 with 64GB RAM, Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz 8 cores takes SSD's

let me know what you think.

@Eevans @akosiaris we have 2 spare on site that we can use to replace this server
wmf6413 and wmf6414 in netbox both servers are:
HP ProLiant DL360 Gen9 with 64GB RAM, Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz 8 cores takes SSD's

let me know what you think.

Fine by me. If the machine is really needed (blocked on T258414 for figuring out the actual utilization), let's use that spare.

Papaul triaged this task as Medium priority.Jul 27 2020, 2:38 AM

Change 616535 had a related patch set uploaded (by Papaul; owner: Papaul):
[operations/puppet@production] DHCP: Change MAC address for restbase2009

https://gerrit.wikimedia.org/r/616535

Change 616535 merged by Papaul:
[operations/puppet@production] DHCP: Change MAC address for restbase2009

https://gerrit.wikimedia.org/r/616535

Restbase2009 is ready for re-image

Please open a separate task to decommission the old resetbase2009 https://netbox.wikimedia.org/dcim/devices/1099/

thanks

I'm looking into this today - I see that restbase2009 is up 9 days, has been configured by puppet and added to the Cassandra cluster but I don't see anything in SAL about who did it. Still investigating

I'm looking into this today - I see that restbase2009 is up 9 days, has been configured by puppet and added to the Cassandra cluster but I don't see anything in SAL about who did it. Still investigating

eevans@restbase2009:~$ c-any-nt status -r | grep 2009
UN  restbase2009-a.codfw.wmnet  554.01 GiB  256          6.9%              c0d7a947-d423-49b3-b307-416a783a722f  d
UN  restbase2009-b.codfw.wmnet  523.81 GiB  256          6.6%              3ec9435b-dc45-46d3-a2ea-d0d5153615a9  d
UN  restbase2009-c.codfw.wmnet  470.2 GiB  256          6.5%              2e4e1268-1e17-476f-aa9e-b8c42035d115  d
eevans@restbase2009:~$

Umm, wow.

TL;DR It wasn't me. :)

I'm looking into this today - I see that restbase2009 is up 9 days, has been configured by puppet and added to the Cassandra cluster but I don't see anything in SAL about who did it. Still investigating

eevans@restbase2009:~$ c-any-nt status -r | grep 2009
UN  restbase2009-a.codfw.wmnet  554.01 GiB  256          6.9%              c0d7a947-d423-49b3-b307-416a783a722f  d
UN  restbase2009-b.codfw.wmnet  523.81 GiB  256          6.6%              3ec9435b-dc45-46d3-a2ea-d0d5153615a9  d
UN  restbase2009-c.codfw.wmnet  470.2 GiB  256          6.5%              2e4e1268-1e17-476f-aa9e-b8c42035d115  d
eevans@restbase2009:~$

Umm, wow.

TL;DR It wasn't me. :)

I'm confused now. This must be the same machine:

eevans@restbase2009:~$ lastlog | grep -vi "never" | grep -v "Aug  6"
Username         Port     From             Latest
faidon           pts/0    208.80.153.54    Wed Jun 17 23:27:36 +0000 2020
akosiaris        pts/0    91.198.174.60    Thu Jun 18 10:57:39 +0000 2020
jbond            pts/0    208.80.153.54    Mon Jun  1 09:04:44 +0000 2020
ppchelko         pts/0    208.80.154.86    Wed Jun 24 16:27:51 +0000 2020
eevans@restbase2009:~$

So, not reimaged. And not offline, either.

@Eevans @hnowlan this is a different machine same disks. The disks were taken out of the old machine and placed into the new machine

@Papaul @wkandek @akosiaris @wiki_willy

Hope everyone is relatively well. I've also sent this as an email.

There is an issue as a result of 2009 coming back online. The rough chronology I have - please correct me - is:

  • 2009 fails
  • 2009 nodes removed from cluster
  • a replacement machine is available
  • disks are swapped into new machine
  • machine rejoins cluster 9 days ago <- we're not sure how
  • this leads us into an unexpected and unknown state

Having discussed with @Eevans and @hnowlan, the current approach from our team is to wait and be vigilant as we don't have a means to revert. 

I checked in with @wiki_willy to discuss who we should loop in - that is the current basis on which you've been added to this email - if I have missed someone please add them here.

I have a couple of questions:

  1. Are there other escalation paths we need to follow?
  2. Are there any mitigations we can or should take now?
  3. Are there steps or procedures we missed? Or can we improve docs to avoid this in future?

Thanks.

Will, is the unexpected and unknown state due to the Cassandra database state. I remember this being discussed a couple of days ago.

@akosiaris since the HW problem is resolved can I close this task? Also can you please open a new decom task for the old restbase2009 with asset tag wmf6412
https://netbox.wikimedia.org/dcim/devices/1099/

Sure. I 've resolved it and filed T261968 for wmf6412. Thanks!