Page MenuHomePhabricator

DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )))
Closed, ResolvedPublic

Description

I recently started receiving notification alarms from Icinga for contint2001.mgmt/SSH):

Notification Type: PROBLEM

Service: SSH
Host: contint2001.mgmt
Address: 10.193.2.250
State: CRITICAL

That seems rather recent:

contint2001-mgmt-ssh.png (300×900 px, 32 KB)

https://icinga.wikimedia.org/cgi-bin/icinga/histogram.cgi?host=contint2001.mgmt&service=SSH

Kunal stated that it might be the codfw network flapping somehow but we could not find a related task.


statushostversionNew versionBIOS versionNew BIOS versionComments
[x]ores2005.mgmt2.402.81
[x]gerrit2001.mgmt2.212.81
[x]ms-fe2006.mgmt2.402.81
[x]wdqs2001.mgmt2.302.81
[x]wdqs2002.mgmt2.302.81
[]logstash2021.mgmtoffline
[]logstash2022.mgmtoffline
[x]contint2001.mgmt2.212.812.3.42.12Reset IDRAC
[x]mw2253.mgmt2.402.812.3.42.12Reset IDRAC
[x]mw2255.mgmt2.402.812.3.42.13Reset IDRAC

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks @hashar. I would agree with @ayounsi's analysis, if considering contint2001.mgmt in isolation.

But given your subsequent reply and Daniel's response confirming it is spread across racks/rows, but confined to codfw, I think it's unlikely to be the result of issues on the unmanaged management switches. It's unlikely we'd have multiple of them faulty, and restricted only to codfw. It's not impossible that we purchased a dodgy batch of Cat5 cables which have been used there, I did experience an issue like that before, but it's extremely unlikely.

So, given the codfw only pattern, I think it's more likely to be something with the mr1-codfw firewall, or msw1-codfw aggregation switch (that connects all the unmanaged ones). I've checked in LibreNMS and the associated links, CPU checks and other relevant metrics look good for both of these devices, so nothing is jumping out at me. But I can't say for sure there is no problem at this layer.

I observe some packet loss when sending high-speed pings from mr1-codfw to any device, for instance pinging the CRs over direct link. However when sending such pings to devices connected via management switches the results are much better towards the CR mgmt interfaces than to server mgmt interfaces:

CR via uplink:

cmooney@mr1-codfw> ping 208.80.153.206 size 1400 do-not-fragment rapid count 1000 
PING 208.80.153.206 (208.80.153.206): 1400 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 208.80.153.206 ping statistics ---
1000 packets transmitted, 993 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.277/2.028/57.115/2.675 ms

CR via msw1-codfw:

cmooney@mr1-codfw> ping 10.193.0.12 size 1400 do-not-fragment rapid count 1000     
PING 10.193.0.12 (10.193.0.12): 1400 data bytes
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- 10.193.0.12 ping statistics ---
1000 packets transmitted, 997 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.181/3.004/683.287/22.545 ms

contint2001.mgmt via msw1-codfw:

cmooney@mr1-codfw> ping 10.193.2.250 size 1400 do-not-fragment rapid count 1000    
PING 10.193.2.250 (10.193.2.250): 1400 data bytes
!!!!!!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!..!.!.
--- 10.193.2.250 ping statistics ---
1000 packets transmitted, 340 packets received, 66% packet loss
round-trip min/avg/max/stddev = 1.354/9.492/404.538/44.024 ms

Being honest the pattern looks like ICMP rate-limiting. And the Dell iDRAC sub-system is known to have as much processing power as a casio watch, so this could be a total red herring.

To that last point though, I wonder could the issue be the iDRAC modules on these boxes? I've seen issues with those in a past life, timing out to SNMP and SSH connections, so maybe that's it. Looking at the hosts reported above I notice they are all R430 models purchased in 2015/2016:

HostModelPurchase Date
contint2001Dell PowerEdge R430 (1U)2016-03-24
gerrit2001Dell PowerEdge R430 (1U)2016-03-24
logstash2021Dell PowerEdge R4302015-12-07
logstash2022Dell PowerEdge R4302016-11-03
wdqs2001Dell PowerEdge R4302016-08-15

I've limited time to do more checks right now. But I think it'd be worth trying to pull a longer list of affected servers, and try to confirm the server models and then the iDRAC firmware version on each. Possibly a firmware update, or even just an iDRAC reset on them, would help.

I'll assign this to myself for now. May not have a huge amount of time to work on it in the coming weeks but will see what I can dig up.

That is quite an epic diagnostic @cmooney ! It is definitely not trivial to end up root causing some specific piece of hardware as a common cause. Well done!

I went to fetch the IRC log from https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-operations/ which are from May 22nd. For codfw hosts (assuming they match s/2...\.mgmt/:

grep -o '.*icinga.*PROBLEM.*2...\.mgmt is CRITICAL' *|grep -o on.*mgmt|sort|uniq -c|sort -n
      1 on ores2005.mgmt
      3 on gerrit2001.mgmt
      5 on ms-fe2006.mgmt
     24 on wdqs2001.mgmt
     24 on wdqs2002.mgmt
     30 on logstash2021.mgmt
     52 on contint2001.mgmt

wdqs2002 and ms-fe2006 are PowerEdge R430 with some date indicating 2016 as well.

Non codfw excluding hosts that had only one alarm:

 2 on mw1273.mgmt
 2 on wdqs1006.mgmt
 3 on mw1305.mgmt
 4 on mw1297.mgmt
13 on mw1279.mgmt
14 on mw1303.mgmt
24 on bast5001.mgmt
46 on cp5005.mgmt
46 on mw1284.mgmt
56 on analytics1069.mgmt

analytics1069 is a PowerEdge R730xd

wdqs1006, mw1303, mw1305, bast5001, cp5005 are PowerEdge R430

mw1279, mw1284 I could not find them. They, may have been decommissioned

We are seeing this issue because all those hosts are running an old firmware version for the IDRAC. Upgrading the IDRAC on some of those servers in the past did fix the problem. It is not a management switch issue.

@Papaul Could we schedule a firmware upgrade for gerrit2001 due to this issue? (not high prio)

@cmooney Thank you very much for all the debugging effort you put into this and thanks @Papaul for confirming it is indeed an issue of firmware upgrades.

Cathal, would you still like to keep this ticket assigned to you? Papaul, should we turn this into a tracking ticket for firmware upgrades with checkboxes of affected hosts?

@Dzahn I will go for turning this into a tracking ticket for firmware upgrades with check boxes of affected hosts.

Sorry @Dzahn I should have updated it before now. Makes sense to re-assign to DC-Ops I think.

@Papaul I think we can start with these hosts:

ores2005.mgmt
gerrit2001.mgmt
ms-fe2006.mgmt
wdqs2001.mgmt
wdqs2002.mgmt
logstash2021.mgmt
logstash2022.mgmt
contint2001.mgmt

What might be good is if we can confirm the iDRAC firmware version(s) on these first. Then possibly we can do a query to identify a full list of devices with that version?

Is the best thing to do now edit the task subject / description to change it to the tracking task as suggested?

Dzahn renamed this task from Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ) to DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )).Oct 19 2021, 4:29 PM
Dzahn updated the task description. (Show Details)
Dzahn updated the task description. (Show Details)

affected hosts I am ACKing right now in Icinga:

contint2001.mgmt
ms-fe2006.mgmt
mw2253.mgmt

@Papaul Could you maybe get the current versions for the ones I put in the ticket description now?

@cmooney No worries at all. All sounds good to me. I started editing ticket name and description indeed, added some check boxes and a version table.

ores2005.mgmtPER430
gerrit2001.mgmtPER430
ms-fe2006.mgmtPER430
wdqs2001.mgmtPER430
wdqs2002.mgmtPER430
logstash2021.mgmtServer is offine
logstash2022.mgmtServer is offline
contint2001.mgmtPER430

@Papaul @Dzahn I had a go at enumerating the iDrac firmware version on our Dell R430s. Some of the connections timed out, but for those that didn't it resulted in the following list of codfw hosts using firmware 2.4 and below (matching the pattern in the affected hosts).

I could observe even with the API calls the more recent versions responded quickly, whereas the older ones took ages. Not very scientific but it does seem to confirm the older version has issues, even if we don't always see the ping timeouts/alerts for systems on that revision. That said I'm not sure we've had problems with all of these so perhaps the issue is slightly more complex than just the firmware version.

Host                                   Version
-------------------------------------------------
cloudcephmon2002-dev.mgmt.codfw.wmnet  2.40.40.40                             
furud.mgmt.codfw.wmnet                 2.40.40.40                             
ganeti2007.mgmt.codfw.wmnet            2.40.40.40                             
ganeti2008.mgmt.codfw.wmnet            2.40.40.40                             
krb2001.mgmt.codfw.wmnet               2.40.40.40                             
kubernetes2001.mgmt.codfw.wmnet        2.40.40.40                             
kubernetes2002.mgmt.codfw.wmnet        2.40.40.40                             
kubernetes2003.mgmt.codfw.wmnet        2.40.40.40                             
ms-fe2005.mgmt.codfw.wmnet             2.40.40.40                             
ms-fe2007.mgmt.codfw.wmnet             2.40.40.40                             
ms-fe2008.mgmt.codfw.wmnet             2.40.40.40                             
mw2252.mgmt.codfw.wmnet                2.40.40.40                             
mw2253.mgmt.codfw.wmnet                2.40.40.40                             
mw2254.mgmt.codfw.wmnet                2.40.40.40                             
mw2255.mgmt.codfw.wmnet                2.40.40.40                             
mw2257.mgmt.codfw.wmnet                2.40.40.40                             
mw2258.mgmt.codfw.wmnet                2.40.40.40                             
ores2001.mgmt.codfw.wmnet              2.40.40.40                             
ores2002.mgmt.codfw.wmnet              2.40.40.40                             
ores2003.mgmt.codfw.wmnet              2.40.40.40                             
ores2004.mgmt.codfw.wmnet              2.40.40.40                             
ores2006.mgmt.codfw.wmnet              2.40.40.40                             
ores2007.mgmt.codfw.wmnet              2.40.40.40                             
ores2008.mgmt.codfw.wmnet              2.40.40.40                             
ores2009.mgmt.codfw.wmnet              2.40.40.40                             
pki2001.mgmt.codfw.wmnet               2.40.40.40                             
prometheus2003.mgmt.codfw.wmnet        2.40.40.40                             
prometheus2004.mgmt.codfw.wmnet        2.40.40.40                             
restbase2010.mgmt.codfw.wmnet          2.40.40.40                             
restbase2012.mgmt.codfw.wmnet          2.40.40.40                             
thumbor2003.mgmt.codfw.wmnet           2.40.40.40                             
thumbor2004.mgmt.codfw.wmnet           2.40.40.40

There are many more in eqiad, but we don't seem to observe the issue there as much (I'm wondering if it's just a race condition and the additional latecny from monitoring to codfw is responsible?). Full list in the attached file anyway.

There are many more in eqiad, but we don't seem to observe the issue there as much

Yes, there are some but not as much as codfw to the point that first it looked like codfw-only, it's not though. But fewer and not as often.

Right now I do see these examples though:

contint1001.mgmt
kubernetes1003.mgmt

So the ones alerting in eqiad are one case of 2.30.30.30 and one case of "404: API Endpoint Not Found". I guess let's just get rid of all 2.30.* cases first.

@Dzahn I need mw2253 and contint2001 down for me to reset the IDRAC before upgrading.

Thanks.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T14:49:06Z] <mutante> depooling mw2253 for DRAC upgrade (T283582)

@Papaul mw2253 is not a problem. done. it's shut down and downtimed.

contint2001 we have to coordinate with @hashar though

@Papaul Let's go ahead with mw2253.

For contint2001 please consider it stalled and do NOT take down, it is currently the main for CI. I made T294271 for that.

Papaul updated the task description. (Show Details)

@Dzahn mw2253 done

@Papaul. Thank you!

mw2253:

  • scap pulled
  • confirmed icinga green
  • repooled to production

@Papaul Afraid this is a long story. just saw mw2255.mgmt alerting in Icinga.

Mentioned in SAL (#wikimedia-operations) [2021-10-25T19:47:04Z] <mutante> mw2255 - depooled=inactive (incl "dsh groups"), shut down physically for T283582 - can be worked on anytime

Mentioned in SAL (#wikimedia-operations) [2021-10-27T20:47:43Z] <mutante> mw2255 - scap pull, repooling - after DRAC firmware was upgraded - T283582

Thanks @Papaul ! it's back in service now

I am not sure what is next exactly in this ticket. Currently I see no such alerts in Icinga but that can always change of course.

I also see no more 2.3x in codfw and eqiad upgrades should be treated separately.

So while we have a bunch of 2.4x left now I am not sure if the goal is to upgrade them ALL now or if we just go "on demand" and keep this ticket open for a while and just watch it and add affected hosts when/if they pop-up. It's a bit hard to tell when exactly we should close it.

P.S. except contint2001 subtask is still open, yep

@Dzahn thank you. I think it is best to just close this task and go "on demand" since must of those servers were purchased in 2016 refresh is next year.

Dzahn claimed this task.

I agree and boldly resolve it, expecting to reopen / add servers if I see them pop up in Icinga. Thanks for all the work on it. Icinga is less noisy already for sure.

I've seen this alert pop up a few times in the last few days, is it related?

[09:16:02] <+icinga-wm> PROBLEM - SSH on thumbor1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

one case of mw2252.mgmt right now

@AntiCompositeNumber Not for sure but it seems likely that it could also be fixed with firmware upgrades, yes. That being said, it still does not technically belong into this ticket because it is supposed to be "per data center" and this only covers the hosts in codfw while the one you report is in eqiad. So if there are more of these then technically it should be a new ticket like this but for ops-eqiad. Before we do that though.. we need to weigh the effort of fixing it on old hardware vs the actual benefit we get from it. There is the option to ignore it and permanently downtime it. I would say it depends how often and how many hosts.

I no more receive alarms from contint2001.mgmt which was the purpose of this task. When looking at Icinga it returns Error: Host Not Found!! May we get the management host check back?

Has contint2001.mgmt DRAC been updated?

contint1001.mgmt started alerting a few weeks ago, I got four alarms over the course of the night https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=contint1001.mgmt&service=SSH . So I guess it will need a DRAC update as well.

db2083 and db2086 were affected today

for the record: I have absolutely no idea why contint2001.mgmt disappeared from icinga

Dzahn removed Dzahn as the assignee of this task.Jan 7 2022, 7:32 PM

@Papaul Do you know about contint2001.mgmt status?

Dzahn renamed this task from DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL )) to contint2001.mgmt disappeared from Icinga (was: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))).Jan 7 2022, 7:46 PM
Dzahn added a project: observability.
hashar renamed this task from contint2001.mgmt disappeared from Icinga (was: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) to DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))).Jan 10 2022, 8:29 AM
hashar removed a project: observability.

for the record: I have absolutely no idea why contint2001.mgmt disappeared from icinga

I have spinned a standalone task for that T298861 , it is probably related to some refactoring of the monitoring puppet classes.

The reason I have reopened this task is to check contint1001 and see whether its IDRAC has been updated, if not we would have to upgrade it as well to prevent the alarm spam ;)

ACK, understood @hashar, then this goes back to ops-codfw

@Papaul it seems that "broken DRAC" is actually the reason for both things, the alerts here and that it disappeared entirely from Icinga, per T298861#7608653 could you schedule downtime with @hashar

@Papaul wrote:

The IDRAC on this server needs reset. Please coordinate a day and time that is best for this server to be taken off line.

The machine hosts the CI servers and interruptions are quite disruptive to our developers and for deployments. Ideally we should avoid conflict with scheduled window at https://wikitech.wikimedia.org/wiki/Deployments

Anytime in your morning (which is my late afternoon) will be fine, my calendar amusso@ wikimedia.org should be up to date.

@hashar since Monday is a Holiday, let is do this on the 18th at 10am CT. Thanks

@Papaul ack, I have send the following announcement to ops-l and wikitech-l.

The continuous integration server contint2001 will be restarted for a 
hardware maintenance on Tuesday January 18th at 16:00 UTC. During the 
maintenance, the CI systems will be unavailable:

- Jenkins
- Zuul
- https://integration.wikimedia.org/

The out-of-band management system requires an update to address 
intermittent loss of connectivity.  We have to restart the server.


Time conversions:

PST  8:00
CT  10:00
UTC 16:00
CET 17:00

And I have added it to the deployments calendar https://wikitech.wikimedia.org/wiki/Deployments#Tuesday%2C_January_18

@hashar let me know when this is offline so i can take over

Mentioned in SAL (#wikimedia-operations) [2022-01-18T15:59:52Z] <hashar> Shutting down CI for maintenance on contint2001 # T283582

@Papaul the machine is shutting down. I am on IRC if you want to sync up.

reset IDRAC, uograde BIOS and IDRAC.

hashar assigned this task to Papaul.

I have restarted ferm.

Zuul/Jenkins seems to behave properly. Thank you @Papaul for the upgrade!

@hashar no problem you can close the task once all is back online.

CI had to be restarted after the machine went up due to some oddities. The system is fully back up now.Thank you @Papaul!