Page MenuHomePhabricator

eqord - ulsfo Telia link down - IC-313592
Closed, ResolvedPublic

Description

The Telia circuit IC-313592 between eqord and ulsfo went down.

There was no planned maintenance mail or calendar entry for this time frame.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqord&service=Router+interfaces

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr3-ulsfo&service=Router+interfaces


CRITICAL: host '208.80.154.198', interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0:
xe-0/1/1: down -> Transport: cr3-ulsfo:xe-0/1/1 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave];

CRITICAL: host '198.35.26.192', interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0:
xe-0/1/1: down -> Transport: cr2-eqord:xe-0/1/1 (Telia, IC-313592, 51ms) {#1062} [10Gbps wave];

Event Timeline

time frame 16:27 UTC, 12:27 PST:

12:26 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0:

12:27 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org

Indeed, just got a notification:
"We have an outage which is suspected to be caused by a cable fault. Our NOC is investigating and activating local resources. We will provide more information as it becomes available."

Seems like a non-issue as redundancy worked as expected. To be monitored.

" Field tech isolated the fault location and is en route to perform a survey of the damage. "

Wed, 17 Apr 2019 23:04 :

Field tech is still working at the site to run OTDR and isolate the location of the fault.

Wed, 17 Apr 2019 21:46 :
A splice crew has been dispatched to Lovelock with an ETA of 23:00 UTC. A construction crew is also being dispatched. The technician dispatched to take measurements is still en route with an ETA of 20:15 UTC.

Just noting that at 10:41 UTC the circuit was still down per

akosiaris@cr3-ulsfo> show interfaces descriptions | match 313592 
xe-0/1/1        up    down Transport: cr2-eqord:xe-0/1/1 (Telia, IC-313592, 51ms) {#1062} [10Gbps wave]

Got an email 1h ago saying the onsite crew was still splicing hard.

This is to inform you that splicing activity on the east side is still ongoing and we will keep you updated with work progress.

Techs have completed splicing and are hands off. It may be necessary to reset your services locally at your equipment. We will now proceed to form an official RFO which we will share at a later time. 

..

Thu, 18 Apr 2019 18:26 :

We have seen alarms clear and our transmission restore. We are awaiting confirmation from the field that techs are hands off.
Dzahn claimed this task.
12:01 <+icinga-wm> RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 
                   https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
12:02 <+icinga-wm> RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 
                   https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

5 hours ago: "We regret to inform you that we are facing a cable Between Denver and Strasburg in US. We will investigate and update you accordingly"

4 hours ago: "Our provider confirmed that they have a fiber issue outside Denver. They are dispatching a technician to investigate this issue. We appreciate your patience, sincere apologies for the inconvenience. Thanks "

3 hours ago: "Our Provider advised that Field Operations arrived on site and have identified what appears to be a partial damage approximately 3.7 miles from the testing facility. The OSPE has been engaged and is reviewing the results to assist with confirming the exact point of failure. Field Operations will also begin traversing the span in the approximate area of the suspected damage. An ETA is currently unavailable."

2 hours ago: "Field Operations reports that Field Operations are searching the approximate area for the damage looking to confirm the exact point of failure. The OSPE is also currently in transit to the failure site, but with heavy rush hour traffic is not expected to arrive until approximately 00:00 GMT on May 14, 2019. "

1 hour ago: "Field Operations report that the OSPE remains in transit to the failure site. An updated ETA is currently unavailable. "

1 hour ago: "Our provider has confirmed that the OSPE has arrived at the failure site and confirmed that a 96-count buried fiber optic cable has been damaged and is causing the service interruption. A tentative plan has been established to install approximately 700 feet of temporary 96-count fiber to expedite service restoration. Splicing teams, construction teams, and replacement fiber have all been engaged and dispatched to the failure site. Once all necessary repair personnel are on site, the OSPE will confirm if the repair plan will be possible or if another plan of action will be established. All necessary teams are expected to be on site by approximately 02:00 GMT. "

Thanks! Yes it's fine to reuse the ticket. Link is back up now.

and..it is DOWN again

23:03 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

23:03 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
  • Maintenance window:

Start Date and Time: 2019-May-23 03:00 UTC
End Date and Time: 2019-May-23 07:00 UTC

Action and Reason: Emergency hardware work needed to restore traffic. We will reset the board, but in worst case we need to replace the board. Both actions in same emergency work.

^ that matches ... so should recover in max 4 hours.. we will see though...

23:21 <+icinga-wm> RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0

23:22 <+icinga-wm> RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0

This went down again today at 20:12 UTC per Icinga

20:12 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 62, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

20:13 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 74, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down

Transport: cr3-ulsfo:xe-0/1/1 (Telia, IC-313592, 51ms 10Gbps wave) {#11372};

Telia: We have opened case 01134611 and are currently investigating.

"Kindly be informed that your circuit involved in major outage in our network , suspecting faulty card , investigation is ongoing. Major case 01134704."

Wed, 18 Mar 2020 01:09 :

Our technician performed cleaning to FMP in Denver and the issue resolved , the traffic is up now .

apologize for the inconvenience caused.

Wed, 18 Mar 2020 00:18 :

We have engaged our HW provider in the case and they ordered dispatch to the site with spare card , no ETR available yet but we will keep you posted as much as we can.

Dzahn reopened this task as Open.EditedMay 7 2020, 10:18 AM

IC-313592 is down again

Transport: cr2-eqord:xe-0/1/1 (Telia, IC-313592, 51ms 10Gbps wave) {#1062};
Transport: cr3-ulsfo:xe-0/1/1 (Telia, IC-313592, 51ms 10Gbps wave) {#11372};

Last State Change: 2020-05-07 10:11:25
Last State Change: 2020-05-07 10:11:11

sent mail to Telia

"we observed some flaps in our WDM equipment and we suspect a card problem in Palo Alto, USA. Investigation is now ongoing with our vendor and we will keep you posted. "

ACKed for 6 more hours the time Telia fixes it.

After ticket 01157098 was resolved, the link didn't come back up.
Ticket 01157707 was opened.
Telia setup a loop on the Chicago side towards SF, which brought the SF interface up, but the Chicago facing loop didn't bring the interface up.

As the issue started with a remote issue, it's unlikely that cabling is at fault, so my guess is that the optic died.

Opened procurement T252188

ayounsi mentioned this in Unknown Object (Task).May 8 2020, 10:01 AM
ayounsi added a project: ops-eqord.

Remote hands replaced the optics yesterday but the link is still down. Lights are correct.

Emailed Telia 12h ago with:

Remote hands replaced the optic, we're still seeing good light:
    Laser output power                        :  0.7100 mW / -1.49 dBm
    Laser receiver power                      :  0.3358 mW / -4.74 dBm

But link is still down. Can you check the light levels on your side?

Waiting for their answer.

At 07:50 UTC Telia responded stating that this was due to a planned maintenance PWIC110129, despite the circuit having been down for days already.

The maintenance was scheduled to end at 12:00. At 12:40 I emailed them.

At 12:56 they opened a case 01160116 and at 13:10 they asked us to bounce both ends of the interface and report light levels.

Replied to Telia:

Thanks, interfaces have been bounced on both ends.

Here's the light levels on the Chicago side:

cdanis@cr2-eqord> show interfaces diagnostics optics xe-0/1/1
Physical interface: xe-0/1/1

Laser bias current                        :  36.480 mA
Laser output power                        :  0.7090 mW / -1.49 dBm
Module temperature                        :  27 degrees C / 81 degrees F
Module voltage                            :  3.3000 V
Laser receiver power                      :  0.3377 mW / -4.71 dBm

No alarms or warnings on the Chicago side.

On the SFO side, we show no light / a low rx power alarm:

cdanis@cr3-ulsfo> show interfaces diagnostics optics xe-0/1/1
Physical interface: xe-0/1/1

Laser bias current                        :  33.886 mA
Laser output power                        :  0.7040 mW / -1.52 dBm
Module temperature                        :  33 degrees C / 91 degrees F
Module voltage                            :  3.2880 V
Laser receiver power                      :  0.0001 mW / -40.00 dBm
Laser rx power low alarm                  :  On

From Telia after asking them the light levels they're getting.

Looks like we are still at times seeing low light and errors in Chicago and transmitting those to San Francisco:
CHI: Rx -3.25 Tx -3.45
Rx -55.00 @ 11:00 to 11:15 UTC
Rx -55.00 @ 2:15 to 2:30 UTC
Rx PCS: ES for the last week from customer
San Jose: Rx -3.26
Tx -55.00
Tx -55.00 for the last week same as the errors in CHI
Tx errors for last week
Was it on the Chicago side that you changed optics, and can you try a different port there?

To summarize the situation as of now:

  • The link never came back up after a seemingly unrelated Telia maintenance (linecard replacement in CA)
  • A loop set by Telia in Chicago facing ulsfo brings the ulsfo port up
  • A loop set by Telia in Chicago facing eqord doesn't bring the eqord port up
  • Replacing the eqord optic didn't fix the issue
  • Light rx/tx is good between Telia in Chicago and eqord (the -55 above seem to be when I manually bounced the ports)

@faidon Telia is asking us to try a different port in eqord. Which would be surprising (never happened before, no logs and it's usually the optic that fails). So I'd like to have a 2nd opinion before having to dispatch remote hands again.

Not having visibility on the Telia side, there is also a possibility the issue is on their side.

From Telia:

Your service was affected by an outage along the transmission path, but the Loss of Signal we saw in Chicago happened after that outage had already started so it is unrelated.
Regarding the Loss of Sync alarm, it is something we see in our Chicago equipment:
This alarm is generated when a Loss Of Sync is detected from the client signal.
This alarm is most likely caused by:
A physically severed fiber between the Trib port and the client equipment
Physically severed fiber between local network element and the upstream network element
A faulty transmitter in the client equipment
When previously tested and again re-tested right now, placing a soft loop in Chicago facing the line, the traffic from San Francisco makes it to Chicago and then loops back, so we start transmitting back to you in San Francisco, so we know the span is good from San Francisco to Chicago. In Chicago, we had previously dispatched our equipment to hard loop test and replace our optic just in case, and I believe this was after you had already replaced your optic. Since that test was also passing, the next step in isolating the issue is if you can try a different port on your equipment, as well as verify all cabling.

Following up in T252188 to move the port.

Mentioned in SAL (#wikimedia-operations) [2020-05-15T17:27:48Z] <XioNoX> renumber cr2-eqord:xe-0/1/1 to xe-0/1/3 - T221259

Change 596710 had a related patch set uploaded (by Ayounsi; owner: Ayounsi):
[operations/homer/public@master] Move ulsfo-eqord port to cr2-eqord:xe-0/1/3

https://gerrit.wikimedia.org/r/596710

Change 596710 merged by jenkins-bot:
[operations/homer/public@master] Move ulsfo-eqord port to cr2-eqord:xe-0/1/3

https://gerrit.wikimedia.org/r/596710

ayounsi claimed this task.

Physically moving the optic to a different port solved the issue.
Opened T252988 to troubleshot that specific issue.

CDanis mentioned this in Unknown Object (Task).May 22 2020, 5:38 PM