Page MenuHomePhabricator

eqord - ulsfo Telia link down - IC-313592
Closed, ResolvedPublic

Description

The Telia circuit IC-313592 between eqord and ulsfo went down.

There was no planned maintenance mail or calendar entry for this time frame.

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr2-eqord&service=Router+interfaces

https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cr3-ulsfo&service=Router+interfaces


CRITICAL: host '208.80.154.198', interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0:
xe-0/1/1: down -> Transport: cr3-ulsfo:xe-0/1/1 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave];

CRITICAL: host '198.35.26.192', interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0:
xe-0/1/1: down -> Transport: cr2-eqord:xe-0/1/1 (Telia, IC-313592, 51ms) {#1062} [10Gbps wave];

Event Timeline

Dzahn created this task.Apr 17 2019, 4:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 17 2019, 4:35 PM
Dzahn added a comment.EditedApr 17 2019, 4:38 PM

time frame 16:27 UTC, 12:27 PST:

12:26 <+icinga-wm> PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0:

12:27 <+icinga-wm> PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0:

https://wikitech.wikimedia.org

Indeed, just got a notification:
"We have an outage which is suspected to be caused by a cable fault. Our NOC is investigating and activating local resources. We will provide more information as it becomes available."

Seems like a non-issue as redundancy worked as expected. To be monitored.

Dzahn updated the task description. (Show Details)Apr 17 2019, 4:45 PM
colewhite triaged this task as High priority.Apr 17 2019, 6:47 PM

" Field tech isolated the fault location and is en route to perform a survey of the damage. "

Wed, 17 Apr 2019 23:04 :

Field tech is still working at the site to run OTDR and isolate the location of the fault.

Wed, 17 Apr 2019 21:46 :
A splice crew has been dispatched to Lovelock with an ETA of 23:00 UTC. A construction crew is also being dispatched. The technician dispatched to take measurements is still en route with an ETA of 20:15 UTC.

Just noting that at 10:41 UTC the circuit was still down per

akosiaris@cr3-ulsfo> show interfaces descriptions | match 313592 
xe-0/1/1        up    down Transport: cr2-eqord:xe-0/1/1 (Telia, IC-313592, 51ms) {#1062} [10Gbps wave]

Got an email 1h ago saying the onsite crew was still splicing hard.

This is to inform you that splicing activity on the east side is still ongoing and we will keep you updated with work progress.
Dzahn added a comment.Apr 18 2019, 9:19 PM
Techs have completed splicing and are hands off. It may be necessary to reset your services locally at your equipment. We will now proceed to form an official RFO which we will share at a later time. 

..

Thu, 18 Apr 2019 18:26 :

We have seen alarms clear and our transmission restore. We are awaiting confirmation from the field that techs are hands off.
Dzahn closed this task as Resolved.Apr 18 2019, 9:21 PM
Dzahn claimed this task.
12:01 <+icinga-wm> RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 
                   https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
12:02 <+icinga-wm> RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 
                   https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
Dzahn reassigned this task from Dzahn to ayounsi.Tue, May 14, 1:48 AM
Dzahn added a comment.Tue, May 14, 1:52 AM

5 hours ago: "We regret to inform you that we are facing a cable Between Denver and Strasburg in US. We will investigate and update you accordingly"

4 hours ago: "Our provider confirmed that they have a fiber issue outside Denver. They are dispatching a technician to investigate this issue. We appreciate your patience, sincere apologies for the inconvenience. Thanks "

3 hours ago: "Our Provider advised that Field Operations arrived on site and have identified what appears to be a partial damage approximately 3.7 miles from the testing facility. The OSPE has been engaged and is reviewing the results to assist with confirming the exact point of failure. Field Operations will also begin traversing the span in the approximate area of the suspected damage. An ETA is currently unavailable."

2 hours ago: "Field Operations reports that Field Operations are searching the approximate area for the damage looking to confirm the exact point of failure. The OSPE is also currently in transit to the failure site, but with heavy rush hour traffic is not expected to arrive until approximately 00:00 GMT on May 14, 2019. "

1 hour ago: "Field Operations report that the OSPE remains in transit to the failure site. An updated ETA is currently unavailable. "

1 hour ago: "Our provider has confirmed that the OSPE has arrived at the failure site and confirmed that a 96-count buried fiber optic cable has been damaged and is causing the service interruption. A tentative plan has been established to install approximately 700 feet of temporary 96-count fiber to expedite service restoration. Splicing teams, construction teams, and replacement fiber have all been engaged and dispatched to the failure site. Once all necessary repair personnel are on site, the OSPE will confirm if the repair plan will be possible or if another plan of action will be established. All necessary teams are expected to be on site by approximately 02:00 GMT. "

ayounsi closed this task as Resolved.Tue, May 14, 3:21 PM

Thanks! Yes it's fine to reuse the ticket. Link is back up now.