Page MenuHomePhabricator

Termbox service is logging timeouts mostly on codfw
Open, Stalled, Needs TriagePublicBUG REPORT

Description

On 2019-07-24 timeouts were reported by fsero to us from the codfw pods.

They all appeared to be triggered due to the automatic healthcheck requests.

We hoped that deploying new images to codfw would fix the problem. It did not an we saw intermittent timeouts from all 4 codfw pods in the 1 hour after deployment.

The content of the errors wasn't super helpful because the request key of the AxiosError was empty when the request fails on the network level (T228885 )

Now we have more detailed logging we can see the fail requests. The majority remain from codfw and all of those are to the SpecialEntityData endpoint

Event Timeline

More details logging of the request parameters is now available since yesterday.

There are no events to look at in the last 24 hours. Guess we put this to the bottom of the pile until we see some errors

In the last 24 hours there is now some data:
https://logstash.wikimedia.org/goto/aa869fe7c18571f288fbb8afd45d582a

One failure around 3am to content languages. This came from eqiad pod: termbox-production-57bf7c845-d42wc
13 more failures before at 8pm-ish trying to get the entity from special entity data. These came from a mix of all 4 codfw pods

Tarrow renamed this task from Service on codfw is logging Timeouts to Termbox service is logging timeouts mostly on codfw.Aug 1 2019, 9:57 AM

From IRC @Joe and I talked:

<_joe_> Giuseppe Lavagetto it might be due to some congestion on the link, or some other cross-dc issue
11:14 AM 100 errors in 3 weeks is well below the limit where I start investigating though :)
11:15 AM <tarrow> Tom _joe_: right; it's just that right now the only traffic hitting it is the healthcheck service. We're wondering if that will spike once we have real traffic
11:16 AM <_joe_> Giuseppe Lavagetto possibly, but when you'll have real traffic there, it will be because mw is active-active, so api-ro will point to codfw :)
11:16 AM so no more cross-dc latencies etc.

Another flurry of errors this afternoon. It's enough to cause us to trigger pages to #-operations.

I'm unclear if we should be bumping the timeout across the board; just on the "off" data center; or if we should now be investigating if there is some network issue or something

There were 40 (!) timeout errors on Friday (2019-08-02) from health check requests logged by production services, only 1 each on Saturday and Sunday, and one on Sunday from the test service serving test.wikidata.org.

Tarrow changed the task status from Open to Stalled.Aug 15 2019, 9:46 AM

I think the conclusion we have come to is to occasionally monitor this but as per the discussion with Joe we'll not actively work on it unless it becomes more of a pain point.

@Tarrow: This open task is only associated with an archived project tag, so it will not show up on any project workboard and cannot be found when searching by project. Please either associate an active project tag to this task, or change its status. Thanks!

@Tarrow: No reply, hence adding Wikidata-Termbox so this task is displayed on a workboard of an active project.