Page MenuHomePhabricator

High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
jcrespo
May 5 2022, 11:06 AM
Referenced Files
F35105727: image.png
May 6 2022, 6:52 AM
F35104826: image.png
May 5 2022, 1:20 PM
F35104772: image.png
May 5 2022, 12:49 PM
F35104674: image.png
May 5 2022, 11:26 AM
F35104642: Screenshot_20220505_125737.png
May 5 2022, 11:06 AM
F35104641: Screenshot_20220505_125813.png
May 5 2022, 11:06 AM

Description

Since approximately 3:20am UTC, the amount of 5XX responses as monitored on Wikimedia CDNs shows an increase in rate, from 1-2 per second to 20-30 per second:

https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1&viewPanel=7&from=1651705335095&to=1651748535095

(please ignore the spike from 5:40-5:55, that was an unrelated outage T307647)

Looking at logstash logs, the source of the errors seem to be maps:

(error rates without maps included)

Screenshot_20220505_125737.png (557×1 px, 63 KB)

(only error rates of maps)

Screenshot_20220505_125813.png (833×1 px, 126 KB)

Dashboards seem worrying and not normal regarding maps performance (increased load and IO):

https://grafana.wikimedia.org/d/W67hDHEGz/maps-performances-jgiannelos?orgId=1&from=1651143918627&to=1651748718627

Event Timeline

It looks like the errors on the service level are stabilized now:
https://logstash.wikimedia.org/goto/b120e6cc14580395c999030f1e16c872

It looks like most of the errors are caused from geoshapes requests that send queries back to the service itself causing load combined with a starvation of DB connections.
This is a known issue. From the dashboards it looks like connection availability is going up so things would slowly resolve.

I suggest we monitor it for now.

image.png (1×3 px, 493 KB)

The part that connections available are almost zero correlates with the kartotherian service high error count.

Jgiannelos triaged this task as Medium priority.May 5 2022, 11:27 AM

The increase in HTTP traffic makes sense because of the way geoshapes retries the HTTP request on failure.

Change 789598 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/kartotherian@master] Dont retry geoshape requests on failure

https://gerrit.wikimedia.org/r/789598

Another thing that looks a bit off is that in the top 50 request on the error logs is that this tile show up repeatedly:
GET https://maps.wikimedia.org/osm-intl/6/33/21.png

Turns out this tile was never cached. My guess here is that:

  • the tile is very big
  • envoy timed out the requests
  • tegola never cached it

Which checks out from the tegola logs:

image.png (1×2 px, 572 KB)

I manually hit the tegola pod to bypass envoy and it eventually got cached.

Lets see if this fixes things. One potential action item could be to fine tune the envoy timeout threshold.

jcrespo renamed this task from High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:00 to High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20.May 5 2022, 12:57 PM
jcrespo updated the task description. (Show Details)

It looks like manually GETing the tile fixed the DB query latency:

image.png (664×1 px, 54 KB)

At the moment:

  • DB connections look healthy
  • The load/IO issue on maps nodes is resolved
  • Tiles DB query latency went back down
  • Tiles request rate is is back to our regular traffic
  • No errors related to map tiles/snapshots

The only errors we get at the moment is about geoshapes so my current guess that we have an issue with the indexing.
We need to enable slow queries on maps replicas to understand whats happening

Change 789638 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] postgres: allow enabling the slow query log on replicas

https://gerrit.wikimedia.org/r/789638

From postgres logs around the time errors started on maps1007:

2022-05-05 03:24:04 GMT LOG:  server process (PID 24334) was terminated by signal 9: Killed

From postgres logs around the time errors started on maps1007:

2022-05-05 03:24:04 GMT LOG:  server process (PID 24334) was terminated by signal 9: Killed

There was a corresponding OOM kill at the same time (and some others up until 07:00 or so):

[Thu May  5 03:24:12 2022] Out of memory: Kill process 25314 (postgres) score 46 or sacrifice child
[Thu May  5 03:24:12 2022] Killed process 25314 (postgres) total-vm:9829200kB, anon-rss:1206852kB, file-rss:4kB, shmem-rss:4947668kB
[Thu May  5 03:24:12 2022] oom_reaper: reaped process 25314 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:4947668kB

Maybe the current issue with geoshapes is because of a corrupted index? Still TBD based on the output of slow query logs but in the past reindexing worked.

Maybe its worth depooling maps1007 temporarily if its the only node that had issues to see if we still get errors.

Mentioned in SAL (#wikimedia-operations) [2022-05-05T16:15:29Z] <akosiaris> T307671 depool maps1007 from traffic per suggestion.

The number of errors seems to have been reduced by around a third (?), but they seem to keep happening: https://logstash.wikimedia.org/goto/89b18804d16f435ed4d0b5438f68615b

Also maps1006 got killed by OOM at the same time.

2022-05-05 03:24:10 GMT LOG:  server process (PID 42352) was terminated by signal 9: Killed

It would be useful for debugging to see if only the nodes that were killed by OOM are causing errors. Not sure if we can afford to depool another node though.

This view is very useful for debugging on the maps end:

image.png (1×1 px, 366 KB)

I think what happened is a combination of:

  • Pages requesting the tile from https://phabricator.wikimedia.org/T307671#7906646
  • Tile was too big and timed out the request
  • After timeout tegola never cached it
  • The fact that we never manage to cache this tile caused load in the DB
  • Load in DB caused geoshapes to fail
  • Failures caused more load because of retries (already tackled on a patch)
  • Load causes OOM
  • Postgres goes down

Still TBD if there is an issue with the indexing after postgres was killed.
The drop on the query latency is where I manually populated the cache for the faulty tile.

Change 789638 merged by Hnowlan:

[operations/puppet@production] postgres: allow enabling the slow query log on replicas

https://gerrit.wikimedia.org/r/789638

Mentioned in SAL (#wikimedia-operations) [2022-05-06T11:38:55Z] <hnowlan> enabling postgres slow query log on maps replicas T307671

I dont see any slow queries on the replicas after enabling the postgres config also kartotherian level errors are back to the regular rates:
https://logstash.wikimedia.org/goto/933d704ba3f70f30b6999dbc71fa6769

Could it be the case that there is an issue somewhere higher in the stack?

The issue seems to have been gone since today at ~01:00 UTC: https://logstash.wikimedia.org/goto/61cf711c8ad63cc8e9520775e9101be2

Number of errors throughout the infrastructure is back to normal/good levels: https://logstash.wikimedia.org/goto/61cf711c8ad63cc8e9520775e9101be2 (20-60 503 per minute).

Jgiannelos claimed this task.

Change 789598 merged by jenkins-bot:

[mediawiki/services/kartotherian@master] Dont retry geoshape requests on failure

https://gerrit.wikimedia.org/r/789598