High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	jcrespo
	May 5 2022, 11:06 AM

Description

Since approximately 3:20am UTC, the amount of 5XX responses as monitored on Wikimedia CDNs shows an increase in rate, from 1-2 per second to 20-30 per second:

https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1&viewPanel=7&from=1651705335095&to=1651748535095

(please ignore the spike from 5:40-5:55, that was an unrelated outage T307647)

Looking at logstash logs, the source of the errors seem to be maps:

(error rates without maps included)

Screenshot_20220505_125737.png (557×1 px, 63 KB)

(only error rates of maps)

Screenshot_20220505_125813.png (833×1 px, 126 KB)

Dashboards seem worrying and not normal regarding maps performance (increased load and IO):

https://grafana.wikimedia.org/d/W67hDHEGz/maps-performances-jgiannelos?orgId=1&from=1651143918627&to=1651748718627

Details

	Subject	Repo	Branch	Lines +/-
	Dont retry geoshape requests on failure	mediawiki/services/kartotherian	master	+2 -1
	postgres: allow enabling the slow query log on replicas	operations/puppet	production	+17 -11

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	BUG REPORT	Jgiannelos	T307671 High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20
		Resolved		Jgiannelos	T307757 Reduce tegola pod replicas

Event Timeline

jcrespo created this task.May 5 2022, 11:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 5 2022, 11:06 AM

jcrespo updated the task description. (Show Details)May 5 2022, 11:07 AM

akosiaris subscribed.May 5 2022, 11:09 AM

jcrespo added a project: Wikimedia-production-error.May 5 2022, 11:20 AM

It looks like the errors on the service level are stabilized now:
https://logstash.wikimedia.org/goto/b120e6cc14580395c999030f1e16c872

It looks like most of the errors are caused from geoshapes requests that send queries back to the service itself causing load combined with a starvation of DB connections.
This is a known issue. From the dashboards it looks like connection availability is going up so things would slowly resolve.

I suggest we monitor it for now.

The part that connections available are almost zero correlates with the kartotherian service high error count.

Jgiannelos triaged this task as Medium priority.May 5 2022, 11:27 AM

The increase in HTTP traffic makes sense because of the way geoshapes retries the HTTP request on failure.

Change 789598 had a related patch set uploaded (by Jgiannelos; author: Jgiannelos):

[mediawiki/services/kartotherian@master] Dont retry geoshape requests on failure

https://gerrit.wikimedia.org/r/789598

gerritbot added a project: Patch-For-Review.May 5 2022, 12:08 PM

Another thing that looks a bit off is that in the top 50 request on the error logs is that this tile show up repeatedly:
GET https://maps.wikimedia.org/osm-intl/6/33/21.png

Turns out this tile was never cached. My guess here is that:

the tile is very big
envoy timed out the requests
tegola never cached it

Which checks out from the tegola logs:

I manually hit the tegola pod to bypass envoy and it eventually got cached.

Lets see if this fixes things. One potential action item could be to fine tune the envoy timeout threshold.

jcrespo renamed this task from High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:00 to High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20.May 5 2022, 12:57 PM

jcrespo updated the task description. (Show Details)

It looks like manually GETing the tile fixed the DB query latency:

At the moment:

DB connections look healthy
The load/IO issue on maps nodes is resolved
Tiles DB query latency went back down
Tiles request rate is is back to our regular traffic
No errors related to map tiles/snapshots

The only errors we get at the moment is about geoshapes so my current guess that we have an issue with the indexing.
We need to enable slow queries on maps replicas to understand whats happening

Change 789638 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] postgres: allow enabling the slow query log on replicas

https://gerrit.wikimedia.org/r/789638

From postgres logs around the time errors started on maps1007:

2022-05-05 03:24:04 GMT LOG:  server process (PID 24334) was terminated by signal 9: Killed

In T307671#7907315, @Jgiannelos wrote:
From postgres logs around the time errors started on maps1007:
2022-05-05 03:24:04 GMT LOG:  server process (PID 24334) was terminated by signal 9: Killed

There was a corresponding OOM kill at the same time (and some others up until 07:00 or so):

[Thu May  5 03:24:12 2022] Out of memory: Kill process 25314 (postgres) score 46 or sacrifice child
[Thu May  5 03:24:12 2022] Killed process 25314 (postgres) total-vm:9829200kB, anon-rss:1206852kB, file-rss:4kB, shmem-rss:4947668kB
[Thu May  5 03:24:12 2022] oom_reaper: reaped process 25314 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:4947668kB

Maybe the current issue with geoshapes is because of a corrupted index? Still TBD based on the output of slow query logs but in the past reindexing worked.

Maybe its worth depooling maps1007 temporarily if its the only node that had issues to see if we still get errors.

Mentioned in SAL (#wikimedia-operations) [2022-05-05T16:15:29Z] <akosiaris> T307671 depool maps1007 from traffic per suggestion.

The number of errors seems to have been reduced by around a third (?), but they seem to keep happening: https://logstash.wikimedia.org/goto/89b18804d16f435ed4d0b5438f68615b

Also maps1006 got killed by OOM at the same time.

2022-05-05 03:24:10 GMT LOG:  server process (PID 42352) was terminated by signal 9: Killed

It would be useful for debugging to see if only the nodes that were killed by OOM are causing errors. Not sure if we can afford to depool another node though.

This view is very useful for debugging on the maps end:

I think what happened is a combination of:

Pages requesting the tile from https://phabricator.wikimedia.org/T307671#7906646
Tile was too big and timed out the request
After timeout tegola never cached it
The fact that we never manage to cache this tile caused load in the DB
Load in DB caused geoshapes to fail
Failures caused more load because of retries (already tackled on a patch)
Load causes OOM
Postgres goes down

Still TBD if there is an issue with the indexing after postgres was killed.
The drop on the query latency is where I manually populated the cache for the faulty tile.

Change 789638 merged by Hnowlan:

[operations/puppet@production] postgres: allow enabling the slow query log on replicas

https://gerrit.wikimedia.org/r/789638

Mentioned in SAL (#wikimedia-operations) [2022-05-06T11:38:55Z] <hnowlan> enabling postgres slow query log on maps replicas T307671

I dont see any slow queries on the replicas after enabling the postgres config also kartotherian level errors are back to the regular rates:
https://logstash.wikimedia.org/goto/933d704ba3f70f30b6999dbc71fa6769

Could it be the case that there is an issue somewhere higher in the stack?

The issue seems to have been gone since today at ~01:00 UTC: https://logstash.wikimedia.org/goto/61cf711c8ad63cc8e9520775e9101be2

Number of errors throughout the infrastructure is back to normal/good levels: https://logstash.wikimedia.org/goto/61cf711c8ad63cc8e9520775e9101be2 (20-60 503 per minute).

Jgiannelos closed this task as Resolved.May 9 2022, 8:13 AM

Jgiannelos claimed this task.

Change 789598 merged by jenkins-bot:

[mediawiki/services/kartotherian@master] Dont retry geoshape requests on failure

https://gerrit.wikimedia.org/r/789598

Jgiannelos mentioned this in rMSKA7438a9247a35: Dont retry geoshape requests on failure.May 10 2022, 12:39 PM

awight mentioned this in T310058: Write release notes for kartotherian deployment.Jun 10 2022, 12:02 PM

Jgiannelos closed subtask T307757: Reduce tegola pod replicas as Resolved.Jun 23 2022, 2:17 PM

Krinkle moved this task from Untriaged to May 2022 on the Wikimedia-production-error board.Jun 23 2022, 4:05 PM

thiemowmde mentioned this in T312913: Geoline is shown in the article, but not in full size map view.Aug 23 2022, 3:21 PM

	F35104642: Screenshot_20220505_125737.png
	May 5 2022, 11:06 AM

	F35104641: Screenshot_20220505_125813.png
	May 5 2022, 11:06 AM

	F35105727: image.png
	May 6 2022, 6:52 AM

	F35104826: image.png
	May 5 2022, 1:20 PM

	F35104772: image.png
	May 5 2022, 12:49 PM

	F35104674: image.png
	May 5 2022, 11:26 AM

High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20Closed, ResolvedPublicBUG REPORTActions

Description

Details

Related ObjectsSearch...

Event Timeline

High rate of 5XX errors from maps.wikimedia.org since 2022-05-05 ~03:20
Closed, ResolvedPublicBUG REPORT
Actions

Related Objects
Search...