Page MenuHomePhabricator

low rate of mw-memcached errors
Closed, ResolvedPublic

Assigned To
Authored By
jijiki
Aug 6 2024, 10:16 AM
Referenced Files
F59360187: image.png
May 16 2025, 5:04 PM
F59360185: image.png
May 16 2025, 5:04 PM
F57113304: image.png
Aug 6 2024, 10:16 AM
F57113246: image.png
Aug 6 2024, 10:16 AM

Description

We have observed a low rate of memcached errors in production, eg 61/3hrs, which may or may not be worth investigating. While the rate is *very* low, we could dig a little bit further just in case there is an underlying problem or mcrouter behaviour we should be aware of or mitigate.

image.png (1×2 px, 212 KB)

zooming in a bit, we get:

image.png (778×2 px, 115 KB)

6 Aug 2024 - 08:33-08:48
Rack: B5
Host: parse2006
Pod: mcrouter-main-lcqjh
Container: mcrouter-main
Notes: Resources wise the pod looks alright, the host itself is experiencing TCP retransmits. https://grafana.wikimedia.org/goto/iDpzTPrIg?orgId=1

I will keep adding some data, and check if there is some sort of a pattern here

Related Objects

StatusSubtypeAssignedTask
Resolvedjijiki
OpenNone
Resolvedakosiaris
ResolvedJhancock.wm
ResolvedNone
ResolvedJhancock.wm
DuplicateNone
DuplicateNone
ResolvedJhancock.wm
DuplicateNone
DuplicateNone
ResolvedMoritzMuehlenhoff
ResolvedJhancock.wm
InvalidNone
ResolvedPRODUCTION ERRORClement_Goubert
ResolvedJMeybohm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedJhancock.wm
ResolvedNone
OpenNone

Event Timeline

jijiki triaged this task as Low priority.
jijiki updated the task description. (Show Details)

Adding thos graphs as notes, though I do not think they are the cause. Chatted with @cmooney as well if network issues could be the culprit, though we deduced it may not be it.

image.png (1×2 px, 436 KB)

image.png (369×936 px, 41 KB)

jijiki raised the priority of this task from Low to Medium.May 16 2025, 5:17 PM

I have observed that the rate those errors surface has increased, I will keep working on it and update.

Change #1148900 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] memcached: add option to switch to the performance cpu governor

https://gerrit.wikimedia.org/r/1148900

Change #1148901 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] mediawiki::memcached enable performance cpu governor

https://gerrit.wikimedia.org/r/1148901

Change #1148900 merged by Effie Mouzeli:

[operations/puppet@production] memcached: add option to switch to the performance cpu governor

https://gerrit.wikimedia.org/r/1148900

Change #1148901 merged by Effie Mouzeli:

[operations/puppet@production] mediawiki::memcached enable performance cpu governor

https://gerrit.wikimedia.org/r/1148901

A chunk of the memcached errors we are observing are due to T374366, which we have known for a long time and attempted to mitigate (to no avail yet). After chatting with @Muehlenhoff and @JMeybohm, we will attempt to firstly add rules to allow access from k8s nodes to memcached, since all nodes are running the mw-mcrouter ds already. As the migration to nftables progresses, we will revisit.

Change #1153981 had a related patch set uploaded (by Muehlenhoff; author: Muehlenhoff):

[operations/puppet@production] memcached: Switch to profile::memcached::firewall_src_sets

https://gerrit.wikimedia.org/r/1153981

Change #1153981 abandoned by Muehlenhoff:

[operations/puppet@production] memcached: Switch to profile::memcached::firewall_src_sets

Reason:

replaced by 1156269

https://gerrit.wikimedia.org/r/1153981

Change #1169104 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] k8s::mediawiki_runner: allow outgoing connections to memcached

https://gerrit.wikimedia.org/r/1169104

Change #1169118 had a related patch set uploaded (by Effie Mouzeli; author: Effie Mouzeli):

[operations/puppet@production] k8s::mediawiki_runner: allow outgoing connections to memcached

https://gerrit.wikimedia.org/r/1169118

Change #1169104 abandoned by Effie Mouzeli:

[operations/puppet@production] k8s::mediawiki_runner: allow outgoing connections to memcached

Reason:

in favour of Icc0f0cfb8832809df6bb64baa015e1831b371718

https://gerrit.wikimedia.org/r/1169104

Change #1169118 merged by Effie Mouzeli:

[operations/puppet@production] k8s::mediawiki_runner: allow outgoing connections to memcached

https://gerrit.wikimedia.org/r/1169118

I am closing this as it has not manifested itself for a long time