Excimer UI profile lost when requested from mw-on-k8s
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Oct 3 2023, 2:46 AM

Description

WikimediaDebug with "Excimer UI" and "k8s-experimental" selected.
Load a page, e.g. https://test.wikipedia.org/wiki/Special:Blankpage

Actual:

https://performance.wikimedia.org/excimer/profile/dd285078377363ab

[XHR] 404 Not Found

Details

	Subject	Repo	Branch	Lines +/-
	mw: allow egress to excimer	operations/deployment-charts	master	+22 -0
	services: fix xenon/arclamp redis egress rules	operations/deployment-charts	master	+16 -6

Customize query in gerrit

Related Objects

Mentioned In: T347916: Investigate sharp increase in lost Arc Lamp samples (arclamp_client_error.exception)
T347987: PHP Warning: RedisException: Connection timed out
Mentioned Here: T347916: Investigate sharp increase in lost Arc Lamp samples (arclamp_client_error.exception)

Event Timeline

Krinkle created this task.Oct 3 2023, 2:46 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 3 2023, 2:46 AM

Change 963024 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] services: fix xenon/arclamp redis egress rules

https://gerrit.wikimedia.org/r/963024

gerritbot added a project: Patch-For-Review.Oct 3 2023, 12:50 PM

fgiunchedi mentioned this in T347987: PHP Warning: RedisException: Connection timed out.Oct 3 2023, 1:11 PM

fgiunchedi merged a task: T347987: PHP Warning: RedisException: Connection timed out.

fgiunchedi added subscribers: TheresNoTime, fgiunchedi.

Clement_Goubert moved this task from Backlog to In Progress on the MW-on-K8s board.Oct 3 2023, 1:45 PM

Change 963024 merged by Filippo Giunchedi:

[operations/deployment-charts@master] services: fix xenon/arclamp redis egress rules

https://gerrit.wikimedia.org/r/963024

fgiunchedi added a comment.Oct 3 2023, 2:58 PM

This comment was removed by fgiunchedi.

Maintenance_bot removed a project: Patch-For-Review.Oct 3 2023, 3:10 PM

Krinkle mentioned this in T347916: Investigate sharp increase in lost Arc Lamp samples (arclamp_client_error.exception).Oct 3 2023, 5:22 PM

Change 963274 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mw: allow egress to excimer

https://gerrit.wikimedia.org/r/963274

gerritbot added a project: Patch-For-Review.Oct 4 2023, 9:26 AM

Change 963274 merged by jenkins-bot:

[operations/deployment-charts@master] mw: allow egress to excimer

https://gerrit.wikimedia.org/r/963274

After enabling egress to webperf I was able to get an excimer profile posted, e.g. https://performance.wikimedia.org/excimer/profile/f4865b080f541cf8

@Krinkle with @Clement_Goubert we were wondering if excimer egress traffic should be enabled for mw-debug only or any mw k8s deployment?

Maintenance_bot removed a project: Patch-For-Review.Oct 4 2023, 10:10 AM

@fgiunchedi For baremetal, it is intentional that this is not limited to mwdebug. I can ssh to a random appserver to investigate an issue, and make a local curl request that sets X-Wikimedia-Debug. The WikimediaDebug features available in that context are logging (to a file, or Logstash) and profiling with Excimer. (XHGui is not available given it depends on php-tideways which is unsuitable for production hosts given its high overhead even in disabled state, unlike excimer, which is specifically designed for production sampling).

For Kubernetes, I imagine it's both harder and less common to need to investigate a specific pod. However, I suppose from an egress configuration point of view, it's the same whether we're talking a naturally spawned mw-web pod (not mw-debug) and one that's spawned for the purpose of creating a shell and investigating something. Granted, in most cases we'll be spawning mw-debug pods for that purpose, but it seems harmless to allow the possibility and the unneeded differences between the two, the better I think? Especially since the failure mode would be hard to detect, it'd likely look like a mysql or webperf host issue rather than an egress issue.

That's my 2c anyway. No strong feelings either way.

In T347926#9226441, @Krinkle wrote:

@fgiunchedi For baremetal, it is intentional that this is not limited to mwdebug. I can ssh to a random appserver to investigate an issue, and make a local curl request that sets X-Wikimedia-Debug. The WikimediaDebug features available in that context are logging (to a file, or Logstash) and profiling with Excimer. (XHGui is not available given it depends on php-tideways which is unsuitable for production hosts given its high overhead even in disabled state, unlike excimer, which is specifically designed for production sampling).

For Kubernetes, I imagine it's both harder and less common to need to investigate a specific pod. However, I suppose from an egress configuration point of view, it's the same whether we're talking a naturally spawned mw-web pod (not mw-debug) and one that's spawned for the purpose of creating a shell and investigating something. Granted, in most cases we'll be spawning mw-debug pods for that purpose, but it seems harmless to allow the possibility and the unneeded differences between the two, the better I think? Especially since the failure mode would be hard to detect, it'd likely look like a mysql or webperf host issue rather than an egress issue.

Thank you for the added context, I agree not special casing mw-debug isn't worth it unless we absolutely need to. In this case we can keep the non-debug / debug symmetry in place as-is now. I'm resolving the task since excimer now works in k8s

lmata added a project: SRE Observability (FY2023/2024-Q2).Oct 9 2023, 2:59 PM

lmata moved this task from Inbox to Done on the SRE Observability (FY2023/2024-Q2) board.Jan 26 2024, 1:08 AM

Excimer UI profile lost when requested from mw-on-k8sClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Excimer UI profile lost when requested from mw-on-k8s
Closed, ResolvedPublic
Actions