Page MenuHomePhabricator

Phabricator was logging out users repeatedly (2022-08-26)
Closed, ResolvedPublic

Description

We received several user reports where people were being repeteadly logged out, even after logging in again.

It seemed to affect only users using the edge at DRMRS, so potentially traffic-related.

A lot of folks seem to get stuck at "Login: Partial Login" on phabricator logs.

Incident document: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues

Event Timeline

jcrespo renamed this task from test ticket to Phabricator was logging out users repeteadly.Aug 26 2022, 10:35 AM
jcrespo reopened this task as Open.
jcrespo added projects: Traffic, Phabricator.
jcrespo updated the task description. (Show Details)
Aklapper renamed this task from Phabricator was logging out users repeteadly to Phabricator was logging out users repeatedly (2022-08-26).Aug 26 2022, 12:31 PM

Mentioned in SAL (#wikimedia-operations) [2022-08-29T08:55:40Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in cp6016 - T316338 T316337

Mentioned in SAL (#wikimedia-operations) [2022-08-29T10:09:10Z] <vgutierrez> test trafficserver: Hide non session cookies during cache lookup in drmrs - T316338 T316337

Mentioned in SAL (#wikimedia-operations) [2022-08-29T12:14:20Z] <vgutierrez> rolling restart of ats-be fleet wide to apply "Hide non session cookies during cache lookup" - T316338 T316337

Vgutierrez triaged this task as Medium priority.Sep 6 2022, 2:43 PM

I have posted the very few actions I have done on the incident documentation. Given the root cause was immediately found (trafficserver) and the Phabricator is fully restored, should we close this task?

on the incident documentation

Where? There is no incident doc yet (or I couldn't find one on Wikitech)

I apologize for my unclear comment, I was referring to the notes taking document at https://docs.google.com/document/d/1Ka9MQB8OwdzAzJVfZuaIGo5VfnyRNRr_WxLPZ6YFMkE/edit . Does it have to be converted to an incident report on Wikitech? It does not seem to contain any sensible information, I am guessing one can copy paste it. I could do it but could use pairing with someone familiar with the process :)

Does it have to be converted to an incident report on Wikitech?

It does.

I could do it but could use pairing with someone familiar with the process :)

I am going to do it, but I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just the effects and response).

I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just the effects and response).

@Vgutierrez: ping? :)

I am going to do it, but I am waiting for a 1 paragraph from @Vgutierrez to understand what actually happened to varnish (not just the effects and response).

Nothing happened to varnish. ATS was the culprit. https://gerrit.wikimedia.org/r/c/operations/puppet/+/826785 prevented phabricator session cookies reaching the phabricator origin server. A more detailed explanation is included in the commit message for https://gerrit.wikimedia.org/r/c/operations/puppet/+/828002:

474fb2d didn't work as expected because it hit an ATS bug / misdocumented feature. Cookie data was stored in ts.ctx during do_global_post_remap() and restored in do_global_cache_lookup_complete() but for some reason ts.ctx gets wiped in the middle of those two hooks.

474fb2d also missed the step now performed in hide_cookie_store_response(). Upon a server response we need to hide the cookies that we also got hidden during the cache lookup stage also missed the step now performed in hide_cookie_store_response(). Upon a server response we need to hide the cookies that we also got hidden during the cache lookup stage

Thanks, that is all I needed to understand the context! I will create a draft doc on Wikitech and link it here for review. As you can see your feedback was needed as I would have incorrectly blamed varnish and not ATS layer.

What should we do with this task? Anything left?

As soon as I finish the wikitech description I intend to resolve it.

jcrespo assigned this task to Vgutierrez.
jcrespo updated the task description. (Show Details)

@hashar @Vgutierrez Please review my summary of the incident at: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-26_Phabricator_login_issues . I left some things as guesses, as I am unsure of what the best actionables are for ATS/Phabricator, but please you are invited to edit and file any followup tickets if necessary.