Page MenuHomePhabricator

mw.user.sessionId sometimes returns IDs with 16 characters rather than 20
Closed, ResolvedPublic

Description

Since the start of May, the event error stream shows about 4,400 events in the UniversalLanguageSelector data stream failing validation because '.event.web_session_id' should NOT be shorter than 20 characters, '.event.web_session_id' should match pattern "^[0-9a-z]{20}$". For context, this is only about 0.05% of all UniversalLanguageSelect events over the same period.

In all the invalid events I checked, the web_session_id was 16 characters rather than 20. I don't know why this would be since mw.user.sessionId should always return a 20 character ID.

This has been happening throughout the past 90 days (the dramatic spike around 30 April is almost certainly because the error rate stayed constant as T275762 dramatically increased the overall event volume). Since T275894 was done more than 90 days ago, we can't see if the error started then, but there's no obvious connection.

This error wouldn't have occurred before T267352, since that step added the 20-character validation for web_session_id, but I doubt it caused the underlying problem of the 16 character IDs.

Event Timeline

This is also occurring for the following data streams:

  • DesktopWebUIActionsClickTracking [0]
  • MobileWebUIActionsClickTracking [1]
  • SearchSatisfaction [2]

I expected this to be the case but I'm glad the data backs it up.


[0]
month	day	n	n_16
5	25	3293252	897
5	26	3229161	919
5	27	3106105	883
5	28	2750182	827
5	29	2306607	778
5	30	2538758	921
5	31	3202681	887

select
  month,
  day,
  sum(1) as n,
  sum(if(char_length(event.token) = 16, 1, 0)) as n_16
from
  desktopwebuiactionstracking
where
  year = 2021
  and month = 5
  and day > 24
group by
  month, day
order by
  day asc
limit 10000
;
[1]
month	day	n	n_16
5	25	223859	8
5	26	220843	0
5	27	218718	0
5	28	216208	2
5	29	230081	3
5	30	242252	0
5	31	222744	1

select
  month,
  day,
  sum(1) as n,
  sum(if(char_length(event.token) = 16, 1, 0)) as n_16
from
  mobilewebuiactionstracking
where
  year = 2021
  and month = 5
  and day > 24
group by
  month, day
order by
  day asc
limit 10000
;
[2]
month	day	n	n_16
5	31	25957827	20686

select
  month,
  day,
  sum(1) as n,
  sum(if(char_length(event.mwsessionid) = 16, 1, 0)) as n_16
from
  searchsatisfaction
where
  year = 2021
  and month = 5
  and day = 31
group by
  month, day
order by
  day asc
limit 10000
;

Should we bump the priority of this task?

Should we bump the priority of this task?

In my opinion, no. As a data consumer, I definitely find this annoying but it's a very small proportion of events (less than 0.1% in all the streams discussed here) so there's no need to deal with it soon.

ovasileva subscribed.

I've added a filter to the eventgate-validation Logstash dashboard (Logstashboard?!) that refers to this task. Given T283881#7152241, I'm inclined to either lower the priority of this task to Lowest or decline this task. Regardless, I'm moving this off of Readers Web's kanban board.

I've added a filter to the eventgate-validation Logstash dashboard (Logstashboard?!) that refers to this task.

Thanks! That's very helpful.

Given T283881#7152241, I'm inclined to either lower the priority of this task to Lowest or decline this task.

I wouldn't decline it, since it's helpful to have an open task for reference and discussion even if we are unlikely to ever fix the bug.

Regardless, I'm moving this off of Readers Web's kanban board.

Yes, that makes sense. Since the actual problem seems to be with mw.user.sessionId, the primary responsibility lies with Product-Data-Infrastructure (although, of course, they may not prioritize it).

nshahquinn-wmf renamed this task from A few events in the UniversalLanguageSelector data stream failing validation because web_session_id is too short to mw.user.sessionId sometimes returns IDs with 16 characters rather than 20.Jun 22 2021, 4:15 PM

I didn't know that the faulty values are coming from mw.user.sessionId(). But now that I do, the problem looks clear: it reads the value from the 'mwuser-sessionId' cookie, which is controlled by the user and could be anything.

In fact, we used to save 16-character values into that cookie prior to rMW29c803227f0d: mediawiki.user: Increase entropy of random token generator. That was in 2018, so it's a little bit surprising that a session cookie would persist that long, but only a little bit.

The method should validate the cookie value before using it if anything is going to depend on it being exactly 20 characters.

In fact, we used to save 16-character values into that cookie prior to rMW29c803227f0d: mediawiki.user: Increase entropy of random token generator. That was in 2018, so it's a little bit surprising that a session cookie would persist that long, but only a little bit.

Amazing! We noted that this behaviour (session IDs persisting for a long time) could happen over on https://wikitech.wikimedia.org/wiki/Analytics/Sessions#Web but I didn't connect the proverbial dots…

Change 935767 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] mw.user: Validate the cookie in sessionId() before returning

https://gerrit.wikimedia.org/r/935767

Change 935767 merged by jenkins-bot:

[mediawiki/core@master] mediawiki.user: Validate the cookie in sessionId() before returning

https://gerrit.wikimedia.org/r/935767

I'll keep an eye on the error rate and, once it's declined sufficiently, I'll update https://wikitech.wikimedia.org/wiki/Analytics/Sessions#Web and close this task as Resolved.

Screenshot 2023-07-27 at 17.09.10.png (2×3 px, 1 MB)

Note that the other streams that were affected by this also show the same sharp decrease in these kinds of validation errors.

Thanks to @matmarex for the catch and the patch and @Krinkle for merging the patch.