Page MenuHomePhabricator

'.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error
  • mwversion: undefined
  • reqId: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-2021.07.16?id=ukCssHoBmhjHfZRTZezS
  • type: eventgate_validation_error
  • errored_stream_name: eventlogging_SearchSatisfaction
normalized_message
'.event.pageViewId' should be string, '.event.subTest' should be string, '.event.searchSessionId' should be string
exception.trace
{"event":{"action":"checkin","source":"autocomplete","searchSessionId":null,"pageViewId":null,"scroll":false,"mwSessionId":"4e674597ad2f9f4cd4a3","uniqueId":"6f8652699fd6fa6b93c2kr6pbcd8","sampleMultiplier":1,"subTest":null,"skin":"vector","isAnon":true,"userEditBucket":"0 edits","skinVersion":"legacy","checkin":40},"schema":"SearchSatisfaction","webHost":"en.wiktionary.org","wiki":"enwiktionary","$schema":"/analytics/legacy/searchsatisfaction/1.4.0","client_dt":"2021-07-16T18:55:21.692Z","meta":{"stream":"eventlogging_SearchSatisfaction","domain":"en.wiktionary.org","id":"31cd2510-b6b1-49bc-8723-3bf60cad413e","dt":"2021-07-16T18:55:23.576Z","request_id":"c6bbbd0b-4cf1-4956-8e43-2f016a5423c1"},"dt":"2021-07-16T18:55:23.576Z","http":{"client_ip":"83.45.123.223","request_headers":{"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}}}
Impact

~ 2300 occurrences in the last 12 hours.

Notes

Logstash url:
https://logstash.wikimedia.org/app/dashboards#/view/AXN5OoJu3_NNwgAUlbUT?_g=h@865c245&_a=h@d23a0e8

Event Timeline

cjming updated the task description. (Show Details)
mforns subscribed.

Thanks for posting this @cjming.
We have added the Search team to the task,
because the searchsatisfaction schema, which is giving the errors, belongs to them.
Cheers!

mforns added a subscriber: Analytics-Radar.
mforns removed a subscriber: Analytics-Radar.
Aklapper removed a subscriber: Discovery-ARCHIVED.

Please don't add project tags as subscribers but as project tags

Uou, yea, my bad. Thanks for the heads up!

EChetty moved this task from Incoming (new tickets) to Ops Week on the Data-Engineering board.
thcipriani subscribed.

Still happening at the rate of about 50 per hour.

Alertmanager has been firing again for this error as of 2024-07-01: https://mail.google.com/mail/u/0/#inbox/FMfcgzQVxRGDzSKGpXzlvVqVBwtdCSWR

I see an excpliti exclusion filter for this alert in the eventgate valiation dashboard. Is the alert safe to ignore ?

We skipped estimation of this ticket for now, since we have to investigate the issue first to better understand where the deviation from the schema originates. @EBernhardson might be able to help, since he authored some of that javascript.

pfischer changed the task status from Open to In Progress.Jul 8 2024, 3:13 PM
pfischer claimed this task.
pfischer moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

I did a quick test run with the browser implied by the UA (Chrome 91 on Windows 10):

Entering text in the search input field resulted in correct events. Sadly, it's not clear on which page that happened.

Alright, I might have found a way to reproduce it:

  1. Load any wiki page
  2. Click in one of the autocomplete fields
  3. Clear local storage
  4. Enter any key in the focused autocomplete field
  5. Result: The generated event no longer contains neither`pageViewId` nor searchSessionId

searchSatisfaction.js subscribes to events coming from mediawiki.searchSuggest. When using the input as described above, the following actions are fired:

  • 1..1 session-start
  • 0..* render-one (once per autocomplete suggestion)
  • 1..1 impression-results (once all suggestions have been rendered)

impression-results leads to to a reset of the search session during which the cached state (cache on top of local SafeStorage) is cleared and cannot be backfilled since the underlying local storage has been cleared. However, that only happens if the current search session no longer isActive() which in turn looks up __EndTime__ from mv.storage instead of looking into it's local state first. Maybe that could solve it.

Change #1052998 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[mediawiki/extensions/WikimediaEvents@master] Search satisfaction: Make sure tracking remains operable after clearing local storage

https://gerrit.wikimedia.org/r/1052998

Change #1052998 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Search satisfaction: Make sure tracking remains operable after clearing local storage

https://gerrit.wikimedia.org/r/1052998

The patch should be fully rolled out now, wmf.14 is live everywhere. Sadly not seeing any significant change in the error message on a 2 week graph. It looks like there is another source for this error.

Hm. After all, it was bold to assume that so many visitors clear their local storage in the middle of search session. I'll have another look.

Here's what I found so far:

All incomplete events are checkin events, that is, they are triggered delayed at increasing intervals of several seconds/minutes. So their incompleteness could result from a SessionState that has been invalidated (cleared) in the meantime. But that does only happen on page loads/new search. Even if a new tab is opened, that only overrides the underlying local storage (since it's per domain, not tab-isolated) but leaves the SessionState wrapper in the old tab intact. I was not able to enforce a scenario where the local storage is cleared while checkins are in progress.

The other potentially problematic piece I found may be be state[ type ] = null; in SessionState.get(type) since that does not remove the property from state, so future calls to state.hasOwnProperty(type) would return true, consequently null is returned.

To try and get an idea of where these might be coming from I ran some data analysis against the user agents of the most recent 10k logs of this type. Breakdown is as follows (smaller values have been truncated for brevity).

os_familybrowser_familycount
windowschrome4324
windowsedge1790
windowsfirefox1344
mac os xsafari1035
mac os xchrome655

Sadly this didn't result in anything more specific to investigate, but does suggest we should be able to reproduce in many different browsers and os's

By skin:

skincount
vector-20226596
vector3246
monobook158

By event.source:

event.sourcecount
autocomplete9278
full text722

By event.action:

As pointed out by Peter above, they are all checkin events:

event.actioncount
checkin10000

Reproduction:

  • For simplicity set ttl in searchSatisfaction.js to 30s (instead of 10 minute default)
  • Perform a search and click through to one of the result pages to start checkin events firing
  • Wait a minute or so to ensure the ttl has expired
  • Type something that gets results (important, it can't be a zero-result query due to having a test wiki with a single page) into autocomplete

The last step, receiving autocomplete results, triggers a refresh of the session. Because the session is expired that runs the invalidate routine which resets the loaded state. Future checkin events on the same page read pageViewId from that now empty state.

I suspect our best course of action is to stop allowing checkin events after session close. It wasn't a terrible idea, but we don't do much with checkins beyond the first couple. A checkin that says the user was on a page for 10 minutes is mostly ignored.

Change #1059122 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/WikimediaEvents@master] search: Repair checkin events fired after session close

https://gerrit.wikimedia.org/r/1059122

Change #1059122 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] search: Repair checkin events fired after session close

https://gerrit.wikimedia.org/r/1059122

This will ride the train, once deployed we will need to check logstash and verify if the errors have quieted down.

This might be resolved, but might be too early to call. Going to check again on monday. The train rolled forward a few hours ago. Previously we were seeing around 50-100 errors per hour, in the last 2 hours we have seen 4 and 9 errors. I'm willing to guess at the moment those errors either have a different reproduction, or some users browser is still running the old javascript. Giving it a few days should make things more clear.

These are not 100% gone, but close enough that i'm willing to call this complete. Previously we would see 700-1000 per day, we now see ~15 per day.