Page MenuHomePhabricator

QA WebUIScroll port to the new metrics platform
Closed, ResolvedPublic0 Estimated Story Points

Assigned To
Authored By
KSarabia-WMF
Nov 29 2023, 10:48 PM
Referenced Files
F42456326: image.png
Mar 8 2024, 5:45 AM
F42456526: image.png
Mar 8 2024, 5:45 AM
F42425275: image.png
Mar 6 2024, 9:46 PM
F41700923: image.png
Jan 20 2024, 1:44 AM
F41700911: image.png
Jan 20 2024, 1:44 AM
F41700907: image.png
Jan 20 2024, 1:44 AM
F41700888: image.png
Jan 20 2024, 1:44 AM
F41700885: image.png
Jan 20 2024, 1:44 AM

Description

NOTE: We split this ticket into two for ease of discussion. Here is the twin ticket for *webUIActions.

Once we port WebUIScroll to the new metrics platform, we want to QA the incoming scroll data

Success Criteria:

  • Note any differences in data collection and querying.
  • Note any major blockers that need to be addressed before deprecating old schema.

Sign-Off Criteria:

QA Instructions

  1. Go to Hue
  2. See event table
  3. You can also query the Hive tables directly if you have private-analytics-group credentials

Now that the config is live, it should have created a new table based on the stream name based on the 2 config patches we deployed:

  • mediawiki_web_ui_scroll_migrated

QA summary
Summary: https://docs.google.com/spreadsheets/d/1cR2R2K54tzkHiIdUoQta53YbQapVm3YVN5KIemgmjzA/edit?usp=sharing
Sample rate documentation

Event Timeline

KSarabia-WMF renamed this task from QA Port WebUIScroll schema to the new metrics platform to QA WebUIScroll and *webUIActions schema port to the new metrics platform.Dec 7 2023, 9:32 PM
KSarabia-WMF updated the task description. (Show Details)
KSarabia-WMF renamed this task from QA WebUIScroll and *webUIActions schema port to the new metrics platform to QA WebUIScroll port to the new metrics platform.Dec 7 2023, 10:13 PM
KSarabia-WMF updated the task description. (Show Details)
Summary of data QA for the data collected by Jan 19, 2024.

QAed Schemas:
Old: mediawiki_web_ui_scroll
New: mediawiki_web_ui_scroll_migrated

Questions to confirm with engineers

  1. The number of events, sessions and pages are slightly higher in the new schema. Is it expected?
  2. Which field is to capture Spider user agent?
  3. Is access_method captured in agent.client_platform_family in the new schema?
  4. Please review the field mapping table below and confirm whether all entries are as expected.
Field in old schemaField in new schemaValue example
actionactionscroll-to-top
action_contextNULL
action_sourceNULL
action_subtypeNULL
web_session_idperformer.session_ide.g. , '2751f1d9e9a0417cbc1x'
meta.dtmeta.dte.g. "2024-01-16T00:17:25.272Z"
page_idpage.id59519
access_methodagent.client_platform_family❓access_method= 'desktop' ; agent.client_platform_family='desktop_browser'
is_anonperformer.is_logged_intrue, false. The old schema captures the status of being an anoymous user, while the new schema captures the status of being a loggedin users.
skinmediawiki.skinvector-2022
user_agent_map['device_family']MISSING ❓Spider

QA details

What has been checkedStatusNoteSnapshot of the result from the old schemaSnapshot of the result from the new schema
Pick one session_id, compare the result✅PassSame number of events are captured
image.png (724×806 px, 71 KB)
image.png (592×676 px, 55 KB)
By dateThe events are available since 2024-01-08. The number of events, sessions and pages are slightly higher in new schema. In average, 0.18% more events daily in new schema; 0.19% more sessions daily in new schema.
image.png (454×916 px, 76 KB)
image.png (510×1 px, 138 KB)
By action✅PassBoth schemas only captured scroll-to-top action
image.png (178×618 px, 22 KB)
image.png (170×610 px, 21 KB)
By wiki✅PassBoth schemas captured events from 518 wikis on 2024-01-18
image.png (384×710 px, 36 KB)
image.png (390×658 px, 35 KB)
By skin name✅PassBoth schemas only captured events in vector-2022 skin
By user type✅PassOld schema captures the status of being an anonymous user, while the new schema captures the status of being a logged in users.
image.png (248×628 px, 29 KB)
image.png (226×628 px, 28 KB)
By access method❓To check with engineerIs agent.client_platform_family the field in the new schema?
image.png (158×684 px, 24 KB)
image.png (174×778 px, 25 KB)
By agent type❓To check with engineerNot captured in the new schema.
image.png (172×676 px, 19 KB)

The above summary is also documented in https://docs.google.com/spreadsheets/d/1cR2R2K54tzkHiIdUoQta53YbQapVm3YVN5KIemgmjzA/edit?usp=sharing .

Questions to confirm with engineers

  1. The number of events, sessions and pages are slightly higher in the new schema. Is it expected?

I see the following discrepancies in counts:

Countweb_ui_scrollweb_ui_scroll_migratedDiscrepancyQueries
Number of events11447909114717970.20%[0], [1]
Number of unique sessions604739360607900.22%[0], [1]
Number of unique pages230372523067910.13%[2], [3]

I would expect both streams to have very nearly almost the same number of events, sessions, and pages because the original and migrated instruments execute on the same pageviews, one after the other. I would expect a small number of events to be dropped because variations in network conditions out of our control. Would that explain all of the 0.2% discrepancy? I'm not sure. I'd be happy to discuss this further.

  1. Which field is to capture Spider user agent?

When I checked, both streams have the same number of events with lower(user_agent_map['device_family']) = 'spider' or lower(user_agent_map['device_family'] = 'bot' [4][5].

  1. Is access_method captured in agent.client_platform_family in the new schema?

Yes. When agent.client_platform=mediawiki_js, agent.client_platform_family will be either desktop_browser or mobile_browser.

  1. Please review the field mapping table below and confirm whether all entries are as expected.

👍


[0]
select
  count(*) as n,
  count(distinct web_session_id) as n_distinct_sessions
from
  event.mediawiki_web_ui_scroll
where
  year = 2024
  and month = 1
  and day > 16 and day < 24
;

|n|n_distinct_sessions|
|---|---|
|11447909|6047393|
[1]
```sql
select
  count(*) as n,
  count(distinct performer.session_id) as n_distinct_sessions
from
  event.mediawiki_web_ui_scroll_migrated
where
  year = 2024
  and month = 1
  and day > 16 and day < 24
;

|n|n_distinct_sessions|
|---|---|
|11471797|6060790|
[2]
select
  count(distinct page_id) as n_distinct_pages
from
  event.mediawiki_web_ui_scroll
where
  year = 2024
  and month = 1
  and day > 16 and day < 24
;

|n_distinct_pages|
|---|
|2303725|
[3]
select
  count(distinct page.id) as n_distinct_pages
from
  event.mediawiki_web_ui_scroll_migrated
where
  year = 2024
  and month = 1
  and day > 16 and day < 24
;

|n_distinct_pages|
|---|
|2306791|
[4]
select
  count(*) as n
from
  event.mediawiki_web_ui_scroll
where
  year = 2024
  and month = 1
  and day > 16 and day < 24
  
  and (
    lower(user_agent_map['device_family']) = 'spider'
    or lower(user_agent_map['device_family']) = 'bot'
  )
;

|n|
|---|
|13|
[5]
select
  count(1) as n
from
  event.mediawiki_web_ui_scroll_migrated
where
  year = 2024
  and month = 1
  and day > 16 and day < 24
  
  and (
    lower(user_agent_map['device_family']) = 'spider'
    or lower(user_agent_map['device_family']) = 'bot'
  )
;

|n|
|---|
|13|

@phuedx, Thanks for resolving all the questions. I will further investigate the remaining question of why the numbers of events, sessions and pages are slightly higher in the new schema. Will bring it up to you when I have more data.

@phuedx, Here are some findings from my investigation.

Takeaways

  • The data discrepancy is mainly due to the difference in number of unique sessions. Some sessions are captured in the new instrument but not captured in the old one. For example, session '46261717108c4df8c582'. See the query in the cell L3 and M3 at tab by_wiki in [1].
  • By browser family, the data discrepancy in unique sessions are mainly from Firefox browser family. See the result and query at tab by_browser_family in [1].
  • By OS family, the data discrepancy in unique sessions are mainly from windows, which is the prevalent OS. While other OS, like Linux and Ubuntu, also exhibit a higher different rate, these OS variations are observed from smaller groups of sessions. See the result and query at tab by_os_family in [1]
  • By wiki, did not observe a pattern by wiki/wiki family. See the result and query at tab by_wiki in [1]

[1] https://docs.google.com/spreadsheets/d/18-ZNWXwiIanAZrCFVLaORzNEhfLdHUexqIpo70hpiKE/edit?usp=sharing

Question/hypothesis for discussion:
I remember in this schema only the scroll actions that meet specific conditions are counted as 'scroll-to-top' events to avoid capturing random scroll actions. Is it possible that these events are more easily triggered in the new instrument compared to the old one?
Is it possible that the new instrument is correct as it captured more sessions?

I am happy to discuss and explore further on it.

Mabualruz closed this task as Resolved.EditedFeb 13 2024, 5:39 PM
Mabualruz claimed this task.
Mabualruz subscribed.

In standup we have discussed this.

  • 0.2% discrepancy:

If Jennifer is happy Jon is Happy ☺️

@jwang Please check as we decided to resolve it

Thanks for checking on it. Regarding 0.2% discrepancy, it can be marked as PASS given 1) it's within variance range , 2.5% variance for daily events across all wikis, that we defined in Metrics Platform Instrument Migration Data QA Process Description ; 2) the new instrumentation is capturing more unique sessions than old instrumentation.

Hi, @KSarabia-WMF , Can you confirm if below sample rate captured in the new schema is correct?

image.png (228×1 px, 30 KB)

Query

SELECT distinct sample, performer.is_logged_in
FROM  mediawiki_web_ui_scroll_migrated
WHERE year=2024 and month=3 and day=1

Hi @jwang
Sorry for the delay.
Yes, this is correct, but I don't believe we decided as a group if we wanted to maintain the sampling rates that were set in the old web_ui_scroll schema or not.
Question for you-->What would best support your work? Make the new migrated event's sampling consistent with the old event or not?
If we want the new migrated scroll event to match the sample rates of the old events, I can make a ticket to make that adjustment. Just let me know.
Currently the new migrated scroll schema is set to 100%.
The old sampling rate (non-migrated) is currently:

  • 'default' configuration: the sampling rate is set to 0.1/10% of anonymous users will be captured.
  • 'enwiki' configuration: the sampling rate is set to 0.01/ only 1% of events

@phuedx does this sound right? Or should the config vars have applied to both instruments (current + migrated)?

@KSarabia-WMF, thanks for checking. Can you also clarify what's the sample rate for logged-in users?

Based on the sample rates of anonymous users mentioned above, it appears something is not right. The two instruments have different sample rate, but captured a similar amount of data.

OldNew
sample rate10% of anonymous users100% sample rate
result
image.png (248×780 px, 34 KB)
image.png (230×648 px, 29 KB)
query_old <- 
"
SELECT is_anon, COUNT(1) AS events,
COUNT(DISTINCT web_session_id) AS sessions,
COUNT(DISTINCT page_id) AS pages
FROM event.mediawiki_web_ui_scroll
WHERE year=2024 AND month=1 AND day=18
GROUP BY is_anon
"
query_new <- 
"
SELECT performer.is_logged_in, COUNT(1) AS events,
COUNT(DISTINCT performer.session_id) AS sessions,
COUNT(DISTINCT page.id) AS pages
FROM event.mediawiki_web_ui_scroll_migrated
WHERE year=2024 AND month=1 AND day=18
GROUP BY  performer.is_logged_in
"

Based on the sample rates of anonymous users mentioned above, it appears something is not right. The two instruments have different sample rate, but captured a similar amount of data.

That appears to be a bug or a shortcoming in the Metrics Platform JS client. The stream is currently figured to sample at a rate of 100% of all pageviews but the instrument is configured to execute on 10% of all pageviews.

This is a fairly common pattern – not executing an instrument for various performance reasons – and so could/should be handled by the Metrics Platform client. I'll write up a task…

Change 1009718 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/mediawiki-config@master] ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config

https://gerrit.wikimedia.org/r/1009718

Change #1009718 merged by jenkins-bot:

[operations/mediawiki-config@master] ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config

https://gerrit.wikimedia.org/r/1009718

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:06:18Z] <cjming@deploy1002> Started scap: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]]

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:08:45Z] <cjming@deploy1002> cjming and phuedx: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-03-21T21:20:42Z] <cjming@deploy1002> Finished scap: Backport for [[gerrit:1009718|ext-EventStreamConfig: Remove mediawiki.web_ui_scroll_migrated sampling config (T352342)]] (duration: 14m 24s)

As a followup, I have documented sample rate at data hub.

QA code has been cleaned and uploaded to gitlab. gitlab link