User Details
- User Since
- Jan 6 2022, 11:29 AM (205 w, 4 d)
- Availability
- Available
- LDAP User
- Snwachukwu
- MediaWiki User
- SNwachukwu (WMF) [ Global Accounts ]
Mon, Dec 8
@Htriedman I created an MR to add the DDL Scripts to your wmepageview repo. Please review. I have added it to your repo and afterwards we can update the production repository.
Fri, Dec 5
@Htriedman, The database is created and now I want to create the tables. I have prepared the create table statement. I just need you to confirm the column comments before i create them. See patch
Wed, Dec 3
To point (1) — in WME datasets, the page_id is referred to as identifier (inside a JSON object). Because this dataset is going to be used in a WME context, I plan to stick with that convention.
In that case should we rename the page_id column in the hourly table pageview_hourly_proportion to identifier for consistency?
Thanks @Htriedman for all your response. I'm about creating the tables in the newly created database. I was wondering, Do you have a folder for create table statements? I can't seem to find any in your repos.
Tue, Dec 2
@Htriedman, Regarding the request to move the code to a non-user repo, I'm a little confused, It seems it has already been moved because I can see a similar repo in WME directory here. Can I say that this task:
Mon, Dec 1
Regarding the schema @Htriedman , I took alook at the hourly job output table schemas, and I noticed the following:
- The table htriedman.pageview_combined_analytics has a column identifier which is the page_id. This column should be named page_id to be consistent with other tables.
- I noticed namespace_id is on neither of the tables. I would suggest it is added as it is part of a page metadata.
Hi @Htriedman . Can you help me confirm that these are all the list of output tables that need to be moved to a non-user database?
Hourly updated tables:
- htriedman.pageview_hourly_proportion
- htriedman.pageview_combined_analytics
Daily updated tables:
- htriedman.pageview_geo_distribution
- htriedman.pageview_geo_top10
Monthly updated tables:
- htriedman.pageview_associated_distribution
- htriedman.pageview_associated_top10
Wed, Nov 26
Mon, Nov 17
Nov 7 2025
Thank you @mmartorana
Nov 3 2025
@mmartorana Oh Please I'd like my access back if its not too late
Oct 31 2025
@xcollazo . I have linked it.
Oct 29 2025
Yes, for clarity, Here is the approach decided on to define our thresholds:
- Use Fix thresholds and revist them after a period of time (1 year suggested)
- Use quantiles to define thresholds.
- Current Quantiles values are gotten from using data between March 20th to Oct 15 2025.
Oct 27 2025
I'm using this ticket as an opportunity to perform the following upgrades on the airflow dag:
Oct 21 2025
@Hghani I think for thresholds, the question is should we use a fixed threshold or rolling threshold? I believe they both has advantages and disadvantages. Traffic changes over time. I suggested we use a mix of both.
Oct 17 2025
Do the quantile values capture the spikes before May 28th? I ask because May 28th until the first week of June was the absolute peak of the May incident.
Oct 16 2025
Oct 15 2025
I applied the suggested thresholds above to the old incident data found in wmf.pageview_hourly_backup_2025 but unfortunately they didn't catch the changes in human-bot ratio. I think its either because the old table only has data from the time 2025-03-20 which this time the incident already started. If we had an older history data without the incident, it would help to create a more trustworthy baseline to calculate our diff.
A quick summary of 5 weeks data:
History table used: Pageview hourly
Proposed Monitory/Alerting frequency : Daily
Oct 14 2025
Oct 9 2025
Curent plan:
- Understand the variability of pageviews by monitoring the delta change of total pageviews in a day against a baseline over a period of time.
- Suggested baseline is the a floating average of pageviews over a period of 7 days.
- We monitor this delta for a month maybe and from this pick a threshold for users and bots pageviews
Oct 2 2025
Sep 24 2025
Apr 3 2025
Apr 1 2025
@AndrewTavis_WMDE This is good news. Thank you so much for the effort put into this.
Mar 19 2025
Okay @Andrew. Thank you!
To solve the missing wikis issue, we decided it's best to automate sqoop list. There are 3 source of truth in consideration:
- Canonical_data.wikis table (from Wikimedia NOC website). Note there is ongoing work to automate this table T339928
- Site_creation log website.
- Project_namespace_map table.
Mar 11 2025
Hi @AndrewTavis_WMDE . Thank you for the confirmation. Please feel free to reach out if you need any form of support.
Mar 10 2025
@AndrewTavis_WMDE can we work with the date 28th March to finally disable the wikidata metric job in airflow?
Mar 3 2025
The api metrics had been disabled. The wikidata metrics is pending,
Feb 28 2025
Feb 25 2025
@AndrewTavis_WMDE The plan is to go read-only (Graphite) by the end of Q3-FY24/25. We can hold off turning off the wikidata_metrics_to_graphite_daily_dag.py until towards the end of this quater.
Feb 20 2025
Hi @AndrewTavis_WMDE, I'm just following up the wikidata_metrics_to_graphite_daily_dag.py. As part of T372855, Can we go ahead to turn this dag off?
Feb 18 2025
Jan 29 2025
There aren't any hive query or script I found using these tables. The dumps are currently used by Research and Platform Engineering teams.
Jan 25 2025
I did some search on the repositories and have pulled up this analysis. It's still WIP in progress as I'm yet to get all tables/datasets that uses the mediawiki wikitext dumps.
Jan 17 2025
Nov 27 2024
Nov 26 2024
Nov 19 2024
Nov 18 2024
Nov 14 2024
Nov 6 2024
Nov 4 2024
Oct 31 2024
Oct 30 2024
Oct 21 2024
Oct 16 2024
Oct 8 2024
The switchover has been done. The gerrit repositories are deprecated(set to read-only) and the schema servers have all been updated with the gitlab urls with @BTullis support.
Oct 1 2024
We plan to do the switch in 1 week time i.e 8th October, 2024.
Data-Platform-SRE We would need your support to manage merging this patch next Tuesday, on 8th October. We need to make sure the existent checkout have its git origin changed. Please help confirm availability so I can proceed with notifying everyone of this date.
Plan for EventPlatform Schema Migration.
Sep 25 2024
The following documents have been updated: