User Details
- User Since
- Jan 6 2022, 11:29 AM (222 w, 4 d)
- Availability
- Available
- LDAP User
- Snwachukwu
- MediaWiki User
- SNwachukwu (WMF) [ Global Accounts ]
Thu, Apr 9
A couple of links to the heuristics for reference (Sandra, please correct me if these are wrong!):
- MediaFileUrlParser.java seems to be where most of the heuristic logic lives, using regex to analyze the upload URLs.
- MediaTypeClassification.java is the enum of available types.
Thu, Apr 2
Wed, Apr 1
Hi @Eevans I'd like for the change to be applied to production tables
Mon, Mar 30
AQS / wikistats classifies media files using a simple file-extension-to-type mapping defined in MediaTypeClassification.java in the refinery source repository. The mapping is not 1:1 for container formats like .ogg, which can hold audio or video. There are also some file extensions not considered in our list.
Fri, Mar 27
not listing opus
not listing .mpeg and .mpg
midi twice (I think one of them should be .mid)
tiff twice (one should be .tif)
@TheDJ Indeed, the doc doesn't list the extensions correctly. it should be .midi and .mid , .tiff and .tif
jpeg, but not jpg
Although jpg is not listed in the doc but we are actually considering .jpg in our code implementation.
Thu, Mar 26
I ran some investigation on case container "ogg" and and here are my finding for a full-month run (year = 2025, month = 12)
Wed, Mar 25
Fri, Mar 20
Thanks for creating the tables in staging @Eevans.
Thu, Mar 19
Thanks @Eevans. You can go ahead to deploy to staging.
Wed, Mar 18
I can also see that that we are doing a split-by on a String column when sqooping the centralauth_localuser table so it does makes sense. What if we spilt by another column like lu_local_id?
https://github.com/wikimedia/analytics-refinery/blob/35e7f416fe4bc9e3aeb194474a4fa803d8983823/python/refinery/sqoop.py#L1355
Mon, Mar 16
Thank you @Eevans . I have left you a comment on your patch
Mar 13 2026
Mar 12 2026
@Eevans thank you for taking a look at the design doc. We decided to reuse existing mediarequest Cassandra tables to avoid reloading the keys and rather just add new columns with needed value to it. I would update the design doc with the proposed columns.
Mar 11 2026
The sqoop work is done.
Mar 10 2026
Thanks @AndrewTavis_WMDE
Mar 9 2026
@AndrewTavis_WMDE @Ladsgroup , I’ve put together a design document outlining the proposed endpoints. When you have a chance, please review it—particularly the API design section—and let me know if the proposed endpoints cover your requirements or if there are any additional endpoints we should consider.
cc @Eevans for the cassandra tables in the serving layer
Mar 5 2026
Thank you @Zabe for the explanation. Indeed I used stale data from last sqoop run.
Mar 2 2026
Feb 26 2026
We are loosing 554641 rows at the point where where we do the join to linktarget table. Not all il_target_id have a corresponding page_title in the linktarget table. About 366519 target_id don't have corresponding page_title in wmf_raw.mediawiki_private_linktarget table.
Feb 25 2026
With regards to
Sure @xcollazo .
NB: All test was done using snapshot=2026-01
Feb 24 2026
Okay so after using the imagelink table gotten from the manual sqoop run to run a manual CIM. here are the row counts of all the tables below. You would see that all tables are poluated. But there is a diff in row counts for common_pageviews_per_category_monthly, commons_pageviews_per_media_file_monthly, and commons_media_file_metrics_snapshot
@xcollazo Found the culprit. 😔 That new inner join produced no rows. Of course that's because the mediawiki_imagelinks table doesn't have the il_target_id yet. 🙈
spark-sql (default)> DROP TABLE IF EXISTS ebysans.imagelinks_with_title;
Response code
Time taken: 0.037 seconds
spark-sql (default)>
> CREATE TABLE ebysans.imagelinks_with_title
> USING PARQUET AS
> SELECT il.il_from,
> il.wiki_db,
> lt.lt_title
> FROM wmf_raw.mediawiki_imagelinks il
> INNER JOIN wmf_raw.mediawiki_private_linktarget lt
> ON il.il_target_id = lt.lt_id
> AND lt.snapshot = il.snapshot
> AND lt.wiki_db = il.wiki_db
> WHERE il.snapshot = '2026-01'
> AND il.il_from_namespace = 0
> AND il.wiki_db NOT IN ('commonswiki', 'wikidatawiki')
> AND lt.snapshot = '2026-01';
26/02/24 19:00:16 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
26/02/24 19:00:55 WARN DAGScheduler: Broadcasting large task binary with size 1118.6 KiB
26/02/24 19:00:56 WARN DAGScheduler: Broadcasting large task binary with size 1160.5 KiB
Response code
Time taken: 98.699 seconds
spark-sql (default)>
>
> SELECT COUNT(*) AS cnt FROM ebysans.imagelinks_with_title;
cnt
0
Time taken: 5.905 seconds, Fetched 1 row(s)
spark-sql (default)>@AndrewTavis_WMDE based on the conversation above, Lets create another ticket to add rcshowwikidata property to existing prefupdate data. Please can you create the request ticket and we'll close this one.
cc @Ahoelzl
Feb 23 2026
Seems like this may have to do with the fact that all usage_map values in the intermidiate table category_and_media_with_usage_map_2026_01 is NULL value. Still trying to figure out why.
I did a manual CIM run just comparing the number of rows alone, only the wmf_contributors.commons_category_metrics_snapshot and wmf_contributors.commons_edits have the same row counts with their equivalent test tables. The remaining 3 tables didn't even populate after running manually. I'm currently investigation the reason.
Feb 20 2026
using the prefupdate data and adding this to the list of not reconciled datasets (see Andrew's point)
I'd tend towards this option since they only need the rcshowwikidata property. we'd just need to update the whitelist. I would love to hear your thoughts on the approach to use @Ottomata and @JAllemandou
@xcollazo Sounds good. Are there particular metrics you’d like us to look out for or compare when validating the previous CIM run against the manual run with the new changes?
Feb 17 2026
Feb 12 2026
We have another ticket that has the same purpose as this ticket
After discussion with @xcollazo and @Antoine_Quhen, we will proceed with removing artifact versions older than a cutoff as part of this ticket. This allows us to make progress now rather than waiting for the Spark upgrade. When the Spark upgrade happens, all jobs will be bumped to the appropriate latest artifacts at that time.
Feb 11 2026
Feb 9 2026
@xcollazo When we move all the jobs to the latest, how do we plan to make sure that all the jobs are always updated whenever there is a latest artifact? This is one thing I worry about.
The patch is ready but not yet deployed. I would be deployed with our deployment train in Tuesday (tomorrow).
Feb 5 2026
Jan 30 2026
I have deleted all artifacts that are not used anymore in the airflow repository and also manually deleted them from the airflow cache location on HDFS.
Here is a list of all artifacts manually deleted from HDFS:
42 artifacts from main airflow cache location
hdfs://analytics-hadoop/wmf/cache/artifacts/airflow/analytics
and 17 from analytics test airflow cache location
Jan 29 2026
Jan 20 2026
Jan 12 2026
Yes @BTullis. I still don't have the access to Admin-> Variable on the web UI after adding me to the LDAP group. Here is my screenshot after logout -> login cycle.
Dec 8 2025
@Htriedman I created an MR to add the DDL Scripts to your wmepageview repo. Please review. I have added it to your repo and afterwards we can update the production repository.
Dec 5 2025
@Htriedman, The database is created and now I want to create the tables. I have prepared the create table statement. I just need you to confirm the column comments before i create them. See patch
Dec 3 2025
To point (1) — in WME datasets, the page_id is referred to as identifier (inside a JSON object). Because this dataset is going to be used in a WME context, I plan to stick with that convention.
In that case should we rename the page_id column in the hourly table pageview_hourly_proportion to identifier for consistency?
Thanks @Htriedman for all your response. I'm about creating the tables in the newly created database. I was wondering, Do you have a folder for create table statements? I can't seem to find any in your repos.
Dec 2 2025
@Htriedman, Regarding the request to move the code to a non-user repo, I'm a little confused, It seems it has already been moved because I can see a similar repo in WME directory here. Can I say that this task:
Dec 1 2025
Regarding the schema @Htriedman , I took alook at the hourly job output table schemas, and I noticed the following:
- The table htriedman.pageview_combined_analytics has a column identifier which is the page_id. This column should be named page_id to be consistent with other tables.
- I noticed namespace_id is on neither of the tables. I would suggest it is added as it is part of a page metadata.
Hi @Htriedman . Can you help me confirm that these are all the list of output tables that need to be moved to a non-user database?
Hourly updated tables:
- htriedman.pageview_hourly_proportion
- htriedman.pageview_combined_analytics
Daily updated tables:
- htriedman.pageview_geo_distribution
- htriedman.pageview_geo_top10
Monthly updated tables:
- htriedman.pageview_associated_distribution
- htriedman.pageview_associated_top10
Nov 26 2025
Nov 17 2025
Nov 7 2025
Thank you @mmartorana
Nov 3 2025
@mmartorana Oh Please I'd like my access back if its not too late
Oct 31 2025
@xcollazo . I have linked it.
Oct 29 2025
Yes, for clarity, Here is the approach decided on to define our thresholds:
- Use Fix thresholds and revist them after a period of time (1 year suggested)
- Use quantiles to define thresholds.
- Current Quantiles values are gotten from using data between March 20th to Oct 15 2025.
Oct 27 2025
I'm using this ticket as an opportunity to perform the following upgrades on the airflow dag:
Oct 21 2025
@Hghani I think for thresholds, the question is should we use a fixed threshold or rolling threshold? I believe they both has advantages and disadvantages. Traffic changes over time. I suggested we use a mix of both.
Oct 17 2025
Do the quantile values capture the spikes before May 28th? I ask because May 28th until the first week of June was the absolute peak of the May incident.
Oct 16 2025
Oct 15 2025
I applied the suggested thresholds above to the old incident data found in wmf.pageview_hourly_backup_2025 but unfortunately they didn't catch the changes in human-bot ratio. I think its either because the old table only has data from the time 2025-03-20 which this time the incident already started. If we had an older history data without the incident, it would help to create a more trustworthy baseline to calculate our diff.

