Page MenuHomePhabricator

Snwachukwu (Sandra Ebele Nwachukwu)
User

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jan 6 2022, 11:29 AM (222 w, 4 d)
Availability
Available
LDAP User
Snwachukwu
MediaWiki User
SNwachukwu (WMF) [ Global Accounts ]

Recent Activity

Thu, Apr 9

Snwachukwu added a comment to T419882: Consider updating our heuristics for media type classification in AQS / wikistats.

A couple of links to the heuristics for reference (Sandra, please correct me if these are wrong!):

  • MediaFileUrlParser.java seems to be where most of the heuristic logic lives, using regex to analyze the upload URLs.
  • MediaTypeClassification.java is the enum of available types.
Thu, Apr 9, 4:21 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Thu, Apr 2

Snwachukwu added a comment to T422066: Request for +2 rights on Deployment-chart Repository for Snwachukwu.

Thank you! @BTullis @taavi

Thu, Apr 2, 4:03 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Gerrit-Privilege-Requests

Wed, Apr 1

Snwachukwu created T422066: Request for +2 rights on Deployment-chart Repository for Snwachukwu.
Wed, Apr 1, 5:47 PM · Data-Platform-SRE (2026-03-27 - 2026-04-17), Gerrit-Privilege-Requests
Snwachukwu added a comment to T420008: Alter AQS Cassandra tables in support of video plays endpoints.

Hi @Eevans I'd like for the change to be applied to production tables

Wed, Apr 1, 2:59 PM · User-Eevans, Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th), AQS2.0

Mon, Mar 30

Snwachukwu created T421743: Use transcoding signal to resolve ambiguous extensions.
Mon, Mar 30, 3:45 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0
Snwachukwu added a comment to T419882: Consider updating our heuristics for media type classification in AQS / wikistats.

AQS / wikistats classifies media files using a simple file-extension-to-type mapping defined in MediaTypeClassification.java in the refinery source repository. The mapping is not 1:1 for container formats like .ogg, which can hold audio or video. There are also some file extensions not considered in our list.

Mon, Mar 30, 3:09 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Fri, Mar 27

Snwachukwu added a comment to T419882: Consider updating our heuristics for media type classification in AQS / wikistats.

not listing opus
not listing .mpeg and .mpg

Fri, Mar 27, 2:41 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0
Snwachukwu added a comment to T419882: Consider updating our heuristics for media type classification in AQS / wikistats.

midi twice (I think one of them should be .mid)
tiff twice (one should be .tif)

@TheDJ Indeed, the doc doesn't list the extensions correctly. it should be .midi and .mid , .tiff and .tif

jpeg, but not jpg

Although jpg is not listed in the doc but we are actually considering .jpg in our code implementation.

Fri, Mar 27, 2:19 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Thu, Mar 26

Snwachukwu added a comment to T419882: Consider updating our heuristics for media type classification in AQS / wikistats.

I ran some investigation on case container "ogg" and and here are my finding for a full-month run (year = 2025, month = 12)

Thu, Mar 26, 6:25 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Wed, Mar 25

Snwachukwu added a comment to T420008: Alter AQS Cassandra tables in support of video plays endpoints.

Hello @Eevans. I would like to alter`local_group_default_T_mediarequest_top_files.data` table as well. See thiis patch. Do you mind adding it to satging as well?

Wed, Mar 25, 1:34 PM · User-Eevans, Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th), AQS2.0

Fri, Mar 20

Snwachukwu added a comment to T420008: Alter AQS Cassandra tables in support of video plays endpoints.

Thanks for creating the tables in staging @Eevans.

Fri, Mar 20, 1:55 PM · User-Eevans, Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th), AQS2.0

Thu, Mar 19

Snwachukwu added a comment to T420008: Alter AQS Cassandra tables in support of video plays endpoints.

Thanks @Eevans. You can go ahead to deploy to staging.

Thu, Mar 19, 5:06 PM · User-Eevans, Patch-For-Review, Data-Engineering (Q3 FY25/26 January 1st - March 31th), AQS2.0

Wed, Mar 18

Snwachukwu added a comment to T411116: CentralAuth's localuser table contains many nulls and duplicate mappings.

I can also see that that we are doing a split-by on a String column when sqooping the centralauth_localuser table so it does makes sense. What if we spilt by another column like lu_local_id?
https://github.com/wikimedia/analytics-refinery/blob/35e7f416fe4bc9e3aeb194474a4fa803d8983823/python/refinery/sqoop.py#L1355

Wed, Mar 18, 4:00 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), MediaWiki-Platform-Team, MediaWiki-extensions-CentralAuth

Mon, Mar 16

Snwachukwu added a comment to T415202: Introduce a new AQS endpoint to expose video plays.

Thank you @Eevans . I have left you a comment on your patch

Mon, Mar 16, 2:25 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Mar 13 2026

Snwachukwu updated subscribers of T420046: Add Human-Bot Alert Runbook Link to Alert Email..
Mar 13 2026, 7:10 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu created T420046: Add Human-Bot Alert Runbook Link to Alert Email..
Mar 13 2026, 7:08 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 12 2026

Snwachukwu added a comment to T415202: Introduce a new AQS endpoint to expose video plays.

@Eevans thank you for taking a look at the design doc. We decided to reuse existing mediarequest Cassandra tables to avoid reloading the keys and rather just add new columns with needed value to it. I would update the design doc with the proposed columns.

Mar 12 2026, 2:05 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Mar 11 2026

Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

The sqoop work is done.

Mar 11 2026, 4:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Mar 11 2026, 4:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 10 2026

Snwachukwu added a comment to T415202: Introduce a new AQS endpoint to expose video plays.

Thanks @AndrewTavis_WMDE

Mar 10 2026, 2:54 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Mar 9 2026

Snwachukwu updated subscribers of T415202: Introduce a new AQS endpoint to expose video plays.

@AndrewTavis_WMDE @Ladsgroup , I’ve put together a design document outlining the proposed endpoints. When you have a chance, please review it—particularly the API design section—and let me know if the proposed endpoints cover your requirements or if there are any additional endpoints we should consider.
cc @Eevans for the cassandra tables in the serving layer

Mar 9 2026, 6:53 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Mar 5 2026

Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

Thank you @Zabe for the explanation. Indeed I used stale data from last sqoop run.

Mar 5 2026, 2:18 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Mar 2 2026

Snwachukwu moved T416481: Adapt Sqoop for imagelinks schema changes from In Review to Ready to Deploy on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 2 2026, 5:06 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu moved T415283: Refactor pingback analytics pipeline from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Mar 2 2026, 4:37 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st)

Feb 26 2026

Snwachukwu updated subscribers of T416481: Adapt Sqoop for imagelinks schema changes.

We are loosing 554641 rows at the point where where we do the join to linktarget table. Not all il_target_id have a corresponding page_title in the linktarget table. About 366519 target_id don't have corresponding page_title in wmf_raw.mediawiki_private_linktarget table.

Feb 26 2026, 2:58 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 25 2026

Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

With regards to

Feb 25 2026, 7:40 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

Sure @xcollazo .
NB: All test was done using snapshot=2026-01

Feb 25 2026, 7:30 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 24 2026

Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

Okay so after using the imagelink table gotten from the manual sqoop run to run a manual CIM. here are the row counts of all the tables below. You would see that all tables are poluated. But there is a diff in row counts for common_pageviews_per_category_monthly, commons_pageviews_per_media_file_monthly, and commons_media_file_metrics_snapshot

Feb 24 2026, 7:23 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.
Feb 24 2026, 7:19 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

@xcollazo Found the culprit. 😔 That new inner join produced no rows. Of course that's because the mediawiki_imagelinks table doesn't have the il_target_id yet. 🙈

spark-sql (default)> DROP TABLE IF EXISTS ebysans.imagelinks_with_title;
Response code
Time taken: 0.037 seconds
spark-sql (default)>
                   > CREATE TABLE ebysans.imagelinks_with_title
                   > USING PARQUET AS
                   > SELECT il.il_from,
                   >        il.wiki_db,
                   >        lt.lt_title
                   > FROM wmf_raw.mediawiki_imagelinks il
                   > INNER JOIN wmf_raw.mediawiki_private_linktarget lt
                   >     ON il.il_target_id = lt.lt_id
                   >     AND lt.snapshot = il.snapshot
                   >     AND lt.wiki_db = il.wiki_db
                   > WHERE il.snapshot = '2026-01'
                   >     AND il.il_from_namespace = 0
                   >     AND il.wiki_db NOT IN ('commonswiki', 'wikidatawiki')
                   >     AND lt.snapshot = '2026-01';
26/02/24 19:00:16 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
26/02/24 19:00:55 WARN DAGScheduler: Broadcasting large task binary with size 1118.6 KiB
26/02/24 19:00:56 WARN DAGScheduler: Broadcasting large task binary with size 1160.5 KiB
Response code
Time taken: 98.699 seconds
spark-sql (default)>
                   >
                   > SELECT COUNT(*) AS cnt FROM ebysans.imagelinks_with_title;
cnt
0
Time taken: 5.905 seconds, Fetched 1 row(s)
spark-sql (default)>
Feb 24 2026, 7:11 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T323456: NEW FEATURE REQUEST: sqoop (all) user properties from mariadb to wmf_raw.mediawiki_user_properties.

@AndrewTavis_WMDE based on the conversation above, Lets create another ticket to add rcshowwikidata property to existing prefupdate data. Please can you create the request ticket and we'll close this one.
cc @Ahoelzl

Feb 24 2026, 1:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Wikidata, Wikidata Analytics, Data Pipelines

Feb 23 2026

Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

Seems like this may have to do with the fact that all usage_map values in the intermidiate table category_and_media_with_usage_map_2026_01 is NULL value. Still trying to figure out why.

Feb 23 2026, 7:45 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

I did a manual CIM run just comparing the number of rows alone, only the wmf_contributors.commons_category_metrics_snapshot and wmf_contributors.commons_edits have the same row counts with their equivalent test tables. The remaining 3 tables didn't even populate after running manually. I'm currently investigation the reason.

image.png (514×2 px, 164 KB)

Feb 23 2026, 7:23 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 20 2026

Snwachukwu added a comment to T323456: NEW FEATURE REQUEST: sqoop (all) user properties from mariadb to wmf_raw.mediawiki_user_properties.

using the prefupdate data and adding this to the list of not reconciled datasets (see Andrew's point)

I'd tend towards this option since they only need the rcshowwikidata property. we'd just need to update the whitelist. I would love to hear your thoughts on the approach to use @Ottomata and @JAllemandou

Feb 20 2026, 5:26 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Wikidata, Wikidata Analytics, Data Pipelines
Snwachukwu added a comment to T416481: Adapt Sqoop for imagelinks schema changes.

@xcollazo Sounds good. Are there particular metrics you’d like us to look out for or compare when validating the previous CIM run against the manual run with the new changes?

Feb 20 2026, 2:20 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 17 2026

Snwachukwu claimed T415202: Introduce a new AQS endpoint to expose video plays.
Feb 17 2026, 4:33 PM · Data-Engineering (Q4 FS25/26 April 1st - June 30st), AQS2.0

Feb 12 2026

Snwachukwu added a comment to T415951: Move Sqoop timers to Airflow.

We have another ticket that has the same purpose as this ticket

Feb 12 2026, 7:34 PM · Data-Engineering
Snwachukwu updated the task description for T373693: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support multiple Airflow instances.
Feb 12 2026, 7:26 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated subscribers of T405379: Clean up artifacts.yaml.

After discussion with @xcollazo and @Antoine_Quhen, we will proceed with removing artifact versions older than a cutoff as part of this ticket. This allows us to make progress now rather than waiting for the Spark upgrade. When the Spark upgrade happens, all jobs will be bumped to the appropriate latest artifacts at that time.

Feb 12 2026, 7:23 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Feb 12 2026, 2:41 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Feb 12 2026, 2:40 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Feb 12 2026, 2:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Feb 12 2026, 2:19 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 11 2026

Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Feb 11 2026, 8:31 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T416481: Adapt Sqoop for imagelinks schema changes.
Feb 11 2026, 8:30 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Feb 9 2026

Snwachukwu added a comment to T405379: Clean up artifacts.yaml.

@xcollazo When we move all the jobs to the latest, how do we plan to make sure that all the jobs are always updated whenever there is a latest artifact? This is one thing I worry about.

Feb 9 2026, 6:06 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T361210: Changes to the cuc_agent column in the cu_changes table.

The patch is ready but not yet deployed. I would be deployed with our deployment train in Tuesday (tomorrow).

Feb 9 2026, 2:19 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Product Safety and Integrity, CheckUser

Feb 5 2026

Snwachukwu updated the task description for T373693: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support multiple Airflow instances.
Feb 5 2026, 3:07 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 30 2026

Snwachukwu updated the task description for T405379: Clean up artifacts.yaml.
Jan 30 2026, 6:08 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a comment to T405379: Clean up artifacts.yaml.

I have deleted all artifacts that are not used anymore in the airflow repository and also manually deleted them from the airflow cache location on HDFS.
Here is a list of all artifacts manually deleted from HDFS:
42 artifacts from main airflow cache location

hdfs://analytics-hadoop/wmf/cache/artifacts/airflow/analytics

and 17 from analytics test airflow cache location

Jan 30 2026, 6:08 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 29 2026

Snwachukwu moved T373693: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support multiple Airflow instances from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 29 2026, 3:40 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu moved T373693: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support multiple Airflow instances from Backlog to Q3 FY25/26 January 1st - March 31th on the Data-Engineering board.
Jan 29 2026, 3:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu claimed T373693: [Iceberg Migration] Extend Iceberg table maintenance mechanism to support multiple Airflow instances.
Jan 29 2026, 2:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 20 2026

Snwachukwu moved T414714: Add data-steward-alerts mail to anomaly_detection_traffic_distribution_daily DAG from Next Up to In progress on the Data-Engineering (Q3 FY25/26 January 1st - March 31th) board.
Jan 20 2026, 2:48 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Jan 12 2026

Snwachukwu added a comment to T414353: Add user ebysans and amastilovic platform-eng airflow instance admins.

Yes @BTullis. I still don't have the access to Admin-> Variable on the web UI after adding me to the LDAP group. Here is my screenshot after logout -> login cycle.

image.png (1×2 px, 465 KB)

Jan 12 2026, 4:23 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu updated the task description for T414353: Add user ebysans and amastilovic platform-eng airflow instance admins.
Jan 12 2026, 3:46 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu added a project to T414353: Add user ebysans and amastilovic platform-eng airflow instance admins: Data-Engineering (Q3 FY25/26 January 1st - March 31th).
Jan 12 2026, 3:46 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu created T414353: Add user ebysans and amastilovic platform-eng airflow instance admins.
Jan 12 2026, 3:45 PM · Essential-Work, Data-Platform-SRE (2026.01.05 - 2026.01.23), Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 8 2025

Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

@Htriedman I created an MR to add the DDL Scripts to your wmepageview repo. Please review. I have added it to your repo and afterwards we can update the production repository.

Dec 8 2025, 7:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
Snwachukwu updated the task description for T412035: Upgrade Airflow HdfsEmailOperator to take both a String or a List(String) email addresses..
Dec 8 2025, 4:38 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)
Snwachukwu created T412035: Upgrade Airflow HdfsEmailOperator to take both a String or a List(String) email addresses..
Dec 8 2025, 4:37 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th)

Dec 5 2025

Snwachukwu claimed T411876: Add new data-steward email to Human-Bot Alert email..
Dec 5 2025, 4:47 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu created T411876: Add new data-steward email to Human-Bot Alert email..
Dec 5 2025, 4:47 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

@Htriedman, The database is created and now I want to create the tables. I have prepared the create table statement. I just need you to confirm the column comments before i create them. See patch

Dec 5 2025, 4:14 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review

Dec 3 2025

Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

To point (1) — in WME datasets, the page_id is referred to as identifier (inside a JSON object). Because this dataset is going to be used in a WME context, I plan to stick with that convention.

In that case should we rename the page_id column in the hourly table pageview_hourly_proportion to identifier for consistency?

Dec 3 2025, 7:30 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

Thanks @Htriedman for all your response. I'm about creating the tables in the newly created database. I was wondering, Do you have a folder for create table statements? I can't seem to find any in your repos.

Dec 3 2025, 6:53 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review

Dec 2 2025

Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

@Htriedman, Regarding the request to move the code to a non-user repo, I'm a little confused, It seems it has already been moved because I can see a similar repo in WME directory here. Can I say that this task:

Dec 2 2025, 6:34 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review

Dec 1 2025

Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

Regarding the schema @Htriedman , I took alook at the hourly job output table schemas, and I noticed the following:

  1. The table htriedman.pageview_combined_analytics has a column identifier which is the page_id. This column should be named page_id to be consistent with other tables.
  2. I noticed namespace_id is on neither of the tables. I would suggest it is added as it is part of a page metadata.
Dec 1 2025, 8:11 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
Snwachukwu closed T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive., a subtask of T405039: Global Editor Metrics - Data Pipeline, as Resolved.
Dec 1 2025, 8:10 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, OKR-Work, MediaWiki-Page-derived-data
Snwachukwu closed T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive. as Resolved.
Dec 1 2025, 8:10 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu added a comment to T409601: Review and productionize the WME differential privacy data set.

Hi @Htriedman . Can you help me confirm that these are all the list of output tables that need to be moved to a non-user database?
Hourly updated tables:

  • htriedman.pageview_hourly_proportion
  • htriedman.pageview_combined_analytics

Daily updated tables:

  • htriedman.pageview_geo_distribution
  • htriedman.pageview_geo_top10

Monthly updated tables:

  • htriedman.pageview_associated_distribution
  • htriedman.pageview_associated_top10
Dec 1 2025, 7:59 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review
Snwachukwu renamed T411378: Human vs Bot Alerting Email Upgrade from Human vs Bot ALERTING to Human vs Bot Alerting Email Upgrade.
Dec 1 2025, 4:29 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu moved T411378: Human vs Bot Alerting Email Upgrade from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Dec 1 2025, 4:28 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu created T411378: Human vs Bot Alerting Email Upgrade.
Dec 1 2025, 4:28 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 26 2025

Snwachukwu claimed T409601: Review and productionize the WME differential privacy data set.
Nov 26 2025, 7:50 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review

Nov 17 2025

Snwachukwu claimed T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive..
Nov 17 2025, 6:39 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu added a subtask for T405039: Global Editor Metrics - Data Pipeline: T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive..
Nov 17 2025, 6:39 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, OKR-Work, MediaWiki-Page-derived-data
Snwachukwu added a parent task for T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive.: T405039: Global Editor Metrics - Data Pipeline.
Nov 17 2025, 6:39 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu changed the status of T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive. from Open to In Progress.
Nov 17 2025, 4:55 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu created T410289: Develop HQL Scripts for Creating Global Editor Metrics Cassandra Tables in Hive..
Nov 17 2025, 4:54 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu moved T401022: Implement the data layout, UI, and documentation for the XML file export from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 17 2025, 4:34 PM · Patch-For-Review, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu moved T408918: Upgrade mediawiki-event-enrichment jobs to >= Flink 1.20.3 and Java 17 from In progress to Blocked/Paused on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 17 2025, 4:32 PM · Data-Engineering (Q3 FY25/26 January 1st - March 31th), Patch-For-Review, Event-Platform, Essential-Work
Snwachukwu moved T407239: SDS 1.3.2 Implementation from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 17 2025, 4:29 PM · Patch-For-Review, OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu moved T409099: Iceberg Merge strategies with dbt from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 17 2025, 4:22 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 7 2025

Snwachukwu added a comment to T397076: Re-enable WMF-NDA access for Miriam and Snwachukwu.

Thank you @mmartorana

Nov 7 2025, 8:46 PM · SecTeam-Processed, Security, Security-Team

Nov 3 2025

Snwachukwu added a comment to T397076: Re-enable WMF-NDA access for Miriam and Snwachukwu.

@mmartorana Oh Please I'd like my access back if its not too late

Nov 3 2025, 4:07 PM · SecTeam-Processed, Security, Security-Team

Oct 31 2025

Snwachukwu added a comment to T407239: SDS 1.3.2 Implementation.

@xcollazo . I have linked it.

Oct 31 2025, 5:29 PM · Patch-For-Review, OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu updated the task description for T407239: SDS 1.3.2 Implementation.
Oct 31 2025, 5:25 PM · Patch-For-Review, OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Oct 29 2025

Snwachukwu added a comment to T406882: SDS 1.3.2 Conduct Analysis on Alerting for changes in automated traffic distribution.

Yes, for clarity, Here is the approach decided on to define our thresholds:

  • Use Fix thresholds and revist them after a period of time (1 year suggested)
  • Use quantiles to define thresholds.
  • Current Quantiles values are gotten from using data between March 20th to Oct 15 2025.
Oct 29 2025, 1:54 PM · OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Oct 27 2025

Snwachukwu added a comment to T408400: Unable to change input dump path of Airflow commons_structured_data_dump_to_hive_weekly dag.

I'm using this ticket as an opportunity to perform the following upgrades on the airflow dag:

Oct 27 2025, 3:22 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu updated the task description for T408400: Unable to change input dump path of Airflow commons_structured_data_dump_to_hive_weekly dag.
Oct 27 2025, 3:15 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu claimed T408400: Unable to change input dump path of Airflow commons_structured_data_dump_to_hive_weekly dag.
Oct 27 2025, 3:09 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu moved T408400: Unable to change input dump path of Airflow commons_structured_data_dump_to_hive_weekly dag from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Oct 27 2025, 3:09 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu created T408400: Unable to change input dump path of Airflow commons_structured_data_dump_to_hive_weekly dag.
Oct 27 2025, 3:09 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Oct 21 2025

Snwachukwu added a comment to T406882: SDS 1.3.2 Conduct Analysis on Alerting for changes in automated traffic distribution.

@Hghani I think for thresholds, the question is should we use a fixed threshold or rolling threshold? I believe they both has advantages and disadvantages. Traffic changes over time. I suggested we use a mix of both.

Oct 21 2025, 3:46 PM · OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Oct 17 2025

Snwachukwu added a comment to T406882: SDS 1.3.2 Conduct Analysis on Alerting for changes in automated traffic distribution.

Do the quantile values capture the spikes before May 28th? I ask because May 28th until the first week of June was the absolute peak of the May incident.

Oct 17 2025, 3:02 PM · OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Oct 16 2025

Snwachukwu moved T365203: [Data Quality] Implement wiki completeness check for MediaWiki History from Ready to Deploy to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Oct 16 2025, 10:25 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work

Oct 15 2025

Snwachukwu added a comment to T406882: SDS 1.3.2 Conduct Analysis on Alerting for changes in automated traffic distribution.

I applied the suggested thresholds above to the old incident data found in wmf.pageview_hourly_backup_2025 but unfortunately they didn't catch the changes in human-bot ratio. I think its either because the old table only has data from the time 2025-03-20 which this time the incident already started. If we had an older history data without the incident, it would help to create a more trustworthy baseline to calculate our diff.

Oct 15 2025, 8:27 PM · OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
Snwachukwu changed the status of T406882: SDS 1.3.2 Conduct Analysis on Alerting for changes in automated traffic distribution, a subtask of T407235: SDS 1.3.2 [EPIC] Automated alerting for changes in automated traffic behavior, from Open to In Progress.
Oct 15 2025, 4:59 PM · Data-Engineering, Patch-For-Review, Epic, OKR-Work
Snwachukwu changed the status of T406882: SDS 1.3.2 Conduct Analysis on Alerting for changes in automated traffic distribution from Open to In Progress.
Oct 15 2025, 4:59 PM · OKR-Work, Data-Engineering (Q2 FY25/26 October 1st - December 31th)