Page MenuHomePhabricator

JAllemandou (joal)
Data Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Feb 11 2015, 6:02 PM (565 w, 6 d)
Availability
Available
IRC Nick
joal
LDAP User
Unknown
MediaWiki User
JAllemandou (WMF) [ Global Accounts ]

Recent Activity

Yesterday

JAllemandou added a project to T412003: Airflow-main scheduler loop sometimes slows down markedly: Data-Engineering (Q2 FY25/26 October 1st - December 31th).
Mon, Dec 15, 4:55 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Data-Platform-SRE (2025.11.07 - 2025.11.28)
JAllemandou moved T412443: Handle `network_flows_internal` data growth from Ready to Deploy to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Mon, Dec 15, 4:54 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T412322: Investigate and ensure Druid snapshot retention is limited to 3 from Next Up to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Mon, Dec 15, 4:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou claimed T412322: Investigate and ensure Druid snapshot retention is limited to 3.
Mon, Dec 15, 4:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T412443: Handle `network_flows_internal` data growth from Next Up to Ready to Deploy on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Mon, Dec 15, 4:39 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou claimed T412443: Handle `network_flows_internal` data growth.
Mon, Dec 15, 4:39 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T412321: Revert webrequest retention time back to 90 days from In Review to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Mon, Dec 15, 1:50 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T412443: Handle `network_flows_internal` data growth.

I've been spending time on this, here's my latest finding: By reducing retention to 30 days, we keep the data size to an acceptable size with current sampling.
I have set up the retention to 30 days, and the druid data-size dropped from 600Gb to 80Gb.
At current sampling we store ~500Mb per hour, which will lead to ~400Gb for one month - lower than the previous 600Gb.
I'll keep the conf as is (30 days retention), and will remove the daily loading job, as the data is big enough to not require compaction.

Mon, Dec 15, 1:33 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T412443: Handle `network_flows_internal` data growth.

In the meatime I'm going to reenable the indexation job with more resources, allowing it not to fail, but we need to find a solution about disk space. Either sampling more, or keeping one week of data for now.
Note: if we decide to keep one week of data, it means Druid can be queried for that week only, but the data is not deleted yet, it's still available in deep-storage on Hadoop, allowing to make it available at a config change (if disk space is available).

Mon, Dec 15, 7:49 AM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Sat, Dec 13

JAllemandou added a comment to T412443: Handle `network_flows_internal` data growth.

That's right, the failue is due to OOM at Druid ingestion job. But it's better for us not ingest while we don't have a decision on how to handle data size.

Sat, Dec 13, 11:41 AM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Thu, Dec 11

JAllemandou moved T412321: Revert webrequest retention time back to 90 days from In progress to In Review on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Thu, Dec 11, 7:05 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou created T412443: Handle `network_flows_internal` data growth.
Thu, Dec 11, 7:01 PM · Infrastructure-Foundations, netops, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T376790: Split the permission to access Logstash from the cn=wmf and cn=nda groups.

Thanks so much @Novem_Linguae , I confirm the link worked for me.

Thu, Dec 11, 4:40 PM · SRE Observability, Infrastructure-Foundations, SRE
JAllemandou added a comment to T376790: Split the permission to access Logstash from the cn=wmf and cn=nda groups.

Hi folks, as part of this ticket I lost access to logstash. Can one of you please add me to the cn=logstash-access ldap group? Many thanks.

Thu, Dec 11, 1:58 PM · SRE Observability, Infrastructure-Foundations, SRE
JAllemandou moved T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Thu, Dec 11, 11:21 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou moved T412321: Revert webrequest retention time back to 90 days from Urgent to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Thu, Dec 11, 11:21 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Thu, Dec 4

JAllemandou added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

One another thing I found is that some projects are present in mediawiki_revision_history_v1 that are not in mediawiki_history_reduced:

spark.sql("""
WITH
    namespace_map AS (
        SELECT
            dbname AS wiki_db,
            -- Compatibility with exisitng AQS project hostnames
            REGEXP_REPLACE(hostname, '(www\\.|\\.org)', '') AS hostname
        FROM wmf_raw.mediawiki_project_namespace_map
        WHERE TRUE
            AND snapshot = '2025-11'
    ),
    mwr_projects AS (
        select distinct hostname
        from wmf_content.mediawiki_revision_history_v1 mwr
        INNER JOIN namespace_map ns
            ON (mwr.wiki_id = ns.wiki_db)
        WHERE mwr.revision_dt < '2025-11'
    ),
    mwh_projects AS (
        select distinct project
        from wmf.mediawiki_history_reduced
        where snapshot = '2025-10'
          AND event_timestamp < '2025-11'
    )
SELECT
    hostname as mediawiki_revision_projects,
    project as mediawiki_history_projects
FROM mwr_projects
FULL OUTER JOIN mwh_projects
    ON (hostname = project)
WHERE (hostname IS NULL OR project IS NULL)
""").show(100,false)
Thu, Dec 4, 12:12 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

First thing: new snapshot (2025-11) has landed, I verified: same problem.
Next step I'm trying is to get the MariaDB revisions for the specific day/user/project we picked for the investigation, and try to look for them in the event-oriented datasets.

spark.sql("""
SELECT rev.rev_id
FROM wmf_raw.mediawiki_revision rev
WHERE rev.snapshot = '2025-11'
    AND rev.wiki_db = 'bewwiktionary'
    AND rev.rev_timestamp like '20250904%'
    AND rev.rev_actor = 81
""").createOrReplaceTempView("r_81")
spark.table("r_81").cache()
Thu, Dec 4, 11:56 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

First investigation: I've looked for day/user_central_id/page_type being present in both datasets but with different counts. I took only one month (2025-09) to limit the data volume.

spark.sql("""
  SELECT
      event_timestamp,
      user_central_id,
      page_type,
      events
  FROM daily_aggregated_mediawiki_history_reduced
  WHERE
      event_timestamp < '2025-10'
      AND event_timestamp >= '2025-09'
""").createOrReplaceTempView("damwh")
spark.table("damwh").cache()
Thu, Dec 4, 10:29 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Wed, Dec 3

JAllemandou added a comment to T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.

I have added the aggregated events value to @xcollazo query above and indeed we have an issue:

+---------+----------+-----------+                                              
|    count|    events|count_match|
+---------+----------+-----------+
|  3694708| 674767899|       null|
|269857694|4167426208|       true|
|  8951255|2631857251|      false|
+---------+----------+-----------+

There is a big proportion of aggregated events for which rows don't match. We need to investigate before publishing.

Wed, Dec 3, 4:49 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Tue, Dec 2

JAllemandou moved T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Tue, Dec 2, 1:12 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou moved T410017: Provide a Spark production access for dbt with Airflow from In progress to Blocked/Paused on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Tue, Dec 2, 1:11 PM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou updated the task description for T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.
Tue, Dec 2, 12:48 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou claimed T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.
Tue, Dec 2, 12:46 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou created T411484: Create new Druid datasource based on the `mediawiki_revision_history_v1` table.
Tue, Dec 2, 12:45 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data

Thu, Nov 27

JAllemandou added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

I have verified this morning: We had successful airflow jobs updating the new table, and the number of rows in the new wmf_content.mediawiki_revision_history_v1 table exactly matches the number of rows in the wmf_content.mediaiwiki_content_history_v1 table . Looks like our strategy of reusing events works !

Thu, Nov 27, 10:09 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Wed, Nov 26

JAllemandou added a comment to T410531: Drop rc_type from recentchanges in wmf production.

@JAllemandou is this relevant for Data Engineering?

Wed, Nov 26, 8:45 PM · Data-Engineering, DBA, Schema-change-in-production
JAllemandou added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

We quickly chat with @xcollazo and decided that, since time is pressing, we are going to go for the as-simple-as-possible version of the system for page_type.
This is a change from what I stated in slack yesterday with @Ottomata.

Wed, Nov 26, 7:02 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Funnily enough, now there are now 4.3M rows more on the target table than the source

Wed, Nov 26, 10:24 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou created T411079: `central_auth.local_user` table contains corrupted records.
Wed, Nov 26, 10:21 AM · MediaWiki-Platform-Team, MediaWiki-Core-AuthManager

Tue, Nov 25

JAllemandou added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Something else needed in order to get the data in Druid with all the needed dimensions is to have a daily version of the project_namespace_map table. The data-generation step is cheap and the data small, I'll create a new table.

Tue, Nov 25, 6:33 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes.

After talking with @mforns this morning:

  • The 4 metrics defined in the task will always be requested for a single user_central_id.
  • The 4th metric List of edited pages within a date range will not be used this year (see this page)- no need to embed page information (page_title or page_id) in the new dataset for now.
  • It's ok to NOT have rows that don't have a defined user_central_id (not NULL and strictly positive)in the new druid dataset - the queries are always filtering for a single user_central_id. (only a very small number of rows filtered out, see https://phabricator.wikimedia.org/T410688#11404257)
  • We wish to have an exact-copy of the mediawiki_history_reduced dataset in terms of dataset definition. Endpoints for the first three metrics are already serving data using mediawiki_history_reduced, and we now wish to have this data updated daily using a new datasource, but minimize change.
Tue, Nov 25, 10:07 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou updated subscribers of T410688: Implement a new pipeline and table with reconciled historical revision data.

Some data validation on joining for user_central_id:

WITH centralauth AS (
    SELECT DISTINCT
        lu_wiki,
        lu_local_id,
        lu_global_id
    FROM wmf_raw.centralauth_localuser
    WHERE snapshot = '2025-10'
)
Tue, Nov 25, 8:45 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T410688: Implement a new pipeline and table with reconciled historical revision data.

Since we are going to need to backfill wmf_content.mediawiki_revision_history_v1 from wmf_content.mediawiki_content_history_v1, perhaps it is best if we backfill user_central_id to the latter first: T406515: Add user_central_id to mediawiki_content_history_v1 (and mediawiki_content_current_v1).

Tue, Nov 25, 8:13 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Mon, Nov 24

JAllemandou created T410966: Bump hadoop container maximum memory size.
Mon, Nov 24, 11:39 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering

Thu, Nov 20

JAllemandou added a comment to T406069: Global Editor Metrics - Druid mediawiki_history_reduced changes.

Here is the proposed solution we came with in today's meeting:

  • We are going to create a new dataset in Druid containing revision-events only, for the purpose of the Editors-metrics.
  • This dataset will be updated daily from a newly created Iceberg table.
  • This Iceberg table will be a replica of the mediawiki_content_history_v1 table, minus the content.
  • this table will be updated daily from the same sources and using the same process (copy/paste) as the mediawiki_content_history_v1 table:
    • pagechange events
    • reconciliation events
  • No reconciliation against MariaDB is needed, as this is already done by the mediawiki_content_history_v1 process handling the same data (minus text!)
Thu, Nov 20, 7:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), OKR-Work, MediaWiki-Page-derived-data
JAllemandou renamed T410017: Provide a Spark production access for dbt with Airflow from Provide a Spark thrift-server for dbt with Airflow to Provide a Spark production access for dbt with Airflow.
Thu, Nov 20, 9:35 AM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T410017: Provide a Spark production access for dbt with Airflow from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Thu, Nov 20, 9:32 AM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T410378: Reduce the number of refine job calls to the hive metastore from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Thu, Nov 20, 9:32 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Wed, Nov 19

JAllemandou created T410557: Make blunderbuss synchronize artifacts for the test-cluster.
Wed, Nov 19, 8:12 PM · Data-Engineering
JAllemandou added a comment to T410017: Provide a Spark production access for dbt with Airflow.

Also, if using spark-thrift, I'm not sure at all it would run in k8s: wouldn't it be the same issue as being able to run a job with the spark-master in k8s and executors on yarn?

No, I think it would be fine. All of ther executors would be launched on kubernetes as well.

Wed, Nov 19, 4:28 PM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T410017: Provide a Spark production access for dbt with Airflow.

Also, if using spark-thrift, I'm not sure at all it would run in k8s: wouldn't it be the same issue as being able to run a job with the spark-master in k8s and executors on yarn?

Wed, Nov 19, 3:46 PM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T410017: Provide a Spark production access for dbt with Airflow.

The other option not using a Thrift-server not k8s is to make dbt run in skein. I need to think more about this one.

Wed, Nov 19, 3:43 PM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou updated subscribers of T410017: Provide a Spark production access for dbt with Airflow.

If we run the dbt-spark job in k8s, no need for a thrift server, dbt could use the "session mode" where it launches it own job with predefined configuration.

Wed, Nov 19, 3:42 PM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Tue, Nov 18

JAllemandou added a comment to T410378: Reduce the number of refine job calls to the hive metastore.

I'm pretty sure I have nailed down the problem.
At write time, the refine job inserts partitions in overwrite mode. To do so, it is needed to set spark.sql.sources.partitionOverwriteMode to dynamic, otherwise Spark overwrites the entire table, not the partitions touched only.
Changing this approach to a delete+insert where we drop the partition, write the files in a folder with overwrite, then recreate the partition, has solved the problem.
My guess is that for Spark to decide about partitions drop and recreate, it first requests information about all the existing partitions in the table, to store that information in its own format (catalog) instead of having in the metastore only.
Anyhow, I have patch fixing the issue (see above), it currently runs on the test-cluster successfully, we'll deploy it tomorrow.

Tue, Nov 18, 7:41 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T410378: Reduce the number of refine job calls to the hive metastore from Next Up to In progress on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Tue, Nov 18, 10:12 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou claimed T410378: Reduce the number of refine job calls to the hive metastore.
Tue, Nov 18, 10:11 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou created T410378: Reduce the number of refine job calls to the hive metastore.
Tue, Nov 18, 10:11 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 13 2025

JAllemandou updated the task description for T410017: Provide a Spark production access for dbt with Airflow.
Nov 13 2025, 10:47 AM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou claimed T410017: Provide a Spark production access for dbt with Airflow.
Nov 13 2025, 10:28 AM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou created T410017: Provide a Spark production access for dbt with Airflow.
Nov 13 2025, 10:28 AM · OKR-Work, Patch-For-Review, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T409099: Iceberg Merge strategies with dbt.

The solution presented above is completely aligned with our current way of doing.
We could even think of optimizations allowing to run bigger portions of data than the regular hourly job.
Great work!

Nov 13 2025, 8:20 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou claimed T409470: mediawiki_history_dumps failure.
Nov 13 2025, 8:18 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T409470: mediawiki_history_dumps failure from Next Up to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 13 2025, 8:17 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou updated Other Assignee for T409470: mediawiki_history_dumps failure, added: JAllemandou.
Nov 13 2025, 8:17 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T409470: mediawiki_history_dumps failure.

Problem solved!

Nov 13 2025, 8:17 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 11 2025

JAllemandou added a comment to T409676: Request for Hourly Pageview Data for multiple articles– July 18 to September 8, 2025.

Hi folks, you can fin pageview hourly dumps here: https://dumps.wikimedia.org/other/pageview_complete
It'll be a lot to download and filter, but you'll find what you're after in there :)

Nov 11 2025, 6:54 PM · Research, Data-Engineering
JAllemandou added a comment to T409470: mediawiki_history_dumps failure.

This problem is due to the job still using the old jar version after the patches we made changing the schema. I'll provide a patch and deploy it today.

Nov 11 2025, 5:10 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T409099: Iceberg Merge strategies with dbt.

Very cool !
Let's schedule some time to work together on the Iceberg time partitioning.

Nov 11 2025, 4:21 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T409099: Iceberg Merge strategies with dbt.

This pre_hook solution is great :)
If we can build our own materialization defining such a hook, it would be perfect.
The only concern I see here is defining the time-period for the deletion and materialization (this can change between jobs), I'm not sure of how we can do that :)

Nov 11 2025, 12:16 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 10 2025

JAllemandou moved T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data from In progress to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Nov 10 2025, 3:08 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

And the druid datasources have been backfilled.
This task is considered done, waiting for feedback for a few days before closing it.

Nov 10 2025, 3:08 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

While backfilling I discovered that some fake very long domains could make the classification too long (2d for one webrequest hour!). I have sent the above patch to solve this.

Nov 10 2025, 10:58 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T409514: Migrate Sqoop jobs to Airflow.

Airflow DAGs that would handle each table-Sqoop job independently

Nov 10 2025, 8:51 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)

Nov 6 2025

JAllemandou added a comment to T409099: Iceberg Merge strategies with dbt.

I looked on the internet, and the idempotent-pipeline problem in DBT is a known and documented issue.
My preferred article is https://techblog.finatext.com/dbt-incremental-strategy-and-idempotency-877993f48448

Nov 6 2025, 2:36 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T409099: Iceberg Merge strategies with dbt.

We met with @JMonton-WMF and found that DBT supports building new materializations: https://docs.getdbt.com/guides/create-new-materializations
From a quick look at it, it feels like possibly this should be the solution to our problem (pre-hooks, etc).
We have decided to try to apply DBT to one real example of us: Try to reproduce projectview_hourly with DBT and an iceberg (and Airflow when feasible).
I suggested this example as data is way smaller than using mediawiki_content.

Nov 6 2025, 2:20 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

Starting backfilling pageview_hourly hive table from 2025-05-10T04:00. I'll backfill one month at a time.

Nov 6 2025, 7:54 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform

Nov 5 2025

JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

The patch has been deployed, new data started to flow:

SELECT
    hour,
    COUNT(1),
    COUNT(DISTINCT referer_data)
FROM wmf.webrequest
WHERE year = 2025 and month = 11 and day = 5 and hour IN (15, 16)
GROUP BY hour
ORDER BY hour
Nov 5 2025, 7:28 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform

Nov 4 2025

JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

I have updated the regexes a few times, and I have posted a latest list of domains and referer-classes in the doc, in place of the previous one.
In my checking of the list, I haven't find false-positives, only improvements.
Obviously this will not stay true for long, but it seems good enough for now.
I'll be deploying tomorrow if ok for you @nshahquinn-wmf .
Thanks!

Nov 4 2025, 9:05 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T408939: Fix iceberg table location in hive metastore.

My understanding is that DELETE always removes data whether on managed or EXTERNAL tables, while DROP TABLE deletes data only on managed tables, not on EXTERNAL ones.

Nov 4 2025, 2:01 PM · Data-Engineering

Nov 3 2025

JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

In additional to all the positive changes, I noticed a few new false positives from removing the requirement for some sites that there be a preceding subdomain (which is probably assumed to be "www."). The top examples I see are "https://bing.gugugegu.com/" (now classified as Bing) and "https://startpage.freebrowser.org/" (now classified as Startpage). The numbers are very small, so it's not a big issue, but it also makes me wonder about the "https://bing.com" referrers that are now classified as Bing. If the regex was originally written to require a subdomain because that's what Bing always uses, maybe the hits without it are actually from a bot that messed up when trying to imitate Bing.

Nov 3 2025, 10:27 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T408939: Fix iceberg table location in hive metastore.

prevent data-dropping errors

Does Iceberg delete external table location data on DELETE statements? Or does it just do its metadata file switching stuff?

Nov 3 2025, 3:42 PM · Data-Engineering
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

I have vetted the data with the latest version of the change, and it looks good to me.
I have copied some values where referer has changed grouped by referer-domain and ordered by number of hists desc here (note: this list doesn't include IP-referer that have now been categorized as unknown instead of external).

Nov 3 2025, 1:49 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T408939: Fix iceberg table location in hive metastore.

Another reason for which we used the external-table mechanism was to prevent data-dropping errors. I think it's worth keeping it as is :)

Nov 3 2025, 8:42 AM · Data-Engineering
JAllemandou added a comment to T408798: Spike: investigate incorrect page_id values in pageview_hourly.

For diffs: could we not just modify the pageview algorithm and add an is_diff or pageview_kind=diff field that indicates if it was a diff pageview? We should know pretty easily by the URI path.

pageview_kind (name to be bikeshed) could be useful, if apps start adding page_ids to their pageview requests (and we can find them in webrequests), we could do e.g. pageview_kind=ios_app.

Nov 3 2025, 8:40 AM · MediaWiki-Platform-Team (Radar), MW-Interfaces-Team, Data-Engineering, MediaWiki-Core-Hooks

Oct 31 2025

JAllemandou added a comment to T408687: Create example dbt models using Iceberg.

I don't think the micro_batch strategy changes anything to our case. It helps when reprocessing big amount of temporarily partitioned data only with DBT: DBT split the jobs in smaller instances, in a similar way we would do with Airflow.

Oct 31 2025, 10:30 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Movement-Insights, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Epic
JAllemandou added a comment to T408798: Spike: investigate incorrect page_id values in pageview_hourly.

Thank you so much for the very clear summary @Milimetric !
Some notes:

Oct 31 2025, 10:22 AM · MediaWiki-Platform-Team (Radar), MW-Interfaces-Team, Data-Engineering, MediaWiki-Core-Hooks

Oct 30 2025

JAllemandou added a comment to T408687: Create example dbt models using Iceberg.

To my understanding, the insert_overwrite strategy only works for Iceberg tables when the time-partition of that table matches the loading job regularity. This is because insert_overwrite in Iceberg overwrites entire partitions if some data is written to them.
The related concern is that we wish to NOT make iceberg time-partitioning match loading job frequency, to optimize small file issues for instance. So in those cases (uniaue_devices tables being one of them), we need a different strategy, and the question I had is whether merge can safely replace delete + insert. If it can, great, let's update to using that. Otherwise we need to make DBT replicate our delete + insert way.

Oct 30 2025, 5:40 PM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Movement-Insights, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Epic
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

Code got pushed and reviewed positively. I'll spend tomorrow to do some analysis on real data, trying to spot blatantly wrong rows, and if everything goes right it should be merged/deployed early next week.

Oct 30 2025, 5:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

Back from holidays, I'll update my patch to:

  • not include invalid referer URLS (declined T383088 based on comments)
  • Add the requested AIChatbot referer category
  • Update/add individual referer definition as suggested
Oct 30 2025, 12:40 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou placed T408185: Improve referrer tracking/classification using `utm_source` URL parameter up for grabs.
Oct 30 2025, 12:34 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Movement-Insights
JAllemandou closed T383088: webrequest dataset sets referer_class "unknown" instead of "external (search engine)" for origin-based referer values as Declined.

Boldly declining this based on feedback above. please re-open as needed.

Oct 30 2025, 12:33 PM · Patch-For-Review, Analytics-Data-Problem, Movement-Insights, Product-Analytics, Data-Engineering
JAllemandou added a comment to T407994: Move Druid realtime configuration out of Refinery into standalone repo on GitLab.

The reason for which I suggested doing this task is that Druid-realtime are a specific type of jobs: they are not airflow related while all other batch data-movements are. Also, they could benefit of gitlab CI/CD HTTP calls to automatically apply changes instead of manual application today.
That being said, I completely agree to take a broader look at the refinery at large before making decision here :)

Oct 30 2025, 11:07 AM · SRE, Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T408687: Create example dbt models using Iceberg.

One thing I wish to highlight here is that for the moment on most of our Iceberg tables we don't use a merge strategy for our incremental runs, but rather a delete + insert, that allows us to be more idempotent in case of jobs reruns (see below).
I suggest we spend some time in spiking to check if it could be possible to replicate the existing strategy with DBT, and think carefully of the impact if we need to change the strategy :)

Oct 30 2025, 11:04 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Movement-Insights, Data-Engineering (Q2 FY25/26 October 1st - December 31th), Epic

Oct 24 2025

JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

Thank you everyone for chiming in :)

Oct 24 2025, 8:35 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou created T408185: Improve referrer tracking/classification using `utm_source` URL parameter.
Oct 24 2025, 8:33 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Movement-Insights
JAllemandou added a comment to T407997: Drop the afl_ip column and the afl_ip_timestamp index from the abuse_filter_log table.

No direct impact on us, we don't sqoop this table.

Oct 24 2025, 8:05 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), DBA, AbuseFilter, Schema-change-in-production
JAllemandou added a comment to T407485: Set up x1 replication to an-redacteddb1001.

The impact on DE seems small IMO. I'm not an expert but my uinderstanding is that some databases/tables are to be moved to the x1 section, containing databases used across multiple projects. Those tables, if configured so, should be replicated to the redacted database, one of them being the an-redacteddb1001 we use to sqoop redacted data.
Two things to keep in mind here:

  • If we're talking about the database/tables from T395072 and T387419, we currently don't sqoop them
  • Longer term we wish to stop sqooping from an-redacteddb and only sqoop from analytics-replicas
Oct 24 2025, 8:03 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering

Oct 23 2025

JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

@Mayakp.wiki can I get your perspective on Isaac's comment above please? Thank you :)

Oct 23 2025, 4:04 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou merged T400799: Change RefererClassifier to capture malformed but internal referer strings as internal (wmf.pageview_actor) into T383088: webrequest dataset sets referer_class "unknown" instead of "external (search engine)" for origin-based referer values.
Oct 23 2025, 4:01 PM · Patch-For-Review, Analytics-Data-Problem, Movement-Insights, Product-Analytics, Data-Engineering
JAllemandou merged task T400799: Change RefererClassifier to capture malformed but internal referer strings as internal (wmf.pageview_actor) into T383088: webrequest dataset sets referer_class "unknown" instead of "external (search engine)" for origin-based referer values.
Oct 23 2025, 4:01 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T406531: NEWFEATURE REQUEST: Add new referral sources to pageview data.

I have a question as I'm changing this: in the current classification, referer in the form of IPs with a protocol (for instance http://192.168.2.1) are classified as external referrers. I'm planning changing this to unknown, in the same way IPs without protocols are. Is that ok for you ?
Pinging @Isaac on this too, cause I know you devised the referrer dataset, and this change impacts it.

Oct 23 2025, 3:34 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th), Patch-For-Review, Essential-Work, Movement-Insights (FY25-26 H1), Data-Platform
JAllemandou added a comment to T406765: Create a new gitlab repository for use with dbt.

Yesterday during a meeting, @amastilovic suggested dbt-models as name, I believe we agreed it is a good name. Is everyone else ok with this name?

Oct 23 2025, 10:37 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T406764: Provide a dbt-core development environment and production setup in the data-platform.

Pasting here the slack comment I wrote yesterday:

Oct 23 2025, 10:24 AM · Patch-For-Review, Data-Engineering-Roadmap, Movement-Insights, Epic, Data-Platform-SRE

Oct 20 2025

JAllemandou merged T406644: Stop sqooping revision.rev_sha1 into T406000: Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1.
Oct 20 2025, 6:51 AM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou merged task T406644: Stop sqooping revision.rev_sha1 into T406000: Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1.
Oct 20 2025, 6:51 AM · Data-Engineering
JAllemandou added a comment to T406644: Stop sqooping revision.rev_sha1.

I did this as part of T406000. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196631

Oct 20 2025, 6:50 AM · Data-Engineering

Oct 16 2025

JAllemandou moved T406000: Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1 from Ready to Deploy to Done on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Oct 16 2025, 5:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou moved T406000: Adapt mediawiki_history to the removal of mediawiki revision.rev_sha1 from In Review to Ready to Deploy on the Data-Engineering (Q2 FY25/26 October 1st - December 31th) board.
Oct 16 2025, 5:35 PM · Data-Engineering (Q2 FY25/26 October 1st - December 31th)
JAllemandou added a comment to T402145: Create view for betafeatures_user_counts table in wiki replicas.

I have reviewed the table, it contains only a features list with how many users use it. It is safe to replicate.

Oct 16 2025, 8:34 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering, BetaFeatures, Data-Services, cloud-services-team