User Details
- User Since
- Feb 11 2015, 6:02 PM (565 w, 6 d)
- Availability
- Available
- IRC Nick
- joal
- LDAP User
- Unknown
- MediaWiki User
- JAllemandou (WMF) [ Global Accounts ]
Yesterday
I've been spending time on this, here's my latest finding: By reducing retention to 30 days, we keep the data size to an acceptable size with current sampling.
I have set up the retention to 30 days, and the druid data-size dropped from 600Gb to 80Gb.
At current sampling we store ~500Mb per hour, which will lead to ~400Gb for one month - lower than the previous 600Gb.
I'll keep the conf as is (30 days retention), and will remove the daily loading job, as the data is big enough to not require compaction.
In the meatime I'm going to reenable the indexation job with more resources, allowing it not to fail, but we need to find a solution about disk space. Either sampling more, or keeping one week of data for now.
Note: if we decide to keep one week of data, it means Druid can be queried for that week only, but the data is not deleted yet, it's still available in deep-storage on Hadoop, allowing to make it available at a config change (if disk space is available).
Sat, Dec 13
That's right, the failue is due to OOM at Druid ingestion job. But it's better for us not ingest while we don't have a decision on how to handle data size.
Thu, Dec 11
Thanks so much @Novem_Linguae , I confirm the link worked for me.
Hi folks, as part of this ticket I lost access to logstash. Can one of you please add me to the cn=logstash-access ldap group? Many thanks.
Thu, Dec 4
One another thing I found is that some projects are present in mediawiki_revision_history_v1 that are not in mediawiki_history_reduced:
spark.sql("""
WITH
namespace_map AS (
SELECT
dbname AS wiki_db,
-- Compatibility with exisitng AQS project hostnames
REGEXP_REPLACE(hostname, '(www\\.|\\.org)', '') AS hostname
FROM wmf_raw.mediawiki_project_namespace_map
WHERE TRUE
AND snapshot = '2025-11'
),
mwr_projects AS (
select distinct hostname
from wmf_content.mediawiki_revision_history_v1 mwr
INNER JOIN namespace_map ns
ON (mwr.wiki_id = ns.wiki_db)
WHERE mwr.revision_dt < '2025-11'
),
mwh_projects AS (
select distinct project
from wmf.mediawiki_history_reduced
where snapshot = '2025-10'
AND event_timestamp < '2025-11'
)
SELECT
hostname as mediawiki_revision_projects,
project as mediawiki_history_projects
FROM mwr_projects
FULL OUTER JOIN mwh_projects
ON (hostname = project)
WHERE (hostname IS NULL OR project IS NULL)
""").show(100,false)First thing: new snapshot (2025-11) has landed, I verified: same problem.
Next step I'm trying is to get the MariaDB revisions for the specific day/user/project we picked for the investigation, and try to look for them in the event-oriented datasets.
spark.sql("""
SELECT rev.rev_id
FROM wmf_raw.mediawiki_revision rev
WHERE rev.snapshot = '2025-11'
AND rev.wiki_db = 'bewwiktionary'
AND rev.rev_timestamp like '20250904%'
AND rev.rev_actor = 81
""").createOrReplaceTempView("r_81")
spark.table("r_81").cache()First investigation: I've looked for day/user_central_id/page_type being present in both datasets but with different counts. I took only one month (2025-09) to limit the data volume.
spark.sql("""
SELECT
event_timestamp,
user_central_id,
page_type,
events
FROM daily_aggregated_mediawiki_history_reduced
WHERE
event_timestamp < '2025-10'
AND event_timestamp >= '2025-09'
""").createOrReplaceTempView("damwh")
spark.table("damwh").cache()Wed, Dec 3
I have added the aggregated events value to @xcollazo query above and indeed we have an issue:
+---------+----------+-----------+ | count| events|count_match| +---------+----------+-----------+ | 3694708| 674767899| null| |269857694|4167426208| true| | 8951255|2631857251| false| +---------+----------+-----------+
There is a big proportion of aggregated events for which rows don't match. We need to investigate before publishing.
Tue, Dec 2
Thu, Nov 27
I have verified this morning: We had successful airflow jobs updating the new table, and the number of rows in the new wmf_content.mediawiki_revision_history_v1 table exactly matches the number of rows in the wmf_content.mediaiwiki_content_history_v1 table . Looks like our strategy of reusing events works !
Wed, Nov 26
Funnily enough, now there are now 4.3M rows more on the target table than the source
Tue, Nov 25
Something else needed in order to get the data in Druid with all the needed dimensions is to have a daily version of the project_namespace_map table. The data-generation step is cheap and the data small, I'll create a new table.
After talking with @mforns this morning:
- The 4 metrics defined in the task will always be requested for a single user_central_id.
- The 4th metric List of edited pages within a date range will not be used this year (see this page)- no need to embed page information (page_title or page_id) in the new dataset for now.
- It's ok to NOT have rows that don't have a defined user_central_id (not NULL and strictly positive)in the new druid dataset - the queries are always filtering for a single user_central_id. (only a very small number of rows filtered out, see https://phabricator.wikimedia.org/T410688#11404257)
- We wish to have an exact-copy of the mediawiki_history_reduced dataset in terms of dataset definition. Endpoints for the first three metrics are already serving data using mediawiki_history_reduced, and we now wish to have this data updated daily using a new datasource, but minimize change.
Some data validation on joining for user_central_id:
WITH centralauth AS (
SELECT DISTINCT
lu_wiki,
lu_local_id,
lu_global_id
FROM wmf_raw.centralauth_localuser
WHERE snapshot = '2025-10'
)Mon, Nov 24
Thu, Nov 20
Here is the proposed solution we came with in today's meeting:
- We are going to create a new dataset in Druid containing revision-events only, for the purpose of the Editors-metrics.
- This dataset will be updated daily from a newly created Iceberg table.
- This Iceberg table will be a replica of the mediawiki_content_history_v1 table, minus the content.
- this table will be updated daily from the same sources and using the same process (copy/paste) as the mediawiki_content_history_v1 table:
- pagechange events
- reconciliation events
- No reconciliation against MariaDB is needed, as this is already done by the mediawiki_content_history_v1 process handling the same data (minus text!)
Wed, Nov 19
Also, if using spark-thrift, I'm not sure at all it would run in k8s: wouldn't it be the same issue as being able to run a job with the spark-master in k8s and executors on yarn?
The other option not using a Thrift-server not k8s is to make dbt run in skein. I need to think more about this one.
If we run the dbt-spark job in k8s, no need for a thrift server, dbt could use the "session mode" where it launches it own job with predefined configuration.
Tue, Nov 18
I'm pretty sure I have nailed down the problem.
At write time, the refine job inserts partitions in overwrite mode. To do so, it is needed to set spark.sql.sources.partitionOverwriteMode to dynamic, otherwise Spark overwrites the entire table, not the partitions touched only.
Changing this approach to a delete+insert where we drop the partition, write the files in a folder with overwrite, then recreate the partition, has solved the problem.
My guess is that for Spark to decide about partitions drop and recreate, it first requests information about all the existing partitions in the table, to store that information in its own format (catalog) instead of having in the metastore only.
Anyhow, I have patch fixing the issue (see above), it currently runs on the test-cluster successfully, we'll deploy it tomorrow.
Nov 13 2025
The solution presented above is completely aligned with our current way of doing.
We could even think of optimizations allowing to run bigger portions of data than the regular hourly job.
Great work!
Problem solved!
Nov 11 2025
Hi folks, you can fin pageview hourly dumps here: https://dumps.wikimedia.org/other/pageview_complete
It'll be a lot to download and filter, but you'll find what you're after in there :)
This problem is due to the job still using the old jar version after the patches we made changing the schema. I'll provide a patch and deploy it today.
Very cool !
Let's schedule some time to work together on the Iceberg time partitioning.
This pre_hook solution is great :)
If we can build our own materialization defining such a hook, it would be perfect.
The only concern I see here is defining the time-period for the deletion and materialization (this can change between jobs), I'm not sure of how we can do that :)
Nov 10 2025
And the druid datasources have been backfilled.
This task is considered done, waiting for feedback for a few days before closing it.
While backfilling I discovered that some fake very long domains could make the classification too long (2d for one webrequest hour!). I have sent the above patch to solve this.
Airflow DAGs that would handle each table-Sqoop job independently
Nov 6 2025
I looked on the internet, and the idempotent-pipeline problem in DBT is a known and documented issue.
My preferred article is https://techblog.finatext.com/dbt-incremental-strategy-and-idempotency-877993f48448
We met with @JMonton-WMF and found that DBT supports building new materializations: https://docs.getdbt.com/guides/create-new-materializations
From a quick look at it, it feels like possibly this should be the solution to our problem (pre-hooks, etc).
We have decided to try to apply DBT to one real example of us: Try to reproduce projectview_hourly with DBT and an iceberg (and Airflow when feasible).
I suggested this example as data is way smaller than using mediawiki_content.
Starting backfilling pageview_hourly hive table from 2025-05-10T04:00. I'll backfill one month at a time.
Nov 5 2025
The patch has been deployed, new data started to flow:
SELECT
hour,
COUNT(1),
COUNT(DISTINCT referer_data)
FROM wmf.webrequest
WHERE year = 2025 and month = 11 and day = 5 and hour IN (15, 16)
GROUP BY hour
ORDER BY hourNov 4 2025
I have updated the regexes a few times, and I have posted a latest list of domains and referer-classes in the doc, in place of the previous one.
In my checking of the list, I haven't find false-positives, only improvements.
Obviously this will not stay true for long, but it seems good enough for now.
I'll be deploying tomorrow if ok for you @nshahquinn-wmf .
Thanks!
My understanding is that DELETE always removes data whether on managed or EXTERNAL tables, while DROP TABLE deletes data only on managed tables, not on EXTERNAL ones.
Nov 3 2025
I have vetted the data with the latest version of the change, and it looks good to me.
I have copied some values where referer has changed grouped by referer-domain and ordered by number of hists desc here (note: this list doesn't include IP-referer that have now been categorized as unknown instead of external).
Another reason for which we used the external-table mechanism was to prevent data-dropping errors. I think it's worth keeping it as is :)
Oct 31 2025
I don't think the micro_batch strategy changes anything to our case. It helps when reprocessing big amount of temporarily partitioned data only with DBT: DBT split the jobs in smaller instances, in a similar way we would do with Airflow.
Thank you so much for the very clear summary @Milimetric !
Some notes:
Oct 30 2025
To my understanding, the insert_overwrite strategy only works for Iceberg tables when the time-partition of that table matches the loading job regularity. This is because insert_overwrite in Iceberg overwrites entire partitions if some data is written to them.
The related concern is that we wish to NOT make iceberg time-partitioning match loading job frequency, to optimize small file issues for instance. So in those cases (uniaue_devices tables being one of them), we need a different strategy, and the question I had is whether merge can safely replace delete + insert. If it can, great, let's update to using that. Otherwise we need to make DBT replicate our delete + insert way.
Code got pushed and reviewed positively. I'll spend tomorrow to do some analysis on real data, trying to spot blatantly wrong rows, and if everything goes right it should be merged/deployed early next week.
Back from holidays, I'll update my patch to:
- not include invalid referer URLS (declined T383088 based on comments)
- Add the requested AIChatbot referer category
- Update/add individual referer definition as suggested
Boldly declining this based on feedback above. please re-open as needed.
The reason for which I suggested doing this task is that Druid-realtime are a specific type of jobs: they are not airflow related while all other batch data-movements are. Also, they could benefit of gitlab CI/CD HTTP calls to automatically apply changes instead of manual application today.
That being said, I completely agree to take a broader look at the refinery at large before making decision here :)
One thing I wish to highlight here is that for the moment on most of our Iceberg tables we don't use a merge strategy for our incremental runs, but rather a delete + insert, that allows us to be more idempotent in case of jobs reruns (see below).
I suggest we spend some time in spiking to check if it could be possible to replicate the existing strategy with DBT, and think carefully of the impact if we need to change the strategy :)
Oct 24 2025
Thank you everyone for chiming in :)
No direct impact on us, we don't sqoop this table.
The impact on DE seems small IMO. I'm not an expert but my uinderstanding is that some databases/tables are to be moved to the x1 section, containing databases used across multiple projects. Those tables, if configured so, should be replicated to the redacted database, one of them being the an-redacteddb1001 we use to sqoop redacted data.
Two things to keep in mind here:
Oct 23 2025
@Mayakp.wiki can I get your perspective on Isaac's comment above please? Thank you :)
I have a question as I'm changing this: in the current classification, referer in the form of IPs with a protocol (for instance http://192.168.2.1) are classified as external referrers. I'm planning changing this to unknown, in the same way IPs without protocols are. Is that ok for you ?
Pinging @Isaac on this too, cause I know you devised the referrer dataset, and this change impacts it.
Pasting here the slack comment I wrote yesterday:
Oct 20 2025
I did this as part of T406000. See https://gerrit.wikimedia.org/r/c/operations/puppet/+/1196631
Oct 16 2025
I have reviewed the table, it contains only a features list with how many users use it. It is safe to replicate.