User Details
- User Since
- Oct 9 2014, 4:50 PM (582 w, 1 d)
- Availability
- Available
- IRC Nick
- ottomata
- LDAP User
- Ottomata
- MediaWiki User
- Ottomata [ Global Accounts ]
Today
The choice of pyflink was to help solve a problem: to enable teams to build and own their (realtime) derived data pipelines. But, as you say, no one is doing this. So, before we make a decision like this, I’d really like to work with @GGoncalves-WMF on the broader derived data problem from a platform product management perspective. What do our users need and what do we want to provide for them? So, I’d prefer if we moved a bit slow and carefully on this. There are lots of questions about how to do the data transfer between data platform and online storage for serving, as well as for streaming enrichment, etc.
If we do this...java for sure! we built all of our Flink library tooling in Java. I like scala too, and while it makes coding some things easier, it makes integrating with different unexpected things harder.
Tue, Nov 25
For my own (out of the loop) understanding, here are the changes to previously made decisions:
Mon, Nov 24
If we implement it on PCS level that would count only the requests that are cache miss on edge.
Thu, Nov 6
I wanted to understand how multi-DC ness relates to all the pieces here. Just writing down what I found:
Update for most recent API endpoints:
If these are all caused by imports (are they? we should check for sure), then we should probably model a page_change_kind: import in the mediawiki.page_change.v1 event.
Nov 5 2025
Decision ^ here: T403660#11347022
In meeting today, we decided that "Good enough product" was sufficient for now. If this is not the case, Product will try to let us know as soon as possible.
Also
Speaking of redirects:
Here is product question about the "top k pages viewed" metric to discuss in today's sync meeting: T401260#11341613
Yep, we'll use mediawiki.page_content_change.v1. I think we just need to change the kafka_topic in change-prop, right?
Nov 4 2025
I've been testing backfilling pageview_per_editor_per_page. Fab repartitioned Alek's test table at fab.edit_per_editor_per_page_daily and it performs better now. I can backfill a month of pageview data using this table in a little over 5 minutes.
;
For intermediate Data Lake tables, Add HQL for edit_per_editor_per_page_daily and pageview_per_editor_per_page_daily (1196892) should be good to go from a data model and load query perspective.
^^ did we create a new ticket? :)
At T401260#11230961, we decided to not store per editor per page pageviews metrics in cassandra just to support the top K pageviews use case. This wasn't our favorite decision, because it means we have to maintain 2 different cassandra tables and data pipelines, and the top k pageviews metric is no longer an additive timeseries metric. Product teams can't do 'top k in last 30 days', they can only do e.g. 'top k in October'.
We've got our first actual daily pageviews per editor per page data lake table record! @amastilovic backfilled and ran the pageviews daily query for 2025-10-25. On that day, we stored 26637692 records.
But it's not about pages. It's about paragraphs, which are implicitly part of pages (revisions of pages),
True, but specifically about paragraphs that belong to MediaWiki pages. Paragraphs do not have a corresponding MediaWiki entity concept. Paragraphs do not have a unique id with which they can be referred to alone. They require a page_id (and/or revision_id) to be contextualized.
Nov 3 2025
Here is a suspicious event from October
If that is not the case then I think we have to also consider page_revision_paragraph_tone_scores, or even wiki_page_revision_paragraph_tone_scores.
...This conversation makes me think it would be useful to have a property in the event that indicates if the latest revision_id has changed. IIRC MW DomainEvents are actually named and modeled around this concept.
Ya move is good.
Both data and metadata get deleted when dropping an Iceberg managed table.
Hm, okay just asking for my education. This is different than regular Hive external tables then, yes?
@daniel, moving the convo from the patch to this ticket.
Could we go with page_paragraph_tone_scores?
I think it's clear that the data represents paragraphs from MediaWiki pages when we have page_id as part of primary key
@xcollazo do you want content also on page_change_kind == visibility_change and page_change_kind == delete? (Well uh, we can't do delete, because we can't get content after a page has been deleted.)
location would have changed as part of the ALTER TABLE RENAME, and it would have broken the Iceberg table because Iceberg keeps track of fully qualified file names.
prevent data-dropping errors
I have a concern about changing the job name as I don't know what can be affected.
Oct 31 2025
For diffs: could we not just modify the pageview algorithm and add an is_diff or pageview_kind=diff field that indicates if it was a diff pageview? We should know pretty easily by the URI path.
I also wonder if there is a real need to always use a specific (external) location for hive managed iceberg tables. We did this with Hive tables in the past especially because not all tables were created via SQL. Some were directories and files in hdfs before a Hive table is layered on them (e.g. wmf_raw.webrequest, etc.)
When fixing this, we should use a fully fully qualified URL, not just "hdfs:///..." but "hdfs://analytics-hadoop/..." , specifying the specific Hadoop cluster where the location is. (We do have an analytics-test-hadoop cluster ;) )
Hm, I thought we used https://gitlab.wikimedia.org/repos/maven/wmf-jvm-parent-pom#maven-checkstyle-plugin already?
Oct 30 2025
We won't use a different source unit, so I think including page is unnecessary.
It seems there is more to look into here, but I wrote up the implications for Global Editor Metrics here: at T405039#11329322.
@mforns and I were debugging our pageviews/per_editor queries yesterday, and we ran into a very unexpected issue with pageviews_hourly. This issue is explored and (will be) documented in T408798: Spike: investigate incorrect page_id values in pageview_hourly.
Wow nice find. Def high priority and probably a great task for @JMonton-WMF to take on!
I just deployed the pageviews/v3/per_editor endpoint. It will not work because there is no data behind it.
For T405039: Global Editor Metrics - Data Pipeline, we are using pageview_hourly to compute editor impact metrics. We wanted to include the page_title in the output dataset, to make the metrics more useable. Since the same page_id is associated with many page_titles, this won't be possible.
Just wondering what you meant when you say "other usages"
there isn't any reason for restricting things.
Moving a Slack convo here to phab.
Oct 29 2025
I'm not quite sure who would be responsible for figuring this out for sure. I would guess that it was disabled for loginwiki a long long time ago by the old and nonexistent Services team. I assume there was a reason?
If there are no privacy concerns with just enabling all TYPE_EVENT for loginwiki, that would be the simplest way to accomplish this. Otherwise, we will have to do something similar for loginwiki as was done for private wikis in T346046: [Search Update Pipeline] Source streams for private wikis.
13:26:24 [@stat1011:/home/otto] $ ls -la /srv/published/ total 28 drwxrwxr-x 6 root wikidev 4096 Oct 31 2024 . drwxr-xr-x 3 stats wikidev 4096 Jun 21 2024 datasets ...
Oct 28 2025
IIUC, you're okay with not naming this table more specifically about structured tasks?
each type of metric is basically its own microservice
Which begs the question: should it be? ;)
user_central_id is now in Druid mediawiki_history_reduced! Thanks @amastilovic !
could you confirm that the version change is going to be applied for all AQS endpoints?
@Dbrant! Great news! edits/v3/per_editor is live!
@aqu why mergeComments vs e.g mergeFieldsMetadata like in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/987195/10/refinery-spark/src/main/scala/org/wikimedia/analytics/refinery/spark/sql/HiveExtensions.scala? Since Spark treats comments like a kind of field metadata, shouldn't we make the SparkSqlExtension stuff do the same for all metadata?
Ya exactly.
Ya, currently codfw is the active datacenter, so only its topic will have real data. Try:
@JMonton-WMF thanks for the patch!
I just deployed edit-analytics with the metrics/v3/edits/per_editor endpoint. It does not work externally at https://wikimedia.org/api/rest_v1/metrics/v3/edits/per_editor.
Oct 27 2025
...and also back to the 'is it a cache' discussion!
every change to the model absolutely requires a change to the application code as well
This is probably a good thing. IIUC, model_version rarely changes, but if it does, you probably want to have a managed upgrade path. This also would give you the ability to A/B test serving different model versions. I would expect when this happens that ML could generate and store tasks using both models, until we are sure the new model_version is the one to use for sure.
Yep, we'll use mediawiki.page_content_change.v1. I think we just need to change the kafka_topic in change-prop, right?
Yup, but is it otherwise different in any meaningful way?
Technically, maybe not. But in terminology/usage/common understanding maybe! But yes, agree that we should sidetrack this discussion for larger stuff, as is this is fine!
But also, it kind of is cache isn't?
I'm not sure if it is! At the very least, it is not a read-through cache. But as we discussed in slack, the line is blurry.
consumes the mediawiki.page_content_change.v1 events (triggered by changeprop)
FYI, in case this is useful to you all in T304373, eventgate-logging-external events, like mediawiki.client_error are now ingested into Hive in the Data Lake. There is now an event.mediawiki_client_error Hive table
I don't have a strong preference other than not dbt.
I like it! Some field naming suggestions: