User Details
- User Since
- Nov 7 2014, 8:52 PM (464 w, 2 d)
- Availability
- Available
- IRC Nick
- mforns
- LDAP User
- Mforns
- MediaWiki User
- Mforns (WMF) [ Global Accounts ]
Today
Looks good to me in general!
Could we have interactionData and customData be the same parameter maybe?
Does it matter for them to be separate?
Fri, Sep 22
@phuedx I like that the performer_id would be unrecognizable for someone capturing the events through the network!
Regarding where to define sanitization rules, I think we should consider it as part of @JAllemandou's idea about centralizing data pipelines configuration, no?
+1 spider
Hi all! Adding some thoughts:
Thu, Sep 21
If I understand correctly, the only missing step here is to import the vendor MetricsPlatform code into the EventLogging extension and deploy that.
Wed, Sep 20
Here's the GitLab MetricsPlatform MR to update the monoschema version:
https://gitlab.wikimedia.org/repos/data-engineering/metrics-platform/-/merge_requests/4
Tue, Sep 19
Mon, Sep 18
Thu, Sep 14
Tue, Sep 12
After a bit of investigation, I think there's good news:
IIUC, the Chrome User Agent reduction has already been rolled out completely.
Mon, Sep 11
I think the short term risk is that the data is not correct.
Since the changes applied to rev_deleted/rev_actor, I did a short data vetting and couldn't find any weird data behaviors.
However, the MediaWikiHistory code is complex and so are the job's data flows, it is possible that details have escaped me.
Here's a spreadsheet with an analysis on the impact of the UserAgent deprecation so far (as of today).
My summarized takeaway is that so far there's not a visible degradation of the automated traffic detection.
https://docs.google.com/spreadsheets/d/1y1mxagwM5FI5y1qQdlM76xeQKgfHCUQtvJbszuozMiA/edit#gid=0
Aug 28 2023
I need a background task while I work on other things. I will take this one if it's OK.
@Milimetric let me know if you want to tackle this.
Aug 25 2023
The Event Platform folks responded in Slack the question: Is it OK if future Metrics Platform events are 10% to 40% bigger?
They agreed that:
- The size of the individual message is completely fine (the limit is 4MB - and will be increased to 10MB).
- The overall size of all Metrics Platform events is a small* share of the events that flow through Event Platform (most streams receive 0-100 evts/sec, some 100-500 evts/sec). This is within the current Event Platform system capabilities.
- (*) Except for virtual_pageview and session_length instruments. However, streams like these are going to be treated individually and stay out of the question.
- In the unlikely case that EventGate suffers because of the volume of Metrics Platform, we can always add more replicas to it.
- The transition to Metrics Platform (either migration or new streams) will be progressive (we are not going to switch all instruments at once).
Aug 23 2023
Just to make sure that I've understood the correctly:
We shouldn't be overly concerned about performance (data transferred to the event ingestion service, data size on Kafka – both bandwidth used and storage – and data size in Hadoop) when it comes to the design of the fragments that we're creating, e.g. when considering splitting by entity or having a monofragment.
Yes, I think that makes sense!
Maybe with the caveat to confirm with Event Platform folks about the size of the events in network/Kafka.
Aug 22 2023
- It's difficult to compare the 2 datasets since they are quite different, but:
- There seems to be a significant increase (10%-40%) in the size of the events when traveling through the network, and thus an increase in bandwidth consumption (and Kafka storage). We should check with the Event Platform team.
- The increase of the data size in Hadoop, if any, will be compensated mostly by the lack of eventlogging-legacy fields.
- There should not be any adverse effect on the performance of queries to the datasets. Read-all processes such as Refine or sanitization can be affected proportionally to the increment in size, if any.
- Even if we see increase in bandwidth or storage or compute, it would only be critical for high-throughput streams such as virtual_pageview or session_length.
I wanted to try and benchmark both datasets against the sanitization Refine process.
However, since they are quite different in structure (see other 2 comparisons), the results would probably not matter.
Event data is stored in 3 places in Hadoop:
- Raw json events (HDFS) [compressed json]
- Refined events (Hive event database) [parquet]
- Refined sanitized events (Hive event_sanitized database) [parquet]
The original dataset events traveling through the network look like this:
{ "event": { "page_id": 10869877, "page_title": "Parlamentswahl_in_Portugal_2019", "page_ns": 0, "revision_id": 220378432, "user_id": 0, "user_is_temp": false, "user_class": "IP", "user_editcount": 0, "mw_version": "1.41.0-wmf.22", "page_token": "9abdbb91213cdcf45484", "session_token": "fada448f54f4f5d2a1b6", "version": 1, "wiki": "dewiki", "skin": "minerva", "is_bot": false, "editing_session_id": "c6bed539485c2118e640", "editor_interface": "wikitext", "integration": "page", "platform": "phone", "action": "ready", "ready_timing": 324, "is_oversample": false }, "schema": "EditAttemptStep", "webHost": "de.wikipedia.org", "wiki": "dewiki", "$schema": "/analytics/legacy/editattemptstep/2.0.0", "client_dt": "2023-08-22T17:20:15.929Z", "meta": { "stream": "eventlogging_EditAttemptStep", "domain": "de.wikipedia.org", "id": "8f80d246-001f-4c15-a93b-36d45d2bfe0e", "dt": "2023-08-22T17:20:18.964Z", "request_id": "8efe119d-c1e2-4c09-b9ba-698c7ff2b422" }, "dt": "2023-08-22T17:20:18.964Z", "http": { "request_headers": { "user-agent": "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Mobile Safari/537.36" }, "client_ip": "2a0a:a546:a098:0:2994:e264:ac1f:ce81" } }
About 1116 bytes in size.
Aug 21 2023
Looks great to me!
Agree with @mpopov that this dataset is limited so far.
We definitely could improve it by adding more dimensions to it.
I can't recall any impediment to (once additional data is collected) splitting by dimensions other than wiki or project family.
The only thing maybe is that this dataset gets sparse in the long tail quickly, so probably we would not be able to break it down by more than 1 dimension at a time.
That said, I think it would be a pity to switch it off, I still think it gives some value, and it is an example of how to collect data in a privacy-aware way.
Aug 18 2023
Looking now at the monoschema dataset. This is the final list of fields:
$schema: agent: app_install_id: client_platform: client_platform_family: custom_data: dt: http: has_cookies: method: protocol: request_headers: response_headers: status_code: mediawiki: is_debug_mode: is_production: site_content_language: site_content_language_variant: skin: version: database: meta: domain: dt: id: request_id: stream: uri: name: page: title: content_language: id: is_redirect: namespace: namespace_name: revision_id: user_groups_allowed_to_edit: user_groups_allowed_to_move: wikidata_id: wikidata_qid: performer: can_probably_edit_page: edit_count: edit_count_bucket: groups: id: is_bot: is_logged_in: language: language_variant: name: pageview_id: registration_dt: session_id: user_agent_map: is_wmf_domain: normalized_host: project_class: project: qualifiers: tld: project_family: datacenter: year: month: day: hour:
Note that the custom_data is not exploded. This is because in Hive, since it's a field of type map<str, ...>, it does not keep null values for the missing fields, it just doesn't store them. Like this:
{"is_bot":{"data_type":"boolean","value":"false"},"integration":{"data_type":"string","value":"page"},"loaded_timing":{"data_type":"number","value":"2486"},"editing_session_id":{"data_type":"string","value":"ed3c724deec25d8809264e0b6d69f4a2"},"editor_interface":{"data_type":"string","value":"wikitext"},"wiki":{"data_type":"string","value":"enwiki"},"skin":{"data_type":"string","value":"vector-2022"}}
This only contains values for 7 of the defined fields, the rest (which whould have had null values in a regular schema, are not present). Again, this makes it difficult to compare the sizes and efficiency of both datasets.
Before comparing both datasets I wanted to have a closer look at the original one, to make sure the comparison makes sense.
Aug 14 2023
The sanitization pipeline has a feature with which the developer can specify the fields to hash.
That hash is salted by default, and the salts are rotated every 3 months (and thrown away).
Would the feature described in this task be necessarily implemented by the Metrics Platform, too?
Also, considering that we might need the ability to salt-hash other fields at some point?
Aug 9 2023
I created this MR for the subtask "Create the instance specific dags folder (ready to merge)"
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/474
Please, check that all is correct :-)
Jul 21 2023
I have reviewed the passed sources and videos (BTW, thank you @ntsako for the videos!) and here's a summary of:
- What I've learned, so we can refer to this in the future and avoid some rework.
- Questions that arised (I reached out today to @ntsako but I fear it was too late for his time zone).
- Some observations on potential improvements for each section of the project.
Jun 28 2023
@AndrewTavis_WMDE Hi! I think you could go with simply wmde. The analytics prefix in product_analytics exists because the team is named like that. In your case, you could use wmde I think.
BTW this is the task to create the WMDE Airflow instance: T340648
Jun 27 2023
Jun 26 2023
Jun 22 2023
Here's a list of all Airflow analytics tags as of 2023-06-22.
https://docs.google.com/spreadsheets/d/1XtvtLeZUWIWmEYGF9JYukeszZZ1oO0je8FJVbOKyxpA
Since Oozie doesn't have any more jobs running (except for T333218, which will be removed asap), soon there won't be any more Oozie users.
Jun 21 2023
Since after 30 days nobody has complained about the data missing:
Removed the refinery-source code and the airflow-dags code.
Also deleted the data from HDFS. It went to the Trash folder, where it will be irreversibly deleted in 30 days.
I think this can be moved to Done.
Jun 19 2023
There is a _FAILURE flag that gets written, and by default previous failures are excluded from the next refinement. That's why we have to manually rerun those with --ignore-failure-flag=true. We only get one alert per Refine attempt of an hour (unless someone reruns and it fails again).
Gotcha! sorry for misleading...
Oh, this looks great! It will simplify the deletion of data so much... 👏👏👏
In T337052#8865357, @mforns wrote:
Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now.Thanks @mforns. Could you elaborate on this a little for me, please?
Just stumbled on this task, I created the missing patch.
It should work regardless of the version of Airflow that is running it.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/434
Jun 13 2023
Iceberg table that keeps track of Iceberg ingestions
I also like this idea!
Good thing about having it in an Iceberg table is that we can also apply deletes. This would help in case we have some mechanism that triggers cascading re-runs.
Maybe we could implement our own ExternalTaskSensor/ExternalTaskMarker in Airflow that would use this table.
This way we could get cascading re-runs without being limited to 1 Airflow instance, plus the state would not be in the Airflow database. Win-win.
Maybe we could even use this table to aid us implementing Refine?
Jun 9 2023
I like the simplicity of this approach. It got me thinking of the following though: we'd be coupling the availability of Presto with the availability (or progress) of Airflow, which I think is something to consider.
Yes, agree.
Jun 8 2023
If we use Presto, we wouldn't need to launch anything, since we could use Presto's SQL server. We could even avoid Skein, since we could call Presto SQL from the sensor itself (Airflow machine). Joseph and I did some tests today, and it took 4 seconds to sense for a full week of daily referrer data. I also like the idea of using Java Table API, but at the same time I think the way that Iceberg is intended to be used is via SQL, and not checking internals (manifests, files, partitions...), no?
Jun 7 2023
Could we use presto instead of spark?
That would be lighter and with less latency, right?
But, I thought not all datasets were queriable in the Presto cluster...?
We discussed in standup whether it would be OK to spawn JVM/Spark for each poke of the Iceberg sensors.
So here's some modelling:
I think if we use PyIceberg, we can use:
from pyiceberg.catalog import load_catalog from pyiceberg.expressions import GreaterThanOrEqual
Jun 6 2023
I received another answer, this time from a committer & PMC on Apache Avro, Airflow, Druid and Iceberg. He thinks the best way would be to use PyIceberg and implement support for HDFS/Kerberos. Since spinning up a full JVM/Spark would be "overkill for just checking the number of rows".
Jun 5 2023
Folks from the community gave me some answers: