Page MenuHomePhabricator

mforns (Marcel Ruiz Forns)
Software Engineer @ Analytics

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 7 2014, 8:52 PM (464 w, 2 d)
Availability
Available
IRC Nick
mforns
LDAP User
Mforns
MediaWiki User
Mforns (WMF) [ Global Accounts ]

Recent Activity

Today

mforns placed T346463: Identify and label prefetch proxy data in our traffic up for grabs.
Mon, Oct 2, 5:41 PM · Movement-Insights, Data-Engineering
mforns updated the task description for T346463: Identify and label prefetch proxy data in our traffic.
Mon, Oct 2, 5:40 PM · Movement-Insights, Data-Engineering
mforns added a comment to T345439: [SPIKE] Prototype API for Submitting Core Interaction Events.

Looks good to me in general!
Could we have interactionData and customData be the same parameter maybe?
Does it matter for them to be separate?

Mon, Oct 2, 4:59 PM · Data Products (Sprint 01), Spike, Metrics Platform Backlog

Fri, Sep 22

mforns placed T286793: [EventGate] Failures when getting stream config from MediaWiki API up for grabs.
Fri, Sep 22, 5:10 PM · Data-Engineering
mforns updated subscribers of T310198: Replace performer_id with salted hashed user ID.

@phuedx I like that the performer_id would be unrecognizable for someone capturing the events through the network!
Regarding where to define sanitization rules, I think we should consider it as part of @JAllemandou's idea about centralizing data pipelines configuration, no?

Fri, Sep 22, 5:01 PM · Metrics Platform Backlog
mforns added a comment to T336715: Investigate relation of UA deprecation to increase in automated traffic and reduction in unique devices.

+1 spider

Fri, Sep 22, 4:44 PM · Movement-Insights, Research-Freezer, Data-Engineering, Product-Analytics
mforns awarded T266641: Test Alluxio as cache layer for Presto a Burninate token.
Fri, Sep 22, 4:40 PM · Data-Platform-SRE, Data-Engineering
mforns added a comment to T346890: Windows 11 missing in analytics ?.

Hi all! Adding some thoughts:

Fri, Sep 22, 3:41 PM · Data Products, Data-Engineering-Dashiki, Data-Engineering

Thu, Sep 21

mforns added a comment to T344235: Remove `null` entry from custom_data.[].value enum in monoschema.

If I understand correctly, the only missing step here is to import the vendor MetricsPlatform code into the EventLogging extension and deploy that.

Thu, Sep 21, 1:26 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Data Products (Sprint 01), Metrics Platform Backlog
mforns added a comment to T344235: Remove `null` entry from custom_data.[].value enum in monoschema.

https://gitlab.wikimedia.org/repos/data-engineering/metrics-platform/-/merge_requests/4 has been merged.

Thu, Sep 21, 1:26 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Data Products (Sprint 01), Metrics Platform Backlog
mforns moved T344867: [SPIKE] Commons Impact Metrics preliminary technical review from Sprint Backlog to In Process on the Data Products (Sprint 01) board.
Thu, Sep 21, 1:23 PM · Data Products (Sprint 01)
mforns claimed T344867: [SPIKE] Commons Impact Metrics preliminary technical review.
Thu, Sep 21, 1:23 PM · Data Products (Sprint 01)

Wed, Sep 20

mforns added a comment to T344235: Remove `null` entry from custom_data.[].value enum in monoschema.

Here's the GitLab MetricsPlatform MR to update the monoschema version:
https://gitlab.wikimedia.org/repos/data-engineering/metrics-platform/-/merge_requests/4

Wed, Sep 20, 3:31 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Data Products (Sprint 01), Metrics Platform Backlog

Tue, Sep 19

mforns reassigned T345944: Marcel's review of Data Platform Engineering Maintainership Scope from mforns to VirginiaPoundstone.
Tue, Sep 19, 12:10 PM · Data Products (Sprint 01)
mforns moved T345944: Marcel's review of Data Platform Engineering Maintainership Scope from Done to Sign Off on the Data Products (Sprint 01) board.
Tue, Sep 19, 12:09 PM · Data Products (Sprint 01)

Mon, Sep 18

mforns moved T344235: Remove `null` entry from custom_data.[].value enum in monoschema from Sprint Backlog to In Process on the Data Products (Sprint 01) board.
Mon, Sep 18, 2:55 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Data Products (Sprint 01), Metrics Platform Backlog
mforns claimed T344235: Remove `null` entry from custom_data.[].value enum in monoschema.
Mon, Sep 18, 2:55 PM · MW-1.41-notes (1.41.0-wmf.29; 2023-10-03), Data Products (Sprint 01), Metrics Platform Backlog
mforns moved T345944: Marcel's review of Data Platform Engineering Maintainership Scope from In Process to Done on the Data Products (Sprint 01) board.
Mon, Sep 18, 2:52 PM · Data Products (Sprint 01)

Thu, Sep 14

mforns claimed T345944: Marcel's review of Data Platform Engineering Maintainership Scope.
Thu, Sep 14, 12:16 PM · Data Products (Sprint 01)
mforns moved T345944: Marcel's review of Data Platform Engineering Maintainership Scope from Sprint Backlog to In Process on the Data Products (Sprint 01) board.
Thu, Sep 14, 12:16 PM · Data Products (Sprint 01)
mforns moved T345208: [Spike] Identify and mitigate risks associated with MediaWiki History pipeline from In Process to BLOCKED on the Data Products (Sprint 01) board.
Thu, Sep 14, 12:15 PM · Data Products (Sprint 01), Data Engineering and Event Platform Team
mforns moved T336411: AQS 2.0: Geo Analytics service - configure routing in staging and production from In Process to Sign Off on the Data Products (Sprint 01) board.
Thu, Sep 14, 12:14 PM · Data Products (Sprint 01)
mforns moved T346255: Update tests in framework from In Testing to In Process on the Data Products (Sprint 01) board.
Thu, Sep 14, 12:09 PM · Data Products (Sprint 01)

Tue, Sep 12

mforns added a comment to T336084: [SPIKE] Model impact of User-Agent deprecation on top line metrics.

After a bit of investigation, I think there's good news:
IIUC, the Chrome User Agent reduction has already been rolled out completely.

Tue, Sep 12, 5:41 PM · Data Products (Sprint 01), Data Pipelines (Sprint 14), Google-Chrome-User-Agent-Deprecation, Product-Analytics (Kanban), Data-Engineering

Mon, Sep 11

mforns added a comment to T345208: [Spike] Identify and mitigate risks associated with MediaWiki History pipeline.

I think the short term risk is that the data is not correct.
Since the changes applied to rev_deleted/rev_actor, I did a short data vetting and couldn't find any weird data behaviors.
However, the MediaWikiHistory code is complex and so are the job's data flows, it is possible that details have escaped me.

Mon, Sep 11, 4:30 PM · Data Products (Sprint 01), Data Engineering and Event Platform Team
mforns added a comment to T336084: [SPIKE] Model impact of User-Agent deprecation on top line metrics.

Here's a spreadsheet with an analysis on the impact of the UserAgent deprecation so far (as of today).
My summarized takeaway is that so far there's not a visible degradation of the automated traffic detection.
https://docs.google.com/spreadsheets/d/1y1mxagwM5FI5y1qQdlM76xeQKgfHCUQtvJbszuozMiA/edit#gid=0

Mon, Sep 11, 3:38 PM · Data Products (Sprint 01), Data Pipelines (Sprint 14), Google-Chrome-User-Agent-Deprecation, Product-Analytics (Kanban), Data-Engineering

Aug 28 2023

mforns moved T336084: [SPIKE] Model impact of User-Agent deprecation on top line metrics from Sprint Backlog to In Process on the Data Products (Sprint 00) board.
Aug 28 2023, 12:46 PM · Data Products (Sprint 01), Data Pipelines (Sprint 14), Google-Chrome-User-Agent-Deprecation, Product-Analytics (Kanban), Data-Engineering
mforns added a comment to T336084: [SPIKE] Model impact of User-Agent deprecation on top line metrics.

I need a background task while I work on other things. I will take this one if it's OK.
@Milimetric let me know if you want to tackle this.

Aug 28 2023, 12:45 PM · Data Products (Sprint 01), Data Pipelines (Sprint 14), Google-Chrome-User-Agent-Deprecation, Product-Analytics (Kanban), Data-Engineering

Aug 25 2023

mforns moved T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency from In code review / Tech Input to Sign Off on the Data Products (Sprint 00) board.
Aug 25 2023, 5:35 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

The Event Platform folks responded in Slack the question: Is it OK if future Metrics Platform events are 10% to 40% bigger?
They agreed that:

  • The size of the individual message is completely fine (the limit is 4MB - and will be increased to 10MB).
  • The overall size of all Metrics Platform events is a small* share of the events that flow through Event Platform (most streams receive 0-100 evts/sec, some 100-500 evts/sec). This is within the current Event Platform system capabilities.
  • (*) Except for virtual_pageview and session_length instruments. However, streams like these are going to be treated individually and stay out of the question.
  • In the unlikely case that EventGate suffers because of the volume of Metrics Platform, we can always add more replicas to it.
  • The transition to Metrics Platform (either migration or new streams) will be progressive (we are not going to switch all instruments at once).
Aug 25 2023, 5:30 PM · Data Products (Sprint 00), Metrics Platform Backlog

Aug 23 2023

mforns moved T344373: AQS 2.0 Cassandra-based services: Explore how to run QA test suite using Docker from Ready for Code Review to In code review / Tech Input on the Data Products (Sprint 00) board.
Aug 23 2023, 1:56 PM · Data Products (Sprint 01)
mforns moved T342276: Review and update test environments data (Cassandra & Druid) from Ready for Code Review to In code review / Tech Input on the Data Products (Sprint 00) board.
Aug 23 2023, 1:55 PM · Data Products (Sprint 00), AQS2.0 (Sprint 10)
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

Just to make sure that I've understood the correctly:

We shouldn't be overly concerned about performance (data transferred to the event ingestion service, data size on Kafka – both bandwidth used and storage – and data size in Hadoop) when it comes to the design of the fragments that we're creating, e.g. when considering splitting by entity or having a monofragment.

Yes, I think that makes sense!
Maybe with the caveat to confirm with Event Platform folks about the size of the events in network/Kafka.

Aug 23 2023, 1:44 PM · Data Products (Sprint 00), Metrics Platform Backlog

Aug 22 2023

mforns moved T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency from In Process to Ready for Code Review on the Data Products (Sprint 00) board.
Aug 22 2023, 8:25 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.
  • It's difficult to compare the 2 datasets since they are quite different, but:
  • There seems to be a significant increase (10%-40%) in the size of the events when traveling through the network, and thus an increase in bandwidth consumption (and Kafka storage). We should check with the Event Platform team.
  • The increase of the data size in Hadoop, if any, will be compensated mostly by the lack of eventlogging-legacy fields.
  • There should not be any adverse effect on the performance of queries to the datasets. Read-all processes such as Refine or sanitization can be affected proportionally to the increment in size, if any.
  • Even if we see increase in bandwidth or storage or compute, it would only be critical for high-throughput streams such as virtual_pageview or session_length.
Aug 22 2023, 8:24 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

I wanted to try and benchmark both datasets against the sanitization Refine process.
However, since they are quite different in structure (see other 2 comparisons), the results would probably not matter.

Aug 22 2023, 7:54 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

Event data is stored in 3 places in Hadoop:

  • Raw json events (HDFS) [compressed json]
  • Refined events (Hive event database) [parquet]
  • Refined sanitized events (Hive event_sanitized database) [parquet]
Aug 22 2023, 7:33 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

The original dataset events traveling through the network look like this:

{
  "event": {
    "page_id": 10869877,
    "page_title": "Parlamentswahl_in_Portugal_2019",
    "page_ns": 0,
    "revision_id": 220378432,
    "user_id": 0,
    "user_is_temp": false,
    "user_class": "IP",
    "user_editcount": 0,
    "mw_version": "1.41.0-wmf.22",
    "page_token": "9abdbb91213cdcf45484",
    "session_token": "fada448f54f4f5d2a1b6",
    "version": 1,
    "wiki": "dewiki",
    "skin": "minerva",
    "is_bot": false,
    "editing_session_id": "c6bed539485c2118e640",
    "editor_interface": "wikitext",
    "integration": "page",
    "platform": "phone",
    "action": "ready",
    "ready_timing": 324,
    "is_oversample": false
  },
  "schema": "EditAttemptStep",
  "webHost": "de.wikipedia.org",
  "wiki": "dewiki",
  "$schema": "/analytics/legacy/editattemptstep/2.0.0",
  "client_dt": "2023-08-22T17:20:15.929Z",
  "meta": {
    "stream": "eventlogging_EditAttemptStep",
    "domain": "de.wikipedia.org",
    "id": "8f80d246-001f-4c15-a93b-36d45d2bfe0e",
    "dt": "2023-08-22T17:20:18.964Z",
    "request_id": "8efe119d-c1e2-4c09-b9ba-698c7ff2b422"
  },
  "dt": "2023-08-22T17:20:18.964Z",
  "http": {
    "request_headers": {
      "user-agent": "Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Mobile Safari/537.36"
    },
    "client_ip": "2a0a:a546:a098:0:2994:e264:ac1f:ce81"
  }
}

About 1116 bytes in size.

Aug 22 2023, 6:57 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns moved T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency from Sprint Backlog to In Process on the Data Products (Sprint 00) board.
Aug 22 2023, 2:22 PM · Data Products (Sprint 00), Metrics Platform Backlog

Aug 21 2023

mforns added a comment to T343331: Audit existing schemas to identify Core Interactions.

Looks great to me!

Aug 21 2023, 5:08 PM · Data Products (Sprint 00), Metrics Platform Backlog, Spike
mforns added a comment to T295619: Cookie value sent in HTTP requests changes too frequently.

Agree with @mpopov that this dataset is limited so far.
We definitely could improve it by adding more dimensions to it.
I can't recall any impediment to (once additional data is collected) splitting by dimensions other than wiki or project family.
The only thing maybe is that this dataset gets sparse in the long tail quickly, so probably we would not be able to break it down by more than 1 dimension at a time.
That said, I think it would be a pity to switch it off, I still think it gives some value, and it is an example of how to collect data in a privacy-aware way.

Aug 21 2023, 4:19 PM · Wikimedia-Performance-recommendation, Metrics Platform Backlog, MediaWiki-Platform-Team (Radar), Product-Analytics, MediaWiki-extensions-WikimediaEvents

Aug 18 2023

mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

Looking now at the monoschema dataset. This is the final list of fields:

$schema:
agent:
    app_install_id:
    client_platform:
    client_platform_family:
custom_data:
dt:
http:
    has_cookies:
    method:
    protocol:
    request_headers:
    response_headers:
    status_code:
mediawiki:
    is_debug_mode:
    is_production:
    site_content_language:
    site_content_language_variant:
    skin:
    version:
    database:
meta:
    domain:
    dt:
    id:
    request_id:
    stream:
    uri:
name:
page:
    title:
    content_language:
    id:
    is_redirect:
    namespace:
    namespace_name:
    revision_id:
    user_groups_allowed_to_edit:
    user_groups_allowed_to_move:
    wikidata_id:
    wikidata_qid:
performer:
    can_probably_edit_page:
    edit_count:
    edit_count_bucket:
    groups:
    id:
    is_bot:
    is_logged_in:
    language:
    language_variant:
    name:
    pageview_id:
    registration_dt:
    session_id:
user_agent_map:
is_wmf_domain:
normalized_host:
    project_class:
    project:
    qualifiers:
    tld:
    project_family:
datacenter:
year:
month:
day:
hour:

Note that the custom_data is not exploded. This is because in Hive, since it's a field of type map<str, ...>, it does not keep null values for the missing fields, it just doesn't store them. Like this:

{"is_bot":{"data_type":"boolean","value":"false"},"integration":{"data_type":"string","value":"page"},"loaded_timing":{"data_type":"number","value":"2486"},"editing_session_id":{"data_type":"string","value":"ed3c724deec25d8809264e0b6d69f4a2"},"editor_interface":{"data_type":"string","value":"wikitext"},"wiki":{"data_type":"string","value":"enwiki"},"skin":{"data_type":"string","value":"vector-2022"}}

This only contains values for 7 of the defined fields, the rest (which whould have had null values in a regular schema, are not present). Again, this makes it difficult to compare the sizes and efficiency of both datasets.

Aug 18 2023, 8:43 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns added a comment to T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.

Before comparing both datasets I wanted to have a closer look at the original one, to make sure the comparison makes sense.

Aug 18 2023, 7:43 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns updated the task description for T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.
Aug 18 2023, 6:33 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns updated the task description for T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.
Aug 18 2023, 6:25 PM · Data Products (Sprint 00), Metrics Platform Backlog
mforns created T344512: Evaluate how much null fields in MP fragments affect data pipelines efficiency.
Aug 18 2023, 6:19 PM · Data Products (Sprint 00), Metrics Platform Backlog

Aug 14 2023

mforns added a comment to T310198: Replace performer_id with salted hashed user ID.

The sanitization pipeline has a feature with which the developer can specify the fields to hash.
That hash is salted by default, and the salts are rotated every 3 months (and thrown away).
Would the feature described in this task be necessarily implemented by the Metrics Platform, too?
Also, considering that we might need the ability to salt-hash other fields at some point?

Aug 14 2023, 3:48 PM · Metrics Platform Backlog

Aug 9 2023

mforns added a comment to T340648: [Airflow] Setup Airflow instance for WMDE.

I created this MR for the subtask "Create the instance specific dags folder (ready to merge)"
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/474
Please, check that all is correct :-)

Aug 9 2023, 7:44 PM · Patch-For-Review, Data-Platform-SRE

Jul 21 2023

mforns updated subscribers of T342269: [Spike] Review Equity Landscape Pipeline and Data Collection Approach.

I have reviewed the passed sources and videos (BTW, thank you @ntsako for the videos!) and here's a summary of:

  • What I've learned, so we can refer to this in the future and avoid some rework.
  • Questions that arised (I reached out today to @ntsako but I fear it was too late for his time zone).
  • Some observations on potential improvements for each section of the project.
Jul 21 2023, 10:19 PM · Data Products (Sprint 0), Movement-Insights, Equity-Landscape
mforns moved T342269: [Spike] Review Equity Landscape Pipeline and Data Collection Approach from Incoming to Sprint 0 on the Data Products board.
Jul 21 2023, 5:03 PM · Data Products (Sprint 0), Movement-Insights, Equity-Landscape
AndrewTavis_WMDE awarded T340648: [Airflow] Setup Airflow instance for WMDE a Like token.
Jul 21 2023, 1:15 PM · Patch-For-Review, Data-Platform-SRE

Jun 28 2023

mforns moved T337060: Airflow job to load Knowledge Gap metrics into Cassandra from In Review to Ready to Deploy on the Data Pipelines (Sprint 14) board.
Jun 28 2023, 4:06 PM · Patch-For-Review, Data Pipelines (Sprint 14), Data-Engineering
mforns moved T337059: Implement new AQS endpoints for Knowledge Gap metrics from In Review to Ready to Deploy on the Data Pipelines (Sprint 14) board.
Jun 28 2023, 4:06 PM · Data Pipelines (Sprint 14), Data-Engineering
mforns moved T340463: [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates from Next Up to In Progress on the Data Pipelines (Sprint 14) board.
Jun 28 2023, 4:04 PM · Data Engineering and Event Platform Team
mforns claimed T340463: [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates.
Jun 28 2023, 4:03 PM · Data Engineering and Event Platform Team
mforns added a comment to T334558: [Analytics] Unique user-agents accessing Wikidata's REST API for Q2/2023.

@AndrewTavis_WMDE Hi! I think you could go with simply wmde. The analytics prefix in product_analytics exists because the team is named like that. In your case, you could use wmde I think.
BTW this is the task to create the WMDE Airflow instance: T340648

Jun 28 2023, 1:51 PM · Wikidata Analytics (Kanban), Wikidata
mforns created T340648: [Airflow] Setup Airflow instance for WMDE.
Jun 28 2023, 1:46 PM · Patch-For-Review, Data-Platform-SRE

Jun 27 2023

mforns moved T335306: [SPIKE] Evaluation on iceberg sensor for airflow from In Review to Done on the Data Pipelines (Sprint 14) board.
Jun 27 2023, 3:35 PM · Spike, Data Pipelines (Sprint 14)

Jun 26 2023

mforns added a subtask for T335306: [SPIKE] Evaluation on iceberg sensor for airflow: T340471: [Airflow] P.O.C. on Iceberg sensor using Snapshot metadata to keep status of updates.
Jun 26 2023, 5:02 PM · Spike, Data Pipelines (Sprint 14)
mforns added a parent task for T340471: [Airflow] P.O.C. on Iceberg sensor using Snapshot metadata to keep status of updates: T335306: [SPIKE] Evaluation on iceberg sensor for airflow.
Jun 26 2023, 5:02 PM · Data Engineering and Event Platform Team, Data Pipelines
mforns added a project to T340471: [Airflow] P.O.C. on Iceberg sensor using Snapshot metadata to keep status of updates: Data Pipelines.
Jun 26 2023, 5:02 PM · Data Engineering and Event Platform Team, Data Pipelines
mforns created T340471: [Airflow] P.O.C. on Iceberg sensor using Snapshot metadata to keep status of updates.
Jun 26 2023, 5:02 PM · Data Engineering and Event Platform Team, Data Pipelines
mforns added a subtask for T335306: [SPIKE] Evaluation on iceberg sensor for airflow: T340466: [Airflow] P.O.C. on Iceberg sensor using Postgres table to keep status of updates.
Jun 26 2023, 4:52 PM · Spike, Data Pipelines (Sprint 14)
mforns added a parent task for T340466: [Airflow] P.O.C. on Iceberg sensor using Postgres table to keep status of updates: T335306: [SPIKE] Evaluation on iceberg sensor for airflow.
Jun 26 2023, 4:52 PM · Data Engineering and Event Platform Team, Data Pipelines
mforns created T340466: [Airflow] P.O.C. on Iceberg sensor using Postgres table to keep status of updates.
Jun 26 2023, 4:51 PM · Data Engineering and Event Platform Team, Data Pipelines
mforns added a subtask for T335306: [SPIKE] Evaluation on iceberg sensor for airflow: T340463: [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates.
Jun 26 2023, 4:47 PM · Spike, Data Pipelines (Sprint 14)
mforns added a parent task for T340463: [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates: T335306: [SPIKE] Evaluation on iceberg sensor for airflow.
Jun 26 2023, 4:47 PM · Data Engineering and Event Platform Team
mforns created T340463: [Airflow] P.O.C. on Iceberg sensor using Iceberg table to keep status of updates.
Jun 26 2023, 4:47 PM · Data Engineering and Event Platform Team
mforns moved T333218: Product Analytics ETL Migration: WikipediaPreview stats from Blocked/Paused to Done on the Data Pipelines (Sprint 14) board.
Jun 26 2023, 3:11 PM · Product-Analytics (Kanban), Data Pipelines (Sprint 14)

Jun 22 2023

mforns added a comment to T336744: Harmonize tags across Airflow dags.

Here's a list of all Airflow analytics tags as of 2023-06-22.
https://docs.google.com/spreadsheets/d/1XtvtLeZUWIWmEYGF9JYukeszZZ1oO0je8FJVbOKyxpA

Jun 22 2023, 8:42 PM · Data Engineering and Event Platform Team (Sprint 2), Data Products (Sprint 00), Data Pipelines (Sprint 14)
mforns reassigned T336744: Harmonize tags across Airflow dags from mforns to JEbe-WMF.
Jun 22 2023, 4:14 PM · Data Engineering and Event Platform Team (Sprint 2), Data Products (Sprint 00), Data Pipelines (Sprint 14)
mforns moved T336744: Harmonize tags across Airflow dags from Next Up to In Progress on the Data Pipelines (Sprint 14) board.
Jun 22 2023, 3:37 PM · Data Engineering and Event Platform Team (Sprint 2), Data Products (Sprint 00), Data Pipelines (Sprint 14)
mforns claimed T336744: Harmonize tags across Airflow dags.
Jun 22 2023, 3:37 PM · Data Engineering and Event Platform Team (Sprint 2), Data Products (Sprint 00), Data Pipelines (Sprint 14)
mforns added a comment to T333011: Define Migration/Deprecation Plan for Hue.

Since Oozie doesn't have any more jobs running (except for T333218, which will be removed asap), soon there won't be any more Oozie users.

Jun 22 2023, 3:27 PM · Data Engineering and Event Platform Team, Data Pipelines (Sprint 14)
mforns moved T333011: Define Migration/Deprecation Plan for Hue from Next Up to In Progress on the Data Pipelines (Sprint 14) board.
Jun 22 2023, 2:30 PM · Data Engineering and Event Platform Team, Data Pipelines (Sprint 14)
mforns claimed T333011: Define Migration/Deprecation Plan for Hue.
Jun 22 2023, 2:30 PM · Data Engineering and Event Platform Team, Data Pipelines (Sprint 14)
mforns moved T335306: [SPIKE] Evaluation on iceberg sensor for airflow from In Progress to In Review on the Data Pipelines (Sprint 14) board.
Jun 22 2023, 1:13 PM · Spike, Data Pipelines (Sprint 14)

Jun 21 2023

mforns moved T329310: Deprecate old mobile datasets from In Progress to Done on the Data Pipelines (Sprint 14) board.
Jun 21 2023, 3:49 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns added a comment to T329310: Deprecate old mobile datasets.

Since after 30 days nobody has complained about the data missing:
Removed the refinery-source code and the airflow-dags code.
Also deleted the data from HDFS. It went to the Trash folder, where it will be irreversibly deleted in 30 days.
I think this can be moved to Done.

Jun 21 2023, 3:49 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns updated the task description for T329310: Deprecate old mobile datasets.
Jun 21 2023, 3:49 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns updated the task description for T329310: Deprecate old mobile datasets.
Jun 21 2023, 3:47 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns updated the task description for T329310: Deprecate old mobile datasets.
Jun 21 2023, 3:42 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns updated the task description for T329310: Deprecate old mobile datasets.
Jun 21 2023, 3:31 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns updated the task description for T329310: Deprecate old mobile datasets.
Jun 21 2023, 2:58 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning
mforns moved T329310: Deprecate old mobile datasets from Next Up to In Progress on the Data Pipelines (Sprint 14) board.
Jun 21 2023, 2:00 PM · Data Pipelines (Sprint 14), Data-Engineering-Planning

Jun 19 2023

mforns added a comment to T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager.

There is a _FAILURE flag that gets written, and by default previous failures are excluded from the next refinement. That's why we have to manually rerun those with --ignore-failure-flag=true. We only get one alert per Refine attempt of an hour (unless someone reruns and it fails again).

Gotcha! sorry for misleading...

Jun 19 2023, 4:59 PM · Data-Platform-SRE, Data-Engineering, Observability-Alerting
mforns added a comment to T338065: Implement mechanism for automatic Iceberg data deletion and optimization.

Oh, this looks great! It will simplify the deletion of data so much... 👏👏👏

Jun 19 2023, 4:57 PM · Data Engineering and Event Platform Team, Data Pipelines
mforns added a comment to T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager.

In T337052#8865357, @mforns wrote:
Another reason is that we hopefully(2) won't need to execute Refine on a window of hours, like we do now.

Thanks @mforns. Could you elaborate on this a little for me, please?

Jun 19 2023, 3:31 PM · Data-Platform-SRE, Data-Engineering, Observability-Alerting
mforns added a comment to T336286: Upgrade Airflow to version 2.6.3.

Just stumbled on this task, I created the missing patch.
It should work regardless of the version of Airflow that is running it.
https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/434

Jun 19 2023, 3:10 PM · Data Pipelines, Patch-For-Review, Data-Platform-SRE

Jun 13 2023

mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

Iceberg table that keeps track of Iceberg ingestions

I also like this idea!
Good thing about having it in an Iceberg table is that we can also apply deletes. This would help in case we have some mechanism that triggers cascading re-runs.
Maybe we could implement our own ExternalTaskSensor/ExternalTaskMarker in Airflow that would use this table.
This way we could get cascading re-runs without being limited to 1 Airflow instance, plus the state would not be in the Airflow database. Win-win.
Maybe we could even use this table to aid us implementing Refine?

Jun 13 2023, 2:32 PM · Spike, Data Pipelines (Sprint 14)

Jun 9 2023

mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

I like the simplicity of this approach. It got me thinking of the following though: we'd be coupling the availability of Presto with the availability (or progress) of Airflow, which I think is something to consider.

Yes, agree.

Jun 9 2023, 4:34 PM · Spike, Data Pipelines (Sprint 14)

Jun 8 2023

mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

If we use Presto, we wouldn't need to launch anything, since we could use Presto's SQL server. We could even avoid Skein, since we could call Presto SQL from the sensor itself (Airflow machine). Joseph and I did some tests today, and it took 4 seconds to sense for a full week of daily referrer data. I also like the idea of using Java Table API, but at the same time I think the way that Iceberg is intended to be used is via SQL, and not checking internals (manifests, files, partitions...), no?

Jun 8 2023, 8:14 PM · Spike, Data Pipelines (Sprint 14)

Jun 7 2023

mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

Could we use presto instead of spark?

That would be lighter and with less latency, right?
But, I thought not all datasets were queriable in the Presto cluster...?

Jun 7 2023, 4:48 PM · Spike, Data Pipelines (Sprint 14)
mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

We discussed in standup whether it would be OK to spawn JVM/Spark for each poke of the Iceberg sensors.
So here's some modelling:

Jun 7 2023, 3:55 PM · Spike, Data Pipelines (Sprint 14)
mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

I think if we use PyIceberg, we can use:

from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual
Jun 7 2023, 3:25 PM · Spike, Data Pipelines (Sprint 14)

Jun 6 2023

mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

I received another answer, this time from a committer & PMC on Apache Avro, Airflow, Druid and Iceberg. He thinks the best way would be to use PyIceberg and implement support for HDFS/Kerberos. Since spinning up a full JVM/Spark would be "overkill for just checking the number of rows".

Jun 6 2023, 3:02 PM · Spike, Data Pipelines (Sprint 14)

Jun 5 2023

mforns added a comment to T335306: [SPIKE] Evaluation on iceberg sensor for airflow.

Folks from the community gave me some answers:

Jun 5 2023, 8:12 PM · Spike, Data Pipelines (Sprint 14)

Jun 2 2023

mforns created T338036: [Airflow] Simplify application and java_class parameters in SparkSqlOperator.
Jun 2 2023, 2:25 PM · Data Engineering and Event Platform Team, Data Pipelines

Jun 1 2023

mforns set the point value for T337984: [Airflow] cassandra_monthly_load::load_pageview_top_articles_to_cassandra task needs more Spark resources to 1.
Jun 1 2023, 7:55 PM · Data Pipelines (Sprint 14)
mforns moved T337984: [Airflow] cassandra_monthly_load::load_pageview_top_articles_to_cassandra task needs more Spark resources from Next Up to Done on the Data Pipelines (Sprint 14) board.
Jun 1 2023, 7:55 PM · Data Pipelines (Sprint 14)