Page MenuHomePhabricator

gmodena (GModena (WMF))
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Nov 2 2020, 1:15 PM (186 w, 6 d)
Availability
Available
IRC Nick
gmodena
LDAP User
Gmodena
MediaWiki User
GModena (WMF) [ Global Accounts ]

Recent Activity

Fri, May 24

gmodena added a comment to T346611: [JVM Stewardship] To be discussed: SDK Man.

Will adopting SDKMan be a prescriptive change or just a default option? Would this change affect only a user's development environment, or also impact CI?

Fri, May 24, 1:32 PM · Java-Scala-Standardization

Wed, May 15

gmodena moved T365005: Evaluate ESC and explore an alternative design. from Incoming (new tickets) to Q4 2024 April 1st - June 30th on the Data-Engineering board.
Wed, May 15, 2:32 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Event-Platform
gmodena moved T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator from In progress to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Wed, May 15, 2:32 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena moved T365005: Evaluate ESC and explore an alternative design. from Next Up to In progress on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Wed, May 15, 2:32 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Event-Platform
gmodena claimed T365005: Evaluate ESC and explore an alternative design..
Wed, May 15, 2:29 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Event-Platform
gmodena set the point value for T365005: Evaluate ESC and explore an alternative design. to 5.
Wed, May 15, 2:28 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Event-Platform
gmodena renamed T361094: Orchestrate gobblin ingestion task with Airflow and config store. from [NEEDS GROOMING] Orchestrate gobblin ingestion task with Airflow to Orchestrate gobblin ingestion task with Airflow and config store..
Wed, May 15, 1:24 PM · Event-Platform, Data-Engineering
gmodena created T365005: Evaluate ESC and explore an alternative design..
Wed, May 15, 1:05 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Event-Platform

Tue, May 14

gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Thanks for clarifying @Ottomata.

@Ottomata @Fabfur If we remove prefixing, there is a potential clash between varnishkafka and benthos topics.
How about we name the production Haproxy/benthos topics as follows?

  • webrequest_frontent_text
  • webrequest_frontent_text.error
  • webrequest_frontent_upload
  • webrequest_frontent_upload.error

No problem for us to rename these topics, even with or without "variable" part...

Tue, May 14, 12:58 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

adopt topic names that follow EP conventions: <dc>.<topic_name>

I'm sorry for not thinking about this earlier. There is a big of a design flaw in the use of data center as a topic prefix, and really, for topics that are never mirrored to other Kafka clusters, there is no need for topic prefixes at all.

I just added documentation about this here:
https://wikitech.wikimedia.org/wiki/Kafka#Data_center_topic_prefixing_design_flaw

Given that, and the ever expanding list of data centers, and the fact that webrequest is the only stream we have that is produced to from non main data centers, I think we should not use topic prefixing for webrequest.

All producers should use the same topic name, independent of which data center they are in.

Tue, May 14, 10:06 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena updated subscribers of T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.

FWIW, with this outcome, then dynamic ESC as implemented is fine with me :)

Tue, May 14, 9:38 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mon, May 13

gmodena added a comment to T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.

This comment is the result of some time spent collecting info from various stakeholders, and reviewing documentation and decision records.
It was initially shared as a Google doc (now moved to read only to preserve comment history).

Mon, May 13, 9:46 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Apr 30 2024

gmodena added a comment to T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?.

IMHO it should be explicitly stated that the system we are building is the Airflow Dataset Config store/service, not just a generic configuration repository.

@gmodena @JAllemandou, if this is the case, do we need an external service and datastore? The config is all in git.

Apr 30 2024, 6:33 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

@Fabfur f/up from our chat earlier; these would be the pending config bits that we'll the to finalize when moving to prod topics:

Apr 30 2024, 2:28 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Apr 29 2024

gmodena moved T361017: [SPIKE] Can we express Event Platform configs in Datasets Config? from Next Up to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 29 2024, 6:36 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform
gmodena claimed T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?.
Apr 29 2024, 6:36 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform
gmodena updated subscribers of T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?.

We can easily express a stream config as jsonschema, and expose via datasets-config-service.
I am opposed to a monorepo for all configurations and suggest focusing current efforts on Airflow and Airflow-produced datasets. For integration with Metrics, Platform, and Mediawiki, I lean towards a service mesh approach. The service developed by @tchin could serve as a template.

Apr 29 2024, 6:34 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

I like the overall idea, but I'd prefer to proceed DC-by-DC, in switching topics and shutting down VarnishKakfka when we will be sure about the correctness of data. I'm afraid having two software producing (and sending, and storing) the "same" data on 96 hosts (and soon also MAGRU) could be a little bit expensive for us in terms of bandwidth...

Makes sense. This would require some work on our end to generate webrequest data from two "raw" sources at once, but I think as long as we can filter on dc / hostnames, we should manage. Let me take a better look at how this ETL is setup.

Apr 29 2024, 4:10 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.
  • there's couple of CRs pending (linked to this phab) and I'd like to have a second run on the event schema naming conventions (cc / @Fabfur). We might want to drop the webrequest_source since we don't currently use in ETL (it's inferred from the HDFS path, not schema).

No problem here, for us is just a matter of removing a line from the Benthos configuration. Let me know if I can proceed!

Apr 29 2024, 9:40 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Apr 25 2024

gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

The haproxy_id field has been added to messages.

Apr 25 2024, 2:06 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena moved T353940: We should provide DQ integration with Python from In Review to Done on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 25 2024, 9:07 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Apr 19 2024

gmodena closed T351117: Move analytics log from Varnish to HAProxy as Resolved.

I'm afraid mixing varnishkafka and benthos payloads would break ingestion piepelines, since old/new events have a different schema. We could reuse the current topics, but we'd have to drain them first.

We can do both, for us it's just a matter of changing a string on puppet. I think decision is more on your side, choose the easiest/best option for you and we'll implement!

Apr 19 2024, 7:00 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena changed the point value for T362780: [DQ] Add support for distribution metrics in data quality exporters from 2 to 3.
Apr 19 2024, 6:34 AM · Data-Engineering
gmodena set the point value for T362782: [DQ][NEEDS GROOMING] Add support for deequ's RowLevelSchemaValidator in refinery to 3.
Apr 19 2024, 6:33 AM · Data-Engineering
gmodena set the point value for T362780: [DQ] Add support for distribution metrics in data quality exporters to 2.
Apr 19 2024, 6:33 AM · Data-Engineering

Apr 18 2024

gmodena set the point value for T362783: Add instrumentation for actor signatures to 1.
Apr 18 2024, 1:46 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena set the point value for T362785: Add host level instrumentation on webrequest to 1.
Apr 18 2024, 1:46 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena moved T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator from Next Up to In progress on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 18 2024, 1:35 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

About the sequence issue, that's the most plausible hypotheses. We could append (or prepend) other information pieces to the sequence number (like the haproxy process id) to avoid duplicates but we couldn't guarantee the monotonic increase (or the increase, even) in this case. I suggest using this current approach for the moment and eventually rework later.

Apr 18 2024, 1:12 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

Next steps: now that we are starting to collect more logs, we can start comparing current / new webrequest records.

Apr 18 2024, 11:04 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Apr 17 2024

gmodena moved T362783: Add instrumentation for actor signatures from Next Up to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 17 2024, 5:57 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena moved T362785: Add host level instrumentation on webrequest from Next Up to In Review on the Data-Engineering (Q4 2024 April 1st - June 30th) board.
Apr 17 2024, 5:57 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena created T362785: Add host level instrumentation on webrequest.
Apr 17 2024, 3:18 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena created T362783: Add instrumentation for actor signatures.
Apr 17 2024, 3:15 PM · Patch-For-Review, Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena created T362782: [DQ][NEEDS GROOMING] Add support for deequ's RowLevelSchemaValidator in refinery.
Apr 17 2024, 3:08 PM · Data-Engineering
gmodena created T362780: [DQ] Add support for distribution metrics in data quality exporters.
Apr 17 2024, 3:03 PM · Data-Engineering

Apr 5 2024

gmodena added a parent task for T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?: T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator.
Apr 5 2024, 6:42 AM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform
gmodena added a subtask for T361853: [Datasets Config][Spike] Understand and document the details and conflicts between Datasets Config, Refine refactor, Dynamic EventStreamConfig, and Metrics Platform Instrumentation Configurator: T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?.
Apr 5 2024, 6:42 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Apr 4 2024

gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

@gmodena you should have some more data to play with now, while I work on the performance optimization and on Benthos internal metrics...

Apr 4 2024, 1:16 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Mar 27 2024

gmodena created T361094: Orchestrate gobblin ingestion task with Airflow and config store..
Mar 27 2024, 11:50 AM · Event-Platform, Data-Engineering

Mar 26 2024

gmodena moved T359051: eventstreams: change default num_workers to 0 from Ready to Deploy to Done on the Data-Engineering (Sprint 9) board.
Mar 26 2024, 3:37 PM · Data-Engineering (Sprint 9)
gmodena moved T359051: eventstreams: change default num_workers to 0 from In Review to Ready to Deploy on the Data-Engineering (Sprint 9) board.
Mar 26 2024, 3:37 PM · Data-Engineering (Sprint 9)
gmodena created T361017: [SPIKE] Can we express Event Platform configs in Datasets Config?.
Mar 26 2024, 1:57 PM · Data-Engineering (Q4 2024 April 1st - June 30th), Spike, Event-Platform

Mar 25 2024

gmodena updated the task description for T353940: We should provide DQ integration with Python.
Mar 25 2024, 8:13 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena added a comment to T353940: We should provide DQ integration with Python.

I need to add a wrapper to the Alert generation SerDe

Mar 25 2024, 8:04 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena moved T353940: We should provide DQ integration with Python from In progress to In Review on the Data-Engineering (Sprint 9) board.
Mar 25 2024, 8:01 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mar 22 2024

gmodena added a comment to T314956: [Event Platform] Declare webrequest as an Event Platform stream.

Tagging T360642: Remove extra fields currently sent to Kafka

Mar 22 2024, 8:14 AM · Patch-For-Review, Data-Engineering, Event-Platform
gmodena added a comment to T360642: Remove extra fields currently sent to Kafka.

These are the fields that are sent from Benthos that aren't present in the current webrequest stream:

Mar 22 2024, 8:12 AM · Event-Platform, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena updated subscribers of T360642: Remove extra fields currently sent to Kafka.
Mar 22 2024, 8:08 AM · Event-Platform, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a project to T360642: Remove extra fields currently sent to Kafka: Event-Platform.
Mar 22 2024, 7:58 AM · Event-Platform, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Mar 21 2024

gmodena added a comment to T353940: We should provide DQ integration with Python.

lets maybe pair on it?

I'd love to hack on this at the offsite!!

Mar 21 2024, 2:10 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

See https://github.com/wikimedia/service-runner/commit/b9c98eab5398413c16df2317562745f6ffe74439

Mar 21 2024, 11:39 AM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner

Mar 19 2024

gmodena added a project to T360450: Add $schema key to Benthos payload: Event-Platform.
Mar 19 2024, 4:37 PM · Event-Platform, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena updated subscribers of T360450: Add $schema key to Benthos payload.

For context: this is the approach we follow with other producers, e.g. Java.

Mar 19 2024, 4:33 PM · Event-Platform, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Mar 8 2024

gmodena added a comment to T353940: We should provide DQ integration with Python.

IIUC, the necessity for py4j is only tied to the fact that we developed helper code like the case of HivePartition and DeequAnalyzersToDataQualityMetrics that we'd like to reuse, correct?

Mar 8 2024, 2:40 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mar 7 2024

gmodena created T359561: Add user fabfur to analytics-privatedata-users.
Mar 7 2024, 4:19 PM · Patch-For-Review, Data-Platform-SRE (2024.03.25 - 2024.04.14), SRE, SRE-Access-Requests
gmodena moved T353940: We should provide DQ integration with Python from Next Up to In progress on the Data-Engineering (Sprint 9) board.
Mar 7 2024, 10:39 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena updated subscribers of T353940: We should provide DQ integration with Python.

We can integrate our DQ framework with Python by piggy backing on pyspark 's py4j gateway. Following is a rudimentary example that produces
metrics with data_quality_metrics table format:

Mar 7 2024, 10:36 AM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena moved T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics from In progress to In Review on the Data-Engineering (Sprint 9) board.
Mar 7 2024, 8:50 AM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
gmodena renamed T353940: We should provide DQ integration with Python from [NEEDS GROOMING] we should provide DQ integration with Python to We should provide DQ integration with Python.
Mar 7 2024, 8:36 AM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mar 6 2024

gmodena moved T353940: We should provide DQ integration with Python from SDS3.3 - Data Quality to Sprint 9 on the Data-Engineering board.
Mar 6 2024, 3:28 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mar 5 2024

gmodena claimed T353940: We should provide DQ integration with Python.
Mar 5 2024, 1:42 PM · Data-Engineering (Q4 2024 April 1st - June 30th)

Mar 4 2024

gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

@Ottomata @Jdforrester-WMF there's a caveat wrt using collectDefaultMetrics. The method call does not allow setting custom labels. If i understand the doc correctly, we can still define them at registry level. This would clash with the current implementation. I'm not super keen in refactoring current behaviour given the codebase status, so I'd lean towards avoiding custom labels if possible. Are they used at all?

Mar 4 2024, 8:05 PM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

I spent some time learning this code base, touching base to validate direction. If this makes sense, I'll open a PR. My proposal here would be to add a new collect_default option to the prometheus metrics option block

Mar 4 2024, 2:28 PM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
gmodena claimed T356866: [Data Quality] Update data_quality schemas to be compatible with Iceberg tables.
Mar 4 2024, 2:03 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena set the point value for T359051: eventstreams: change default num_workers to 0 to 1.
Mar 4 2024, 1:06 PM · Data-Engineering (Sprint 9)
gmodena moved T359051: eventstreams: change default num_workers to 0 from In progress to In Review on the Data-Engineering (Sprint 9) board.
Mar 4 2024, 1:05 PM · Data-Engineering (Sprint 9)
gmodena created T359051: eventstreams: change default num_workers to 0.
Mar 4 2024, 1:05 PM · Data-Engineering (Sprint 9)
gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

Happy to pair / code review in case.

If you could implement on top of my PR that'd be great.

Mar 4 2024, 12:33 PM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner

Feb 28 2024

gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

FYI: this is a list of metrics reported with a local run:

Feb 28 2024, 3:00 PM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
gmodena added a comment to T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable".

I was on PTO last week and trying to piece together what happened and how the UBN was mitigated.

Feb 28 2024, 12:50 PM · MediaWiki-Engineering, Data-Engineering, Unstewarded-production-error, User-brennen, serviceops, WMF-JobQueue, Wikimedia-production-error
gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

Update: We've got access back, and v4.0.0 is finally released. Still happy to break whatever we need to, and help people migrate.

Feb 28 2024, 10:52 AM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner

Feb 27 2024

gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

@Jdforrester-WMF FWIW I saw you started deprecation work in https://github.com/wikimedia/service-runner/pull/249/files.

Feb 27 2024, 11:14 AM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
gmodena added a comment to T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.

I am taking a stab at this tasks, because we need gc and memory info to help track T357005: eventstreams regularly uses more than 95% of its memory limit.

Feb 27 2024, 11:12 AM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner
gmodena moved T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics from Next Up to In progress on the Data-Engineering (Sprint 9) board.
Feb 27 2024, 10:27 AM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner

Feb 26 2024

gmodena claimed T350180: Upgrade prom-client in NodeJS service-runner and enable collectDefaultMetrics.
Feb 26 2024, 7:12 PM · Data-Engineering, observability, ChangeProp, Event-Platform, service-runner

Feb 15 2024

gmodena moved T347586: [Maintenance] Delete sanitized events removed from sanitization list from In progress to Done on the Data-Engineering (Sprint 8) board.
Feb 15 2024, 6:14 PM · Data-Engineering (Sprint 8)
gmodena added a comment to T347586: [Maintenance] Delete sanitized events removed from sanitization list.

Data has been deleted from HDFS. It will be quarantined in hdfs://analytics-hadoop/user/hdfs/.Trash/Current/wmf/data/event_sanitized for a period longer than the on week grace time required by this task.

@JAllemandou could you ack if it's ok to move ahead and delete related tables from event?

Feb 15 2024, 6:14 PM · Data-Engineering (Sprint 8)
gmodena updated subscribers of T356866: [Data Quality] Update data_quality schemas to be compatible with Iceberg tables.

Spoke a bit about this with @xcollazo.

Feb 15 2024, 1:42 PM · Data-Engineering (Q4 2024 April 1st - June 30th)
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

Open question: do we want webrequest.frontent (or whatever we settle on) to be a versioned stream? https://wikitech.wikimedia.org/wiki/Event_Platform/Stream_Configuration#Stream_versioning

Feb 15 2024, 11:49 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

the currently suggested one is webrequest.frontend. @gmodena, the idea there is to group all webrequest topics into the same stream, by setting topics manually in stream config. Gobblin will ingest the topics configured in stream config.

Feb 15 2024, 11:32 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T347586: [Maintenance] Delete sanitized events removed from sanitization list.

Data has been deleted from HDFS. It will be quarantined in hdfs://analytics-hadoop/user/hdfs/.Trash/Current/wmf/data/event_sanitized for a period longer than the on week grace time required by this task.

Feb 15 2024, 11:08 AM · Data-Engineering (Sprint 8)
gmodena updated the task description for T347586: [Maintenance] Delete sanitized events removed from sanitization list.
Feb 15 2024, 11:06 AM · Data-Engineering (Sprint 8)

Feb 14 2024

gmodena moved T347586: [Maintenance] Delete sanitized events removed from sanitization list from Next Up to In progress on the Data-Engineering (Sprint 8) board.
Feb 14 2024, 11:07 AM · Data-Engineering (Sprint 8)
gmodena added a comment to T347586: [Maintenance] Delete sanitized events removed from sanitization list.

May I proceed with deleting the tables from the Hive metastore for the impacted datasets?

Feb 14 2024, 11:07 AM · Data-Engineering (Sprint 8)
gmodena claimed T347586: [Maintenance] Delete sanitized events removed from sanitization list.
Feb 14 2024, 10:12 AM · Data-Engineering (Sprint 8)

Feb 13 2024

gmodena added a comment to T314956: [Event Platform] Declare webrequest as an Event Platform stream.

@Fabfur and I would like to start some integration tests in the short term. I moved the webrequest schema from GA to development in the primary repo. This follows the same process we adopted with page_change, and should allow for faster iteration speed without messing around with schema versions.

Feb 13 2024, 7:58 PM · Patch-For-Review, Data-Engineering, Event-Platform

Feb 12 2024

gmodena added a comment to T357005: eventstreams regularly uses more than 95% of its memory limit.

Looking at the logs, this seems to coincide with the redaction patch to eventstreams, but looking at the code I'm having a hard time finding where a memory leak could've happened... more confusing that it's just 1 or 2 pods hitting the limit

Feb 12 2024, 1:57 PM · Data-Engineering, Event-Platform, EventStreams, serviceops, Prod-Kubernetes, Kubernetes
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

TBD on final stream name in T314956: [Event Platform] Declare webrequest as an Event Platform stream, but the currently suggested one is webrequest.frontend

Feb 12 2024, 11:22 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Feb 9 2024

gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

Both approaches are feasible (also at the same time if we do accept to increase the payload a little)...

Feb 9 2024, 12:34 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena updated subscribers of T351117: Move analytics log from Varnish to HAProxy.

@Fabfur here is example payload with added meta, as we'd expect to receive according to the WIP webrequest event schema.

{
  "meta": {
      dt: "2023-11-23T16:04:17Z", # value set by Benthos
      stream: "webrequest_text", # value set by Benthos
      domain: "en.wikipedia.org", # can we get this from HAProxy?
      request_id: request-uuid # can we get this from HAProxy?
      id: "event-uuid" # value set by Benthos? 
   },
  "accept": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.2.0\"",
  "accept_language": "en",
  "backend": "ATS/9.1.4",
  "cache_status": "hit-front",
  "content_type": "application/json; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/Summary/1.5.0\"",
  "dt": "2023-11-23T16:04:17Z", # value recorded by HAProxy
  "hostname": "cp3067.esams.wmnet",
  "http_method": "GET",
  "http_status": "200",
  "ip": "<REDACTED>",
  "range": "-",
  "referer": "https://en.wikipedia.org/w/index.php?title=Category:Films_based_on_non-fiction_books&pagefrom=Power+Play+%281978+film%29%0APower+Play+%281978+film%29",
  "response_size": 987,
  "sequence": 10558502962,
  "time_firstbyte": 0.000201,
  "tls": "vers=TLSv1.3;keyx=UNKNOWN;auth=ECDSA;ciph=AES-256-GCM-SHA384;prot=h2;sess=new",
  "uri_host": "en.wikipedia.org",
  "uri_path": "/api/rest_v1/page/summary/Secretariat_(film)",
  "uri_query": "",
  "user_agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0",
  "x_analytics": "WMF-Last-Access=23-Nov-2023;WMF-Last-Access-Global=23-Nov-2023;include_pv=0;https=1;client_port=33126",
  "x_cache": "cp3067 miss, cp3067 hit/5"
}
Feb 9 2024, 8:18 AM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic

Feb 8 2024

gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

@Fabfur nice!

Feb 8 2024, 7:13 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T351117: Move analytics log from Varnish to HAProxy.

Some updates about the ongoing work:

Feb 8 2024, 1:53 PM · Data Products, Patch-For-Review, Data-Engineering, Observability-Logging, Traffic
gmodena added a comment to T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics.

tl;dr: our approach to address this spike is currently documented at https://wikitech.wikimedia.org/wiki/Data_Engineering/Data_Quality.

Feb 8 2024, 1:20 PM · Data-Engineering (Sprint 8), Patch-For-Review
gmodena moved T349763: [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics from Blocked/Paused to Ready to Deploy on the Data-Engineering (Sprint 8) board.
Feb 8 2024, 1:18 PM · Data-Engineering (Sprint 8), Patch-For-Review
gmodena moved T356401: [BUG] webrequest analyzer DQ jobs fails to store data from Ready to Deploy to Done on the Data-Engineering (Sprint 8) board.
Feb 8 2024, 1:18 PM · Data-Engineering (Sprint 8)
gmodena moved T356628: [Data quality] Create database and tables for DQ backend from Ready to Deploy to Done on the Data-Engineering (Sprint 8) board.
Feb 8 2024, 8:56 AM · Data-Engineering (Sprint 8)
gmodena moved T356628: [Data quality] Create database and tables for DQ backend from In Review to Ready to Deploy on the Data-Engineering (Sprint 8) board.
Feb 8 2024, 8:56 AM · Data-Engineering (Sprint 8)
gmodena updated the task description for T356628: [Data quality] Create database and tables for DQ backend.
Feb 8 2024, 8:56 AM · Data-Engineering (Sprint 8)
gmodena added a comment to T356401: [BUG] webrequest analyzer DQ jobs fails to store data.

db and tables have been created:

spark-sql (default)> use wmf_data_ops;
Response code
Time taken: 2.698 seconds
spark-sql (default)> show tables;
database	tableName	isTemporary
wmf_data_ops	data_quality_alerts	false
wmf_data_ops	data_quality_metrics	false
Time taken: 0.569 seconds, Fetched 2 row(s)
Feb 8 2024, 8:55 AM · Data-Engineering (Sprint 8)