Page MenuHomePhabricator

[Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests
Closed, ResolvedPublic

Description

User Story
As a Data Engineer, I want to document how web requests is logged, monitored and checked for data quality so that we can be transparent about what we have now and understand what we want to have
Questions to answer (please add more as needed):
  • What are the processing steps involved?
  • Per each processing step:
    • What are the unit tests in place?
    • What is logged during the process step?
    • Where are the logs stored?
    • Are the logs visualized? If so where?
  • On a data set level
    • What other data quality checks do we have?
    • What alerts do we send? Where do they go? What system sends the alert?
    • What are the upstream dependencies?
    • How do we determine completeness of the dataset
Deliverable

Document with all the above information:

https://docs.google.com/document/d/1clSe6bnIxJUdd2LaFtQ-_MXGIRFBOzKNVZ2130qfGqI/edit

Event Timeline

Here is the first version of it. Please review in the gdoc: https://docs.google.com/document/d/1clSe6bnIxJUdd2LaFtQ-_MXGIRFBOzKNVZ2130qfGqI/edit

For openness, here is a copy of it:

Webrequest

Webrequest represents requests sent to the caching layers of Mediawiki: Varnish.

Note: Defining what is logged in each process and the log level would require another pass.

VarnishKafka collection

VarnishKafka is a varnish log collector with an integrated Apache Kafka producer.
https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Varnishkafka

Logs: VarnishKafka code includes some logging mechanisms. Is it collected in Logstash, and accessible on the machines? What's the log level?

Unit tests: not of the code itself, only integration tests.

Instrumentation:

Checks:

  • Can't find a sanity check between Varnish logs and Kafka messages.
  • Can't find a Check that all Varnish instances bundle a running VarnishKafka beside them.
  • Duplication and completeness check: included in webrequest DQ checks.

Alerts:

  • VarnishkafkaNoMessages: We fire a warning if the rate of kafka messages is less than 20% (10%) of the rate of varnish requests over a 5 minutes period.
  • VarnishKafkaDeliveryErrors: When we average more than 1 (5) error(s) per second per caching method, instance, and datacenter.

Later downstream, we don't have canary events for webrequest. That means we don't have a way to check that the producer is active but not producing any events (e.g.: new DC ?).

Kafka mirroring

Inter DC sync for Kafka topics.
https://www.openlogic.com/blog/kafka-mirrormaker-overview

Following the 2023-09-27 incident Kafka MirrorMaker, this showed us there could be delay/loss in data from other DCs.

Test: by the community

Logs: probably produced, but are they collected in Logstash?

Instrumentation: https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker

Alerts: Some alerts are received on IRC defined in puppet modules/profile/manifests/kafka/mirror/alerts.pp

Kafka

Kafka itself is an SRE responsibility. We are reading from a single DC.

Instrumentation: https://grafana.wikimedia.org/d/000000027/kafka

Logs: By SREs. Logstash?

Alerts:

  • KafkaBrokerUnavailable
  • KafkaUnderReplicatedPartitions
  • KafkaLagIncreasing

Tests: by the community

Gobblin

Gobblin produces JSON from Kafka Jumbo into HDFS in hdfs://analytics-hadoop/wmf/data/raw/webrequest (wmf_raw.webrequest table)

Unit tests:

  • by the community
  • custom code is tested in analytics/gobblin-wmf
  • configuration files not tested

Metrics:

Alerts:

  • GobblinLastSuccessfulRunTooLongAgo
  • GobblinKafkaRecordsExtractedNotEqualRecordsExpected

Logs:

  • like any Yarn application
  • triggers are in journalctl on an-launcher
  • No UI

Checks:

Refine webrequest

The Refine webrequest Spark job reads partitions from wmf_raw.webrequest and produces partitions in wmf.webrequest. Webrequest itself has many downstream dependencies. Airflow triggers the processes and monitors them.

Alerts: by Airflow

Logs:

  • on Airflow
  • on YARN
  • in the dataloss table (see below)

Instrumentation:

  • Airflow
  • YARN
  • Some stat computed in the wmf.webrequest_sequence_stats(_hourly) tables

Checks:

  • Partial schema validation check, when reading the JSON files according to the Hive table schema.
  • DataLoss Check A sequence number for each request is attributed by each producer (V ? VK ? K ?). It allows us to measure for each Varnish producer:
    • if we lose some webrequests between the producer and HDFS
    • if we introduce some duplicated webrequests between the producer and HDFS
  • We check for some incomplete webrequest (missing attributes)

Alerts:

    • by Airflow
  • Data loss alert: The stats are aggregated across all Varnish producers. then alerts are sent when a threshold is reached.
    • No checks that some columns in the schema are empty (e.g. missing X-analytics header problem)
    • No durability check of the resulting dataset (Some files could be trashed by mistake).

Unit tests: none

Traffic anomaly detections

Later, downstream, the webrequest dataset is indirectly and partially tested by the traffic anomaly detection job.

Unit tests: yes, in refinery/source

Logs:

  • Airflow
  • yarn

Metrics:

  • Airflow
  • Yarn

Great analysis, added a few follow up questions.

lbowmaker assigned this task to Antoine_Quhen.
lbowmaker updated the task description. (Show Details)
lbowmaker updated the task description. (Show Details)
Ahoelzl renamed this task from [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests to [Data Quality] [SPIKE] Document Current Logging, Monitoring and Data Quality Checks for webrequests.Oct 20 2023, 3:56 PM

Joseph and Andreas to review.