Page MenuHomePhabricator

Increase retention of Wikidata RDF Stream (Kafka and/or Hadoop)
Closed, ResolvedPublic1 Estimated Story Points

Description

With the current issue with Wikidata dumps (T386401), the recovery mechanism of Wikidata Query Service (WDQS) is at risk. Recovery relies on reloading data from dumps and catching up from our update stream. But that stream is currently only kept for 30 days.

Increasing retention of the WDQS RDF stream to 60 days would allow for more time to fix the Wikidata dumps. This increase should be permanent, to also mitigate future dumps issues. The increase could be done on Kafka, or on Hadoop if resources are not available on Kafka. Note that we currently don't have a mechanism to consume the update stream from Hadoop, so some additional implementation would be needed if this is the solution we choose.

Once a strategy is decided on, subtasks will be created to track actual work.

The topics are:

  • [eqiad|codfw].rdf-streaming-updater.mutation
  • [eqiad|codfw].rdf-streaming-updater.mutation-main
  • [eqiad|codfw].rdf-streaming-updater.mutation-scholarly
  • [eqiad|codfw].mediainfo-streaming-updater.mutation

in kafka-main.

If increasing the retention in kafka-main is not an option we could consider increasing the retention in kafka-jumbo (the automated tooling for wdqs data reloads does not support kafka-jumbo but in a disaster recovery scenario I believe we could manage to use it relatively quickly).

Event Timeline

What would be the topics we'd need to change retention for, in kafka?

dcausse subscribed.

@brouberol just updated the task descriptions with those

Oh, these are kafka-main topics. In that case we probably need to rope in @elukey as well.

This dashboard shows the sizes of the topics under discussion, over the past 60 days: https://grafana.wikimedia.org/goto/wjX_Ui2NR?orgId=1

image.png (880×1 px, 138 KB)

The combined total is about 400 GB (https://grafana.wikimedia.org/goto/d46Swm2NR?orgId=1) but has recently been around 560 GB.

image.png (737×1 px, 115 KB)

Kafka-main eqiad brokers are currently between 35% and 65% allocated in terms of disk space (https://grafana.wikimedia.org/goto/7tjlQihHg?orgId=1). codfw is similar.

image.png (953×1 px, 108 KB)

In terms of available space, kafka-main1007 has 739 GB free in its /srv volume.

btullis@kafka-main1007:~$ df -h /srv
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg0-srv  2.0T  1.2T  739G  62% /srv

The first questions I have are:

  • is there enough capacity on the kafka-main brokers to double this retention period from 30 to 60 days?
  • is this a good use of the kafka-main clusters to provide additional protection against the potential for prolonged failure of the wikidata dumps?

Additionally, as set out in the description:

  • should we increase the retention time on kafka-jumbo instead?
  • should we be looking to migrate the WDQS reload mechanism to process files from Hadoop instead?

For reference, kafka-jumbo has bigger disks and so 64% usage equates to much more available capacity per broker. e.g. kafka-jumbo1007 has 6.1 TB available.

btullis@kafka-jumbo1007:~$ df -h /srv
Filesystem           Size  Used Avail Use% Mounted on
/dev/mapper/vg1-srv   18T   11T  6.1T  64% /srv

This might be less problematic to add another ~600 GB combined total for these topics with a doubled retention period.

After discussion, we'll use Hadoop for retention.

Gehel set the point value for this task to 1.Mar 10 2025, 4:48 PM

Where can I go read about this use case? I'd like to see if T388040 could help here long term.

@xcollazo the main use-case here is disaster recovery for wdqs in case the wikidata RDF dumps are down for more than one week.

Change #1131286 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/mediawiki-config@master] wdqs: enable hive/hdfs ingestion for rdf update streams

https://gerrit.wikimedia.org/r/1131286

EBernhardson subscribed.

Patch looks ready to go, will need to ship it in a deploy window.

Change #1131286 merged by jenkins-bot:

[operations/mediawiki-config@master] wdqs: enable hive/hdfs ingestion for rdf update streams

https://gerrit.wikimedia.org/r/1131286

Raw events are available in hdfs:///wmf/data/raw/event/(eqiad|codfw).rdf-streaming-updater.mutation*/ (3months retention).
And in the hive tables starting with rdf_streaming_updater_mutation (3 months)
We could go even further by keeping them for ever but I'm not entirely sure that it would be justified.