[Epic] Re-architect the Search Update Pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Gehel
	Sep 5 2022, 2:54 PM

Description

The Search Update Pipeline was architected ~8 years ago. It has served its purpose well, but now is time to review its architecture and address a few of the long lived limitations it has. Design document here.

High level plan:

Test the updater job on the dse-k8s cluster
- create a namespace for the cirrus-streaming-updater on the dse-k8s cluster: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/951960
- T328675 create a helmfile service using the FlinkDeployment resource via the flink-app helm chart
- T341792 Provision Zookeeper Cluster for storing Flink HA data
- T344614 Add Zookeeper config to 'cirrus-streaming-updater' test service on DSE cluster
- in progress test various maintenance operations for Flink Operator: taking savepoint, job upgrade, H/A recoveries (kill pods manually), k8s upgrade (wipe out the namespace, T293063), ... (see also T328561)
Enable the k8s-operator on the staging wikikube cluster for the cirrus-streaming-updater namespace (might need a dedicated task)
- test various maintenance operations on staging wk: taking savepoint, job upgrade, H/A recoveries (kill pods manually), k8s upgrade (wipe out the namespace, T293063), ... (see also T328561)
Enable the k8s-operator on the production wikikube cluster for the cirrus-streaming-updater namespace (might need a dedicated task)

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Gehel	T317045 [Epic] Re-architect the Search Update Pipeline
Resolved		EBernhardson	T314063 Create a skeleton project for the Flink part of the Search Update Pipeline
Resolved		Gehel	T317046 Coordinate with Platform Engineering / Data Value Stream Team about a rework of the Search Update Pipeline
Resolved		EBernhardson	T317023 Investigate moving incoming_links computation to a batch job
Resolved		EBernhardson	T265056 Make Cirrus Search dump script more resilient to failures (elasticsearch restarts)
Resolved		• dcausse	T316016 Limit the size of the documents indexed by CirrusSearch
Resolved		pfischer	T317202 Model the update document used by the CirrusSearch Update Pipeline
Resolved		EBernhardson	T322327 CirrusSearch should generate a document consistent to a given schema
Resolved		Gehel	T317283 Coordinate with ServiceOps Team about a rework of the Search Update Pipeline
Resolved		• dcausse	T317309 Create an API that renders the CirrusSearch indexable document based on a page_id and a revision id
Resolved		• dcausse	T317609 Create a schema for fetch failures
Resolved		Gehel	T317611 Implement the enrichment function
Declined		None	T318388 Configure spotless to better match WMF java coding style
Resolved		pfischer	T318396 Use assertj for unit tests
Declined		None	T318649 The enrichment function should call a API endpoint based on the event domain
Resolved		pfischer	T322186 Consume revision based changes from the mediawiki.page-state stream
Resolved		pfischer	T323506 Implement the ingestion job
Resolved		• dcausse	T323508 The preparation job should discover what index to write to
Resolved		Gehel	T323509 Deploy the cirrus-update pipeline to YARN for testing
Resolved		EBernhardson	T323687 Enable the wmf_capped doc size limiter in the mediawiki-config for CirrusSearch
Resolved		pfischer	T323688 The kafka consumer group should be configurable
Resolved		pfischer	T323690 Add support page deletion
Resolved		pfischer	T325315 Add support for redirects in CirrusSearch
Resolved		bking	T344366 Rollout Elasticsearch extra plugins package and restart cluster to apply
Resolved		• dcausse	T325273 EventRowTypeInfo should support schema evolution of rows seriliazed in flink application state
Resolved		pfischer	T325565 Add support for page re-renders
Resolved		EBernhardson	T325672 Re-order and optimize change events
Resolved		pfischer	T326318 Create docker images for the cirrus-streaming-updater flink jobs
Resolved		EBernhardson	T326328 Create helm charts for the cirrus-streaming-updater flink jobs
Resolved		RKemper	T328330 Create SLI / SLO on Search update lag
Resolved		• dcausse	T320408 Monitor CirrusSearch update lag
Declined		None	T363795 SUP: Add metric to determine deduplication window length
Resolved	BUG REPORT	EBernhardson	T331127 phantom redirects lingering in incategory searches after page moves
Open		None	T374662 PHP web requests running for multiple hours
Resolved		pfischer	T332763 The search update pipeline should support events compatible with the /mediawiki/page/change/1.0.0 schema
Resolved		Gehel	T340548 [EPIC] Deployment of the Search Update Pipeline on Flink / k8s
Open		• lbowmaker	T328561 [Event Platform] Flink Operations
Resolved		gmodena	T328563 [Flink Operations] How to handle restarting a Flink application
Open		None	T328565 [Flink Operations] Automate Replay of Failed Events
Resolved		gmodena	T328569 [Flink Operation] How to handle app upgrades
Resolved		bking	T344614 Add Zookeeper config to 'rdf-streaming-updater' test service on DSE cluster
Resolved		bking	T345957 Restore dse-k8s' rdf-streaming-updater from savepoint/improve bootstrapping process
Resolved		bking	T346048 Troubleshoot rdf-streaming-updater/dse-k8s cluster
Resolved		MatthewVernon	T342620 Storage request: swift s3 bucket for flink search-update-pipeline checkpointing
Resolved		bking	T347075 Deploy test instance of cirrus updater in k8s
Resolved		bking	T348350 Set requests (not limits) for cirrus-streaming-updater in k8s
Resolved		EBernhardson	T350186 Cirrus-streaming-updater test: validate relforge indices are correctly updated
Resolved		EBernhardson	T350299 EventBus change events involving redirect changes are sometimes incorrect
Resolved		• dcausse	T326914 Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink
Resolved		bking	T349772 Create dashboards/alerts for new Cirrus Streaming Updater
Resolved		bking	T359213 Adapt Flink-related rdf-streaming-updater alerts for Cirrus Streaming Updater
Invalid		None	T349848 Determine and control cirrus streaming updater's usage of MWAPI resources
Invalid		None	T350185 Estimate cirrus streaming updater's usage of MWAPI
Resolved		• dcausse	T350826 Test backfilling for cirrus-streaming-updater
Resolved		EBernhardson	T358518 Deploy streaming updater for 100% of writes to cloudelastic
Resolved		Gehel	T341332 [EPIC] The CirrusSearch streaming updater should support private wikis
Resolved		EBernhardson	T345185 Provide a method for internal services to run api requests for private wikis
Resolved		EBernhardson	T354976 Create new NetworkSession mediawiki extension
Resolved		jhsoby	T356438 Add NetworkSession mediawiki extension to translatewiki.net
Resolved		EBernhardson	T355267 Add extension NetworkSession to all wmf wikis
Resolved		Mstyles	T357353 Application Security Review Request : NetworkSession MediaWiki extension
Resolved		EBernhardson	T346046 [Search Update Pipeline] Source streams for private wikis
Resolved		bking	T341705 eqiad: 3 VMs requested for Zookeeper
Resolved		EBernhardson	T344357 [Search Update Pipeline] avoid duplicate updates (multi DC)
Resolved		Gehel	T341625 Requesting permission to use kafka-main cluster to transport CirrusSearch updates
Resolved		EBernhardson	T345634 [Search Update Pipeline] Add a way to filter input events per wiki
Duplicate		None	T345638 [Search Update Pipeline] Add a way to configure a default http route
Resolved		• dcausse	T346015 [Search Update Pipeline] Consider dropping support for java8
Resolved		EBernhardson	T346717 [Search Update Pipeline] Name and identify operators that have a state
Resolved		EBernhardson	T346718 [Search Update Pipeline] Set max parallelism explicitly on operators with a state
Resolved		pfischer	T347184 [Search Update Pipeline] Rename InputEvent to UpdateEvent
Resolved		pfischer	T346895 [Search Update Pipeline] Reference latest streams/schemas
Resolved		pfischer	T346719 [Search Update Pipeline] Upgrade to flink 1.17.1
Resolved		pfischer	T347543 [Search Update Pipeline] Fetch: Handle Timeout of AsyncAwaitOperator
Resolved		pfischer	T348211 [Search Update Pipeline] logging: add correlating information
Resolved		pfischer	T351503 Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis
Resolved		EBernhardson	T353427 ConsumerApplicationIT should fail when the update request payload changed
Resolved		pfischer	T353430 The elasticsearch client does not properly estimate the size of the bulk requests
Resolved		Gehel	T353460 The consumer job of the SUP does not achieve its expected throughput
Duplicate		None	T353473 The cirrussearch.update_pipeline.update stream should be keyed by wiki and page_id
Resolved		pfischer	T354064 SUP: Partition update_pipeline kafka topic
Resolved		pfischer	T354517 Search Update Pipeline: HTTP client/proxy config
Resolved		bking	T354289 Investigate connection timeouts between Search Update Pipeline and MediaWiki APIs
Resolved		bking	T354595 SUP: Production
Resolved		brouberol	T354794 Requesting permission to enable kafka log compaction for page_rerender on kafka-main
Invalid		None	T355236 SUP: Provide config option for cirrussearch to partially disable writing to elasticsearch
Resolved		EBernhardson	T354793 SUP: Adapt saneitizer to allow SUP to operate next to cirrus jobs
Resolved		• dcausse	T355066 SUP: Process (large) JSON responses non-blocking to save memory
Invalid		bking	T356302 setup production Cirrus Streaming Updater alerts
Resolved		EBernhardson	T356439 [Tracking] Evaluate differences in saneitizer fixes eqiad vs cloudelastic
Resolved		EBernhardson	T356655 Create tool and process to investigate Search update Pipeline failures
Resolved		pfischer	T356933 Streaming Updater should still make forward progress when one index has problems
Resolved		Gehel	T358599 Integrate Saneitizer with SUP