Re-Ingestion: Execution
Open, HighPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	REsquito-WMF
	Jan 30 2024, 8:37 PM

Description

As an Engineer, I would like to execute the article-bulk DAG, so that I can trigger the Re-ingestion process and update the WME dataset with fresh data and new v2 schema.

Acceptance Criteria

New projects projects are added to the system.
Ingestion process executed and the WME dataset is updated with the newest data.

To-Do
Refer to detailed checklist at Runbooks/Bulk Ingestion Runbook v2 section Bulk-ingestion using versioning checklist.

Send an email to sre-service-ops
Deploy pre-ingestion changes to prod for the following services:
- Infrastructure/services
- Structured-data
- On-demand (need to deploy a new version of the service)
- Scheduler
Run ingestion in production
Monitor ingestion (3-4 days) for the following:
- Bulk-ingested articles should be landing on vn+1 compacted topic.
- Event-based articles should be landing on both vn & vn+1 compacted topics.
- On-demand should be updating articles on both key locations
- On-demand, batches and snapshots are still picking from vn compacted topic

Acceptance criteria

vn+1 compacted topics should have similar amount of messages as those of vn compacted topics
In s3, articles should be present at both key locations: articles/project_identifier/article_name.json and articles_v2/project_identifier/article_name.json

Additional Context
Currently, we have an Airflow DAG specifically built to populate our system with "baseline" data. This data acts as the initial dataset, which we then routinely refresh and maintain.
Our aim is to execute this DAG specifically to ensure that our Kafka cluster contains a fresh, up-to-date version of this baseline dataset.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	BUG REPORT	REsquito-WMF	T351712 Q3- Q4: Snapshots service is failing to decode some Kafka messages
Resolved		Protsack.stephan	T356217 Re-Ingestion: Update Structured Data service to work with multiple topic versions
Resolved	BUG REPORT	Protsack.stephan	T360417 Structured data message unmarshaling randomly fails
Resolved		REsquito-WMF	T356218 Re-ingestion: Update On-demand service to work with multiple topic versions
Resolved		ROdonnell-WMF	T356215 Re-ingestion: Provision new Kafka topics and update config submodule
Open		prabhat	T356219 Re-Ingestion: Execution
Open		None	T356220 Re-ingestion: Cleanup