Page MenuHomePhabricator

Re-Ingestion: Execution
Open, HighPublic5 Estimated Story Points

Description

As an Engineer, I would like to execute the article-bulk DAG, so that I can trigger the Re-ingestion process and update the WME dataset with fresh data and new v2 schema.

Acceptance Criteria

  • New projects projects are added to the system.
  • Ingestion process executed and the WME dataset is updated with the newest data.

To-Do
Refer to detailed checklist at Runbooks/Bulk Ingestion Runbook v2 section Bulk-ingestion using versioning checklist.

  • Send an email to sre-service-ops
  • Deploy pre-ingestion changes to prod for the following services:
    • Infrastructure/services
    • Structured-data
    • On-demand (need to deploy a new version of the service)
    • Scheduler
  • Run ingestion in production
  • Monitor ingestion (3-4 days) for the following:
    • Bulk-ingested articles should be landing on vn+1 compacted topic.
    • Event-based articles should be landing on both vn & vn+1 compacted topics.
    • On-demand should be updating articles on both key locations
    • On-demand, batches and snapshots are still picking from vn compacted topic

Acceptance criteria

  • vn+1 compacted topics should have similar amount of messages as those of vn compacted topics
  • In s3, articles should be present at both key locations: articles/project_identifier/article_name.json and articles_v2/project_identifier/article_name.json

Additional Context
Currently, we have an Airflow DAG specifically built to populate our system with "baseline" data. This data acts as the initial dataset, which we then routinely refresh and maintain.
Our aim is to execute this DAG specifically to ensure that our Kafka cluster contains a fresh, up-to-date version of this baseline dataset.

Event Timeline

REsquito-WMF set the point value for this task to 5.Jan 31 2024, 2:44 PM