Page MenuHomePhabricator

Create bulk ingestion articles endpoint
Closed, ResolvedPublic13 Estimated Story Points

Description

We need to create articles bulk ingestion articles endpoint to ingest pages from all the the projects that we support so that we can then use that data for On-demand and Snapshots service.

Acceptance criteria
I can trigger articles endpoint for a project name and ingest all of the articles in that project.

To-Do

  • add articles method to bulk.proto in protos repository
  • add articles handler to bulk-ingestion service
    • input params should be a project and namespace
    • need to investigate if titles dump is still the best way to get a list of articles
    • should produce an event with list of titles to go through

Event Timeline

Protsack.stephan triaged this task as High priority.
Protsack.stephan updated the task description. (Show Details)
AnnaMikla changed the task status from Open to In Progress.Oct 26 2022, 11:08 AM
AnnaMikla changed the task status from In Progress to Open.Oct 28 2022, 12:27 PM
AnnaMikla changed the task status from Open to In Progress.
  1. article update - Actions API (GetPage), Actions API (GetPage - date created), REST API (GetPageHTML), ORES, Actions API (GetUser), TextProcessor gRPC
  2. topics: articles, versions -> 48h history
  3. article bulk - Action API (GetPages - bulk max 50 titles), Actions API (GetPages - date created, can we do bulk?), REST API (GetPageHTML - needs concurrency, maybe semaphore?, create GetPagesHTML with concurrency)
  4. topics: articles (compacted)
  5. how do we process the GetPages and GetPagesHTML to build Article object
Daria_Kevana changed the task status from In Progress to Open.Dec 5 2022, 9:24 AM