Page MenuHomePhabricator

Implement bulk ingestion handler in Structured Data service
Closed, ResolvedPublic8 Estimated Story Points

Description

We need to create a handler that will consume from bulk ingestion articles topic and publish events into articles compacted topic, so that other services can consume from that topic and be able to do transformations or serving layers for the dataset.

Acceptance criteria
Articles bulk handler fully implemented in Structured Data service.

To-Do

  • create initial setup for articlesbulk handler in structured-data service
  • integration of the new API client
  • should produce almost identical outcome as articleupdate handler but fields from versions object (the fields is TBD, we want to minimize the amount of API requests and data throughput so that processing is fast)

Notes

  1. article update - Actions API (GetPage), Actions API (GetPage - date created), REST API (GetPageHTML), ORES, Actions API (GetUser), TextProcessor gRPC
  2. topics: articles, versions -> 48h history
  3. article bulk - Action API (GetPages - bulk max 50 titles), Actions API (GetPages - date created, can we do bulk?), REST API (GetPageHTML - needs concurrency, maybe semaphore?, create GetPagesHTML with concurrency)
  4. topics: articles (compacted)
  5. how do we process the GetPages and GetPagesHTML to build Article object

Event Timeline

Daria_Kevana changed the task status from Open to In Progress.Dec 5 2022, 4:31 PM
Daria_Kevana changed the task status from In Progress to Open.Jan 26 2023, 12:58 PM
Daria_Kevana changed the status of subtask T325294: Add abstract field to the article from In Progress to Open.
Daria_Kevana changed the status of subtask T325269: Text Processor service GetDictWords endpoint is slow from In Progress to Open.
Daria_Kevana changed the status of subtask T325274: Text Processor service GetDictWords endpoint can't sustain load from In Progress to Open.