We need to create a handler that will consume from bulk ingestion articles topic and publish events into articles compacted topic, so that other services can consume from that topic and be able to do transformations or serving layers for the dataset.
Acceptance criteria
Articles bulk handler fully implemented in Structured Data service.
To-Do
- create initial setup for articlesbulk handler in structured-data service
- integration of the new API client
- should produce almost identical outcome as articleupdate handler but fields from versions object (the fields is TBD, we want to minimize the amount of API requests and data throughput so that processing is fast)
Notes
- article update - Actions API (GetPage), Actions API (GetPage - date created), REST API (GetPageHTML), ORES, Actions API (GetUser), TextProcessor gRPC
- topics: articles, versions -> 48h history
- article bulk - Action API (GetPages - bulk max 50 titles), Actions API (GetPages - date created, can we do bulk?), REST API (GetPageHTML - needs concurrency, maybe semaphore?, create GetPagesHTML with concurrency)
- topics: articles (compacted)
- how do we process the GetPages and GetPagesHTML to build Article object