Allow parallel consumption on Realtime API
Closed, ResolvedPublic13 Estimated Story Points
Actions

Assigned To

Authored By

	Protsack.stephan
	May 10 2023, 1:05 PM

Description

We need to introduce partition level consumption to allow data re-users parallelized consumption of streams, that will allow more efficient scaling.

Acceptance criteria
Ability to consume and resume consumption by partition in Realtime API.

To-Do

Schema updates
- date_published - equals to ROWTIME in ksqlDB
- offset - equals to ksqlDB ROWOFFSET
- partition - equals ROWPARTITION in ksqlDB

{
 "name": "Albert Einstein",
 ...,
 "event": {
  "identifier": "a9f6d391-c216-48d6-986b-d4763f077fbd",
  "type": "update",
  "date_created": "2021-08-31T04:51:39Z",
  "date_published": "2021-08-31T04:52:39Z",
  "offset": 223123,
  "partition": 100
 } 
}

query that will be supported with following arguments:
- parts - number from 0 to 9, representing the number of parallel connections can be made per query
- offsets - simple map that will show the latest consumed offset and map it to partition (map[int]int - where the key is partition value collected from event.partition field and the value is event.offset)
- since - similar to offsets, but the key will be event.date_published instead

curl -X 'POST' \
  'https://realtime.enterprise.wikimedia.com/v2/articles' \
  -H 'accept: text/event-stream' \
  -H 'Content-Type: application/json' \
  -d '{
  "parts": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
  "offsets": {
    "55": 200,
    "48": 1232
  },
  "since": {
    "55": 123123123123123,
    "88": 123123213213211
  },
  "fields": [
    "name",
    "identifier"
  ],
  "filters": [
    {
      "field": "in_language.identifier",
      "value": "en"
    }
  ]
}'

Notes

Here’s an example of the query that’s going to extract the data described above from the stream:

SELECT 
  NAME,
  ROWTIME,
  ROWPARTITION,
  ROWOFFSET
FROM rlt_articles_str
WHERE ROWPARTITION in (0, 1)
  and case
    when ROWPARTITION = 0 then ROWOFFSET >= 2135270
    else ROWOFFSET >= 2117757
  end EMIT CHANGES
LIMIT 10000;

The query described will start consuming from the offset provided in the query, also going to split the offsets by partition.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T330876 API Usability
Open	None	T341686 Create a Nightly CRON job to validate Parsoid HTML dependencies [NEEDS GROOMING]
Resolved	prabhat	T336373 Allow parallel consumption on Realtime API
Resolved	prabhat	T343334 Fix ksqldb offset config for realtime API and deploy in production

Event Timeline

Protsack.stephan triaged this task as High priority.May 10 2023, 1:05 PM

Protsack.stephan created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 10 2023, 1:05 PM

Protsack.stephan added a parent task: T330876: API Usability.May 10 2023, 1:07 PM

Protsack.stephan moved this task from Incoming to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.

Protsack.stephan set the point value for this task to 13.May 10 2023, 2:11 PM

Protsack.stephan moved this task from To Be Estimated/To Be Discussed to Estimated /Discussed on the Wikimedia Enterprise board.Jun 1 2023, 2:11 PM

prabhat claimed this task.Jun 2 2023, 1:38 PM

prabhat changed the task status from Open to In Progress.Jun 5 2023, 11:45 PM

prabhat moved this task from Estimated /Discussed to In Progress on the Wikimedia Enterprise board.

prabhat moved this task from In Progress to Merge Request on the Wikimedia Enterprise board.Jun 20 2023, 1:22 PM

prabhat reassigned this task from prabhat to ROdonnell-WMF.Jul 11 2023, 3:26 PM

prabhat moved this task from Merge Request to Machine Readability PB on the Wikimedia Enterprise board.

prabhat subscribed.

JArguello-WMF moved this task from Machine Readability PB to Blocked on the Wikimedia Enterprise board.Jul 17 2023, 1:05 PM

Protsack.stephan moved this task from Blocked to Machine Readability PB on the Wikimedia Enterprise board.Jul 20 2023, 12:54 PM

ROdonnell-WMF added a parent task: T341686: Create a Nightly CRON job to validate Parsoid HTML dependencies [NEEDS GROOMING].Jul 25 2023, 1:19 PM

QA tests: I had "context cancelled" from some of my calls. This could be a problem on my ISP connection to the US AWS servers. It's best to re-run this QA step with someone else to see if it's a one-off on my side.

LDlulisa-WMF claimed this task.Jul 26 2023, 10:38 AM

LDlulisa-WMF added a subscriber: ROdonnell-WMF.

QA was done by observing the that messages received come from the correct partitions on the kafka topics in dev. When offset is chosen it was observed that no messages with offsets less than the chosen are received. offset and since_per_partition cannot both be used for the same partition. max_parts cannot be exceeded. When using since_per_partition no messages with datetime less than the one picked are received. Length of offset and since_per_partition cannot exceed the max number of partitions. When some offsets are picked for some partitions and not others, messages for the partitions without offsets are received normally. Ticket can be marked as done.

JArguello-WMF moved this task from Machine Readability PB to Done Sprint 42 (July 14 - July 27) on the Wikimedia Enterprise board.Jul 26 2023, 1:53 PM

Protsack.stephan reassigned this task from LDlulisa-WMF to prabhat.Jul 26 2023, 1:54 PM

Protsack.stephan added a subscriber: LDlulisa-WMF.

prabhat updated the task description. (Show Details)Jul 27 2023, 1:17 PM

prabhat updated the task description. (Show Details)

prabhat closed this task as Resolved.Jul 27 2023, 1:20 PM

JArguello-WMF added a subtask: T343334: Fix ksqldb offset config for realtime API and deploy in production.Aug 4 2023, 4:42 PM

JArguello-WMF closed subtask T343334: Fix ksqldb offset config for realtime API and deploy in production as Resolved.Aug 17 2023, 1:41 PM