As a WME Engineer i want to synchronize ondemand state with snapshots state. At the moment they are out of sync.
Root Cause
A schema change has impacted ondemand service avro deserialization. Our complex system of submodules, and lack of detection of out of sync….ondemand for the base product has been losing events since then. This means ondemand api for base product and structured contents is now out of sync with the snapshots.
The issue doesn't impact structured contents, so the focus should be only on articles.
Solution
As discussed with team, the decision is to synchronize ondemand state only, by resetting ondemand consumer offsets and replay traffic.
TODO
- Record "Current offset" for consumer ondemand-articles-compacted-v2 for all topic-partitions. Tip: You can make use of this repo . Update it to give you current offsets. You can use Position API.
- Create new instances of ondemand service with consumer auto.offset.reset "earliest", and consumer_group name ondemand-articles-compacted-fix (for instance). Tip: You can use ondemand-<x>-deploy as template and create 4 new repos in service-deploys. Update ondemad service to set consumer auto.offet.reset. Take a look at KAFKA_AUTO_OFFSET_RESET env variable in snapshots service as an example.
- After deploying the new instances, monitor the current offset. This is same process as step 1 but for the new consumer group name. The current offset should steadily grow from 0 till the offsets you recorded in step 1.
- Keep monitoring occasionally. Once the offsets for the new consumer group exceeds the offsets recorded in step 1 for all topic-partitions, we can destroy the new instances.
Note
Here is the visual representation of current offsets for ondemand-articles-compacted-v2 for these topic-partitions.
Acceptance Criteria
- Old events processed by new ondemand instances (current_offset of new instances > current_offset of old instances)
- No errors during re-processing.
