User story
--
As a developer, I need to comply with the Search team's indices update process concerning weighted tags.
---
>>! In T300045#7708772, @EBernhardson wrote:
> when updating we will always provide a "full update" (...)
> we always provide the full new state and get rid of the old state. (...)
> it would be more resilient if the system could generate the expected data for the set of pages it wants to update, emitting `__DELETE_GROUPING__` when it has calculated that the page doesn't have any of the specified tags
>
> Essentially, I think what the script should output is not the updates it wants to make to the cirrus indices, but rather **the expected final state of the items it wants to update.**
To be implemented in T300045 script:
- Input = previous run output, current run output
- Output = update dataframe slice with relevant `__DELETE_GROUPING__` values
**Simplified example**
Previous run:
| page_id | tag | values
| 123 | p373 | [ Q1∣25 ]
| 666 | p18 | [ Q42∣999 ]
Current run:
| 666 | p18 | [ Q10000∣27 ]
Output:
| page_id | tag | values
| 123 | p373 | `__DELETE_GROUPING__`
| 666 | p18 | [ Q10000∣27 ]
High-level steps
--
1. Append `.previous` to the previous dataframe HDFS parquet
2. read the the previous dataframe
3. generate the current one
4. compute the delta between the two (an option could be through a //left anti join//, but needs further investigation)