User story
As a developer, I need to comply with the Search team's indices update process concerning weighted tags.
To be implemented in T300045 script:
- Input = previous run output, current run output
- Output = update dataframe slice with relevant __DELETE_GROUPING__ values
Simplified example
Previous run:
| page_id | tag | values |
| 123 | p373 | [ Q1∣25 ] |
| 666 | p18 | [ Q42∣999 ] |
Current run:
| 666 | p18 | [ Q10000∣27 ] |
Output:
| page_id | tag | values |
| 123 | p373 | [ __DELETE_GROUPING__ ] |
| 666 | p18 | [ Q10000∣27 ] |
High-level steps
- Append .previous to the previous dataframe HDFS parquet
- read the the previous dataframe
- generate the current one
- compute the delta between the two (an option could be through a left anti join, but needs further investigation)