We've started to plan how we'll calculate the Audiences 2018–19 annual plan metrics, and we've identified some infrastructural needs.
Daily editing data, with tags, in the Data Lake
- Need some type of aggregated and/or denormalized table of edits that contains data about edit tags (to be used for mobile retention, mobile edits and editor counts, or anything else that relies on tagging edits)
- Needs to be granular enough to look at short-term editor retention (daily is probably sufficient)
- We need it updated relatively quickly, within a day after the events take place (if the Data Lake is only source and it isn't available till 10th after month's end, that is not tenable).
- A one-time sqoop of the change_tag tables into the Data Lake to provide past edit tag data to complement current data
In mediawiki_history, we'd calculate reverts using the convenient is_reverted field. How do we do it when we simply have a table of revisions with their hashes?