We've started to plan how we'll calculate the [Audiences 2018–19 annual plan metrics](https://docs.google.com/spreadsheets/d/1SivevFV9IBhzhXQkgzPYf1Od2FIXf1yLJdLKfRK4BBg/edit?ts=5b1e475a#gid=909203420), and we've identified some infrastructural needs.
= Daily editing data, with tags, in the Data Lake =
== Filed ==
* Need some type of aggregated and/or denormalized table of edits that contains data about edit tags (to be used for mobile retention, mobile edits and editor counts, or anything else that relies on tagging edits)
* Needs to be granular enough to look at short-term editor retention (daily is probably sufficient)
* We need it updated relatively quickly, within a day after the events take place (if the Data Lake is only source and it isn't available till 10th after month's end, that is not tenable).
== Not yet in separate tasks ==
* A one-time sqoop of the change_tag tables into the Data Lake to provide past edit tag data to complement current data
* Joining this old data in one schema to a new schema in another seems like it will be a pain. Is there an easier way to do this?
= Shorter-term retention metrics =
* If we have to wait 2 months to see impact on retention metric, we have some real practical challenges that arise. We probably need something like 2-day and 2-week retention as directional proxies for making product decisions.
= Identifying reverts =
In mediawiki_history, we'd calculate reverts using the convenient is_reverted field. How do we do it when we simply have a table of revisions with their hashes?
= A/B testing =
* @JKatzWMF's thought: "for editors the easiest would be if we could use last digit of user id for metrics (0-9) since that sits in mediawiki tables, instead of creating an independently generated variable, but I'm not sure if we can or should use that for bucketing."