Change Details

We've started to plan how we'll calculate the [Audiences 2018–19 annual plan metrics](https://docs.google.com/spreadsheets/d/1SivevFV9IBhzhXQkgzPYf1Od2FIXf1yLJdLKfRK4BBg/edit?ts=5b1e475a#gid=909203420), and we've identified some infrastructural needs. = Daily editing data, with tags, in the Data Lake = == Filed == * Need some type of aggregated and/or denormalized table of edits that contains data about edit tags (to be used for mobile retention, mobile edits and editor counts, or anything else that relies on tagging edits) * Needs to be granular enough to look at short-term editor retention (daily is probably sufficient) * We need it updated relatively quickly, within a day after the events take place (if the Data Lake is only source and it isn't available till 10th after month's end, that is not tenable). == Not yet in separate tasks == * A one-time sqoop of the change_tag tables into the Data Lake to provide past edit tag data to complement current data * Joining this old data in one schema to a new schema in another seems like it will be a pain. Is there an easier way to do this? = Shorter-term retention metrics = * If we have to wait 2 months to see impact on retention metric, we have some real practical challenges that arise. We probably need something like 2-day and 2-week retention as directional proxies for making product decisions. = Identifying reverts = In mediawiki_history, we'd calculate reverts using the convenient is_reverted field. How do we do it when we simply have a table of revisions with their hashes? = A/B testing = * @JKatzWMF's thought: "for editors the easiest would be if we could use last digit of user id for metrics (0-9) since that sits in mediawiki tables, instead of creating an independently generated variable, but I'm not sure if we can or should use that for bucketing."