The Product Analytics team has been trying to figure out how we can calculate our annual plan metrics in near-real-time using the Analytics Data Lake. We could use the MariaDB analytics replicas, but we're reluctant to invest time in a system that will be deprecated soon (T172410).
It seems like we can use the EventBus logs (e.g. mediawiki_revision_create) for real-time-ish edit data, but that data still doesn't include edit tags. The ideal would be for those logs to include edit tags natively, but to make things simpler, can we just have the change_tag tables loaded into the Data Lake daily as a separate table?
This means we analysts would take care of the details of joining it to other data sources and adapting to the upcoming schema changes (T185355) when they happen, which hopefully makes this easy to accomplish.
Theoretically, a revision's change tags can be changed by users at any time, but the tags we're interested in are software-set which means we can rely on them to be set initially and not change afterwards. So it would be sufficient to append a day's new rows rather than reloading the entire table everyday, except for the fact that the schema changes will require a complete reload when they occur.
The schema can be similar to the MediaWiki tables in wmf_raw, where it's identical to the original schema except an additional wikidb field.
Setting up a workflow for calculating our annual plan metrics is an important priority for us, so it would be extremely helpful if this could be done within the next 2 weeks (by August 24).
Note that T161149 is a separate, still-open request; this is a stopgap until (1) change tags are integrated in mediawiki_history and (2) mediawiki_history is updated on a closer to real-time basis.