Page MenuHomePhabricator

Drop old WMDEBanner events from Hive
Closed, ResolvedPublic

Description

When sanitization for EventLogging events was added, a job to drop old unsanitized data was added in https://phabricator.wikimedia.org/T209503#4991424. At that time, it looks like the WMDEBanner* tables were somehow excluded from having their data dropped.

T273789: Sanitize and ingest all event tables into the event_sanitized database is refactoring and formalizing sanitization and data purging for all tables in the Hive event database. Can we drop the old data from WMDEBanner* tables? Does it need to be sanitized and kept indefinitely in the event_sanitized database?

Event Timeline

@Addshore can you help this ticket find its way to the right people at WMDE? Thank you!

Ping? :-) If you need to keep this data, I can help in determining what can be kept indefinitely. Thanks!

Thanks for the ping. We have to check if we need the data and get back asap.

hey @GoranSMilovanovic
I am wondering if the raw data mentioned above is relevant for your old reports?

Verena and I am not sure if your reports are always newly generated, when looked at and therefor we need to find a solution or it does not matter for the old reports

@Merle_von_Wittich_WMDE I don't think so. All the datasets that we need to re-render the old reports in R markdown should still be with me.

A related q as we are figuring this out.

Are these used at all? If not, we would like to stop collecting them as part of T259163: Migrate legacy metawiki schemas to Event Platform. If so, we need to migrate them to Event Platform.

thanks for the ping, @Ottomata! Please notice that the WMDE FUNtech dev team consists of Gabriel, me and @AbbanWMDE (instead of @Tim_WMDE) since March 2020.
(you might want to change that in https://phabricator.wikimedia.org/T282131)

@Merle_von_Wittich_WMDE & @GoranSMilovanovic, regarding deletion of historical data:

@Merle_von_Wittich_WMDE I don't think so. All the datasets that we need to re-render the old reports in R markdown should still be with me.

Can you confirm, then, that we can delete data older than 90 days? :-)

Can you confirm, then, that we can delete data older than 90 days? :-)

And/or can we just stop collecting this data altogether? :)

@mforns @Ottomata @Merle_von_Wittich_WMDE

  • We have all our Campaign reports for the WMDE New Editors team already rendered as R Markdown Notebooks in html and accepted as such at some point in past;
  • I am certain that I have all the aggregated datasets that were used to render the reports in place; raw data are not even used for reporting/analytics purposes;
  • So yes, help yourself and get rid of the tables : )

@Ottomata

And/or can we just stop collecting this data altogether? :)

: )

Can you confirm, then, that we can delete data older than 90 days? :-)

And/or can we just stop collecting this data altogether? :)

That is a decision I have leave to @GoranSMilovanovic and the FunTech Team of @CorinnaHillebrand_WMDE

@Merle_von_Wittich_WMDE @Ottomata @mforns

For the WMDE New Editors campaigns I am using event.wmdebannerinteractions for some time already.

Will that table survive? If not, for the next campaign I will need someone to let me know where will the WMDE banner interactions logging move. Thanks!

As of the older tables: I have the aggregated datasets used to render the past WMDE New Editors Campaign Reports with me. So the deletion of any older table would not interfere with my work (unless someone asks for something that implies the processing of raw data in the future - but that never happened when banner data were considered in the past).

Ok let me try to rephrase, there are actually 2 distinct questions here.

  1. Do you need historical data (older than 90 days) in the event.wmdebanner* tables? If so, we can set up special jobs to 'sanitize' those tables into the event_sanitized database. If not, we will apply the same purging rules we apply for event tables by default and just drop data older than 90 days. Docs here.
  1. Do you need to continue collecting this data going forward? That is what is being asked in T282131: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned. If not, we will disable the instrumentation and stop collecting this data. If so, we will handle the migration to Event Platform. The migration should be transparent to users of data in Hive; but if you need to make schema changes things are a little different after the migration. You can read more about legacy EventLogging support here.

Perhaps you only need wmdebannerinteractions? If so we can keep collecting that and remove the rest? Just let us know! :)

@Ottomata We definitely need to have the data gathered in event.wmdebannerinteractions in the future.

Do you need historical data (older than 90 days) in the event.wmdebanner* tables?

WMDE we need a decision here ^^ @Merle_von_Wittich_WMDE @Verena @Christine_Domgoergen_WMDE @WMDE-leszek

If we ever need to re-analyze the banner interactions data beyond the level that is typically present in my campaign reports, the answer to the above question is yes, otherwise it is no. Which one? Thanks.

@Ottomata @Milimetric

Do you need historical data (older than 90 days) in the event.wmdebanner* tables?

I can see the task is now a High priority. Then, my answer is: no.

Please check with the WMDE FUN team if they agree too.

Thank you for your patience.

@Ottomata Sorry for keeping you waiting.

The schemas prefixed by WMDE are used for either fundraising (event.wmdebannersizeissue, event.wmdebannerevents) or by editor engagement campaigns (event.wmdebannerinteractions and also event.wmdebannerevents). We want to continue using the schemas (thanks for linking to the EventLogging legacy doc). We do not need raw data older than 90 days.

@kai.nissen, @GoranSMilovanovic and @Merle_von_Wittich_WMDE
Thanks all for the details and confirmation!

For each of the schemas prefixed with WMDEBanner*, we will proceed and:

  1. Periodically delete all its data older than 90 days.
  2. Migrate its event collection from EventLogging to Event Platform.

Number 1. will happen in the following days as part of this task. There's no action item on your side. We'll comment here when the data is gone.
Number 2. will happen in the following weeks as part of T282562. There will be some questions for you there, but no work to do.

Cheers!

Oh! If you ever want to keep WMDEBanner* data (or any other Event Platform stream) for longer than 90 days, please add the schema to the Event Platform analytics sanitization allowlist.
We will review the change to ensure no privacy-sensitive fields are persisted, and once it's merged, Event Platform will keep the specified data indefinitely.
Thanks!

The data older than 90 days has been deleted. Cheers!

This comment was removed by Ottomata.