Page MenuHomePhabricator

Improve purging for analytics-slave data on Eventlogging
Closed, ResolvedPublic55 Estimated Story Points

Description

Temporary solution until we move backend away from MySQL

Purging policy defined in [1].

NOTE: pause the replication from production while this is running.

[1] https://gerrit.wikimedia.org/r/#/c/298721/5/files/mariadb/eventlogging_purging_whitelist.tsv

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Nuria edited projects, added Analytics; removed Analytics-Kanban.
Nuria added a subscriber: elukey.

Let's re-task this in the light of @elukey being interested in doing these changes, we might be able to incorporate them to the upgraded replication script

  • Delete old data from master (purging rules applies to slaves only) this should be a SEPARATE script from replication
  • Create script that runs on slaves that applies purging rules

Alternative:

  • Apply purging on master, the issue is that then master might have old data but master and slave will be identical

Also, please document option choosen

Nuria changed the point value for this task from 13 to 21.Apr 10 2017, 4:02 PM
Nuria edited projects, added Analytics-Kanban; removed Analytics.
Nuria removed a subscriber: kevinator.

As far as I know, master only holds data for the last 45 days. So, slaves and analytics-storage are the only one who keep the historical data. I may be wrong, but that is my understanding.

As far as I know, master only holds data for the last 45 days.

I think this is incorrect, we recently deleted data from master that was older than 45 days

Useful comment from Marcel about the whitelist: https://phabricator.wikimedia.org/T108850#2454271

Addendum from previous comments:

  • If some fields in a table are featured in the white-list but others are not: the fields that are not in the white-list will be set to a garbage-value as soon as possible for events older than 90 days.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging

Useful comment from Marcel about the whitelist: https://phabricator.wikimedia.org/T108850#2454271

Addendum from previous comments:

  • If some fields in a table are featured in the white-list but others are not: the fields that are not in the white-list will be set to a garbage-value as soon as possible for events older than 90 days.

Can we set them to NULL instead of an arbitrary value? That would make it much easier to avoid the accidental inclusion of garbage data in analyses.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging

BTW that page seems outdated in some parts (e.g. it refers to the clientIP field - removed from the capsule long ago) and draft-like in others (e.g. unfinished sentence: "Note that this strategy includes the ".)

Please note that there is an open task about the whitelist at T164125.

https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging

BTW that page seems outdated in some parts (e.g. it refers to the clientIP field - removed from the capsule long ago) and draft-like in others (e.g. unfinished sentence: "Note that this strategy includes the ".)

On second thought, the talk page there is probably a better venue to discuss issues with the page - I've left comments there and also made an edit trying to clarify the dangling sentence.

@Tbayer

Can we set them to NULL instead of an arbitrary value? That would make it much easier to avoid the accidental inclusion of garbage data in analyses.

Sure! When I wrote garbage-value, I didn't mean an arbitrary value. I generically meant a distinct value that was convenient for us. For example: NULL or empty string or 0 or... First thing that comes to my mind is NULL as well.

BTW that page seems outdated in some parts (e.g. it refers to the clientIP field - removed from the capsule long ago) and draft-like in others (e.g. unfinished sentence: "Note that this strategy includes the ".)

On second thought, the talk page there is probably a better venue to discuss issues with the page - I've left comments there and also made an edit trying to clarify the dangling sentence.

I updated the page and responded to your comments in the talk page, thanks!

Please note that there is an open task about the whitelist at T164125.

Also commented on that task.

Cheers!

@elukey I commented on @Tbayer 's task T164125
There might be some changes that can affect this task.
Can you read them and give your opinion?
THX!

Some notes taken while researching:

  • running the purge script and eventlogging_syncy.sh on the same host might generate contention while INSERTING/DELETING rows on the same table, so we'd need to be careful in scheduling them properly.
  • eventlogging_syncy.sh should only replicate data from the EL master to the slaves when max(timestamp) (or max(id), it depends) is not the same. This means that deleting old data on the slaves and not on the master should not cause any re-replication of data by eventlogging_syncy.sh.
  • To simplify the overall architecture it would be great if the new purge script could run locally on the slaves (before or after eventlogging_syncy.sh), meanwhile a simpler purge policy would be run on the master (delete rows if timestamp is older than 90 days from now).
  • A good precaution could be to prefix the deletes executed on the master with sql_log_bin=0; in order to skip binlog. This would ensure that if in the future, for some reason (mistake, etc..) a slave will be attached to the master, then no delete will be synced.

Just adding some notes for the sake of documentation:

We encountered an issue when trying to apply the white-list for json-formatted fields. As the field is a string (cannot be handled as json object by mysql) we'd have to update its value one record at a time as opposed to cleaning a bunch of records with a single update statement that sets the not white-listed fields to NULL. There might be solutions to that, but right now it is a significant performance complication that would take more time than we initially intended for this task.

We decided to not implement white-listing of json-formatted fields for now, and do it only for "full" fields. For the use cases that need json-format field sanitization (see T164125), we'll temporarily add the whole fields to the white-list, so that no data is lost. In the future, we'll implement a solution that handles this data properly.

Change 353265 abandoned by Elukey:
[WIP] First prototype of the EventLogging purge script

Reason:
Creating a new repository instead

https://gerrit.wikimedia.org/r/353265

Change 355604 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/software/analytics-eventlogging-maintenance@master] [WIP] Add the eventlogging_cleaner script and base package

https://gerrit.wikimedia.org/r/355604

Change 355604 abandoned by Elukey:
[WIP] Add the eventlogging_cleaner script and base package

https://gerrit.wikimedia.org/r/355604

I did some performance tests. I executed (by hand) the mentioned SELECT/LIMIT/OFFSET query on analytics-store.eqiad.wmnet the exact same way the script would execute it. This command would be executed by the script one time for each batch updated. Depending on the batch size and the number of events in the table this can be executed a few times or a LOT. I used the biggest table that I know of (Edit), to have an idea of the worst case.

Experiment A: Execute it as if the script was running on the whole date-range that needs to be sanitized.

a1) Big table, data since beginning of table (2015-09-07), small offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp < '20170315000000' LIMIT 1000 OFFSET 0;
1000 rows in set (0.28 sec)

a2) Big table, data since beginning of table (2015-09-07), medium offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp < '20170315000000' LIMIT 1000 OFFSET 135000000;
1000 rows in set (3 min 58.80 sec)

a3) Big table, data since beginning of table (2015-09-07), biggest offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp < '20170315000000' LIMIT 1000 OFFSET 269324118;
1000 rows in set (6 min 2.13 sec)

This date range has around 170M events. With batches of 1000 events, we'll need 170K update commands. In average, it seems that the SELECT/LIMIT/OFFSET will take around 3 minutes. 170K * 3 minutes = 8500 hours :'( Even if we tried larger batches (5000, 10000), it would be too much.

Experiment B: Execute it as if the script was executed one month at a time

b1) Big table, one month of data, small offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170201000000' LIMIT 1000 OFFSET 0;
1000 rows in set (0.02 sec)

b2) Big table, one month of data, medium offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170201000000' LIMIT 1000 OFFSET 7000000;
1000 rows in set (2.66 sec)
1000 rows in set (3.07 sec)

b3) Big table, one month of data, biggest offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170201000000' LIMIT 1000 OFFSET 14000000;
1000 rows in set (6.35 sec)
1000 rows in set (5.60 sec)
1000 rows in set (5.27 sec)

This date range has around 14M events. With batches of 1000 events, we'll need 14K update commands. In average, it seems that the SELECT/LIMIT/OFFSET will take around 2.5 seconds. 14K * 2.5 seconds = 10 hours :''( A lot better, but still bad. Now, if we tried larger batches (5000, 10000), we could reduce the SELECT/LIMIT/OFFSET time to something feasible like 1 hour maybe? But! This would be still 1 month of purging only for just 1 (big) table, and the fact that the uuid list would have 10000 elements would make the update statement slower for sure :/. It looks that this would also not be feasible overall, though if it's our only option, we could try!

Experiment C: Execute it as if the script was executed one day at a time

c1) Big table, one day of data, small offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170102000000' LIMIT 1000 OFFSET 0;
1000 rows in set (0.01 sec)

c2) Big table, one day of data, medium offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170102000000' LIMIT 1000 OFFSET 180000;
1000 rows in set (0.08 sec)

c3) Big table, one day of data, biggest offset:

SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170102000000' LIMIT 1000 OFFSET 360000;
1000 rows in set (0.16 sec)

This date range has around 366000 events. With batches of 1000 events, we'll need 366 update commands. In average, it seems that the SELECT/LIMIT/OFFSET will take around 0.1 seconds. 366 * 0.1 seconds = 36 seconds :-) This demonstrates that the script as it is today would serve perfectly for day-to-day purging, provided we execute it once per day!

Conclusions

  • The script would work perfectly for 1-day date range execution (normal execution)
  • It totally looks like it would fail if we use it for the one-off initial historical purging, with whole date range and batch_size=1000
  • If we can not find any other option for the one-off historical purging, we can try to execute it in monthly chunks with batch_size=10000?

I have not much to add, both Riccardo and me warned about the performance penalty of large offsets, and that a different strategy may be needed there (a scan by id and then the filter, for example). SQL tables are not ordered, but unordered sets so they are not optimized for that (OLTP systems are optimized for point selects and updates).

What @jcrespo said, see also my comment on https://gerrit.wikimedia.org/r/#/c/356383/12/modules/role/files/mariadb/eventlogging_cleaner.py@206 regarding the addition of an ORDER BY.

In case this will be the chosen approach, please monitor/log carefully the timings in the real environment where the whole script is run with deletes, selects and updates.

Thanks a lot for all the suggestions and feedback.

Marcel changed a bit the logic that we were using, replacing LIMIT OFFSET with something simpler. A batch of uuids, limited to a fixed number, is retrieved together with its timestamps. Once we UPDATE them according to the whitelist, we retrieve another batch starting from the last timestamp returned by the previous batch. This have the consequence of updating/sanitizing the last rows of the batch twice (the ones sharing the last timestamp), since we don't want to miss any row due to the LIMIT boundary.

I think that this is a good enough solution for this use case, that solves the big performance problem that Riccardo and Jaime pointed out without complicating much the script complexity.

Please let me know your thoughts :)

@elukey thanks for the update. Which one is the final change to be reviewed then?

@Volans: still need to finalize some late requirements but we should be close, I'll ask a final review to Marcel and then I'll let you know :)

Change 364949 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_purging_whitelist.tsv: remove unnecessary schemas

https://gerrit.wikimedia.org/r/364949

Change 364949 merged by Elukey:
[operations/puppet@production] eventlogging_purging_whitelist.tsv: remove unnecessary schemas

https://gerrit.wikimedia.org/r/364949

Change 364956 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: separate logs betweem stdout/stderr$

https://gerrit.wikimedia.org/r/364956

Change 364956 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: split logs between stdout/stderr$

https://gerrit.wikimedia.org/r/364956

Change 364961 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: set default loglevel for the main logger

https://gerrit.wikimedia.org/r/364961

Change 364961 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: set default loglevel for the main logger

https://gerrit.wikimedia.org/r/364961

Change 365007 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mariadb::misc::eventlogging: remove the readonly constraint for slaves

https://gerrit.wikimedia.org/r/365007

Change 365007 merged by Elukey:
[operations/puppet@production] role::mariadb::misc::eventlogging: remove the readonly constraint for slaves

https://gerrit.wikimedia.org/r/365007

Change 365027 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Remove 2 schemas from EventLogging purging white-list

https://gerrit.wikimedia.org/r/365027

Change 365027 abandoned by Mforns:
Remove 2 schemas from EventLogging purging white-list

Reason:
Already done, sorry.

https://gerrit.wikimedia.org/r/365027

Nuria changed the point value for this task from 21 to 55.Jul 19 2017, 12:32 AM

So today I did a bit of calculations for the MobileWebUIClickTracking_10742159_15423246 (500GB in size) table update timings and this is the outcome:

  • We have 1475415359 records in there.
  • ~14754 batches of 100k updates to execute.
  • A batch on dbstore1002 takes ~380s to complete.
  • It would take 64 days to complete.

The current logic selects uuids and applies the UPDATE statement only for them, but it really takes a lot of time to complete. We ran the following experiment with a query that only uses timestamps and one that uses uuids:

elukey@dbstore1002:~$ cat MobileWebUIClickTrackingUUIDS_CSV.txt
UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE uuid IN ('92394923042344234', [...5k uuids...] )

elukey@dbstore1002:~$ time sudo mysql --skip-ssl -Dlog < MobileWebUIClickTrackingUUIDS_CSV.txt

real    0m31.098s
user    0m0.000s
sys     0m0.009s

elukey@dbstore1002:~$ sudo mysql --skip-ssl

MariaDB [(none)]> use log
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922';
Query OK, 0 rows affected (0.07 sec)
Rows matched: 4999  Changed: 0  Warnings: 0

MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = "test" WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922';
Query OK, 4999 rows affected (0.22 sec)
Rows matched: 4999  Changed: 4999  Warnings: 0

MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922';
Query OK, 4999 rows affected (0.23 sec)
Rows matched: 4999  Changed: 4999  Warnings: 0

MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922';
Query OK, 0 rows affected (0.09 sec)
Rows matched: 4999  Changed: 0  Warnings: 0

The speed difference is huge. The whole point of not using directly timestamps but uuids was that we couldn't guarantee that a batch was exactly sizing X (because some elements might lie among the border of the end timestamp).

We are going to work on a code change that forces the script to use only timestamp and ask for reviewers later on.

Change 374823 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] [WIP] Optimize EventLogging purging script using timestamps

https://gerrit.wikimedia.org/r/374823

Change 374823 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: improve sanitize method

https://gerrit.wikimedia.org/r/374823

The new script is running on db1047, restarting from 2009 and ending up to half of 2016 data. Everything seems to work fine, and afaics the UPDATE statements are much quicker. For example, Edit* tables updates are taking ~10/15 seconds to complete. This would mean that the super huge tables would take maximum a couple of days to be sanitized completely.

So the plan is to:

  1. finish the test on db1047
  2. completely sanitize dbstore1002 and db1047 up to the start of 2017
  3. vet/check the data with Marcel
  4. deploy the script as cron

Change 375995 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: force timestamp to CHAR

https://gerrit.wikimedia.org/r/375995

Change 375995 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: force timestamp to CHAR

https://gerrit.wikimedia.org/r/375995

The run of the script was too aggressive, the dbs started to trash due to disk saturation:

Screen Shot 2017-09-07 at 4.03.31 PM.png (699×1 px, 223 KB)

I started again the script only on db1047 with batch_size 10k and sleep_between_batches 4s, that it should reach ~1.4 billion rows sanitized per day if I made my calculations correctly. The current disk saturation is in the 80% range and seems stable.

Screen Shot 2017-09-11 at 2.56.32 PM.png (584×1 px, 98 KB)

The above graphs shows more or less the disk saturation on db1047 (analytics-slave) caused by:

  • eventlogging_cleaner.py
  • eventlogging_sync
  • mysql replication of wikis

I checked with pidstat -dlr 10 before stopping each script/daemon/sync/etc.. and mysqld was the responsible for the IO.

Started eventlogging_cleaner.py again on both slaves, with batch-size 1000 and sleep 4s. This seems to work in a reasonable amount of time, but completing the huge tables will take a lot of time..

Imagine we have purged everything except the giant tables.
Would we want to consider stopping EL and replication for like 1 day,
to be able to purge them (with initial 10-100k batch size and 1 second sleep)?

Imagine we have purged everything except the giant tables.
Would we want to consider stopping EL and replication for like 1 day,
to be able to purge them (with initial 10-100k batch size and 1 second sleep)?

Probably not, the huge UPDATE batches are affecting too much disk usage..

Probably not, the huge UPDATE batches are affecting too much disk usage..

But maybe, if we stop replication from both EL master and MediaWiki dbs, we'll have enough disk to use 10-100K batches (1sec sleep) no?
The disk usage is already at 80% when the purging script is launched, so it's very likely that the performance of the script is a lot higher if the other scripts are paused, no?

We could maybe perform a 30 minutes test on a big table, and see how many rows were updated?

Change 378236 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: add feature to pick start_ts from file

https://gerrit.wikimedia.org/r/378236

Change 378236 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: add feature to pick start_ts from file

https://gerrit.wikimedia.org/r/378236

Change 391828 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::eventlogging:replication: add EL sanitization cron

https://gerrit.wikimedia.org/r/391828

Probably I am stating the obvious, but some thottling could be disabled if only affects the new hosts (not dbstore1002), as they should have much less io impact (larger memory) and much higher io capacity.

Probably I am stating the obvious, but some thottling could be disabled if only affects the new hosts (not dbstore1002), as they should have much less io impact (larger memory) and much higher io capacity.

Definitely not, I have no idea about how throttling is set up for those dbs :(

Change 391828 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::eventlogging:replication: add EL sanitization cron

https://gerrit.wikimedia.org/r/391828

With throttling I meant making either replication or sanitization faster- I think it created load problems in the past according to this ticket.

With throttling I meant making either replication or sanitization faster- I think it created load problems in the past according to this ticket.

Ah yes sure, I've already ran two batches of sanitization with 100k rows per batch and db1108 didn't show any issue (so data is now sanitized up to 90 days ago as expected). With the daily volume I think that 10k is fine.

Replication should be fine as well, we had only one problem in the past due to a very frequent eventlogging schema but even with 10k batches we were not able to catch data as well, so we documented the limitation of the current replication script.

Change 392788 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::el::replication: fix el cleaner cronjob

https://gerrit.wikimedia.org/r/392788

Change 392788 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::el::replication: fix el cleaner cronjob

https://gerrit.wikimedia.org/r/392788

Change 392807 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::eventlogging: use /var/log/eventlogging in el_sync

https://gerrit.wikimedia.org/r/392807

Change 392807 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::eventlogging: use /var/log/eventlogging in el_sync

https://gerrit.wikimedia.org/r/392807

Verified that two runs of the eventlogging cleaner ran on db1108 without any issues.