Temporary solution until we move backend away from MySQL
Purging policy defined in [1].
[1] https://gerrit.wikimedia.org/r/#/c/298721/5/files/mariadb/eventlogging_purging_whitelist.tsv
• Nuria | |
Feb 1 2017, 5:14 PM |
F9456503: Screen Shot 2017-09-11 at 2.56.32 PM.png | |
Sep 11 2017, 1:04 PM |
F9375108: Screen Shot 2017-09-07 at 4.03.31 PM.png | |
Sep 7 2017, 2:05 PM |
Temporary solution until we move backend away from MySQL
Purging policy defined in [1].
[1] https://gerrit.wikimedia.org/r/#/c/298721/5/files/mariadb/eventlogging_purging_whitelist.tsv
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Milimetric | T102224 {tick} Schema Audit | |||
Duplicate | mforns | T104877 Enforce policy for each schema: Sanitize {tick} [8 pts] | |||
Resolved | mforns | T108857 Delete obsolete schemas {tick} [5 pts] | |||
Resolved | • madhuvishy | T110247 Archive obsolete schema pages {tick} [5 pts] | |||
Resolved | None | T108850 Set up auto-purging after 90 days {tick} | |||
Resolved | jcrespo | T108856 Set up bucketization of editCount fields {tick} | |||
Resolved | jcrespo | T104748 db1046 innodb signal 6 abort and restart | |||
Resolved | elukey | T156933 Improve purging for analytics-slave data on Eventlogging |
Let's re-task this in the light of @elukey being interested in doing these changes, we might be able to incorporate them to the upgraded replication script
Alternative:
As far as I know, master only holds data for the last 45 days. So, slaves and analytics-storage are the only one who keep the historical data. I may be wrong, but that is my understanding.
As far as I know, master only holds data for the last 45 days.
I think this is incorrect, we recently deleted data from master that was older than 45 days
Useful comment from Marcel about the whitelist: https://phabricator.wikimedia.org/T108850#2454271
Addendum from previous comments:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging
Can we set them to NULL instead of an arbitrary value? That would make it much easier to avoid the accidental inclusion of garbage data in analyses.
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging
BTW that page seems outdated in some parts (e.g. it refers to the clientIP field - removed from the capsule long ago) and draft-like in others (e.g. unfinished sentence: "Note that this strategy includes the ".)
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Data_retention_and_auto-purging
BTW that page seems outdated in some parts (e.g. it refers to the clientIP field - removed from the capsule long ago) and draft-like in others (e.g. unfinished sentence: "Note that this strategy includes the ".)
On second thought, the talk page there is probably a better venue to discuss issues with the page - I've left comments there and also made an edit trying to clarify the dangling sentence.
Can we set them to NULL instead of an arbitrary value? That would make it much easier to avoid the accidental inclusion of garbage data in analyses.
Sure! When I wrote garbage-value, I didn't mean an arbitrary value. I generically meant a distinct value that was convenient for us. For example: NULL or empty string or 0 or... First thing that comes to my mind is NULL as well.
BTW that page seems outdated in some parts (e.g. it refers to the clientIP field - removed from the capsule long ago) and draft-like in others (e.g. unfinished sentence: "Note that this strategy includes the ".)
On second thought, the talk page there is probably a better venue to discuss issues with the page - I've left comments there and also made an edit trying to clarify the dangling sentence.
I updated the page and responded to your comments in the talk page, thanks!
Please note that there is an open task about the whitelist at T164125.
Also commented on that task.
Cheers!
Some notes taken while researching:
Just adding some notes for the sake of documentation:
We encountered an issue when trying to apply the white-list for json-formatted fields. As the field is a string (cannot be handled as json object by mysql) we'd have to update its value one record at a time as opposed to cleaning a bunch of records with a single update statement that sets the not white-listed fields to NULL. There might be solutions to that, but right now it is a significant performance complication that would take more time than we initially intended for this task.
We decided to not implement white-listing of json-formatted fields for now, and do it only for "full" fields. For the use cases that need json-format field sanitization (see T164125), we'll temporarily add the whole fields to the white-list, so that no data is lost. In the future, we'll implement a solution that handles this data properly.
Change 353265 abandoned by Elukey:
[WIP] First prototype of the EventLogging purge script
Reason:
Creating a new repository instead
Change 355604 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/software/analytics-eventlogging-maintenance@master] [WIP] Add the eventlogging_cleaner script and base package
Change 355604 abandoned by Elukey:
[WIP] Add the eventlogging_cleaner script and base package
I did some performance tests. I executed (by hand) the mentioned SELECT/LIMIT/OFFSET query on analytics-store.eqiad.wmnet the exact same way the script would execute it. This command would be executed by the script one time for each batch updated. Depending on the batch size and the number of events in the table this can be executed a few times or a LOT. I used the biggest table that I know of (Edit), to have an idea of the worst case.
a1) Big table, data since beginning of table (2015-09-07), small offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp < '20170315000000' LIMIT 1000 OFFSET 0; 1000 rows in set (0.28 sec)
a2) Big table, data since beginning of table (2015-09-07), medium offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp < '20170315000000' LIMIT 1000 OFFSET 135000000; 1000 rows in set (3 min 58.80 sec)
a3) Big table, data since beginning of table (2015-09-07), biggest offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp < '20170315000000' LIMIT 1000 OFFSET 269324118; 1000 rows in set (6 min 2.13 sec)
This date range has around 170M events. With batches of 1000 events, we'll need 170K update commands. In average, it seems that the SELECT/LIMIT/OFFSET will take around 3 minutes. 170K * 3 minutes = 8500 hours :'( Even if we tried larger batches (5000, 10000), it would be too much.
b1) Big table, one month of data, small offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170201000000' LIMIT 1000 OFFSET 0; 1000 rows in set (0.02 sec)
b2) Big table, one month of data, medium offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170201000000' LIMIT 1000 OFFSET 7000000; 1000 rows in set (2.66 sec) 1000 rows in set (3.07 sec)
b3) Big table, one month of data, biggest offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170201000000' LIMIT 1000 OFFSET 14000000; 1000 rows in set (6.35 sec) 1000 rows in set (5.60 sec) 1000 rows in set (5.27 sec)
This date range has around 14M events. With batches of 1000 events, we'll need 14K update commands. In average, it seems that the SELECT/LIMIT/OFFSET will take around 2.5 seconds. 14K * 2.5 seconds = 10 hours :''( A lot better, but still bad. Now, if we tried larger batches (5000, 10000), we could reduce the SELECT/LIMIT/OFFSET time to something feasible like 1 hour maybe? But! This would be still 1 month of purging only for just 1 (big) table, and the fact that the uuid list would have 10000 elements would make the update statement slower for sure :/. It looks that this would also not be feasible overall, though if it's our only option, we could try!
c1) Big table, one day of data, small offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170102000000' LIMIT 1000 OFFSET 0; 1000 rows in set (0.01 sec)
c2) Big table, one day of data, medium offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170102000000' LIMIT 1000 OFFSET 180000; 1000 rows in set (0.08 sec)
c3) Big table, one day of data, biggest offset:
SELECT uuid from Edit_13457736_15423246 WHERE timestamp >= '20170101000000' AND timestamp < '20170102000000' LIMIT 1000 OFFSET 360000; 1000 rows in set (0.16 sec)
This date range has around 366000 events. With batches of 1000 events, we'll need 366 update commands. In average, it seems that the SELECT/LIMIT/OFFSET will take around 0.1 seconds. 366 * 0.1 seconds = 36 seconds :-) This demonstrates that the script as it is today would serve perfectly for day-to-day purging, provided we execute it once per day!
I have not much to add, both Riccardo and me warned about the performance penalty of large offsets, and that a different strategy may be needed there (a scan by id and then the filter, for example). SQL tables are not ordered, but unordered sets so they are not optimized for that (OLTP systems are optimized for point selects and updates).
What @jcrespo said, see also my comment on https://gerrit.wikimedia.org/r/#/c/356383/12/modules/role/files/mariadb/eventlogging_cleaner.py@206 regarding the addition of an ORDER BY.
In case this will be the chosen approach, please monitor/log carefully the timings in the real environment where the whole script is run with deletes, selects and updates.
Thanks a lot for all the suggestions and feedback.
Marcel changed a bit the logic that we were using, replacing LIMIT OFFSET with something simpler. A batch of uuids, limited to a fixed number, is retrieved together with its timestamps. Once we UPDATE them according to the whitelist, we retrieve another batch starting from the last timestamp returned by the previous batch. This have the consequence of updating/sanitizing the last rows of the batch twice (the ones sharing the last timestamp), since we don't want to miss any row due to the LIMIT boundary.
I think that this is a good enough solution for this use case, that solves the big performance problem that Riccardo and Jaime pointed out without complicating much the script complexity.
Please let me know your thoughts :)
@Volans: still need to finalize some late requirements but we should be close, I'll ask a final review to Marcel and then I'll let you know :)
Change 364949 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_purging_whitelist.tsv: remove unnecessary schemas
Change 364949 merged by Elukey:
[operations/puppet@production] eventlogging_purging_whitelist.tsv: remove unnecessary schemas
Change 364956 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: separate logs betweem stdout/stderr$
Change 364956 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: split logs between stdout/stderr$
Change 364961 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: set default loglevel for the main logger
Change 364961 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: set default loglevel for the main logger
Change 365007 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] role::mariadb::misc::eventlogging: remove the readonly constraint for slaves
Change 365007 merged by Elukey:
[operations/puppet@production] role::mariadb::misc::eventlogging: remove the readonly constraint for slaves
Change 365027 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] Remove 2 schemas from EventLogging purging white-list
Change 365027 abandoned by Mforns:
Remove 2 schemas from EventLogging purging white-list
Reason:
Already done, sorry.
So today I did a bit of calculations for the MobileWebUIClickTracking_10742159_15423246 (500GB in size) table update timings and this is the outcome:
The current logic selects uuids and applies the UPDATE statement only for them, but it really takes a lot of time to complete. We ran the following experiment with a query that only uses timestamps and one that uses uuids:
elukey@dbstore1002:~$ cat MobileWebUIClickTrackingUUIDS_CSV.txt UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE uuid IN ('92394923042344234', [...5k uuids...] ) elukey@dbstore1002:~$ time sudo mysql --skip-ssl -Dlog < MobileWebUIClickTrackingUUIDS_CSV.txt real 0m31.098s user 0m0.000s sys 0m0.009s elukey@dbstore1002:~$ sudo mysql --skip-ssl MariaDB [(none)]> use log Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922'; Query OK, 0 rows affected (0.07 sec) Rows matched: 4999 Changed: 0 Warnings: 0 MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = "test" WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922'; Query OK, 4999 rows affected (0.22 sec) Rows matched: 4999 Changed: 4999 Warnings: 0 MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922'; Query OK, 4999 rows affected (0.23 sec) Rows matched: 4999 Changed: 4999 Warnings: 0 MariaDB [log]> UPDATE MobileWebUIClickTracking_10742159_15423246 SET userAgent = NULL WHERE timestamp >= '20150101000000' AND timestamp < '20150102041922'; Query OK, 0 rows affected (0.09 sec) Rows matched: 4999 Changed: 0 Warnings: 0
The speed difference is huge. The whole point of not using directly timestamps but uuids was that we couldn't guarantee that a batch was exactly sizing X (because some elements might lie among the border of the end timestamp).
We are going to work on a code change that forces the script to use only timestamp and ask for reviewers later on.
Change 374823 had a related patch set uploaded (by Mforns; owner: Mforns):
[operations/puppet@production] [WIP] Optimize EventLogging purging script using timestamps
Change 374823 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: improve sanitize method
The new script is running on db1047, restarting from 2009 and ending up to half of 2016 data. Everything seems to work fine, and afaics the UPDATE statements are much quicker. For example, Edit* tables updates are taking ~10/15 seconds to complete. This would mean that the super huge tables would take maximum a couple of days to be sanitized completely.
So the plan is to:
Change 375995 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: force timestamp to CHAR
Change 375995 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: force timestamp to CHAR
The run of the script was too aggressive, the dbs started to trash due to disk saturation:
I started again the script only on db1047 with batch_size 10k and sleep_between_batches 4s, that it should reach ~1.4 billion rows sanitized per day if I made my calculations correctly. The current disk saturation is in the 80% range and seems stable.
The above graphs shows more or less the disk saturation on db1047 (analytics-slave) caused by:
I checked with pidstat -dlr 10 before stopping each script/daemon/sync/etc.. and mysqld was the responsible for the IO.
Started eventlogging_cleaner.py again on both slaves, with batch-size 1000 and sleep 4s. This seems to work in a reasonable amount of time, but completing the huge tables will take a lot of time..
Imagine we have purged everything except the giant tables.
Would we want to consider stopping EL and replication for like 1 day,
to be able to purge them (with initial 10-100k batch size and 1 second sleep)?
Probably not, the huge UPDATE batches are affecting too much disk usage..
But maybe, if we stop replication from both EL master and MediaWiki dbs, we'll have enough disk to use 10-100K batches (1sec sleep) no?
The disk usage is already at 80% when the purging script is launched, so it's very likely that the performance of the script is a lot higher if the other scripts are paused, no?
We could maybe perform a 30 minutes test on a big table, and see how many rows were updated?
Change 378236 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] eventlogging_cleaner.py: add feature to pick start_ts from file
Change 378236 merged by Elukey:
[operations/puppet@production] eventlogging_cleaner.py: add feature to pick start_ts from file
Change 391828 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::eventlogging:replication: add EL sanitization cron
Probably I am stating the obvious, but some thottling could be disabled if only affects the new hosts (not dbstore1002), as they should have much less io impact (larger memory) and much higher io capacity.
Change 391828 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::eventlogging:replication: add EL sanitization cron
With throttling I meant making either replication or sanitization faster- I think it created load problems in the past according to this ticket.
Ah yes sure, I've already ran two batches of sanitization with 100k rows per batch and db1108 didn't show any issue (so data is now sanitized up to 90 days ago as expected). With the daily volume I think that 10k is fine.
Replication should be fine as well, we had only one problem in the past due to a very frequent eventlogging schema but even with 10k batches we were not able to catch data as well, so we documented the limitation of the current replication script.
Change 392788 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::el::replication: fix el cleaner cronjob
Change 392788 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::el::replication: fix el cleaner cronjob
Change 392807 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] profile::mariadb::misc::eventlogging: use /var/log/eventlogging in el_sync
Change 392807 merged by Elukey:
[operations/puppet@production] profile::mariadb::misc::eventlogging: use /var/log/eventlogging in el_sync