Page MenuHomePhabricator

Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx}
Closed, ResolvedPublic8 Story Points

Description

Edit schema needs purging, table is too big
https://meta.wikimedia.org/wiki/Schema_talk:Edit

Talk to schema owners, the table has grown too large for feasible queries in our current system (250G in a month or so). We need to talk about how to deal with this going forward.

Details

Related Gerrit Patches:
mediawiki/extensions/MobileFrontend : wmf/1.27.0-wmf.12Sample MobileFrontend edit events at 6.25%
mediawiki/extensions/MobileFrontend : masterSample MobileFrontend edit events at 6.25%

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Nuria raised the priority of this task from to Needs Triage.Jan 25 2016, 5:47 PM
Nuria added a project: Analytics-Kanban.
Nuria added a subscriber: Nuria.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptJan 25 2016, 5:47 PM
Nuria set Security to None.Jan 25 2016, 5:48 PM
Nuria added a subscriber: Neil_P._Quinn_WMF.
Milimetric renamed this task from Edit schema needs purging, table is too big for queries to run (500G before conversion) to Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts].Jan 25 2016, 5:57 PM
Milimetric updated the task description. (Show Details)

I imagine purging refers to the deletion of old rows? And I imagine you have the tools to do automatic purging already, so it's just a matter of us deciding how long we want to retain the logs?

jcrespo added a subscriber: jcrespo.EditedJan 25 2016, 6:42 PM

it's just a matter of us deciding how long we want to retain the logs

Exactly that. Or filtering more (sampling), etc.

Nuria added a comment.Jan 25 2016, 6:45 PM

@Neil_P._Quinn_WMF : Correct, for most schemas we purge after 90 days

I talked with @madhuvishy today, and she suggested another possibility: storing all the data in Hive, and only a subset or none at all in MariaDB. This would require sacrificing the ability to easily join the Edit logs to MediaWiki tables, although we could likely work around that by importing subsets of data into MariaDB tables as needed.

It looks like we could reduce the logging rate to a quarter of what it is now, just by reducing the sampling rate for mobile edits from 100% to 6.25%, the same as it for desktop edits. I think I implicitly classed mobile editing as a minor factor, so I wasn't expecting it to have that kind of impact.

So that takes care of part of the issue. I'll keep thinking about purging policies and such.

@Nuria, given that we have a lot more mobile events than we want or need, what if (rather than purging the table in a time-based fashion) we go back through the mobile events and delete 15 out of every 16, in effect retroactively applying 6.25% sampling? I don't know how hard that would be to do, or how easily I could help. I could certainly make a list of rows to delete based on the primary key.

Also, I'm working on reworking the schema entirely so at some point in the next couple of weeks we should be deploying a new version of the schema. That'll freeze the table at its current size; I don't know how much that'll help.

Nuria added a comment.EditedFeb 1 2016, 11:48 PM

@Neil_P._Quinn_WMF: if this was a flat file it will be real easy to remove 14 out of 15, in the db I am not so sure

cc-ing @jcrespo for good ideas of how to remove data from table on an uniform fashion.

We can delete in batches, using pt-archiver every nth row. For example, using the timestamp "seconds" column, which seems to be distributed uniformly. You only need to say which percentage of rows to delete. But I really recommend the sampling to be applied first so it is not a live table.

mysql> SELECT count(*) FROM Edit_13457736 WHERE timestamp BETWEEN '20160101' and '20160102';
+----------+
| count(*) |
+----------+
|  2440650 |
+----------+
1 row in set (2.40 sec)

mysql> SELECT count(*) FROM Edit_13457736 WHERE timestamp BETWEEN '20160101' and '20160102' AND timestamp like '%0';
+----------+
| count(*) |
+----------+
|   244829 |
+----------+
1 row in set (3.12 sec)

I think you'd want to delete a percentage of distinct editing sessions rather than rows.

Nuria added a comment.Feb 2 2016, 4:47 PM

@Neil_P._Quinn_WMF: can you chime in on whether @jcrespo approach would work?

I think the core idea is sound, but as Krenair points out it would make more sense to use the condition event_editingSessionId NOT LIKE '0%', which is the same condition we would use to sample events on the client.

But I really recommend the sampling to be applied first so it is not a live table.

@jcrespo, I'm working on implementing more sampling (T124845) right now. That won't cause a new table to be created but I'm also working on a more general revision of the schema (T118063) that will. Should this kind of purge just wait for the sampling, or should it wait for the general revision as well?

Nuria added a comment.Feb 2 2016, 9:39 PM

@Neil_P._Quinn_WMF: let's do purge before you are done with your schema revision. The purge is a solution for data volume, not correctness so both things can happen on parallel

Neil_P._Quinn_WMF added a comment.EditedFeb 2 2016, 10:07 PM

@Nuria, I'm fine with that. But it sounded like Jaime was advising against it. Let's wait until we've clarified that.

Also, unless this is more urgent that I realize, I think it makes sense to wait until I've applied 6.25% sampling across all the event sources, so we only have to run the purge once. I'm hoping that will be done in next week's deploy.

Nuria added a comment.Feb 4 2016, 4:35 PM

Also, please see: https://phabricator.wikimedia.org/T124383

When issues with size are fixed these jobs can be re-enabled.

Nuria moved this task from Next Up to In Progress on the Analytics-Kanban board.Feb 4 2016, 5:39 PM

Change 268568 had a related patch set uploaded (by Neil P. Quinn-WMF):
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/268568

Change 268568 merged by jenkins-bot:
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/268568

Change 269293 had a related patch set uploaded (by Jforrester):
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/269293

Change 269293 merged by jenkins-bot:
Sample MobileFrontend edit events at 6.25%

https://gerrit.wikimedia.org/r/269293

Jdforrester-WMF added a subscriber: Jdforrester-WMF.

This is now good to go, from 2016-02-08 backwards (inclusive)

FYI this caused a regression. See T126497

Yes, @jcrespo, @Nuria, we're ready to purge using the following logic. Let us know what your timeline looks like.

DELETE FROM Edit_13457736
WHERE 
event_editingSessionId NOT LIKE '0%' AND
timestamp < "20160209014000";

The time restriction is because that's when the new sampling rates took effect on the clients; we don't want to sample the same data twice.

jcrespo claimed this task.Feb 11 2016, 8:15 AM

Please do not run that, as is. I will take care of creating a batched job to do that.

jcrespo moved this task from Triage to Backlog on the DBA board.Feb 11 2016, 6:24 PM
Neil_P._Quinn_WMF added a comment.EditedFeb 11 2016, 6:27 PM

@jcrespo, don't worry! I meant "we're ready [for you] to purge" using pt-archiver; just give me an idea of when you might get to it so I know when I can get to T126058.

I would want to promise this week, but if not, early next week.

jcrespo triaged this task as Medium priority.Feb 11 2016, 7:23 PM
jcrespo moved this task from Backlog to In progress on the DBA board.Feb 12 2016, 12:23 PM

This is running now (and during the weekend) on the analytics slave. I am doing it slowly so it doesn't affect the rest of the data being inserted.

I am doing it on the analytics slave, so if something goes wrong, we can revert it with data from the master. After it is complete, I will ask you to validate it for ~ a week, then I will irreversibly delete it (at least, from mysql) on all machines.

jcrespo removed jcrespo as the assignee of this task.Feb 16 2016, 1:07 PM

@Neil_P._Quinn_WMF This is done, it took 40 hours to purge those rows. Please check that this was done correctly, by checking against analytics slave dbstore1002, and I will perform it on the master and other slaves, too.

@jcrespo, I'm don't totally understand. If the purged data is in dbstore1002 (= analytics-store), what other data should I compare that with? The only databases I've ever accessed are analytics-store and x1-analytics-slave.

With "before"? Do you see anything strange on your graphs? Too many missing points? Errors, etc. Anything "strange" before I delete everything forever.

@jcrespo, oh, I see. I thought you wanted me to compare it with the master database.

I delete everything forever.

"Before I apply the same change to the other hosts that you cannot see, but that would otherwise allow me to recover if this had been done incorrectly." :-)

@jcrespo, I just found one problem: the data in the Edit logs (on analytics-store) stops on 29 January. I've spot checked some other event log tables, and they don't have the same issue. The number of events on 29 January is also much higher than on previous days.

29 January was about a week before we made the sampling changes in mobile and VE.

SELECT 
	LEFT(timestamp, 8) as day,
	COUNT(*)
FROM Edit_13457736
WHERE
	timestamp > "20160125"
GROUP BY day;

day		COUNT(*)
20160125	537607
20160126	524772
20160127	563500
20160128	612954
20160129	1184511
jcrespo added a comment.EditedFeb 17 2016, 8:33 AM

There is a problem I didn't realized: the automatic syncing process detected missing rows and has started refiling those like crazy, delaying proper replication of data. As you can see here:

Master
$ mysql -BN -A log -e "SHOW TABLES like 'Edit\_%'" | while read table; do echo -n "$table: "; mysql -BN -A log -e "SELECT MAX(timestamp) FROM $table"; done
Edit_10604157: NULL
Edit_10676603: NULL
Edit_11319708: NULL
Edit_11448630: 20160127011701
Edit_13457736: 20160217081942
Edit_5563071: NULL
Edit_5570274: NULL
Slave
$ mysql -BN -A log -e "SHOW TABLES like 'Edit\_%'" | while read table; do echo -n "$table: "; mysql -BN -A log -e "SELECT MAX(timestamp) FROM $table"; done
Edit_10604157: 20150312123640
Edit_10676603: 20150430164114
Edit_11319708: 20150506153533
Edit_11448630: 20160127011701
Edit_13457736: 20160129150029
Edit_5563071: 20130618000918
Edit_5570274: 20150506071731

Data up to the latest day is on the master, only delayed by the automatic backfilling and by the purging. So, aside from this, is there anything else? or can I delete this (again, on both the slave and the master this time). I can give you the result of that query on the master that, except for purging, will match the slave soon:

MariaDB EVENTLOGGING m4 localhost log > SELECT 
    -> LEFT(timestamp, 8) as day,
    -> COUNT(*)
    -> FROM Edit_13457736
    -> WHERE
    -> timestamp > "20160125"
    -> GROUP BY day;
+----------+----------+
| day      | COUNT(*) |
+----------+----------+
| 20160125 |  2798327 |
| 20160126 |  2712766 |
| 20160127 |  2796390 |
| 20160128 |  2768918 |
| 20160129 |  2560945 |
| 20160130 |  2662054 |
| 20160131 |  2917958 |
| 20160201 |  2785474 |
| 20160202 |  2716151 |
| 20160203 |  2711656 |
| 20160204 |  2627985 |
| 20160205 |  2493990 |
| 20160206 |  2548298 |
| 20160207 |  2819553 |
| 20160208 |  2747151 |
| 20160209 |   692748 |
| 20160210 |   538933 |
| 20160211 |   592494 |
| 20160212 |   539869 |
| 20160213 |   511462 |
| 20160214 |   550950 |
| 20160215 |   576825 |
| 20160216 |   582528 |
| 20160217 |   152598 |
+----------+----------+

The numbers seem to match between the purged on the slave and the filtered ones?

@jcrespo, ignoring the cut-off data and spike on 29 January, the data on the slave looks reasonable to me (I'm attaching the spreadsheet I used to check for posterity's sake). You can go ahead and delete the old data permanently.

jcrespo claimed this task.Feb 19 2016, 8:45 AM
jcrespo moved this task from Blocked external/Not db team to In progress on the DBA board.

Applying to all servers now.

This is still ongoing on the master (it ended on the slaves). Purged 121 million rows out of approximately 265 million.

Milimetric renamed this task from Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} [8 pts] to Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx}.Feb 22 2016, 9:05 PM
Milimetric set the point value for this task to 8.
Neil_P._Quinn_WMF renamed this task from Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx} to Edit schema needs purging, table is too big for queries to run (500G before conversion) {oryx}.Feb 22 2016, 10:32 PM

205 million rows purged.

Neil_P._Quinn_WMF added a comment.EditedFeb 23 2016, 5:08 PM

Thank you, @jcrespo! One question: it looks like the data on the slave still cuts off on 30 January. It seems to be refilling, but only very slowly. Do you how long it should take to catch up, or how I can check the replication lag for that table?

I will manually make sure to resync it- a side issue of purging is that is slows down the replication. However, if I manually resync it now, it may copy events that I have to repurge :-)

Let it finish ( it should be soon) and I promise it will go back to a good state, with the additional improvement of now going faster due the the lower amount of events.

jcrespo added a comment.EditedFeb 23 2016, 5:34 PM

Regarding replication lag, I shared a script I use (which is basically comparing the latest timestamp of each table) with one of you operators. I think he is (or will) work on maybe setting up something public for that or an alarm on T124306.

There are a few tables affected by this:

$ bash check_eventlogging_lag.sh dbstore1002[Warning] DeprecatedUsage_7906187 lag: 02:49:20
[CRITICAL] EchoInteraction_15180901 lag > 1 day (134661 seconds)
[CRITICAL] Edit_13457736 lag > 1 day (2073306 seconds)

@jcrespo, ah, okay, I misunderstood your message yesterday and thought the purge had finished on the master. I'll be patient :)

It should have finished now, after deleting 275047166 rows. I am reruning the script to make sure that is true, and tomorrow I will resync all tables to the master.

The purging has finished on the master, I am running it once more on all slaves now that deleted events cannot be sent anymore. I am also resyncing the tables that got behind so they are as much up to date as possible. Will ping one last time when the lag is close to 0 on all schemas, and then I will reenable the purging.

jcrespo moved this task from In progress to Done on the DBA board.Feb 25 2016, 7:41 PM

Lag should be ok now. I just need a last ok from you to resolve this:

mysql> SELECT 
    -> LEFT(timestamp, 8) as day,
    -> COUNT(*)
    -> FROM Edit_13457736
    -> WHERE
    -> timestamp > "20160125"
    -> GROUP BY day;
+----------+----------+
| day      | COUNT(*) |
+----------+----------+
| 20160125 |   537607 |
| 20160126 |   524772 |
| 20160127 |   595237 |
| 20160128 |   555843 |
| 20160129 |   513733 |
| 20160130 |   504448 |
| 20160131 |   556729 |
| 20160201 |   550647 |
| 20160202 |   535601 |
| 20160203 |   534936 |
| 20160204 |   521187 |
| 20160205 |   497161 |
| 20160206 |   471641 |
| 20160207 |   499208 |
| 20160208 |   534005 |
| 20160209 |   571428 |
| 20160210 |   538933 |
| 20160211 |   592494 |
| 20160212 |   539869 |
| 20160213 |   511462 |
| 20160214 |   550950 |
| 20160215 |   576825 |
| 20160216 |   582528 |
| 20160217 |   557550 |
| 20160218 |   532826 |
| 20160219 |   522700 |
| 20160220 |   516598 |
| 20160221 |   543912 |
| 20160222 |   553830 |
| 20160223 |   550432 |
| 20160224 |   554612 |
| 20160225 |   370185 |
+----------+----------+
32 rows in set (35.25 sec)
jcrespo removed jcrespo as the assignee of this task.Feb 25 2016, 7:45 PM

@jcrespo, yes, the data on the slave looks good. I think you can close this!

jcrespo closed this task as Resolved.Feb 26 2016, 7:36 AM
Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.Mar 2 2016, 5:41 PM
Neil_P._Quinn_WMF raised the priority of this task from Medium to Needs Triage.Mar 29 2018, 9:06 AM
Neil_P._Quinn_WMF moved this task from Blocked to Radar on the Contributors-Analysis board.