Page MenuHomePhabricator

Create purging script for mediawiki-history data
Closed, ResolvedPublic8 Estimated Story Points

Description

We import full-datasets into hadoop regularly, and it doesn't make sense to keep many snapshots:

  • /wmf/data/raw/mediawiki/[tables|project_namepasce_map]/*/snaphsot=YYYY-MM
  • /wmf/data/wmf/mediawiki/[user_history|page_history|history|metrics]/snaphsot=YYYY-MM

We should maintain 2 snaphots: 1 with current-month imported data, and another with previous month imported data. Older ones should be removed (after current month import success).

Event Timeline

The reason to purge is space, we do not need to keep redundant information.

We can create a new script and execute dropping code via puppet.

Nuria set the point value for this task to 5.Apr 3 2017, 4:05 PM

Question to solve: In some tables, the snapshot partition has sub-partitions (for instance, in the wmf_raw.mediawiki_user table, table is partitioned by (snapshot, wikidb). Can we drop partitions that contain subpartitions, or do we need to iterate over every subpartition?

Nuria moved this task from In Progress to Next Up on the Analytics-Kanban board.
Nuria added a subscriber: fdans.
Milimetric triaged this task as Medium priority.May 8 2017, 2:02 PM
Milimetric moved this task from In Progress to Paused on the Analytics-Kanban board.

@JAllemandou

Can we drop partitions that contain subpartitions, or do we need to iterate over every subpartition?

Thanks for the heads-up!
In my ignorance, though, I don't understand what could be the potential problem of deleting the whole snapshot directory (with sub-partitions in it)? You think Hadoop won't permit it? Or is it more a data size problem?

Also, another question :]
I assume that the script only has to remove data directories, and NOT touch the metastore, because the table schemas won't change. Is that correct? Thanks :)

In my ignorance, though, I don't understand what could be the potential problem of deleting the whole snapshot directory (with sub-partitions in it)? You think Hadoop won't permit it? Or is it more a data size problem?
Also, another question :]
I assume that the script only has to remove data directories, and NOT touch the metastore, because the table schemas won't change. Is that correct? Thanks :)

As you've guessed, it would be better to remove hive partitions in addition to delete data.
Problem would not be with directory, but with hive partition (and sub-partitions in the case of raw-tables).
I actually had a case where I needed to try it so now I know: for wmf_raw.mediawiki_TABLE tables (user, page etc), you can do:

ALTER TABLE wmf_raw.mediawiki_[TABLENAME] DROP PARTITION(snapshot = 'SNAPSHOTNAME', wiki != '');

In wmf history tables, no sub-partitions, so:

ALTER TABLE wmf.mediawiki_HISTORY DROP PARTITION(snapshot = '[SNAPSHOTNAME]');

And last one is wmf.metrics, with 3 layers:

ALTER TABLE wmf.mediawiki_metrics DROP PARTITION(snapshot = '[SNAPSHOTNAME]', metric != '', wikidb != '');

This should work I think :)

Change 355601 had a related patch set uploaded (by Mforns; owner: Mforns):
[analytics/refinery@master] Add script to puge old mediawiki data snapshots

https://gerrit.wikimedia.org/r/355601

mforns changed the point value for this task from 5 to 8.May 29 2017, 12:00 PM

After looking into this a bit more i think we should keep 6 months of snapshots if we can afford the space, a simple error fixing jobs could lead to deleting a past snapshots so I think having two is too few. Since data is public the consideration that matters is space.

We can merge this change as soon as we tested, just noting here we need companion puppet change for it to be effective.

@Nuria
I applied the changes that joseph suggested and also the ones to make the logging compatible with our cronjob scheme.
Also tested that in Hadoop, and everything works fine. I think this can be given a final review and be merged if OK.

(had no time to do the puppet patch to add the cronjob, my bad)

Change 376640 had a related patch set uploaded (by Nuria; owner: Nuria):
[operations/puppet@production] Add cron to purge old mediawiki data snapshots

https://gerrit.wikimedia.org/r/376640

Change 355601 merged by Nuria:
[analytics/refinery@master] Add script to purge old mediawiki data snapshots

https://gerrit.wikimedia.org/r/355601

Change 376640 merged by Elukey:
[operations/puppet@production] Add cron to purge old mediawiki data snapshots

https://gerrit.wikimedia.org/r/376640