Page MenuHomePhabricator

Sanitize and ingest all event tables into the event_sanitized database
Open, HighPublic

Description

We are slowly creating more streams in Event Platform, but we don't currently sanitize and save them in the event_sanitized database like we do for legacy EventLogging data.

For example: mediawiki_mediasearch_interaction

Acceptance criteria:

  • Allowlist system – file(s) where users can specify retention policies for streams
  • Sanitization job that sanitizes event data from Event Platform streams according to retention policies specified with the allowlist system

TODO:

  • Refactor EventLoggingSanitization job to something more generic: RefineSanitize
  • Move all event_sanitized partitions to lowercased directory names to avoid re-refining data
  • Apply backwards compatible usage of RefineSanitize for eventlogging
  • Create new generic RefineSanitize job to be able to sanitize any data
  • Copy all existent data that generic RefineSanitize job targets into event_sanitized
  • Create data purge job to remove data after 90 days from all tables in event

Details

ProjectBranchLines +/-Subject
operations/puppetproduction+1 -1
operations/puppetproduction+0 -13
operations/puppetproduction+2 -0
operations/puppetproduction+1 -0
operations/puppetproduction+23 -24
analytics/refinerymaster+1 -1
analytics/refinerymaster+3 -2
operations/puppetproduction+3 -3
operations/puppetproduction+2 -4
analytics/refinery/sourcemaster+3 -2
operations/puppetproduction+14 -1
analytics/refinerymaster+16 -0
analytics/refinerymaster+21 -0
operations/puppetproduction+51 -24
operations/puppetproduction+13 -64
operations/puppetproduction+6 -2
operations/puppetproduction+184 -100
operations/puppetproduction+69 -30
analytics/refinerymaster+1 -0
operations/puppetproduction+42 -4
analytics/refinery/sourcemaster+12 -5
operations/puppetproduction+21 -0
operations/puppetproduction+3 -0
operations/puppetproduction+1 -1
operations/puppetproduction+149 -45
operations/puppetproduction+5 -5
operations/puppetproduction+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+12 -1
operations/puppetproduction+5 -0
operations/puppetproduction+2 -2
operations/puppetproduction+96 -0
operations/puppetproduction+1 -1
operations/puppetproduction+20 -20
operations/puppetproduction+84 -126
operations/puppetproduction+10 -0
operations/puppetproduction+4 -4
analytics/refinery/sourcemaster+487 -523
analytics/refinery/sourcemaster+1 K -805
analytics/refinery/sourcemaster+194 -175
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 675936 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] refine: rename EventLoggingSanitization to RefineSanitize

https://gerrit.wikimedia.org/r/675936

Change 675939 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Use versioned refinery-job.jar in refine sanitize job

https://gerrit.wikimedia.org/r/675939

Change 675939 merged by Ottomata:

[operations/puppet@production] Use versioned refinery-job.jar in refine sanitize job

https://gerrit.wikimedia.org/r/675939

Change 676380 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Set up refine_sanitize jobs in analytics test cluster.

https://gerrit.wikimedia.org/r/676380

@JAllemandou https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/676131 fixes a bug in the validate() stuff we added to refine...the default Config used to build help messages did not validate!

We'll need that before we can use the new Refine stuff.

Change 678607 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine_job - remove support for RefineFailuresChecker and use 0.1.4 in test/refine

https://gerrit.wikimedia.org/r/678607

Change 678608 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - use refinery 0.1.4

https://gerrit.wikimedia.org/r/678608

Change 678607 merged by Ottomata:

[operations/puppet@production] refine_job - remove RefineFailuresChecker and use 0.1.4 in test/refine

https://gerrit.wikimedia.org/r/678607

Change 678608 merged by Ottomata:

[operations/puppet@production] refine - use refinery 0.1.4

https://gerrit.wikimedia.org/r/678608

Change 678836 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Refine - fix typo in job_config

https://gerrit.wikimedia.org/r/678836

Change 678836 merged by Ottomata:

[operations/puppet@production] Refine - fix typo in job_config

https://gerrit.wikimedia.org/r/678836

Change 676380 merged by Ottomata:

[operations/puppet@production] Set up refine_sanitize jobs in analytics test cluster.

https://gerrit.wikimedia.org/r/676380

Change 678857 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - use proper refinery path

https://gerrit.wikimedia.org/r/678857

Change 678857 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - use proper refinery path

https://gerrit.wikimedia.org/r/678857

Change 678863 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate in test

https://gerrit.wikimedia.org/r/678863

Change 678863 merged by Ottomata:

[operations/puppet@production] Ensure refinery/python on PYTHONPATH for refinery-eventlogging-saltrotate

https://gerrit.wikimedia.org/r/678863

Change 678867 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - ensure hdfs salts dir exists, and use -f when running saltrotate rm

https://gerrit.wikimedia.org/r/678867

Change 678867 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - ensure hdfs salts dir exists

https://gerrit.wikimedia.org/r/678867

Change 678876 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_santize - Use normalized lowercase table name

https://gerrit.wikimedia.org/r/678876

Change 678876 merged by Ottomata:

[operations/puppet@production] test/refine_santize - Use normalized lowercase table name

https://gerrit.wikimedia.org/r/678876

Change 678941 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Refactor EventLoggingSanitization using RefineSanitize

https://gerrit.wikimedia.org/r/678941

Change 679376 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - lowercase eventlogging legeacy table names in include/exclude regexes

https://gerrit.wikimedia.org/r/679376

Change 679846 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine - use 0.1.5 and lowercase table regexes

https://gerrit.wikimedia.org/r/679846

Change 679846 merged by Ottomata:

[operations/puppet@production] test/refine - use 0.1.5 and lowercase table regexes

https://gerrit.wikimedia.org/r/679846

Change 679376 merged by Ottomata:

[operations/puppet@production] refine - use 0.1.5 and lowercase table names in include/exclude regexes

https://gerrit.wikimedia.org/r/679376

Change 679939 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - make salt rotation more generic

https://gerrit.wikimedia.org/r/679939

Change 679939 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - make salt rotation more generic

https://gerrit.wikimedia.org/r/679939

Change 679941 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine_sanitize_salt_rotate - use local_salts_prefix properly

https://gerrit.wikimedia.org/r/679941

Change 679941 merged by Ottomata:

[operations/puppet@production] refine_sanitize_salt_rotate - use local_salts_prefix properly

https://gerrit.wikimedia.org/r/679941

Change 680382 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] SanitizeTransformation - Just some simple logging improvements.

https://gerrit.wikimedia.org/r/680382

Change 681069 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitize - absent sanitize_eventlogging_analytics_delayed_test until June

https://gerrit.wikimedia.org/r/681069

Change 681069 merged by Ottomata:

[operations/puppet@production] test/refine_sanitize - absent sanitize delayed_test until June

https://gerrit.wikimedia.org/r/681069

Change 681092 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] analytics test - import mediawiki.page-data with camus and refine it

https://gerrit.wikimedia.org/r/681092

Change 681092 merged by Ottomata:

[operations/puppet@production] analytics test - import mediawiki.page-delete with camus and refine it

https://gerrit.wikimedia.org/r/681092

Change 681099 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] event_sanitized_allowlist - mediawiki_page_delete: keep_all

https://gerrit.wikimedia.org/r/681099

Change 680382 merged by Ottomata:

[analytics/refinery/source@master] SanitizeTransformation - Just some simple logging improvements.

https://gerrit.wikimedia.org/r/680382

Change 679961 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitized - add a general purpose event_sanitized job

https://gerrit.wikimedia.org/r/679961

Change 681105 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/refine_sanitized - add a general purpose event_sanitized job

https://gerrit.wikimedia.org/r/681105

Change 679961 abandoned by Ottomata:

[operations/puppet@production] test/refine_sanitized - add a general purpose event_sanitized job

Reason:

Iae3a76a37169a852e9698157b6b06e5b1e99f47a

https://gerrit.wikimedia.org/r/679961

Change 681099 merged by Ottomata:

[analytics/refinery@master] event_sanitized_main_allowlist - mediawiki_page_delete: keep_all

https://gerrit.wikimedia.org/r/681099

Change 681105 merged by Ottomata:

[operations/puppet@production] test/refine_sanitized - add a event_sanitized_main job

https://gerrit.wikimedia.org/r/681105

Change 678941 merged by Ottomata:

[operations/puppet@production] Move sanitize_eventlogging_analytics jobs from data_purge to refine_santizie

https://gerrit.wikimedia.org/r/678941

Change 681784 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] sanitize_eventlogging_analytics_immediate - ensure absent during switchover

https://gerrit.wikimedia.org/r/681784

Change 681784 merged by Ottomata:

[operations/puppet@production] sanitize_eventlogging_analytics_immediate - ensure absent during switchover

https://gerrit.wikimedia.org/r/681784

Change 681991 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine_sanitize - use refinery 0.1.6 with RefineSanitize job

https://gerrit.wikimedia.org/r/681991

Change 681991 merged by Ottomata:

[operations/puppet@production] refine_sanitize - use refinery 0.1.6 with RefineSanitize job

https://gerrit.wikimedia.org/r/681991

Change 682175 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] refine - allow configuring RefineMonitors since and until params

https://gerrit.wikimedia.org/r/682175

Change 682175 merged by Ottomata:

[operations/puppet@production] refine - allow configuring RefineMonitors since and until params

https://gerrit.wikimedia.org/r/682175

Change 682986 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Rename and symlink sanitization eventlogging/whitelist.yaml

https://gerrit.wikimedia.org/r/682986

Change 683043 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add more tables to sanitize in event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683043

Change 683053 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] test/data_purge - add drop_event job

https://gerrit.wikimedia.org/r/683053

Change 682986 merged by Ottomata:

[analytics/refinery@master] Rename and symlink sanitization eventlogging/whitelist.yaml

https://gerrit.wikimedia.org/r/682986

Change 683043 merged by Ottomata:

[analytics/refinery@master] Add more tables to sanitize in event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683043

Change 683053 merged by Ottomata:

[operations/puppet@production] test/data_purge - add drop_event job

https://gerrit.wikimedia.org/r/683053

Mentioned in SAL (#wikimedia-analytics) [2021-04-28T12:50:06Z] <ottomata> applied data_purge jobs in analytics test cluster; old data will now be dropped there - T273789

Change 683275 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery/source@master] RefineSanitizeMonitor - pass keep_all_enabled to SanitizeTransformation.loadAllowlist

https://gerrit.wikimedia.org/r/683275

Change 683275 merged by Ottomata:

[analytics/refinery/source@master] RefineSanitizeMonitor - pass keep_all_enabled to SanitizeTransformation.loadAllowlist

https://gerrit.wikimedia.org/r/683275

To prep for moving historical event data to event_sanitized, we need to copy the tables we'll be keeping forever there. Procedure:

# On an-launcher1002 in /home/otto
mkdir event_sanitized_main_T280813
cd event_sanitized_main_T280813

tables="
mediawiki_centralnotice_campaign_change \
mediawiki_centralnotice_campaign_create \
mediawiki_page_create \
mediawiki_page_delete \
mediawiki_page_links_change \
mediawiki_page_move \
mediawiki_page_restrictions_change \
mediawiki_page_suppress \
mediawiki_page_undelete \
mediawiki_revision_create \
mediawiki_revision_recommendation_create \
mediawiki_revision_score \
mediawiki_revision_tags_change \
mediawiki_revision_visibility_change \
mediawiki_user_blocks_change"

distcp_source_file=./distcp_source.txt
truncate -s 0 $distcp_source_file

for table in $tables; do
    path_orig="hdfs://analytics-hadoop/wmf/data/event/$table"
    path_new="hdfs://analytics-hadoop/wmf/data/event_sanitized/$table"

    # append the distcp source_file with path_orig.
    echo $path_orig >> $distcp_source_file

    # Build Hive scripts to create and add partitions to the event_sanitized table
    hive_sql_file=./${table}_create.hql
    truncate -s 0 $hive_sql_file

    # Build Hive scripts to recreate the tables with new lowercased partitions and then repair the tables to recreate the partitions.
    echo "SET mapred.child.java.opts=-Xmx8G -XX:+UseConcMarkSweepGC  -XX:-UseGCOverheadLimit;" >> $hive_sql_file
    echo "SET hive.msck.repair.batch.size=5000;" >> $hive_sql_file;
    echo "CREATE EXTERNAL TABLE IF NOT EXISTS event_sanitized.${table} LIKE event.${table};" >> $hive_sql_file
    echo "ALTER TABLE event_sanitized.${table} SET LOCATION '$path_new';" >> $hive_sql_file
    echo "MSCK REPAIR TABLE event_sanitized.${table};" >> $hive_sql_file
done


# do it!
mkdir ./done

# First run the distcp to copy all table directories into /wmf/data/event_sanitized
sudo -u analytics kerberos-run-command analytics hdfs dfs -put $distcp_source_file /tmp/distcp_event_sanitized_main_T280813.txt
time sudo -u analytics kerberos-run-command analytics hadoop distcp -m 60 -p -f /tmp/distcp_event_sanitized_main_T280813.txt hdfs://analytics-hadoop/wmf/data/event_sanitized
mv $distcp_source_file ./done/

for table in $tables; do
    echo "Creating and repairing event_sanitized.$table"
    time sudo -u analytics kerberos-run-command analytics hive -f ./${table}_create.hql && mv -v ./${table}_create.hql ./done/
    echo ""
done

Change 683375 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Update refine_sanitize job to use event_sanitized_analytics_allowlist

https://gerrit.wikimedia.org/r/683375

Change 683376 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] canary_events and refine_sanitize - use refinery 0.1.9

https://gerrit.wikimedia.org/r/683376

Change 683375 merged by Ottomata:

[operations/puppet@production] Update refine_sanitize job to use event_sanitized_analytics_allowlist

https://gerrit.wikimedia.org/r/683375

Change 683376 merged by Ottomata:

[operations/puppet@production] canary_events and refine_sanitize - use refinery 0.1.9

https://gerrit.wikimedia.org/r/683376

Change 683421 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Update event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683421

For reference, the current sizes of the tables to move to event_sanitized are:

9.3 M  /wmf/data/event/mediawiki_centralnotice_campaign_change
1.4 M  /wmf/data/event/mediawiki_centralnotice_campaign_create
39.7 G  /wmf/data/event/mediawiki_page_create
4.4 G  /wmf/data/event/mediawiki_page_delete
309.5 G  /wmf/data/event/mediawiki_page_links_change
2.5 G  /wmf/data/event/mediawiki_page_move
2.5 K  /wmf/data/event/mediawiki_page_properties_change
428.5 M  /wmf/data/event/mediawiki_page_restrictions_change
10.4 M  /wmf/data/event/mediawiki_page_suppress
375.9 M  /wmf/data/event/mediawiki_page_undelete
392.1 G  /wmf/data/event/mediawiki_revision_create
13.9 M  /wmf/data/event/mediawiki_revision_recommendation_create
212.6 G  /wmf/data/event/mediawiki_revision_score
123.5 G  /wmf/data/event/mediawiki_revision_tags_change
1.1 G  /wmf/data/event/mediawiki_revision_visibility_change
3.8 G  /wmf/data/event/mediawiki_user_blocks_change
4.1 T  /wmf/data/event/resource_change

FYI: We are not going to sanitize and keep resource_change

distcp complete, application_id: application_1619507802557_6586

DistCp Counters
        Bytes Copied=1170743820170
        Bytes Expected=1170743820170
        Files Copied=1955833
        DIR_COPY=331299

Moving on to creating and repairing hive tables.

Change 683421 merged by Ottomata:

[analytics/refinery@master] Update event_sanitized_main_allowlist

https://gerrit.wikimedia.org/r/683421

Mentioned in SAL (#wikimedia-operations) [2021-04-29T13:37:36Z] <otto@deploy1002> Started deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789

Mentioned in SAL (#wikimedia-operations) [2021-04-29T13:40:35Z] <otto@deploy1002> Finished deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 59s)

Change 683614 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Enable event_sanitized_main_immediate job

https://gerrit.wikimedia.org/r/683614

Change 683620 had a related patch set uploaded (by Ottomata; author: Ottomata):

[analytics/refinery@master] Add comment in allowlist about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683620

Change 683620 merged by Ottomata:

[analytics/refinery@master] Add comment in allowlist about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683620

Change 683622 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Add comment in refine about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683622

Change 683614 merged by Ottomata:

[operations/puppet@production] Enable event_sanitized_main_immediate job

https://gerrit.wikimedia.org/r/683614

Change 683622 merged by Ottomata:

[operations/puppet@production] Add comment in refine about mediawiki_page_properties_change

https://gerrit.wikimedia.org/r/683622

Mentioned in SAL (#wikimedia-analytics) [2021-04-29T15:38:10Z] <ottomata> enabling event_sanitized_main jobs - T273789

Mentioned in SAL (#wikimedia-operations) [2021-04-29T16:15:37Z] <otto@deploy1002> Started deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789

Mentioned in SAL (#wikimedia-operations) [2021-04-29T16:18:17Z] <otto@deploy1002> Finished deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 39s)

Backfilling since yesterday's distcp was started.

sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_main_immediate --since=72

Change 683698 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove refinery-drop-webrequest-sampled-druid job from test cluster

https://gerrit.wikimedia.org/r/683698

Change 683698 merged by Ottomata:

[operations/puppet@production] Remove refinery-drop-webrequest-sampled-druid job from test cluster

https://gerrit.wikimedia.org/r/683698

Change 683890 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/puppet@production] Remove absented druid data drop job from test cluster

https://gerrit.wikimedia.org/r/683890

Change 683890 merged by Ottomata:

[operations/puppet@production] Remove absented druid data drop job from test cluster

https://gerrit.wikimedia.org/r/683890

ah, I need to reset the _REFINED flag mtimes for the hours I distcp-ed over, just like I did in https://phabricator.wikimedia.org/T280813#7028014. There's more data this time. Here's what I'm running now:


import org.wikimedia.analytics.refinery.job.refine._
import com.github.nscala_time.time.Imports._
import org.joda.time.format.ISODateTimeFormat
import org.apache.hadoop.fs.Path
import org.joda.time.Days

val isoDateTimeFormatter = ISODateTimeFormat.dateTimeParser()

// Have to split this into smaller chunks; default non yarn spark shell couldn't handle all partitions in this range at once.
val begin = DateTime.parse("2021-03-14T00:00:00Z", isoDateTimeFormatter)
val end = DateTime.parse("2021-05-01T00:00:00Z", isoDateTimeFormatter)


def daysInBetween(d1: DateTime, d2: DateTime): Seq[DateTime] = {
    val oldestDay = new DateTime(d1, DateTimeZone.UTC).hourOfDay.roundCeilingCopy
    val youngestDay = new DateTime(d2, DateTimeZone.UTC).hourOfDay.roundFloorCopy

    for (d <- 0 until Days.daysBetween(oldestDay, youngestDay).getDays) yield {
        oldestDay + d.days
    }
}

val days = daysInBetween(begin, end)
val outputPath = new Path("/wmf/data/event_sanitized")

for (i <- 0 until days.size - 1) yield {
    val since = days(i)
    val until = days(i + 1)

    println("==== ", since, until, " ====")

    val rts = RefineTarget.find(
        spark,
        "event",
        outputPath,
        "event_sanitized",
        since,
        until,
        None,
        None
    )

    val doneRts = rts.filter(_.doneFlagExists).par
    doneRts.foreach(_.writeDoneFlag)
    println("==== DONE ", since, until, doneRts.size, " ====")
    println("")
}

Change 684427 had a related patch set uploaded (by Mforns; author: Mforns):

[operations/puppet@production] analytics:refinery:job:test:data_purge: remove -skipTrash from drop_event

https://gerrit.wikimedia.org/r/684427

Change 684427 merged by Ottomata:

[operations/puppet@production] analytics:refinery:job:test:data_purge: remove -skipTrash from drop_event

https://gerrit.wikimedia.org/r/684427