Page MenuHomePhabricator

Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers
Open, HighPublic

Description

As of today the data-reload cookbook does multiple tasks on the wdqs host being reloaded:

  • copy the dumps from the snapshot machines to a local folder
  • munge
  • import into blazegraph

It would be interesting to have a more flexible process that sources its data from hdfs/hive directly (or indirectly via swift?) so that we could reuse the data computed by some jobs running in hadoop (munging, graph splitting).

It is unclear yet how to precisely achieve this but the goal would be to have a set of tools that can be given a wdqs host, a target blazegraph port/namespace, a hive partition for the source data and schedule a data-reload.

Design:
The data-reload cookbook will be adapted to do the following steps:

  • From a stat machine: Generate (or re-use) a .nt dataset in hdfs using org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator on an existing hive tables with triples
  • From a stat machine: copy the generated partitions to a local folder (possibly rename the partition to match what loadData.sh might expect
  • Transfer the files to the destination wdqs host
  • Run the data-reload skipping the munge step
  • Determine the kafka timestamp by sending a SPARQL query
  • Restart the updater

AC:

  • the system is designed and this phab task updated
  • a wdqs host can be loaded using the triples stored in a hive partition
  • the load process can be resumed if it failed (except if blazegraph has corrupted its journal)
  • the loading time must be inferior (~ -1day than the classic data-reload because the munge step)

Event Timeline

Gehel triaged this task as High priority.Oct 18 2023, 8:33 AM

@BTullis @bking I plan to use a cookbook to transfer some data out of hdfs to blazegraph machines, a naive approach I thought about was to use a temp folder somewhere in /srv of a stat100x machine that would be populated using hdfs dfs or hdfs-rsync and then re-use the transferpy python module.
The current dumps are about 200G, do you think that this option is viable? Can we use a folder in /srv as a temp folder for such transfers? This data is only useful for the transfer and should be deleted by the cookbook when it ends.

@dcausse It looks like there's plenty of disk space on /srv:

bking@stat1007:~$ df -h | grep srv
/dev/mapper/stat1007--vg-data   7.2T  4.4T  2.5T  65% /srv

I don't own the stats servers though. @BTullis are there any other concerns besides disk space we need to consider?

Another approach could be to use the /mnt/hdfs mountpoint? I have been told that it might not be stable enough but perhaps it's OK for doing a copy?

Possible options I see so far:

  1. Runs hdfs-rsync directly from the blazegraph hosts
    • requires installing its dependencies
    • open a holes between blazegraph and the hadoop cluster
  2. Schedule hdfs-rsync on a stat machine copying the ttl dumps from hdfs to /srv/analytics-search/wikibase_processed_dumps/wikidata/$SNAPSHOT
    • cons: consumes some space on a stat machine
  3. Run hdfs-rsync on-demand to copy the ttl dump from hdfs to /srv/analytics-search/wikibase_processed_dumps/temp and cleanup this folder once done
    • cons: slows down a bit the process

I was planning on doing option 3, any objections with this approach?

No objection :) I'd have gone for option 1 as it seems the easiest to maintain, but I agree, it means installing some stuff to the blazegraph machines.

Change #1030897 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Allow setting the format of dump files

https://gerrit.wikimedia.org/r/1030897

Change #1030897 merged by jenkins-bot:

[wikidata/query/rdf@master] Allow setting the format of dump files

https://gerrit.wikimedia.org/r/1030897

Change #1031933 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/cookbooks@master] wdqs.data-reload: support HDFS as a source

https://gerrit.wikimedia.org/r/1031933

Change #1031933 merged by jenkins-bot:

[operations/cookbooks@master] wdqs.data-reload: support HDFS as a source

https://gerrit.wikimedia.org/r/1031933

Change #1038904 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs.data-reload: fix regex escaping

https://gerrit.wikimedia.org/r/1038904

Mentioned in SAL (#wikimedia-operations) [2024-06-10T18:11:45Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T19:02:06Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T19:02:49Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T19:22:50Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T20:30:03Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T20:30:26Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-11T22:56:06Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:12:30Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:12:42Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:14:59Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:15:18Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:24:03Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:28:07Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T13:30:02Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T13:33:55Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:49:58Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:56:20Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:58:08Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T18:04:27Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T21:05:11Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T21:11:29Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Change #1042965 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: remove wdqs2023 from the public cluster and enable the updaters

https://gerrit.wikimedia.org/r/1042965

Change #1042965 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove wdqs2023 from the public cluster and enable the updaters

https://gerrit.wikimedia.org/r/1042965

Mentioned in SAL (#wikimedia-operations) [2024-06-13T16:11:50Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-13T16:18:23Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)