Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers
Open, HighPublic
Actions

Assigned To

Authored By

	dcausse
	Oct 17 2023, 8:17 AM

Description

As of today the data-reload cookbook does multiple tasks on the wdqs host being reloaded:

copy the dumps from the snapshot machines to a local folder
munge
import into blazegraph

It would be interesting to have a more flexible process that sources its data from hdfs/hive directly (or indirectly via swift?) so that we could reuse the data computed by some jobs running in hadoop (munging, graph splitting).

It is unclear yet how to precisely achieve this but the goal would be to have a set of tools that can be given a wdqs host, a target blazegraph port/namespace, a hive partition for the source data and schedule a data-reload.

Design:
The data-reload cookbook will be adapted to do the following steps:

From a stat machine: Generate (or re-use) a .nt dataset in hdfs using org.wikidata.query.rdf.spark.transform.structureddata.dumps.NTripleGenerator on an existing hive tables with triples
From a stat machine: copy the generated partitions to a local folder (possibly rename the partition to match what loadData.sh might expect
Transfer the files to the destination wdqs host
Run the data-reload skipping the munge step
Determine the kafka timestamp by sending a SPARQL query
Restart the updater

AC:

the system is designed and this phab task updated
a wdqs host can be loaded using the triples stored in a hive partition
the load process can be resumed if it failed (except if blazegraph has corrupted its journal)
the loading time must be inferior (~ -1day than the classic data-reload because the munge step)

Details

Subject	Repo	Branch	Lines +/-
wdqs.data-reload: various fixes	operations/cookbooks	master	+93 -46
wdqs: remove wdqs2023 from the public cluster and enable the updaters	operations/puppet	production	+1 -2
wdqs.data-reload: support HDFS as a source	operations/cookbooks	master	+542 -256
Allow setting the format of dump files	wikidata/query/rdf	master	+2 -1

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	search: automate graph split and n3 dump generation	repos/data-engineering/airflow-dags!682	dcausse	search-automate-graph-split	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Open	None	T335067 Epic: Wikidata Query Service stabilization
Open	None	T337013 [Epic] Splitting the graph in WDQS
Resolved	Gehel	T350464 Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph
Open	dcausse	T349069 Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers
Resolved	dr0ptp4kt	T350106 Implement a spark job that converts a RDF triples table into a RDF file format

Event Timeline

dcausse created this task.Oct 17 2023, 8:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 17 2023, 8:17 AM

Gehel added a project: Data-Platform-SRE.Oct 17 2023, 9:24 AM

Gehel moved this task from Incoming to Ready for Work on the Data-Platform-SRE board.Oct 17 2023, 3:29 PM

Gehel moved this task from Ready for Work to Quarterly Goals on the Data-Platform-SRE board.Oct 17 2023, 3:41 PM

Gehel triaged this task as High priority.Oct 18 2023, 8:33 AM

Gehel moved this task from Incoming to Current work on the Wikidata-Query-Service board.Oct 30 2023, 4:20 PM

Gehel added a project: Discovery-Search (Current work).

dcausse added a subscriber: JAllemandou.Oct 31 2023, 7:38 AM

Gehel edited parent tasks, added: T350464: Expose SPARQL endpoints with full wikidata data set and with split graph to enable experimentation on federation with a split graph; removed: T337013: [Epic] Splitting the graph in WDQS.Nov 3 2023, 10:21 AM

Gehel removed a project: Discovery-Search (Current work).Nov 13 2023, 4:25 PM

Gehel moved this task from Quarterly Goals to Watching on the Data-Platform-SRE board.

Gehel moved this task from Current work to Scaling on the Wikidata-Query-Service board.

Gehel closed subtask T350106: Implement a spark job that converts a RDF triples table into a RDF file format as Resolved.Jan 19 2024, 1:48 PM

bking mentioned this in T327689: Use rsync instead of NFS for wdqs data reload cookbook.Apr 4 2024, 4:57 PM

Daniel_Mietchen subscribed.Apr 6 2024, 1:06 AM

dcausse claimed this task.Apr 30 2024, 8:14 AM

dcausse added a project: Discovery-Search (Current work).

dcausse moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

dcausse updated the task description. (Show Details)Apr 30 2024, 8:19 AM

@BTullis @bking I plan to use a cookbook to transfer some data out of hdfs to blazegraph machines, a naive approach I thought about was to use a temp folder somewhere in /srv of a stat100x machine that would be populated using hdfs dfs or hdfs-rsync and then re-use the transferpy python module.
The current dumps are about 200G, do you think that this option is viable? Can we use a folder in /srv as a temp folder for such transfers? This data is only useful for the transfer and should be deleted by the cookbook when it ends.

@dcausse It looks like there's plenty of disk space on /srv:

bking@stat1007:~$ df -h | grep srv
/dev/mapper/stat1007--vg-data   7.2T  4.4T  2.5T  65% /srv

I don't own the stats servers though. @BTullis are there any other concerns besides disk space we need to consider?

Another approach could be to use the /mnt/hdfs mountpoint? I have been told that it might not be stable enough but perhaps it's OK for doing a copy?

I would suggest using the hdfs-rsync tool to do this - it requires some setting up with puppet, but it is helpful, through copying only new stuff from folders (see https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024/modules/hdfs_tools/manifests/hdfs_rsync_job.pp and https://github.com/wikimedia/operations-puppet/blob/1c4d67ff19372832484f7551dc49836be5806024/modules/dumps/manifests/web/fetches/stats.pp)

Possible options I see so far:

Runs hdfs-rsync directly from the blazegraph hosts
- requires installing its dependencies
- open a holes between blazegraph and the hadoop cluster
Schedule hdfs-rsync on a stat machine copying the ttl dumps from hdfs to /srv/analytics-search/wikibase_processed_dumps/wikidata/$SNAPSHOT
- cons: consumes some space on a stat machine
Run hdfs-rsync on-demand to copy the ttl dump from hdfs to /srv/analytics-search/wikibase_processed_dumps/temp and cleanup this folder once done
- cons: slows down a bit the process

I was planning on doing option 3, any objections with this approach?

No objection :) I'd have gone for option 1 as it seems the easiest to maintain, but I agree, it means installing some stuff to the blazegraph machines.

dcausse opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/682

search: automate graph split and n3 dump generation

Change #1030897 had a related patch set uploaded (by DCausse; author: DCausse):

[wikidata/query/rdf@master] Allow setting the format of dump files

https://gerrit.wikimedia.org/r/1030897

Change #1030897 merged by jenkins-bot:

[wikidata/query/rdf@master] Allow setting the format of dump files

https://gerrit.wikimedia.org/r/1030897

Change #1031933 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/cookbooks@master] wdqs.data-reload: support HDFS as a source

https://gerrit.wikimedia.org/r/1031933

dr0ptp4kt merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/682

search: automate graph split and n3 dump generation

dcausse moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Thu, May 23, 2:52 PM

Change #1031933 merged by jenkins-bot:

[operations/cookbooks@master] wdqs.data-reload: support HDFS as a source

https://gerrit.wikimedia.org/r/1031933

Maintenance_bot removed a project: Patch-For-Review.Tue, Jun 4, 9:31 PM

Change #1038904 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] wdqs.data-reload: fix regex escaping

https://gerrit.wikimedia.org/r/1038904

gerritbot added a project: Patch-For-Review.Tue, Jun 4, 9:31 PM

Mentioned in SAL (#wikimedia-operations) [2024-06-10T18:11:45Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T19:02:06Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T19:02:49Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T19:22:50Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T20:30:03Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-10T20:30:26Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-11T22:56:06Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:12:30Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:12:42Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:14:59Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:15:18Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:24:03Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T08:28:07Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T13:30:02Z] <brouberol@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T13:33:55Z] <brouberol@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:49:58Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:56:20Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T17:58:08Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T18:04:27Z] <ryankemper@cumin2002> END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T21:05:11Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-12T21:11:29Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Change #1042965 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/puppet@production] wdqs: remove wdqs2023 from the public cluster and enable the updaters

https://gerrit.wikimedia.org/r/1042965

Change #1042965 merged by Ryan Kemper:

[operations/puppet@production] wdqs: remove wdqs2023 from the public cluster and enable the updaters

https://gerrit.wikimedia.org/r/1042965

Mentioned in SAL (#wikimedia-operations) [2024-06-13T16:11:50Z] <ryankemper@cumin2002> START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Mentioned in SAL (#wikimedia-operations) [2024-06-13T16:18:23Z] <ryankemper@cumin2002> END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)

Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot serversOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers
Open, HighPublic
Actions

Related Objects
Search...