Page MenuHomePhabricator

Document process for getting JNL files/consider automation
Closed, ResolvedPublic

Description

Blazegraph (the application that serves WDQS) stores all its data in a single JNL file. WDQS' JNL file is very large (~1.2TB) so moving it on and off the hosts tends to be difficult (see T344732 and this blog post . )

We've had to do this more than once, and my general rule is that if you have to do something more than twice, you need to automate it.

Creating this ticket to:

  • Document the process of extracting a JNL file from a wdqs hosts
  • Solicit feedback from co-workers/community members, and make a decision on whether to automate this process. Note that this does not mean we'll run this process on a schedule, like we do for the TTL dumps; just that we'll have a ready-made script to run that starts with a JNL file on a wdqs server and ends up with a JNL file in in a place where it can be publicly downloaded.
  • There is a separate discussion on whether or not to include the JNL files in our regular dumps; see T344905 for more details.

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2023-09-29T16:22:46Z] <inflatador> bking@wdqs1016 depooling to compress JNL file T347605

A few notes on this process:

    • I used zstd compression to compress the JNL file, as it supposedly offers the best speed. I used zstd -T0 -19 wikidata.jnl as my compression command (all cores, maximum compression), using wdqs1016 as the host. Despite having 32 cores, I never saw load average go past 16 during the compression process. The compression process took ~12 hours.
  • I uploaded to Cloudflare R2 object storage using @Addshore 's rclone command as described here . The upload process (with the command optimizations in the linked post) took about an hour.
  • I'm downloading the file using axel over my 1Gbps post, using 16 concurrent connections: axel -a -n 16, getting about ~320Mbps transfer speed.

For me the first 300 GB of the file went really, really fast. But axel was dropping connections, similar to when I had downloaded the large 1 TB file. So this download took about 5 hours. I'm pretty sure it could be done in 1-3 hours, though, if everything were working well.

Now, I encountered an error, and this was reproducible with two separate downloads. @bking does a test on the file yield the same corrupted block detected warning for you by any chance if you download the the ZST? What about if you do it with your already existing copy?

/mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst
wikidata.jnl.zst     : 649266 MB...     wikidata.jnl.zst : Decoding error (36) : Corrupted block detected

real    124m59.115s
user    17m44.647s
sys     7m24.509s

/mnt/x $ ls -l wikidata.jnl.zst
-rwxrwxrwx 1 adam adam 342189138219 Oct  3 02:32 wikidata.jnl.zst

I've kicked off a sha1sum, but this will take a while to run.

/mnt/x$ time sha1sum wikidata.jnl.zst

Here's the sha1sum for the latest file I had downloaded:

/mnt/x$ time sha1sum wikidata.jnl.zst
62327feb2c6ad5b352b5abfe9f0a4d3ccbeebfab  wikidata.jnl.zst

real    77m16.215s
user    8m39.726s
sys     2m42.932s

Checksum does not match the version from wdqs1016, which is:

sha1sum wikidata.jnl.zst
e3197eb5177dcd1aa0956824cd8dc4afc2d8796c  wikidata.jnl.zst

I also downloaded the file locally after putting it up in Cloudflare, which has a different checksum as well (shasum is a Mac utility which defaults to sha1 output):

shasum wikidata.jnl.zst
d9b3d3729a9a2dce3242e756807411f945cfd824  wikidata.jnl.zst

And I'm also getting wikidata.jnl.zst : Decoding error (36) : Data corruption detected . Will try redownloading with a wget and hope for better results.

Drawing from your inspiration, I downloaded with wget overnight and the sha1sum now matches that from wdqs1016. Deflating now, will update with results.

Addressing @Addshore's comment in T344905#9210122...

I think the amount of time taken to decompress the JNL file should also be taken into consideration on varying hardware if compression is being considered.

Here's what I saw for performance:

/mnt/x $ time unzstd --output-dir-flat /mnt/y/ wikidata.jnl.zst
wikidata.jnl.zst    : 1265888788480 bytes

real    219m10.733s
user    29m51.350s
sys     12m53.425s

This was on an i7-8700 CPU @ 3.20GHz. When I checked this with top it seemed to be using about 0.8-1.6 processor, but hovering around 1 processor, at any given time. unzstd doesn't support multiple processor decompress from what I see.

Gehel moved this task from Incoming to In Progress on the Data-Platform-SRE board.

@bking just wanted to express my gratitude for the support on this ticket and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org and T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`. FWIW I do think it would be good to automate this. As a matter of getting to a functional WDQS local environment replete with BlazeGraph data, it would accelerate things a lot. I think my only reservations are that:

  1. It takes time to automate. Any rough guess on level of effort for that? I understand that'd inform relative prioritization against the large pile of other things.
  2. The energy savings is possibly unclear, at least under current case (but that's partly because it's hard to know how much energy is being expended, which could be guessed at from number of dump downloads; not sure how easy it is to get those stats; this is different from the bandwidth transfer on Cloudflare R2).

However, I would probably err on the side of assuming that ultimately the automation will boost the technical communities' interest and ability to trial things locally (right now the barriers are somewhat prohibitive) and that the energy savings will roughly net out - ironically, if it attracts more people, they'll in the aggregate consume more energy, but they'll also be vastly more efficient energy-wise because they won't have to ETL, which takes a lot of compute resources. For potential reusers (e.g., Enterprise or other institutions) it might help smooth things along a bit, although this is mostly just my conjecture.

Thinking ahead a little, we'd probably want to generalize anything so that it can take arbitrary .jnls, for example for split graphs.

The energy savings is possibly unclear, at least under current case (but that's partly because it's hard to know how much energy is being expended, which could be guessed at from number of dump downloads; not sure how easy it is to get those stats; this is different from the bandwidth transfer on Cloudflare R2).

Are you talking about number of dump downloads of the JNL or of triples?
I believe the only way to really tell # of dump downloads of JNL on R2 is to infer this from the bytes downloaded
For example on my large JNL file I have had many connections / download starts, but only a few people downloaded the full file size.
As the file size is so big the number is probably fairly accurate though.

For my JNL file in the past 30 days...
829 connection requests, 141 unique visitors
8TB data served, so probably 6-7 full downloads in 30 days

Thinking ahead a little, we'd probably want to generalize anything so that it can take arbitrary .jnls, for example for split graphs.

I think the generalized approach here should probably just be taking large files from WDQS land and dumping them into a space like R2
Information that would be great to flow with that data would be a checksum of the thing prior to all of the copying, and also a timestamp the thing was taken from.

One of the next steps I want to try along this journey is downloading the file from R2, and adding it to a volume image for an EC2 machine or compute machine on another cloud provider.
This should also remove the "download" step from those that want to use this file in coud lands and provide a fairly instant experience on whatever hardware for "your own WDQS"

Good question - I meant the contrast with respect to the .ttl.gz dumps and everything that goes into munging and importing (in aggregate across all downloaders of those files) versus the same for if this was done with the .jnl where they don't have to munge and import. Napkin-mathsing it, the thought was that the savings on energy accrues about as soon as the 16 cores x 12 hours of compression time on the .jnl has been "saved" by people in aggregate not needing to run the import process (and I'm just waving away the client side decompression, which in a way technically happens twice for the .ttl.gz user but only once for the .jnl.zst user, and any other disk or network transfer pieces, as those are all close enough, I suppose).

I'll go check on what stats may be readily available on dumps downloads.

Good point on having a checksum and timestamp. Yeah, it would be nice to have it in an on-demand place without the need for extra data transfer!

bking triaged this task as Low priority.

@bking just wanted to express my gratitude for the support on this ticket and its friends T344905: Publish WDQS JNL files to dumps.wikimedia.org and T347647: 2023-09-18 latest-all.ttl.gz WDQS dump `Fatal error munging RDF org.openrdf.rio.RDFParseException: Expected '.', found 'g'`. FWIW I do think it would be good to automate this. As a matter of getting to a functional WDQS local environment replete with BlazeGraph data, it would accelerate things a lot. I think my only reservations are that:

  1. It takes time to automate. Any rough guess on level of effort for that? I understand that'd inform relative prioritization against the large pile of other things.

I appreciate your appreciation! I only wish I'd gotten something useful up in R2. For a truly reliable process, we'd need to implement @Addshore 's suggestions around "...checksum[ming] of the thing prior to all of the copying, and also a timestamp the thing was taken from."

Unfortunately, because the initial process took a lot longer than expected and we have a lot of other things on our plate, this has been deprioritized for the time being. I'll try to work on it in my spare time, and we can revisit the discussion next quarter for sure. Sorry for the trouble!

@bking Would it be possible to get me access to an R2 bucket that is paid for by the WMF in some way?
I'll happily continue my manual process of putting a JNL file in a bucket every few months for folk to use until the point that this is more automated?

UPDATE from previous comment: reducing to GETs, it's closer to 100 (a bunch of the requests were HEAD requests). Also, it seems that there may be some sort of range requests going on in there, so it's messier than at first glance.

zgrep wikidatawiki /srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20231024.gz | grep latest-all | grep " 200 " | grep GET | cut -f1,10 -d" " | less

INITIALLY I said the following, but wanted to clarify here with an update.

Looking at yesterday's downloads with a rudimentary grep we're not far from 1K downloads, and that's just for the latest-all ones. That also doesn't consider mirrors.

stat1007:~$ zgrep wikidatawiki /srv/log/webrequest/archive/dumps.wikimedia.org/access.log-20231024.gz | grep latest-all | grep " 200 " | wc -l

I also said this, which still holds:

Now, it's good to keep in mind that some of these downloads are mirror jobs themselves, but looking at some of the source IPs it's clear that a good number of them also are not obviously mirror jobs.

I also see https://grafana.wikimedia.org/d/000000264/wikidata-dump-downloads?orgId=1&refresh=5m&from=now-2y&to=now which I noticed from some tickets involving @Addshore (cf. T280678: Crunch and delete many old dumps logs and friends) and a pointer from a colleague.

As I noted, there are some complications around the 200s (more specifically, wildly varying response sizes for the same URL from the same source IP and UA even within a day, suggestive of something cleverer going on, at least from some IPs - it's not clear that those are strictly correlated to 206 range behavior, but some may be), and I see from T280678's pointer to source processing at https://github.com/wikimedia/analytics-wmde-scripts/blob/master/src/wikidata/dumpDownloads.php#L12 consideration for 206s and 200s. Future TODO in case we wanted to figure out how to deal with the different-sized 200s and apparent downloader utilities.