Page MenuHomePhabricator

wikidata blazegraph journal file
Open, LowPublic

Description

we would like to get a copy of the blazegraph journal file to seed
an updating wikidata service again.

Since the graph split three journal files might be needed

Estimated size:

du -sm data.jnl 
1328046	data.jnl

this is an example for a full graph with 1.3 TB journal file size.

The preferred transport mode is via a hard disk which may be rotating. Copying to hard disk and shipping express is in our experience actually faster and more stable than trying to copying via an online service. md5sums should be attached to make sure the integrity of the copies may be checked.

The estimated cost for 7200 rpm 6 TB disk to facilitate the copy is some 100-250 USD.

The shipping cost ist USD 40 to 100 with 3-5 days of shipping delay.

Sponsors for disk and shipping cost will be available.

Event Timeline

BTracy-WMF subscribed.

Hi @Seppl2013 ,

We won't be able to work on this request in the near future (1-2 months). The utility of a journal file is tied to the use of Blazegraph, which we are currently evaluating alternatives for. Until we have more clarity on next steps for an architectural migration, we can't commit to feature requests like these.

We will update this ticket in the next quarter with next steps and/or alternative solutions once we have had time to begin scoping.

Would be great if we get this before the legacy full graph is switched off.

I believe I'm still technically capable of helping out here, if there is still a full graph test node of some sort / something that is / can be depooled.

Per the process followed at https://addshore.com/2023/08/wikidata-query-service-blazegraph-jnl-file-on-cloudflare-r2-and-internet-archive/

I could likely give this a go sometime towards the end of the year

@Addshore Adam - i really would appreciate this. What do you imagine would be the transfer logistics?

Please note that i get curl: (56) OpenSSL SSL_read: error:0A000126:SSL routines::unexpected eof while reading, errno 0

for mediawiki downloads more often than not and we had a hard time in June to get a journal file oft some 2 TB transferred reliably.

@Seppl2013 (other Adam here, big fan of @Addshore's works :) ) - you'll probably want to use wget instead of curl, as it tends to be more reliable. I wrote up some notes at https://techblog.wikimedia.org/2025/04/08/wikidata-query-service-graph-database-reload-at-home-2025-edition/ for some pieces of this, after learning of various approaches folks have taken in the past (including the very cool approach of @Addshore) and having spent some time staring at this sort of challenge.

@dr0ptp4kt - even wget is IMHO not up to the task since for 1 TB up file transfers you need a reliable line for multiple hours and for larger datasets potentially even days and even a simple MD5 check already takes hours even if source and target disks are SSDs. I will happily try things out if the file to be tested with is juicy enough. The three blazegraph journal files certainly are :-)

A Wikidata-only data.jnl file (gzip compressed) has been uploaded to my file server in 5 GB segments. It is available for download here: https://files.scatter.red/orb/2025/12/

It is based on Wikidata dumps and uses Wikimedia Foundation-developed tooling to build a Wikidata Query Service. This does not include Categories RDF, lexemes, Commons Structured Data, or other data sources. If you can forgive those absences, this will otherwise give you the same unified query service that the Wikimedia Foundation provides.

If there is a strict provenance requirement for the data.jnl to be published by the Wikimedia Foundation, I unfortunately cannot help with that. Otherwise, this should work.

You can use this in conjunction with https://github.com/scatter-llc/private-wikidata-query using the pre-built data.jnl instructions.

@Harej - thx - download with 4 parallel threads is underway currently at 6 MByte/s ETA +18 h tomorrow afternoon our time. Will report on the md5 reassambly. Do i need an update of the https://github.com/scatter-llc/private-wikidata-query which i already forked to get the streaming update problem fixed?

Unfortunately the problems we had in June show up again. The download is very very unreliable and slow. aria2c and wget have major bugs and i also goofed again by starting dozens of downloads in parallel which i all had to cancel this morning. So far i only managed to download some 15 chunks. Any ideas how i can get a higher download thruput than the current less than 1 MBit?

Is this constrained by the host or client download speed?
What speed can https://files.scatter.red/orb/2025/12/ allow?
And what's your connection speed?

@Addshore i am not sure - the speed has increased a lot this afternoon.

wf@wikidata:/hd/gamma/wikidata2025-12-11$ date;du -sm . 
Sat Dec 13 04:57:26 PM CET 2025
12701	.
wf@wikidata:/hd/gamma/wikidata2025-12-11$ date;du -sm . 
Sat Dec 13 04:59:24 PM CET 2025
16953	.

36 MB/s that is good enough

./jnlget --cat
✓ md5sums.txt exists
➜ reassembling downloads into 2025-12-07-wikidata-data.jnl.gz
57.7GiB 0:01:06 [ 110MiB/s] [ 895MiB/s] [======>                                                  ] 14% ETA 0:06:40

see https://wiki.bitplan.com/index.php/Wikidata_Import_2025-12-13 for the full story.
We now have

22:52:44.153 [main] ERROR org.wikidata.query.rdf.tool.Update - Error during updater run.
java.lang.RuntimeException: com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Please': was expecting ('true', 'false' or 'null
 at [Source: (org.apache.http.conn.EofSensorInputStream); line: 1, column: 8]
        at org.wikidata.query.rdf.tool.wikibase.WikibaseRepository.fetchRecentChanges(WikibaseRepository.java:237)
        at org.wikidata.query.rdf.tool.change.RecentChangesPoller.doFetchRecentChanges(RecentChangesPoller.java:325)
        at org.wikidata.query.rdf.tool.change.RecentChangesPoller.fetchRecentChanges(RecentChangesPoller.java:314)
        at org.wikidata.query.rdf.tool.change.RecentChangesPoller.batch(RecentChangesPoller.java:338)
        at org.wikidata.query.rdf.tool.change.RecentChangesPoller.firstBatch(RecentChangesPoller.java:162)
        at org.wikidata.query.rdf.tool.change.RecentChangesPoller.firstBatch(RecentChangesPoller.java:38)
        at org.wikidata.query.rdf.tool.Updater.run(Updater.java:152)
        at org.wikidata.query.rdf.tool.Update.run(Update.java:174)
        at org.wikidata.query.rdf.tool.Update.main(Update.java:98)

was there not a fix for this?

That isn't a problem with the data, it's a problem with the updater.

Add this to docker-compose.yml, in the environment block under the wdqs service:

- UPDATER_OPTS=-Dhttp.userAgent=OrbUpdateBot/0.3.97-wmde.8

curl -H "User-Agent: TestBot/1.0 (test@example.com)" \
>   "https://www.wikidata.org/w/api.php?action=query&list=recentchanges&format=json&rcprop=title|timestamp|ids&rclimit=1"
{"batchcomplete":"","continue":{"rccontinue":"20251213230327|2517185297","continue":"-||"},"query":{"recentchanges":[{"type":"edit","ns":0,"title":"Q8589960","pageid":8563160,"revid":2442092221,"old_revid":2390459973,"rcid":2517185298,"timestamp":"2025-12-13T23:03:27Z"}]}}

works from inside the docker container

https://wiki.bitplan.com/index.php/Wikidata_Import_2025-12-13#updater has a fix for https://github.com/scatter-llc/private-wikidata-query/issues/10

the brute force version - _JAVA_OPTIONS=-Dhttp.userAgent=OrbUpdateBot/0.3.97-wmde.8

seems more reliable. There is also a script test_updater

./test_updater -2 | jq .

json
{
  "batchcomplete": "",
  "continue": {
    "rccontinue": "20251214100812|2517351786",
    "continue": "-||"
  },
  "warnings": {
    "recentchanges": {
      "*": "The value \"-2\" for parameter \"rclimit\" must be between 1 and 500."
    }
  },
  "query": {
    "recentchanges": [
      {
        "type": "edit",
        "ns": 0,
        "title": "Q124667703",
        "pageid": 118818608,
        "revid": 2442255157,
        "old_revid": 2088291448,
        "rcid": 2517351787,
        "user": "Kristbaumbot",
        "bot": "",
        "oldlen": 2211,
        "newlen": 2284,
        "timestamp": "2025-12-14T10:08:12Z",
        "comment": "/* wbeditentity-update-languages-short:0||mul */ Added 'mul' label: Yao Shu-Ping"
      }
    ]
  }
}

in the https://github.com/WolfgangFahl/get-your-own-wdqs fork of James Hare's solution

second copy on my server fur is also running. Tools ran smoothly. Note that the size of the main file is not available remotely so one has to fill in the value manually and rerun the jnlget --cat.

./jnlget --cat
✓ md5sums.txt exists
➜ reassembling downloads into 2025-12-07-wikidata-data.jnl.gz
⚠ 2025-12-07-wikidata-data.jnl.gz exists will not override
✓ reassembled 2025-12-07-wikidata-data.jnl.gz (438103694405 bytes)
➜ checking md5 for 2025-12-07-wikidata-data.jnl.gz of size 438103694405 to be ad0006a38103efd715c782a539a6f482
✓ 2025-12-07-wikidata-data.jnl.gz size and hash

pv -petrab /hd/tepig/wd2025-12/2025-12-07-wikidata-data.jnl.gz|gunzip > data/data.jnl 
 408GiB 2:04:19 [56,0MiB/s] [56,0MiB/s] [================================================================>] 100%    

20:24:33.120 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2025-12-07T02:05:39Z (next: 20251207020539|2513996789) at (10.1, 3.4, 1.2) updates per second and (8669.7, 2936.2, 1082.4) milliseconds per second
20:24:33.372 [main] INFO  o.w.q.r.t.change.RecentChangesPoller - Got 82 changes, from Q106782689@2439028013@20251207020542|2513996799 to Q2055057@2439028104@20251207020617|2513996895