Page MenuHomePhabricator

Request creation of wikiqlever VPS project
Closed, ResolvedPublic

Description

Project Name: wikiqlever

Developer account usernames of requestors: evomri physikerwelt

Purpose: Evalute qlever as a blazegraph alternative for the wikidata query service graph database backend.

Brief description: We would like to test qlever as alternative to blazegraph in a shared wmf hosted infrastructure. However, it seems that is not possible in tooforge

A standard toolforge environment will not cut it because it is limited to 64 GB RAM and rotating disk. We need at least 128 GB RAM and SSDs. 256 GB RAM and 3 x 4 TB SSD is a decent environment to start with. DBIS RWTH Aachen even runs on 512 GB RAM and 20 GB SSD space.

https://www.wikidata.org/wiki/Wikidata_talk:Scholia/Events/Hackathon_October_2024#Infrastructure_project

How soon you are hoping this can be fulfilled: Long enough before the "graph split" will happen.

Event Timeline

You mentioned 3 different RAM/disk quotas. The 512GB RAM quota is a a big ask. Could we start with the lower one and see from there?

Sure, we can start with 128GB. However, if the experiment is a success and there are many users (on the order of the number of visitors of scholia) this won't be enough.

Could you please clarify the initial disk quota as well?

We need at least 128 GB RAM and SSDs.

There are no direct access SSD devices available to normal Cloud VPS projects. Cloud VPS uses Ceph volumes for storage which generally provide spinning rust level IOPS. We have made some custom instance flavors available to the "integration" project (Beta-Cluster-Infrastructure), but even for those it is questionable if SSD level IOPS performance is possible.

Hello @Physikerwelt ! I am an SRE on the Search Platform team, and my responsibilities include the current WDQS infrastructure. While I can't estimate the exact resource needs of the WDQS graph under qlever, I can give you some info on its current resource usage under Blazegraph.

This dashboard can give you some idea of the system requirements for running a WDQS under Blazegraph. Unfortunately, the disk reporting appears broken, but the current Blazegraph installation is takes up ~1.2TB. The amount of memory used lingers around ~50 GB.

As far as the disk type, SSD would be nice to have, but mechanical drives in RAID-10 with lots of spindles can perform adequately for these types of workloads in my experience. If WMCS is able to offer 1.2+ TB disk, I would go ahead and try the experiment that way.

Thanks for running this experiment. Feel free to reach out on IRC ( # wikimedia-search room for my team, inflatador for me) if we can do anything to help.

@bking thank you. That sound all right https://wiki.bitplan.com/index.php/Wikidata_Import_2024-10-17 suggest similar figures, if I read it correctly. I understood that @WolfgangFahl suggested that SSDs would be needed. I don't know the exact disk usage patterns of qlever, but as the script s ready we might even use puppet and test with different hw configurations.
The script includes running
curl -LRC - --remote-name-all https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 2>&1
Do you know if the latest wibase dumps are available via nfs?

Do you know if the latest wibase dumps are available via nfs?

Yes, everything on dumps.wikimedia.org is also available on NFS. Before you can mount the dumps NFS share on your cloud servers, I believe you'll need to follow the procedure in this article . Good luck and let us know if you have any other questions.

I think we need to clearly specify again the required disk and RAM quotas.

@aborrero sorry for the confusion. I believe we are talking about a single server, as opposed to a project-wide quota.

Based on the resources Blazegraph needs in production (and keeping in mind that qlever may well have different resource needs), I'd recommend the following (again, for a single server)

  • Memory: At least 64 128 GB RAM
  • CPUs: more is better, but once the initial conversion from dump to graph database file happens, heavy CPU isn't needed. 8 cores would be better but 4 is fine.
  • Disk space: @Physikerwelt or @WolfgangFahl , or anyone who's run the qlever import script lately, can you let us know how much disk we might need?

@aborrero sorry for the confusion. I believe we are talking about a single server, as opposed to a project-wide quota.

yeah, the only quotas that exists in Cloud VPS are project-wide.

So, you all can do the math of how many RAM, CPU, disk, instances. you need in total for the project, even if only a single VM will be used, and that would be the project quota that we would need to evaluate/approve/set.

Memory: At least 64 GB RAM

qlever seems to use quite a bit more memory. Even with 128GB we saw Out of memory errors for queries that run on blazegraph.

2.5TB: The current QLever wikdiata import uses 1.1 TB (compressed dumpfile + indexes). If we need to decompress the file for the import you have >1TB for the dump and ~1TB for the index.

it's 4 TB of SSD and 128 GB RAM for starters which will give you a single QLever instance to be indexable. If you want multiple instances e.g. QLever and blazegraph in parallel the footprint for indexing gets bigger. The runtime requirements are lower so it might be possible to just get the index result over but that would not be the idea of the game.

I have two servers with 384 GB of RAM each, plus a workstation with 256 GB of RAM that can be used for testing. (I have tested QLever successfully on it before.) One of the servers has a bit over 2 TB free; the other has over 7 TB free. (Capacity can be added with funding.) I think either or both could be used to run QLever. What I would want to know is:

  • What will RAM usage be when building the index vs. regular operation?
  • What will the rebuild process be? How often will you rebuild?
  • Will you need ongoing access to my systems?

I ask because QLever will be a co-tenant with Blazegraph, so I will want to make sure not too much memory is committed at any one time.

@Harej
thank you for pointing out your infrastructure availabilty which is sort of already "priced" in in the calculation. We are seeking a third mirror environment sponsored by the Mediawiki foundation. As for the technical aspects: https://wiki.bitplan.com/index.php/Wikidata_Import_2024-10-17

has the description of how we intend to have a rotating regular update. Our assumption that we get daily refreshed dumps is currently wrong though i think it is more like a weekly thing. E.g. we just got a fresh dump in https://dumps.wikimedia.org/wikidatawiki/entities/ - so i can now start another import which should give us the RAM usage.

qlv -h
Usage: /home/wf/bin/qlv [OPTIONS]
Options:
  -h, --help             Show this help message
  -c, --current          Show the disk currently used by QLever
  -d, --debug            Enable debug output
  -ir, --index-run       Run QLever wikidata indexing on today's disk
  -p, --pull             Pull QLever Docker images
  -qc, --qlever-control  setup qlever-control
  -s, --space            Show free disk space
  -t, --today            Show disk to be used today
  -v, --version          Show version information
qlv -s
Directory  Device           Available      Total Type
alpha      /dev/sdb1             2.0T       3.5T  SSD
beta       /dev/sdc1             2.2T       3.5T  SSD
delta      /dev/sde1             2.8T       3.5T  SSD
eneco      /dev/sda1             8.7T        11T  HDD
gamma      /dev/sdd1             3.2T       3.5T  SSD
mantax     /dev/nvme0n1p1        1.1T       5.8T  SSD
qlv -t
/hd/delta
qlv -ir
✅:Created directory /hd/delta/qlever/wikidata_20241024
✅:Started screen session qlever_wikidata_20241024.
✅:Logging to /hd/delta/qlever/wikidata_20241024/screen.log

 tail -f /hd/delta/qlever/wikidata_20241024/screen.log
eval "$(register-python-argcomplete qlever)" && export QLEVER_ARGCOMPLETE_ENABLED=1


Command: get-data

curl -LRC - --remote-name-all https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 2>&1

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  105G    0  642M    0     0  4105k      0  7:30:57  0:02:40  7:28:17 4043k

The download does not need much RAM - so we'll have to wait for a few hours to see the CPU and RAM load

@aborrero sorry for the confusion. I believe we are talking about a single server, as opposed to a project-wide quota.

yeah, the only quotas that exists in Cloud VPS are project-wide.

So, you all can do the math of how many RAM, CPU, disk, instances. you need in total for the project, even if only a single VM will be used, and that would be the project quota that we would need to evaluate/approve/set.

Thanks, I was mainly wondering about the flavor size. I guess y'all are willing/able to create large flavors for this? The largest flavor I see on my account is 32 GB RAM.

Do you know a tool that will track CPU and Mem load of a server over time (on an Ubuntu server)?
Currently we have 3% Memory of 512 GB and 300% CPU Load (of 16 cores) while the indexing runs. Hannah Bast reported the memory needs are not at 128 GB any more these days but the peak i do not know. If we want to go try and error we could go with the current max and run things for a test.

@Seppl2013 I recommend sysstat (also known as sar) for tracking memory and load. sysstat takes 10 minute samples by default, and you can see the memory stats with sar -r. Let us know if you have any other questions.

@bking - thanks for the hint i'll try it out on the upcoming imports

Maybe we can create the project with the default quotas, and you can ask for whatever quota bump later once the data is clear to you?

Maybe we can create the project with the default quotas, and you can ask for whatever quota bump later once the data is clear to you?

I'd support this. If it's generally possible to create flavours with more ram, I don't see anything that speaks against it.

aborrero claimed this task.

done, please create a separate ticket for quota requests.

What is the next step here? I do not know what it means that this ticket has been worked on.

What is the next step here? I do not know what it means that this ticket has been worked on.

The task has been resolved. We now have a dedicated wikiqlever cluster and are advised to request more resources when we hit limits.

Still i am in Limbo in what to do - what the next steps are. How can we make use of this environment?

We now have a cloud vps cluster that anyone with a wikitech account can use that has as many resources as necessary to test qlever. I was trying to reproduce your setup there with puppet, but it didn't work at the first try because I wasn't able to understand the documentation T379501. However, there is no need to use puppet and we can also do everything manually. Then it's just harder to reproduce.

We have set up a basic qlever test instance called qlever1 with debian13. Project members can ssh into with the following ssh config

Host qlever1
    Hostname qlever1.wikiqlever.eqiad1.wikimedia.cloud
    ProxyCommand ssh -a -W %h:%p physikerwelt@primary.bastion.wmflabs.org
    User YourUserName
    RequestTTY force
    ServerAliveInterval 240
    ServerAliveCountMax 3

First, we needed to solve a problem with the environment

cat /etc/environment 
LC_CTYPE="en_US.UTF-8"
LANG=en_US.utf-8
LC_ALL=en_US.utf-8

did the trick. Then we did first install docker following https://docs.docker.com/engine/install/debian/ and https://docs.docker.com/engine/install/linux-postinstall/ until we could run

2025-12-18 14:29:15 docker run hello-world

Then we did install qlever

2025-12-18 14:17:14 sudo apt-get install pip
2025-12-18 14:17:27 pip install qlever
2025-12-18 14:17:37 pipx install qlever
2025-12-18 14:17:51 sudo apt-get install pipx
2025-12-18 14:18:02 pipx install qlever
2025-12-18 14:18:43 pipx ensurepath
2025-12-18 14:18:53 source ~/.bashrc

Afterwards we started setting up the olympics example
https://github.com/qlever-dev/qlever-control?tab=readme-ov-file#usage

2025-12-18 14:19:04 qlever setup-config olympics 
2025-12-18 14:19:12 qlever get-data
2025-12-18 14:30:38 qlever index                   # Build index data structures for this dataset
2025-12-18 14:31:22 qlever start                   # Start a QLever server using that index
2025-12-18 14:31:34 qlever query                   # Launch an example query
2025-12-18 14:31:51 qlever ui                      # Launch the QLever UI
2025-12-18 14:33:11 curl http://qlever1:8176/olympics

now we had to set up web proxies (there is no public ip)

https://qlever-backend-demo1.wmcloud.org/ (backend) pointing to qlever1 port 7019
https://qlever-ui-demo1.wmcloud.org (frontend) pointing to qlever1 port 8176

and configure the frontend to use the backend

physikerwelt@qlever1:~$ cat Qleverfile-ui.yml
config:
  backend:
    name: Olympics
    slug: olympics
    sortKey: 1
    baseUrl: https://qlever-backend-demo1.wmcloud.org
...

Now you can access https://qlever-ui-demo1.wmcloud.org

For the next step (importing wikidata) we need more hardware resources

We now have 32GB RAM and 1TB disk, which I attached to the qlever1 instance, which I resized to 32GB memory following https://wikitech.wikimedia.org/wiki/Help:Adding_disk_space_to_Cloud_VPS_instances

sudo wmcs-prepare-cinder-volume
This tool will partition, format, and mount a block storage device.


Attached storage devices:

    sda:  (the primary volume containing /)
    sdb: new volume, will be formatted before mounting

The only block device device available to mount is sdb.  Selecting.

Where would you like to mount it? </srv>  
Ready to prepare and mount sdb on /srv. OK to continue? <Y|n>Y
Formatting as ext4...
mke2fs 1.47.2 (1-Jan-2025)
Discarding device blocks:   8912896/268435456

Preparing qlever data

physikerwelt@qlever1:/srv$ sudo mkdir qlever
physikerwelt@qlever1:/srv$ sudo chown physikerwelt qlever
physikerwelt@qlever1:/srv$ cd qlever/
physikerwelt@qlever1:/srv/qlever$ qlever setup-config wikidata
physikerwelt@qlever1:/srv/qlever$ screen

physikerwelt@qlever1:/srv/qlever$ qlever get-data

Command: get-data

curl -LRC - -O https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 2>&1 | tee wikidata.download-log.txt && curl -sL https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf | docker run -i --rm -v $(pwd):/data stain/jena riot --syntax=RDF/XML --output=NT /dev/stdin > dcatap.nt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0  112G    0  170M    0     0  4101k      0  8:01:22  0:00:42  8:00:40 4077k

running the same step from another instance (with access to nfs) for comparison

/data/scratch/qlever$ time docker run -i --rm -v /public/dumps/public/wikidatawiki/entities:/data stain/jena riot --syntax=RDF/XML --output=NT /dev/std
in > dcatap.nt

Now starting the indexing

physikerwelt@qlever1:/srv/qlever$ qlever get-data

Command: get-data

curl -LRC - -O https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 -O https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.bz2 2>&1 | tee wikidata.download-log.txt && curl -sL https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf | docker run -i --rm -v $(pwd):/data stain/jena riot --syntax=RDF/XML --output=NT /dev/stdin > dcatap.nt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  112G  100  112G    0     0  4073k      0  8:04:42  8:04:42 --:--:-- 2849k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  575M  100  575M    0     0  4036k      0  0:02:26  0:02:26 --:--:-- 4011k
Download successful, total file size: 121,912,642,826 bytes

physikerwelt@qlever1:/srv/qlever$ ls
dcatap.nt  latest-all.ttl.bz2  latest-lexemes.ttl.bz2  Qleverfile  wikidata.download-log.txt
physikerwelt@qlever1:/srv/qlever$ ls -lah
total 114G
drwxr-xr-x 2 physikerwelt root    4.0K Jan  8 07:06 .
drwxr-xr-x 4 root         root    4.0K Jan  7 22:56 ..
-rw-r--r-- 1 physikerwelt wikidev 109K Jan  8 07:06 dcatap.nt
-rw-r--r-- 1 physikerwelt wikidev 113G Dec 31 18:04 latest-all.ttl.bz2
-rw-r--r-- 1 physikerwelt wikidev 576M Jan  2 23:35 latest-lexemes.ttl.bz2
-rw-r--r-- 1 physikerwelt wikidev 2.1K Jan  7 22:57 Qleverfile
-rw-r--r-- 1 physikerwelt wikidev 2.3M Jan  8 07:06 wikidata.download-log.txt
physikerwelt@qlever1:/srv/qlever$ qlever index

Command: index

echo '{ "languages-internal": [], "prefixes-external": [""], "locale": { "language": "en", "country": "US", "ignore-punctuation": true }, "ascii-prefixes-only": true, "num-triples-per-batch": 5000000 }' > wikidata.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --name qlever.index.wikidata --init --entrypoint bash adfreiburg/qlever -c 'ulimit -Sn 500000 && IndexBuilderMain -i wikidata -s wikidata.settings.json --vocabulary-type on-disk-compressed -f <(lbzcat -n 4 latest-all.ttl.bz2) -g - -F ttl -p true -f <(lbzcat -n 1 latest-lexemes.ttl.bz2) -g - -F ttl -p false -f <(cat dcatap.nt) -g - -F nt -p false --stxxl-memory 10G | tee wikidata.index-log.txt'

2026-01-08 11:15:01.717 - INFO: QLever IndexBuilder, compiled on Mon Dec 15 23:24:16 UTC 2025 using git hash 959c50
2026-01-08 11:15:01.732 - INFO: You specified "locale = en_US" and "ignore-punctuation = 1"
2026-01-08 11:15:01.733 - INFO: You specified "ascii-prefixes-only = true", which enables faster parsing for well-behaved TTL files
2026-01-08 11:15:01.733 - INFO: You specified "num-triples-per-batch = 5,000,000", choose a lower value if the index builder runs out of memory
2026-01-08 11:15:01.733 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2026-01-08 11:15:01.733 - INFO: Processing triples from 3 input streams ...
2026-01-08 11:15:01.738 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2026-01-08 11:16:06.650 - INFO: Triples parsed: 70,000,000 [average speed 1.1 M/s, last batch 1.2 M/s, fastest 1.2 M/s, slowest
2026-01-08 11:16:58.941 - INFO: Triples parsed: 130,000,000 [average speed 1.1 M/s, last batch 1.2 M/s, fastest 1.2 M/s, slowest 0.8 M/s]

This task was the request to create the Cloud VPS project. This task is resolved as the project has been created. Please use some other task or place to track follow-ups not directly related to creating the project.

@taavi, sorry, as per T379030, one can now log progress via IRC again. I'll use that instead.