User Details
- User Since
- Jan 22 2026, 1:46 PM (20 w, 4 d)
- Availability
- Available
- LDAP User
- Lerickson
- MediaWiki User
- LErickson-WMF [ Global Accounts ]
Today
claiming what remains of this (registering the stream, integrating with the proxy, whatever else comes up, also finalizing the schema release when we're ready to move it out of dev)
Yesterday
This will run from the indexing pod, right? Do we need to bundle rclone with the qlever docker image?
Update: this doesn't seem to make a difference.
I'm going to dump a couple thoughts and questions here for grooming purposes (we can discuss more on Tuesday):
Fri, Jun 12
Thu, Jun 11
An index and metadata file with files & timestamp now exists in the S3 bucket, and latest.json points to it.
Mon, Jun 8
Update on these error codes. I believe that we don't need to worry about many of them.
The first dump, kicked off last week, is complete, with data ending up in S3. Marking as complete.
Fri, Jun 5
I've been investigating to see which errors can be returned from qlever. I actually could not find documentation about this, but searching the code worked pretty well.
Added server logs when exceptions are created. As for the context MDC object I mentioned above, that's not relevant yet. I expect that we will use one later when we have some fields in mind to add to the logs, but for now, nothing left to do here.
Thu, Jun 4
Wed, Jun 3
Tue, Jun 2
Confirming that I can connect to our S3 bucket from wdqs1030.
Fri, May 29
There was a lot of unexpected complexity involved in this task because connecting to S3 in a DAG turned out to be highly nontrivial config-wise. Thanks to a lot of help from SRE in setting up secrets, debugging issues with accessing them, and resolving network access issues preventing data from going to S3, I have a working DAG now that can do a lexemes dump. Sent back the MR for review again, hoping to merge next week.
Thu, May 28
Wed, May 27
Tue, May 26
Update: I've indexed a shard of the import_wikidata_ttl DAG's output and put it in our S3 bucket, along with a tiny metadata file. This was a manual process; I'm working on a script. We also would like S3 access from the WDQS nodes to make this easy to do before the DAG setup is ready: T427319
Fri, May 22
Thu, May 21
Wed, May 20
Thank you! I am trying to use the connection right now in a DAG-in-progress running on my dev instance. When I do s3 = get_s3_client("wikidata_platform_s3_dpe")
I get this error:
The proxy now rejects updates (well, it actually already did) and diagnoses them as being disallowed writes. Right now it just returns a plain old 400 Bad Request, but later on in T424218 we will add better info. This one is done, though.
Tue, May 19
I filed T426764 to track getting the creds into airflow, since that's really a separate question from what this phab was tracking. This one is all done, IMO.
Mon, May 18
We have a PVC now in mediawiki-dumps-legacy. It is called wdqs-update-pipeline
I have successfully interacted with this PVC in a test airflow instance (created a directory and performed a wikibase lexemes dump)
Good point @trueg that the HTTP method isn't the issue, so 405 isn't as close as I thought it was. I think I'm OK with using 400 given all this context.
May 13 2026
Thank you! I ended up doing something very similar. I like your idea of passing along the cause with an UpdateOperationException and then returning a better HTTP response. I was originally planning to update the response but had 403 in mind, but this one (405) is better.
- we likely will not need our own PVC in our namespace. We can probably use S3 for all cross-task storage in our own DAG tasks.
- We have an S3 bucket!
- @BTullis will create a PVC for us in the mediawiki-dumps-legacy to use for the dump
May 12 2026
@RKemper Thank you so much!
In order to interact with this in a DAG, I understand that these credentials will need to be in an airflow secret, and that is something only SRE can do. Does that belong in a separate ticket? This is the "wikidata" airflow instance.
It turns out that the proxy was already rejecting updates, because it created and executed a Query object and in Jena, a Query is used to represent the read-only operations (see Query javadoc with the different QueryTypes). UpdateRequest is used to hold a request to insert/delete and other non-read-only actions.
Update: after talking with Ben, it seems we might use Ceph RGW/S3 for the munge/split step and also for the side-effect tables too. I will come back with a different storage estimate when I have more info, but just wanted to say this here for posterity.
May 11 2026
@gmodena I was not planning to use this bucket for the output of the data processing task. I was planning to store that using a new CephFS PVC. I was, however, planning to use it for the index files. So I guess it depends on what you mean by main/scholarly splits. If you mean the output of munge/split, no. If you mean the index files themselves, yes.
I filed T425973 for the S3 bucket bullet above.
Update on what I've learned so far. The tasks are numbered as follows: 1) Dump, 2) Quality check, 3) Spark processing, 4) Index, 5) Orchestrate reload
May 8 2026
Hi again @Yirba , I believe it's all ready now (I don't see the items in WDQS anymore). Thanks again for the report.
Thanks @Yirba for the report! I attempted to fix these entities in the same way, but I am unfortunately still seeing them returned in WDQS. I will investigate and get back to you.
May 6 2026
Thanks @gmodena , I will close this then!
@gmodena are we pulling 2009 into the k8s cluster, in which case we can close this?
Closing this - no longer relevant with the plan for an entirely new data pipeline. See T422179
