Page MenuHomePhabricator

[TIMEBOX: 0hrs] Review allowlist check script
Closed, ResolvedPublic

Description

Context

📍KR: Less than 5% of endpoints on the WDQS Allowlist are non-operational

This is a vibe-coded script that @Anton.Kokh uses to check whether endpoints on the WDQS allowlist are operational.

The logic is supposed to be the following:

  1. Get all endpoints from the allowlist.
  2. Normalize them (remove tracing slashes and wildcards).
  3. Remove duplicates.
  4. Make 3 rounds of simple federated calls from WDQS to each endpoint (experimentation showed that results for some endpoints differ between rounds, so this allows to get a more accurate understanding of which endpoints are consistently failing).
  5. Generate an output dataframe and CSV that shows the original list of endpoints and the results for each of them. In the results, we can see which endpoints were removed through de-duplication. and how many times out of 3 they failed, including the errors that happened.

The information in status and fails column provides visibility of redundant and non-operational endpoints (incl. those that are failing at some occasions). Running the script on different days and comparing results, allows to get even more accurate insights.

Attached is the example of the most recent output of this script.

Task

Since the script was vibe-coded by the PM, it would be great if an engineer reviewed it for sanity.

  • Review the script to see if it looks correct
  • Black box test the script by validating the results via some different means (e.g. via a bash script)

Event Timeline

Ollie.Shotton_WMDE renamed this task from Review allowlist check script to [TIMEBOX: 8hrs] Review allowlist check script.Mar 3 2026, 3:24 PM
Ollie.Shotton_WMDE updated the task description. (Show Details)

Questions

  1. Do we really want to be stripping characters or removing double slashes from the end of the URLs? I would have thought we want to test what is actually in the allowlist, and if it fails then we know the URL in the allowlist needs updating, otherwise we are testing something different to what is in the allow list. E.g. I would expect http://example.com/sparql and http://example.com/sparql/ to be disallowed if http://example.com/sparql// is in the allow list - we would want to know about this.
  2. In what use cases are wildcard URLs required in the allowlist? I ask because it is difficult to test a wildcard URL as there is virtually an unlimited number of possible URLs that would be allowed. What if http://example.com/sparql* is in the allowlist but only http://example.com/sparql_a and http://example.com/sparql_b URLs are valid?
    • Is an allowlist the best place to be generating "test SPARQL endpoint URLs" from?
    • I think the process for "having a SPARQL endpoint added to the allowlist" now requires the owners of the query service to provide contact information. How about the process also includes a "Test URL" that has to be provided?
  3. How do you want redirects to be handled? Do you want them to be followed? I notice some logic that excludes 3xx status codes and only accepts 2xx status codes. However, by default redirects will be automatically followed, so the status code will never be a 3xx (see the requests docs).

Changes made (in Ollie's Copy of Check allowlist)

  • Changed the User-Agent to WMDE-FederationCheckBot/1.0 (anton.kokh@wikimedia.de) as per the WMF User-Agent Policy
  • Moved the SPARQL specific headers to the _post_wdqs() method
  • Simplified the allowlist URL variables
  • Simplified the ?format=TEXT logic in`fetch_gitiles_file()`
  • Simplified confusing logic and error handling in probe_attempt()
  • Simplified how the WDQS request timeout value was passed around the script
  • Removed duplicate de-duplication data structure
  • Added datetime stamp to the output CSV file

I then created a new script (Ollie's Rewrite of Check allowlist) using the better parts of the original script and a new consistent data structure throughout. I gave up tracking changes at this point as I just wanted a script that I could be mostly confident worked. This script is a lot more readable and easy to follow and verify but is still pretty rough around the edges. I tried to keep the output CSV similar to the one in the task description.

The biggest issue with all of these scripts is the lack of handling 429 "Too Many Requests" status codes. We could implement trying again after the period of time specified in the Retry-After header, or we could see if any of the libraries (e.g. Pywikibot, SPARQLWrapper, etc.) have built in support for this.

I found this review task quite difficult to do due to the confusing and superfluous code from the LLM. It's also hard to know what functionality is required by the PM vs what the LLM has made up. I suspect it would have been just as quick for me to write the script from scratch and I would have had more confidence over the final code.

I would really like for this script to be placed in a git repository where it will have version control, and can be validated with tests, linting, and static analysis. We could also create a GitHub action to have it run periodically and email you the results.

Ollie.Shotton_WMDE renamed this task from [TIMEBOX: 8hrs] Review allowlist check script to [TIMEBOX: 0hrs] Review allowlist check script.Mar 5 2026, 1:52 AM
Anton.Kokh added a subscriber: dcausse.

Thank you, Ollie!
Having the CSV arrive to my email weekly would be useful, but not mandatory. If it's easy to do, let's do it! :)

As for the questions

1 - When it comes to slash symbols, you are right, we should be testing exactly the link that is on the allowlist. Is this already fixed in your version of the script?

2 -

In what use cases are wildcard URLs required in the allowlist?

I don't have clear information about this. :)
Here's the info I got from @dcausse

So the logic lives in blazegraph at https://gerrit.wikimedia.org/r/plugins/gitiles/wikidata/query/blazegraph/+/refs/heads/m[…]va/com/bigdata/rdf/sparql/ast/service/ServiceRegistry.java
My understanding is that there are two modes:
path wildcard: you define http://foo.bar/* -> you allow any call on the foo.bar host, e.g. SERVICE http://foo.bar/something/sparql is allowed
query wildcard: you define http://foo.bar/sparql* -> you allow any call on this URL but with additional query e.g. SERVICE http://foo.bar/sparl?foo=bar is allowed
related ticket: https://phabricator.wikimedia.org/T196858

I previously created T417573 where I asked why one of the endpoints is repeated twice in the allowlist - with and without the wildcard. This resulted in a Slack thread where our WMF colleagues were trying to figure out why it's there. I think the outcome was: it's easier to just keep it, no one knows what we're gonna break if we remove it.
Since I have no idea how to properly test wildcarded endpoints, I just ignore the wildcard for simplicity.

3 - Redirects should be followed.

Thanks for the info, Anton!

1 and 2 - my version of the script only removes * chars from the URL, everything else is left there so that we can validate the URL as present in the list.

3 - Great, that is how it currently is.

I suggest we have a 4hr timebox task to move this script into a git repository and create an action.