Page MenuHomePhabricator

Toolforge CDNjs mirror no longer lists older versions of libraries
Closed, ResolvedPublicBUG REPORT

Assigned To
Authored By
LucasWerkmeister
Jul 24 2023, 8:09 AM
Referenced Files
F37633935: image.png
Aug 27 2023, 11:10 AM
F37632035: image.png
Aug 26 2023, 11:08 AM
F37153819: image.png
Jul 28 2023, 6:36 PM
F37148172: Screen Shot 2023-07-24 at 10.07.51.png
Jul 24 2023, 8:09 AM

Description

Steps to replicate the issue:

What happens?:
Only the latest version is available.

Screen Shot 2023-07-24 at 10.07.51.png (480×640 px, 40 KB)

What should have happened instead?:
All versions are available.

Other information:
This is probably because the CDNjs API stopped returning assets for all versions of the library about a year ago.

Event Timeline

In theory, it’s possible to fix this by querying the API for the assets of each version, one by one.

diff --git a/generate.py b/generate.py
index 31ced293f2..3cfbfca9ae 100644
--- a/generate.py
+++ b/generate.py
@@ -69,37 +69,52 @@ def main():
 
     github_token = github_token.strip()
 
-    fields = "version,description,homepage,keywords,license,repository,author"
+    fields = "versions,description,homepage,keywords,license,repository,author"
     upstream_url = "https://api.cdnjs.com/libraries"
-    list_url = upstream_url + "?fields={}".format(fields)
-    with requests.get(list_url, stream=True) as resp:
+    with requests.get(upstream_url, stream=True) as resp:
         json_resp = resp.json()
 
     all_packages = json_resp["results"]
 
     libraries = []
     for package in all_packages:
-        logger.info("Processing %s...", package["name"])
+        name = package["name"]
+        logger.info("Processing %s...", name)
+        package_url = (
+            upstream_url
+            + "/"
+            + urllib.parse.quote(name)
+            + "?fields=version,description,homepage,keywords,license,repository,author,versions"
+        )
+        with requests.get(package_url) as resp:
+            package = resp.json()
+
         lib = {
-            "name": package["name"],
+            "name": name,
             "description": package.get("description", None),
             "version": package.get("version", None),
             "homepage": package.get("homepage", None),
             "keywords": package.get("keywords", None),
-            "assets": package.get("assets", None),
+            "assets": [],
         }
 
-        assets_url = (
-            upstream_url
-            + "/"
-            + urllib.parse.quote(lib["name"])
-            + "?fields=assets"
-        )
-        with requests.get(assets_url) as resp:
-            try:
-                lib["assets"] = resp.json()["assets"]
-            except (KeyError, ValueError):
-                logger.exception("Failed to fetch assets using %s", assets_url)
+        for version in package.get("versions", []):
+            assets_url = (
+                upstream_url
+                + "/"
+                + urllib.parse.quote(name)
+                + "/"
+                + urllib.parse.quote(version)
+                + "?fields=files"
+            )
+            with requests.get(assets_url) as resp:
+                try:
+                    lib["assets"].append({
+                        "version": version,
+                        "files": resp.json()["files"],
+                    })
+                except (KeyError, ValueError):
+                    logger.exception("Failed to fetch assets using %s", assets_url)
 
         # Why this is a thing, I don't know. However, it is.
         # so far we don't need

However, this is slow. Just for the first ten libraries, and with the GitHub stars disabled, the time to assemble the output goes from ~10 seconds to ~17 minutes. Now, admittedly, the first ten libraries (crudely implemented as for package in all_packages[:10]: btw) are big ones (React, Vue, Bootstrap, d3), probably with an above-average amount of versions. But still, there’s a risk that this could make the build process untenably long.

I guess I’ll try a full run in Toolforge (as the lucaswerkmeister-test tool) and see how long it takes…

Looks like the job managed to run for about four hours, and made it through 363 out of 4420 libraries, before it got Killed for some unknown reason. I’ll try again with more memory.

This time the job made it through 2146 libraries, running for a bit over 9½ hours, before crashing with “Max retries exceeded with url” (“sslv3 alert handshake failure” – random network hiccup, I guess).

I think it’s fairly clear that, even if it works, this approach will take the better part of a full day – not ideal for a daily cronjob. It would be better if we could find a solution that doesn’t hit the API so heavily.

We know from experience (T182604) that cloning all of cdnjs/cdnjs is not feasible. But maybe we can get the files some other way? Ideas:

  • Can we somehow get git to clone all the commit and tree objects, but not the blob objects, from cdnjs/cdnjs? Or get just a file listing, without contents, from the GitHub API somewhere?
  • cdnjs/logs is smaller, and contains file lists per version as well (mainly kv-publish.log). The format of the log files is a bit opaque, but perhaps we can still use it?

Or get just a file listing, without contents, from the GitHub API somewhere?

I can’t make head or tail of the GraphQL API, but with the REST API this seems doable. List master → list ajax/ → list libs/ → list each lib with ?recursive=true (falling back to non-recursive + listing each version with a separate request if the response was truncated). Assuming that truncation is rare (which seems to be the case – even for vue, which has a fair amount of versions and files, I get an untruncated recursive listing), that ought to be 𝓞(#libraries) rather than 𝓞(#libraries×#versions) network requests.

Okay, I think I have a working version of recursive fallback if GitHub truncates the response, but frankly I’m thinking about turning it off. If ?recursive=true works without truncation for big real libraries, but a fallback is needed for this –

many, many, many 0.0.0-timestamp versions of presumably the same library, often with several versions per day

– then I’m not particularly inclined to spend so much runtime on chasing those shitty libraries’ weird versions.

Eh, it turns out oojs-ui is one of the libraries where ?recursive=true results in a truncated response (all those icon files!), so let’s use the fallback implementation after all ;)

So I have two slightly different versions now:

  • Ignore all API errors from GitHub. Steamroll through the rate limit, losing data until it resets on GitHub’s side (every hour). Works.
  • Handle rate limits properly by sleeping. Don’t lose any data. Crashes.

Why does the “proper” version crash? Well, because some libraries have too many goddamn versions. In particular, the problem seems to be with openlayers; when I try to build just the output for that library locally, command time --verbose python… reports “Maximum resident set size (kbytes): 4831748” (that’s 4.6 GiB or 4.9 GB), the resulting file is 631 MiB (661 MB) big, and it contains 5364 versions. And on Toolforge, I was running this with -mem 4Gi (and the real job in the cdnjs tool even only has 3Gi) – so, the job got killed.

I think we can just say, fuck this one library in particular, and truncate the list of versions to, say, one thousand versions.

Alright, with that change, the script finishes in ca. 2 hours and 19 minutes; 10 libraries have too many versions:

tools.lucaswerkmeister-test@tools-sgebastion-10:~$ grep -c 'Too many versions' t342519-20.err
10
tools.lucaswerkmeister-test@tools-sgebastion-10:~$ grep 'Too many versions' t342519-20.err
2023-07-31T19:24:01Z generate     INFO    : Too many versions for react-is (1149), skipping fetching of older versions
2023-07-31T19:24:47Z generate     INFO    : Too many versions for tailwindcss (1439), skipping fetching of older versions
2023-07-31T19:26:34Z generate     INFO    : Too many versions for pdf.js (1531), skipping fetching of older versions
2023-07-31T19:34:44Z generate     INFO    : Too many versions for react-relay (3126), skipping fetching of older versions
2023-07-31T19:35:55Z generate     INFO    : Too many versions for material-components-web (1283), skipping fetching of older versions
2023-07-31T19:40:48Z generate     INFO    : Too many versions for hls.js (1797), skipping fetching of older versions
2023-07-31T19:41:16Z generate     INFO    : Too many versions for Primer (2860), skipping fetching of older versions
2023-07-31T19:42:45Z generate     INFO    : Too many versions for openlayers (2683), skipping fetching of older versions
2023-07-31T19:45:18Z generate     INFO    : Too many versions for aws-amplify (1052), skipping fetching of older versions
2023-07-31T19:47:49Z generate     INFO    : Too many versions for aws-sdk (1629), skipping fetching of older versions

I’ll do one more test run with -mem 3Gi instead of 4Gi to ensure that it’ll still work in the cdnjs job, and then this should be ready for review at last :)

Change 944290 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[labs/tools/cdnjs-index@master] Retry GitHub requests if rate limit is hit

https://gerrit.wikimedia.org/r/944290

Change 944291 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[labs/tools/cdnjs-index@master] Get other versions’ assets from GitHub

https://gerrit.wikimedia.org/r/944291

Change 944292 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[labs/tools/cdnjs-index@master] Write mod*.html files immediately

https://gerrit.wikimedia.org/r/944292

Change 944293 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[labs/tools/cdnjs-index@master] Trim GitHub trees to save memory

https://gerrit.wikimedia.org/r/944293

Change 944294 had a related patch set uploaded (by Lucas Werkmeister; author: Lucas Werkmeister):

[labs/tools/cdnjs-index@master] Limit libraries to 1000 versions

https://gerrit.wikimedia.org/r/944294

Change 944290 merged by jenkins-bot:

[labs/tools/cdnjs-index@master] Retry GitHub requests if rate limit is hit

https://gerrit.wikimedia.org/r/944290

Change 944291 merged by jenkins-bot:

[labs/tools/cdnjs-index@master] Get other versions’ assets from GitHub

https://gerrit.wikimedia.org/r/944291

Change 944292 merged by jenkins-bot:

[labs/tools/cdnjs-index@master] Write mod*.html files immediately

https://gerrit.wikimedia.org/r/944292

Change 944293 merged by jenkins-bot:

[labs/tools/cdnjs-index@master] Trim GitHub trees to save memory

https://gerrit.wikimedia.org/r/944293

Change 944294 merged by jenkins-bot:

[labs/tools/cdnjs-index@master] Limit libraries to 1000 versions

https://gerrit.wikimedia.org/r/944294

Next job should start in five hours or so, let’s see if it works :) thanks for merging @bd808!

And just for the record, prior to these changes, the runtime of the job was some 42 minutes if I’m not mistaken:

tools.cdnjs@tools-sgebastion-10:~$ toolforge-jobs show update-index | grep Status
| Status:     | Last schedule time: 2023-08-01T04:17:00Z |
tools.cdnjs@tools-sgebastion-10:~$ ls -l public_html/index.html
-rw-r--r-- 1 tools.cdnjs tools.cdnjs 42161265 Aug  1 04:59 public_html/index.html

My last test run (in the lucaswerkmeister-test tool) was about 2 hours and 10 minutes; should be about the same in the cdnjs tool. (cdnjs uses Python 3.9 rather than 3.11, which is somewhat slower, but the runtime here should be dominated by network requests.)

Mentioned in SAL (#wikimedia-cloud) [2023-08-02T19:25:53Z] <wm-bot> <lucaswerkmeister> pulled a526d723b7 (T342519), should take effect with next daily run

It’s, uh, still running? o_O

tools.cdnjs@tools-sgebastion-10:~$ toolforge-jobs show update-index
+-------------+-----------------------------------------------------------------+
| Job name:   | update-index                                                    |
+-------------+-----------------------------------------------------------------+
| Command:    | /data/project/cdnjs/update-index.sh                             |
+-------------+-----------------------------------------------------------------+
| Job type:   | schedule: 17 4 * * *                                            |
+-------------+-----------------------------------------------------------------+
| Image:      | python3.9                                                       |
+-------------+-----------------------------------------------------------------+
| File log:   | yes                                                             |
+-------------+-----------------------------------------------------------------+
| Output log: | update-index.out                                                |
+-------------+-----------------------------------------------------------------+
| Error log:  | update-index.err                                                |
+-------------+-----------------------------------------------------------------+
| Emails:     | none                                                            |
+-------------+-----------------------------------------------------------------+
| Resources:  | mem: 3Gi, cpu: default                                          |
+-------------+-----------------------------------------------------------------+
| Retry:      | yes: 1 time(s)                                                  |
+-------------+-----------------------------------------------------------------+
| Status:     | Running for 10h55m38s                                           |
+-------------+-----------------------------------------------------------------+
| Hints:      | Last run at 2023-08-03T04:17:01Z. Pod in 'Running' phase. State |
|             | 'running'. Started at '2023-08-03T04:17:03Z'.                   |
+-------------+-----------------------------------------------------------------+

And there are no entries newer than 2023-06-14 in update-index.err as far as I can tell.

And there are no entries newer than 2023-06-14 in update-index.err as far as I can tell.

Ah, I guess that’s more or less expected because update-index.sh doesn’t run generate.py with -v (unlike the script I tested with), so all the info-level messages I’m used to aren’t there.

Still running, for 18 hours now. If it’s not done by tomorrow, I’ll probably roll back the source code (and kill the job if it’s running), then try to investigate some more in my test tool again.

Mentioned in SAL (#wikimedia-cloud) [2023-08-04T15:20:57Z] <wm-bot> <lucaswerkmeister> rolled ~/cdnjs-index back to fe85853af5, the changes for T342519 aren’t properly working yet

Mentioned in SAL (#wikimedia-cloud) [2023-08-04T15:22:14Z] <wm-bot> <lucaswerkmeister> kubectl delete job.batch/update-index-28183937 # job for T342519 is running too long, abort – next cronjob schedule should use the old code again

Running for 1d11h

That’s way too long, I’ve stopped it now. Let’s hope it works again with the old code; I’ll investigate some more later (might not get around to it this weekend though).

I tried a closer setup back in lucaswerkmeister-test, including using ~tools.cdnjs/venv-py39/ instead of my own Python 3.11 venv, and it still finished successfully in a bit over two hours :/

Mentioned in SAL (#wikimedia-cloud) [2023-08-09T18:32:04Z] <wm-bot> <lucaswerkmeister> added self to maintainers, I need to do some more investigation for T342519 and it feels weird to keep using sudo for that; it seems to be a very low-maintenance tool overall and I don’t mind being on the list of maintainers

Mentioned in SAL (#wikimedia-cloud) [2023-08-09T18:36:25Z] <wm-bot> <lucaswerkmeister> rotated + compressed update-index.err to update-index.err.1.zst, old file is now empty (T342519)

Mentioned in SAL (#wikimedia-cloud) [2023-08-09T18:36:59Z] <wm-bot> <lucaswerkmeister> added -v option to update-index.sh, next output should be more verbose (T342519)

My plan right now:

  1. Make the tool’s update-index.sh verbose (-v option), to verify that verboseness alone doesn’t break anything. See above. Note that this is still on the old code.
  2. If that works, move to Python 3.11. Still on the old code. Just seems like a good thing to do in general.
  3. If that also works, then bump the code again. Hopefully the verbose output should give us more of an idea what’s broken – right now I have no clue why the job could have run for so long without any visible errors.

Looks like -v works fine – the job ran, index.html was updated, there are plenty of info messages in the output now. Onto the next step then.

Mentioned in SAL (#wikimedia-cloud) [2023-08-10T18:12:08Z] <wm-bot> <lucaswerkmeister> set up venv-py311, changed update-index.sh to use it over py39, recreated job using python3.11 image (T342519)

For the record, the command used to recreate the job (based on the previous toolforge-jobs run command in the bash history):

toolforge-jobs run --image python3.11 --mem "3Gi" --command "/data/project/cdnjs/update-index.sh" update-index --schedule "17 4 * * *"

Later™ this should be a YAML file for toolforge-jobs load to consume, I suppose.

Looks like the Python 3.11 version also worked. Let’s try updating the code again…

Mentioned in SAL (#wikimedia-cloud) [2023-08-11T19:07:51Z] <wm-bot> <lucaswerkmeister> updated code from fe85853af5 to a526d723b7, try again to include old versions (T342519)

Hm, it’s only processed 68 libraries and already had to sleep 11 times to avoid the rate limit:

tools.cdnjs@tools-sgebastion-10:~$ grep -c Processing update-index.err
68
tools.cdnjs@tools-sgebastion-10:~$ grep -c sleeping update-index.err
11

I wonder if the token in ~tools.cdnjs/cdnjs-index/tokenfile has a much lower rate limit than the one I tested with? It looks like an older format, at least (no github_ prefix).

Mentioned in SAL (#wikimedia-cloud) [2023-08-12T15:59:29Z] <wm-bot> <lucaswerkmeister> kubectl delete job update-index-28196897 # job for T342519 is running too long, abort

For some reason, I can’t use the token on the command line:

tools.cdnjs@tools-sgebastion-10:~$ curl -s -L -H "Accept: application/vnd.github+json" -H "Authorization: Bearer $(<~tools.cdnjs/cdnjs-index/tokenfile)" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/rate_limit
{
  "message": "Bad credentials",
  "documentation_url": "https://docs.github.com/rest"
}

But with a quick hack in generate.py, to let Python make the request, I can see the limits:

{"resources": {"core": {"limit": 60, "used": 67, "remaining": 0, "reset": 1691866264}, "search": {"limit": 10, "used": 0, "remaining": 10, "reset": 1691864361}, "graphql": {"limit": 0, "used": 0, "remaining": 0, "reset": 1691867901}, "integration_manifest": {"limit": 5000, "used": 0, "remaining": 5000, "reset": 1691867901}, "source_import": {"limit": 5, "used": 0, "remaining": 5, "reset": 1691864361}, "code_scanning_upload": {"limit": 60, "used": 0, "remaining": 60, "reset": 1691867901}, "actions_runner_registration": {"limit": 60, "used": 0, "remaining": 60, "reset": 1691867901}, "scim": {"limit": 60, "used": 0, "remaining": 60, "reset": 1691867901}, "dependency_snapshots": {"limit": 60, "used": 0, "remaining": 60, "reset": 1691864361}, "audit_log": {"limit": 0, "used": 0, "remaining": 0, "reset": 1691867901}, "code_search": {"limit": 0, "used": 0, "remaining": 0, "reset": 1691864361}}, "rate": {"limit": 60, "used": 67, "remaining": 0, "reset": 1691866264}}

And for comparison, the token I tested with:

{"resources": {"core": {"limit": 5000, "used": 0, "remaining": 5000, "reset": 1691867922}, "search": {"limit": 30, "used": 0, "remaining": 30, "reset": 1691864382}, "graphql": {"limit": 5000, "used": 0, "remaining": 5000, "reset": 1691867922}, "integration_manifest": {"limit": 5000, "used": 0, "remaining": 5000, "reset": 1691867922}, "source_import": {"limit": 100, "used": 0, "remaining": 100, "reset": 1691864382}, "code_scanning_upload": {"limit": 1000, "used": 0, "remaining": 1000, "reset": 1691867922}, "actions_runner_registration": {"limit": 10000, "used": 0, "remaining": 10000, "reset": 1691867922}, "scim": {"limit": 15000, "used": 0, "remaining": 15000, "reset": 1691867922}, "dependency_snapshots": {"limit": 100, "used": 0, "remaining": 100, "reset": 1691864382}, "audit_log": {"limit": 1750, "used": 0, "remaining": 1750, "reset": 1691867922}, "code_search": {"limit": 10, "used": 0, "remaining": 10, "reset": 1691864382}}, "rate": {"limit": 5000, "used": 0, "remaining": 5000, "reset": 1691867922}}

So the older (legacy?) token has a limit of 60 (per hour?) rather than 5000. Two orders of magnitude :D

And the question now is… whose token do we use instead? Mine? Some general Toolforge GitHub account? Something else?

Wait what? I replaced the token, and now I can get limit of 5000 with curl, but if I use Python to make what should be the right request, the limit is still 60?!

Mentioned in SAL (#wikimedia-cloud) [2023-08-12T20:35:48Z] <wm-bot> <lucaswerkmeister> replaced tokenfile with a fine-grained one generated for my account, hoping to get a higher rate limit (T342519)

I tried changing the user agent but it’s still returning the lower rate limit… I don’t have more time to investigate this tonight, I’ll roll back the code again.

And I think at some point it will be worth taking a step back to ask, is this really worth the effort? Or should we just rip out all the code for older versions and only have the index page show you the latest version? (URLs for older versions would still work, since we’re just proxying to Cloudflare, you just wouldn’t be able to get the URLs from the index page anymore. Just like it’s already been for a year without anybody complaining.)

Mentioned in SAL (#wikimedia-cloud) [2023-08-12T20:40:17Z] <wm-bot> <lucaswerkmeister> rolled back code from a526d723b7 to fe85853af5 (T342519)

Well, I found out why the rate limit is lower, at least…

lucas.py
import json
import requests


def github_request(url, token):
    """Make a request against the GitHub REST API."""
    print(token)
    headers = {
        "Accept": "application/vnd.github+json",
        "Authorization": "Bearer {}".format(token),
        "X-GitHub-Api-Version": "2022-11-28",
    }
    print(headers)
    response = requests.get(url, headers=headers)
    print(response.request.url)
    print(response.request.body)
    print(response.request.headers)
    print(response.status_code)
    print(response.headers)
    return response.json()


with open('tokenfile') as token_file:
    github_token = token_file.read()

github_token = github_token.strip()

print(json.dumps(github_request('https://api.github.com/rate_limit', github_token)))
output
REDACTED
{'Accept': 'application/vnd.github+json', 'Authorization': 'Bearer REDACTED', 'X-GitHub-Api-Version': '2022-11-28'}
https://api.github.com/rate_limit
None
{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': 'application/vnd.github+json', 'Connection': 'keep-alive', 'Authorization': 'Basic dG9vbGxhYnMtY2RuanMtbWlycm9yLWFjY291bnQ6YmxhaGJsYWgx', 'X-GitHub-Api-Version': '2022-11-28'}
200
{'Server': 'GitHub.com', 'Date': 'Sat, 26 Aug 2023 11:02:15 GMT', 'Content-Type': 'application/json; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Cache-Control': 'no-cache', 'X-GitHub-Media-Type': 'github.v3; format=json', 'x-github-api-version-selected': '2022-11-28', 'X-RateLimit-Limit': '60', 'X-RateLimit-Remaining': '55', 'X-RateLimit-Reset': '1693051227', 'X-RateLimit-Used': '5', 'X-RateLimit-Resource': 'core', 'Access-Control-Expose-Headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', 'Access-Control-Allow-Origin': '*', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Content-Security-Policy': "default-src 'none'", 'Vary': 'Accept-Encoding, Accept, X-Requested-With', 'Content-Encoding': 'gzip', 'X-GitHub-Request-Id': '7897:13F3:1DB7680:3D0BAC1:64E9DBB7'}
{"resources": {"core": {"limit": 60, "used": 5, "remaining": 55, "reset": 1693051227}, "search": {"limit": 10, "used": 0, "remaining": 10, "reset": 1693047795}, "graphql": {"limit": 0, "used": 0, "remaining": 0, "reset": 1693051335}, "integration_manifest": {"limit": 5000, "used": 0, "remaining": 5000, "reset": 1693051335}, "source_import": {"limit": 5, "used": 0, "remaining": 5, "reset": 1693047795}, "code_scanning_upload": {"limit": 60, "used": 0, "remaining": 60, "reset": 1693051335}, "actions_runner_registration": {"limit": 60, "used": 0, "remaining": 60, "reset": 1693051335}, "scim": {"limit": 60, "used": 0, "remaining": 60, "reset": 1693051335}, "dependency_snapshots": {"limit": 60, "used": 0, "remaining": 60, "reset": 1693047795}, "audit_log": {"limit": 0, "used": 0, "remaining": 0, "reset": 1693051335}, "code_search": {"limit": 0, "used": 0, "remaining": 0, "reset": 1693047795}}, "rate": {"limit": 60, "used": 5, "remaining": 55, "reset": 1693051227}}

It correctly reads the token from the tokenfile and puts it into the request headers… but then the actual request is made with a totally different Authorization header???

$ base64 -d <<< dG9vbGxhYnMtY2RuanMtbWlycm9yLWFjY291bnQ6YmxhaGJsYWgx
toollabs-cdnjs-mirror-account:blahblah1

WTF?!?!?

$ cat ~tools.cdnjs/.netrc 
machine api.github.com
        login toollabs-cdnjs-mirror-account
        password blahblah1

i am going to become the joker

That’s actually an actual GitHub account’s real password, too. But I’m not too upset about accidentally leaking it here, GitHub flagged it as weak anyways:

image.png (1×1 px, 108 KB)

I don’t know who set up this account, and I don’t particularly care. I’ll move the .netrc file out of the way and hope that it then finally uses the proper GitHub token I set up, with the higher rate limit.

And then in five years, when maintainership of this tool naturally migrates from me to someone else, they can swear at me too. That’s okay. That’s how it goes.

(Though I hope I’m at least documenting things a little bit better, e.g. the SAL says that the token belongs to my account.)

Mentioned in SAL (#wikimedia-cloud) [2023-08-26T11:10:14Z] <wm-bot> <lucaswerkmeister> mv ~tools.cdnjs/.netrc{,.disabled-2023-08-26-T342519}

Yay, now I’m getting the higher rate limit (5000 requests) from Python too.

Mentioned in SAL (#wikimedia-cloud) [2023-08-26T11:18:07Z] <wm-bot> <lucaswerkmeister> updated code from fe85853af5 to a526d723b7, try once more to include old versions (T342519)

Mentioned in SAL (#wikimedia-cloud) [2023-08-26T11:19:14Z] <wm-bot> <lucaswerkmeister> rotated update-index.err again to have clean logs for the next run (T342519), if it works I’ll remove -v from update-index.sh again

It’s working \o/ \o/ \o/

image.png (1×1 px, 197 KB)

And we only got throttled two times:

tools.cdnjs@tools-sgebastion-10:~$ grep -c 'sleeping for' update-index.err
2

Mentioned in SAL (#wikimedia-cloud) [2023-08-27T11:11:31Z] <wm-bot> <lucaswerkmeister> removed -v from update-index.sh again (T342519, previous message was also for that task but I forgor mention)

Mentioned in SAL (#wikimedia-cloud) [2023-08-27T11:14:20Z] <wm-bot> <lucaswerkmeister> toolforge-jobs run --image bookworm --command 'zstd --rm update-index.err.1 update-index.err.2 update-index.err.3' --wait compress-err # T342519, compressed each file by about a factor of ten

Mentioned in SAL (#wikimedia-cloud) [2023-08-27T11:11:31Z] <wm-bot> <lucaswerkmeister> removed -v from update-index.sh again (T342519, previous message was also for that task but I forgor mention)

(The previous message being rotate update-index.err one more time.)

I think we’re done here!