User Details
- User Since
- Mar 30 2017, 2:19 AM (463 w, 4 d)
- Availability
- Available
- LDAP User
- Audiodude
- MediaWiki User
- Audiodude [ Global Accounts ]
Yesterday
Glad you were able to figure it out, at least somewhat. I think your intuition is correct. There a lot of layers/middleware in your stack and one of them is probably dripping the response body and just re-wrapping as a generic 403.
Sat, Feb 14
I think security best practices for 403s are to return as little information as possible. You don't really want to advise bad actors how to get around the 403.
Fri, Feb 13
Tue, Feb 10
My pleasure. How about we take another 24 hours for technical/ops folks on this thread to comment on that meta page and then I'll disseminate a final copy on the Village Pump?
Yes I was thinking Villiage Pump. As a technical editor myself, I don't know where the non-technical editors "hang out" other than potentially Village Pump.
And one of the reasons is the expectation of "etherpad is forever" has been built up is that we haven't ever purged it.
I acknowledge the unwieldy size of the database and support a truncation. However, I think there needs to be a longer timeline (maybe more like 90 days instead of less than 30?) in order to properly communicate to affected users and give them time to retrieve their data.
Tue, Jan 20
Sorry the code parses pages from 1-8, where the pages actually are (I forgot how many pages there were, sorry, I wrote it last year!). The 9 only appears in the code as a boundary (off by one error anyone?).
Oh sorry, I think there was a bit of a misunderstanding. I read "Perennial sources/X" and interpreted X as a variable, like my N from above.
Mon, Jan 19
Currently the bot fetches tables from Wikipedia:Reliable_sources/Perennial_sources/N with N from 1-9. Is that sufficient? It doesn't use the "main table" at all.
Jan 17 2026
As mentioned in this thread, the bot was spuriously adding http:// in front of "domain" fields extracted from the table. As long as we don't do that, this issue is fixed.
I wrote the bot, and it's mostly a prototype for now as we discuss and organize a solution to the larger issue here: https://en.wikipedia.org/wiki/Wikipedia_talk:Reliable_sources/Perennial_sources
Jan 16 2026
Ah okay, thanks for the clarification.
Sorry, I haven't used phabricator before to organize tasks for a specific project inside of the Wikimedia movement. Maybe the assumption is that any task I file is a "general" task, or that this is meant to apply beyond the specific tag and context in which it is filed?
Interesting suggestion. I'm not sure. The bot templates are currently written as wikitext, so my intuition is that wouldn't work. I also don't know how to publish in "plain text" but have things like links, styling and templates?
I think two separate "co-tasks" that are jointly the parent of all tasks is preferable, yes. What I was getting at is that we don't need to check if the "Do the whole thing" task is complete to know that the thing is done. If all the tasks with this tag are marked done, or "not needed", we know the thing is done. Going further, as long as every task in this project has the project tag, it's not necessary to define a strict or rigid task hierarchy/tree of parent/subtask (and could even get cumbersome if people think they can't work on a task because the subtasks aren't all done). I don't mean to be overly crtical, especially if you find the scheme useful, and I'm not trying to start a yak shaving contest, so of course be bold.
+1 for separate tasks, because it is possible that one person is doing the designing and another the creating.
In my experience, a "root" supertask like this is not useful. It is by definition the parent of all tasks (master of none?) and therefore has no utility. It will only be completed when the project is done, at which point closing it is the equivalent of turning off the light switch after you've finished moving out.
My advice from the perspective of someone who has done many technical/data "migrations" is that it's best to make the first migrated version as similar to the original in terms of scope, functionality and schema as possible.
So just to clarify, what's happening is that the bot is trying to create a "source page" (subpage) for Dotdash Meredith which is the current owner of the site "about.com". The wiki software sees "about.com" and assumes I'm trying to add a link to that site which is blacklisted (their term, should be denylisted). Technically (in both senses of the word), I am. But it is clearly a case where an exception is warranted. The error message clearly indicates what to do (asking for an exception on the "spam whitelist [their term, should be allowlist] talk page".
Jan 12 2026
Thank you so much!
Jan 6 2026
Great idea! Yes I'll migrate the lone issue once we have a phab tag (or I can do it now and open a task, but I don't want it to be confusing and mis-triaged if we don't have the tag). Then I'll disable issues on the project.
Sep 2 2025
Honestly I feel like we can just ignore this task, or delete these objects, or both.
Make sure you have mwoffliner selected as your project:
Sep 1 2025
You're right, it's for database backups originating from Horizon. You're also right about the date it was enabled. Of course, since we have our own database backups, these are useless. The actual mystery to me is why we suddenly got this message. Either way, it's not something we are using so it can be deleted or just ignored.
Jul 23 2025
Thanks again @Scott_French for the extremely helpful analysis! I plan to submit a PR to mwclient to update the docs for that method to indicate which parameters are ignored when pool is set.
Jul 21 2025
Oh nevermind, we do use a connection pool in order to re-use the login cookies! So it seems according to your analysis (which I just confirmed), because we are setting the pool, we are ending up with a null UA? And that is causing us to get rate limited, presumably because of new policy changes?
Thanks for that. I think you might have incorrect assumptions about the wp1 code, though. We do not attempt to set any custom "WP 1.0 Bot" user agent, and we are not using a connection pool. Since 2018, we have relied on the mwclient/0.* UA, which has worked.
Thanks for taking the time and looking into what mwclient version we use in production. Upgrading to 0.11.0 was the first thing I did when attempting to run the test code, but I immediately ran into the issue described above.
This is relatively high priority because the bot is currently offline pending resolution.
May 23 2025
Thank you for carefully considering the issue. There haven't been any recent code changes in the tool so I'm stumped. I'll keep debugging.
May 15 2025
It has the talk page articles in it, which is what we are looking for (WP 1.0 articles are categorized on their talk pages, not their article pages).
Feb 5 2025
As a stakeholder, my main concern is resolving the issue. That said, my (unsolicited) technical advice is that this is a great temporary workaround. Not only do we mitigate the immediate bug, but the logging especially is probably necessary anyway to help properly diagnose the bug in the future.
Jan 16 2025
They only eventually work on retry. Immediate retries, even with a delay, do not work, as the issue seems to persist for multiple days before eventually resolving itself (which as @Benoit74 points out, seems to point to a caching issue).
The latest examples still "work" (ie they're broken!)
Jan 8 2025
Here's one that exhibits the initial behavior mentioned in this ticket (blank page from API): https://ca.wikipedia.org/api/rest_v1/page/mobile-html/Maylandia
Jan 7 2025
{
type: "https://mediawiki.org/wiki/HyperSwitch/errors/not_found#page_revisions",
title: "Not found.",
method: "get",
detail: "Page was deleted",
uri: "/es.wiktionary.org/v1/page/mobile-html/magnetita"
}These claim that the page was deleted from the API, but they are clearly still there:
Unfortunately, I don't have any special way of finding these other than running full mwoffliner scrapes of suspect wikis.
Dec 11 2024
Thanks for looking into this! Let me know if you need me to keep providing examples. For now, here's another one: https://ca.wikipedia.org/api/rest_v1/page/mobile-html/Dan_Georgiadis
Here's another one on ca:
Dec 10 2024
Couldn't find anything on arz (though mwoffliner failed for a different reason), but just came across this in cawiki:
Dec 6 2024
I'll re-run the scrape of arz and es wiktionary to see if I can find more reproducible cases.
Nov 14 2024
Yes, that page is still broken. The original article that was reported is now working.
Nov 5 2024
Happening with this article from Spanish Wiktionary too: https://es.wiktionary.org/api/rest_v1/page/mobile-html/awalk
Nov 4 2024
Also, we (Kiwix) have observed this issue in the past on arz, but it was with a different article that is no longer broken (see https://github.com/openzim/mwoffliner/issues/2003)
Oct 13 2024
@fnegri Now we're getting a message that we don't have enough RAM quota. The original server was g4.cores2.ram4.disk20, so we'd like to replicate that if possible. Thanks!
Oct 1 2024
Thanks so much!
Sep 29 2024
Jul 23 2024
Thanks for the attention and explanation. We have implemented using the base URL of the wiki being scraped as the Referer header, and https://github.com/openzim/mwoffliner/issues/2067 been closed.
Jul 22 2024
mwoffliner issue: https://github.com/openzim/mwoffliner/issues/2061
Jul 17 2024
It would be best for mwoffliner (which runs the WP 1.0 Bot) if the maintenance wasn't between 0:00 - 4:00 UTC, because that's when the bot is running
Jun 24 2024
Just to be completely clear, I don't really feel comfortable editing /fstab on mwcurator. If one of you could do it that would be great! Thanks!
I can try, but I'm not sure I know what I'm doing. Where do I get the UUIDs from?
Jun 9 2024
May 25 2024
Awesome, thank you so much for the explanation! It might be worth adding that to the README here: https://dumps.wikimedia.org/other/pageview_complete/readme.html
May 18 2024
May 3 2024
Yes it seems to have worked! Thanks so much.
May 1 2024
Mar 8 2024
Awesome, thanks for the feedback. I'll look into Trove.
Mar 7 2024
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Oct 9 2023
I'm completely new to Kubernetes but have been reading through https://wikitech.wikimedia.org/wiki/Kubernetes/Kubernetes_Workshop. Does WM Cloud provide k8s clusters, or is it expected that we would provision our own cluster on individual cloud VPS instances?
Thank you for all the information, it is very helpful! We can stick to asynchronous communication if that's what works best, no problem. I guess we can keep using this ticket for Q&A?
Looking at that wiki page I linked, it seems at least somewhat out of date. I'd like to work on upgrading Python to at least 3.11, since 3.7 is EOL since June of 2023. Of course this might require upgrading dependencies as well. I see that @Framawiki has some quarry-dev-* instances with a puppet "skip" note of:
Oct 7 2023
@SD0001 I found: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Quarry, but I don't think I fully understand it
I thought I was going crazy because REPLICA_HOST does in fact exist in default_config.yaml, but it turns out it isn't used anywhere in the repo so it must be a vestige of an old way of calculating the hostname.
Forked discussion to T348364
Shouldn't REPLICA_DOMAIN be set to analytics.db.svc.wikimedia.cloud for this to work? I haven't tried it myself yet. Then you would get enwiki.analytics.db.svc.wikimedia.cloud which would be correct right?
Oct 6 2023
FWIW I set up the dev environment without any issue and was able to run queries against mywiki.
Oct 5 2023
Confirmed: I got the github invite. I can also access the instances with my wikitech account, thanks!
I'm audiodude on github. Thanks!
I assume we need some kind of access to the Github repo too? https://github.com/toolforge/quarry
Another puzzling part is that MariaDB doesn't appear to be returning results as floats. I exposed the mywiki MariaDB in docker and ran this:
Documenting my investigation (no solution found).
Oct 3 2023
So is it correct that we're looking for a new maintainer, but only in the capacity of migrating all usage of Quarry to Superset? That is, no new features are planned for or expected of Quarry and we expect to turn it down once Superset has feature and use case parity?
Jul 19 2023
My pleasure. Did I save mwdiffs and mwpersistence? That's the goal. If so should one of us update the announcement to [Cloud-announce]?
Jul 18 2023
I successfully built the singleuser image with jupyterlab=3.6.3 and this PR: https://github.com/toolforge/paws/pull/309
Maybe not quite that. Looks like mwpersistence requires deltas -> yamlconf -> PyYAML 5.4.1:
My guess is that the implicated libraries (mwdiffs and mwpersistence) are written in Python 2 and can't upgrade to the latest version of PyYAML, but that's just a guess.
I think removing those libraries simply causes pip to resolve the dependency to a later version of PyYAML which doesn't have the issue, as seen here: https://github.com/flyteorg/flytekit/pull/1752/files
I tried bumping jupyterlab to 3.6.3 as seen in this commit: https://github.com/toolforge/paws/pull/308/commits/425a53a6449198beca9e3466c32f3604bdfbe31e
