Jul 19 2022
I applied the suggested fix to the template. Based on testing the issue appears fixed, but there is some mobile caching so it's not correct everywhere yet.
Mar 5 2022
The trigger condition was opening the page info, so it wouldn't have been a problem if you were opening new articles without looking at that, but I'd still prefer an additional button specifically for the copyvio check.
Aha! I added this error because I was debugging performance issues several weeks ago, and—as it says—I was receiving a lot of API requests coming in with a referer of just "https://en.wikipedia.org/" and no other clues about who was making them. I was hoping someone would notice their tool or script failing and let me know so I could see if there was a problem with its behavior and properly tag their requests in my logs.
Jun 24 2021
May 4 2021
Apr 2 2021
Mar 30 2021
Mar 28 2021
Mar 8 2021
Noting that I was able to reproduce this on test.wikipedia.org with User:The Earwig (test). The problem seems to go away when the user is autoconfirmed, though they might need to log out/log back in to fix it—not sure.
Mar 3 2021
T264765 is restricted so I can't confirm what you are saying. Does it suggest that this should work without a correct title? (But that goes against your follow-up comment?) Or that it's unrelated to the behavior of getArchivedRevisionRecord?
Feb 27 2021
Thanks a lot for the quick fix and deployment! If anyone is interested, I've got my proof of concept working in Lua here: https://en.wikipedia.org/wiki/Module:Bad_title_suggestion.
Feb 23 2021
The patch doesn't work here because it expects the user-provided title to be correct (it calls getArchivedRevisionRecord($oldid) on the PageArchive, which only looks up revisions for that page—the Main Page in Enterprisey's example, because the provided title is blank).
Feb 20 2021
Feb 16 2021
Cool! One (likely) bug: if I press the reply button multiple times, it adds multiple @s.
Considering this fixed.
Jan 22 2021
IMO we should go with background-size: 125px but re-render the commons images to make the globe smaller to match the normal logo. (Comparing the two, the WP20 logo's text is sized appropriately at 125px, but the globe is too large.)
The image size isn't correct—the native resolution is 125px but is being scaled to 100px in the CSS. Setting the background-size on .mw-wiki-logo to 125px and increasing the height by ~15px seems to kinda work, but maybe too big?
Jul 3 2020
By any chance, have you been using PAWS to copyvio-check thousands of articles on the Turkish Wikipedia? See this thread here: https://en.wikipedia.org/wiki/User_talk:The_Earwig#Earwig_doing_the_thing_again
Jun 15 2020
@MusikAnimal I see this is resolved, but to provide an explanation: copyvios already sends Access-Control-Allow-Origin: *. I think the issue was because the tools.wmflabs.org legacy redirect doesn't send it, so CopyPatrol couldn't follow it. Is this something that should be addressed in the legacy redirect?
Apr 15 2020
@MusikAnimal: it seems we're experiencing 403s now from the google-api-proxy that I assume are coming from our end rather than Google's:
Apr 11 2020
Good to hear, though I haven't changed anything on my end and it looks like @bd808's change did not stick due to some unknown issue?
Apr 2 2020
@zhuyifei1999 I didn't comment earlier, but I don't see anything unusual in these graphs.
Mar 9 2020
I also still worry about bots. If we could get some UA logging we could at least get a better idea of what we're up against.
Feb 24 2020
No, not random killing, this is the limit on the number of requests a single worker is allowed to handle before uwsgi restarts it that I mentioned earlier.
Feb 23 2020
Nothing unusual in the backtraces. In the one where there is actual activity, there's a PDF being parsed in one thread, ngrams being generated in another, and a URL being opened in another. These are all standard steps as part of a normal check/comparison.
Feb 22 2020
@Diannaa, I'm sincerely sorry for the recent problems. I've tried a few different configuration adjustments to get it to play nicely on the new infrastructure but nothing seems to help much. It seems there simply isn't enough available CPU with the current quota for the tool to do its job properly, or there is some other new limitation I can't figure out. This tells me the only solution will be a much more careful performance analysis that might require some substantial reworking of the code. I want to work on this, but I'm stuck due to some Real Life things that are eating my free time/energy for the moment. I can't promise a specific ETA.
Feb 18 2020
Initial results don't seem especially promising; the spiky CPU usage continues for about half an hour after restarting until reaching 100% and staying there seemingly indefinitely. It's not clear to me why this would happen. One conclusion I can make is that it's not directly related to memory usage; I had a theory earlier that we were bumping into the memory quota and swap thrashing, but I don't think this is happening. Memory usage is only around 600MB out of 4GB (per grafana) when we get stuck.
I have hacked my way to a higher apparent memory/CPU limit by adjusting the container resource requests to match the command-line limits instead of half (so 4 GiB RAM and 1 CPU instead of 2 GiB and 0.5, respectively). While this is certainly not a good long-term solution, I want to see if the tool behaves better under these conditions.
Feb 17 2020
The CPU usage being reported by grafana does not seem correct. Inside the container, with ps, I see:
Also, the error Arturo found is not a concern. That will happen for any Python 2 webservice because, as I understand it, the container image only makes Python 2 available but sets uwsgi up to load both the Python 2 and 3 plugins.
I'm not sure how to raise the limits beyond where they currently are (1 vCPU and 4GiB of RAM). Because we didn't have this issue with the old k8s cluster, I'm also not sure what has changed now. @bd808, any ideas?
Does this mean that uwsgi is actually pruning processes to stay within memory limits rather than Kubernetes or the kernel doing the pruning?
Feb 4 2020
Thanks for the help! The steps you followed seem to match what I tried, so my only theory now is that trying to override the memory/CPU caused it to fail. Unfortunately, now that it's up it seems the default memory limit is too low as my workers are getting SIGKILL'd frequently. Could you help me figure out how to raise that properly? Running webservice restart with -m 4 -c 1 doesn't seem to change anything.
Feb 3 2020
@bd808 I don't mind 15-20 minutes of downtime if you would like to try yourself (especially now when activity should be lower).
Jan 30 2020
To the end user, both issues (quota/IP change) appeared the same since copyvios shows the same error message.
Jan 27 2020
Prior discussion in this thread on my talk page.
Aug 16 2019
Hey Anomie. Here's the full request generated by Python's requests:
Aug 15 2019
Jun 9 2019
It seems to be working to me. The usual cause of this is going over the daily quota, which I don't always have good insight into. (The error message should mention the quota as a possible cause.)
May 15 2019
I've just released mwparserfromhell 0.5.4 with a fix for this specific bug (guarding that read with a NULL check and propagating the error instead). The interesting thing is that the conditions that lead to this crash should be very rare: the only situations I can think of are running out of memory or an exception being raised (like a KeyboardInterrupt) while we are in the middle of parsing a heading. I guess the latter is probably the cause, due to the timeout logic mentioned in T206654, which would also explain the reproducibility issue. (On my machine this page parses correctly in a couple seconds, but perhaps it's slow enough on ORES to trigger the timeout?)
Feb 24 2019
I've managed to fix a couple more bugs and poor design choices in the tool, and it looks like memory usage has fallen to more reasonable levels, so I'm closing this ticket. Thanks for the help earlier!
Feb 19 2019
Did some investigating with my tool of choice guppy and found a potential "leak" (really shouldn't be, but apparently a stack frame was living longer than intended and keeping a bunch of things alive with it). With that cleaned up, the pure-Python tools no longer seem to be reporting any leak candidates, but memory usage still seems kinda high. I'll follow up.
Feb 17 2019
uWSGI logs the following every several hours, which I assume is the OOM-killer:
Feb 16 2019
It's working now.
May 20 2018
Looks like there was no increase in tool usage on the 18th, but I don't have the exact number of Google API queries made readily available.
May 13 2018
OK, I'll try blocking the bot user agents from above and in @MusikAnimal's comment. If that doesn't reduce the rate on the 16th, or we still want to implement additional protections, we'll go for @Niharika's suggestion of requiring logins. (I'm not sure how this would integrate with the API, though.)
May 12 2018
OK, so starting around 2018-05-11 at 07:40, someone hammers the tool for two hours copyvio-checking a bit over a thousand AfC drafts. They're not using the API, but the sheer rate definitely makes it look like an automated process. They're checking mostly active drafts, but some declined submissions that haven't been touched in months as well. The URLs all have the same format as the copyvio check link in the submission template, a format which probably wouldn't arise if you were generating the URLs yourself, so I suspect it's some web crawler with a predictable activity pattern. I can't imagine why a person would behave in this manner, nor a real Wikipedia bot.
May 2 2018
I don't have access to request IPs on Toolforge. Other methods of tracking are creepy/error-prone (or maybe even disallowed?), and I don't want logging in to be required, so it's difficult.
Apr 11 2018
This was fixed in mwparserfromhell v0.5 (latest stable is 0.5.1, this bug existed in versions 0.4.4 and earlier). Please upgrade.
Aug 9 2017
I fixed it. Thanks.
Nov 18 2016
Yep. It might be covered by another ticket. These kinda things often are. I'm not sure.
It's a database desync issue. (I thought I mentioned that to primefac, guess it got miscommunicated?)
Aug 19 2016
Aug 4 2016
Redirects are followed, and cards are updated (T120695), as long as the new project title is used in wikiproject.json.
Implemented in 64aaa1d. Should work as expected, as long as the new project name is configured in wikiproject.json and the old one isn't (since that's how the bot determines which project names are valid).
We need some form of per-site configuration anyway. For example, sites have custom names for things like the wikiproject.json file, and there's localization questions with the bot's messages.
Added support for category trees. The configuration allows using a list of categories exclusively, mixing them with Wikidata, or using the project index. Should be good enough for most purposes.
It wasn't a user-agent issue, but something else that's hard to explain. Anyway, I fixed it.
Jul 7 2016
I checked my logs from that time, and it turns out the Labs databases were experiencing a bit of replication lag, which I had coincidentally happened to notice:
Jun 29 2016
Jun 23 2016
This is done in the schema, and will be deployed as soon as the new update_project_index script is finished.
All documentation is now in the README or module docstrings.
Jun 9 2016
This should work now. Simply pass detail=true when using action=compare.
@kaldari According to my logs, (human) tool usage has remained normal, but API usage completely stopped after Jun 8 at ~22:45 UTC — does this match with your info? If so, it would indicate that the German API users are responsible for the high usage rate. I don't know why they would suddenly stop using it, though, so we can't assume anything.
There are two links:
Jun 7 2016
I can do the implementation, but it would be helpful to get some suggestions for the output format.
Yes, it looks good now. Cheers.
Jun 6 2016
Google works, but unfortunately, it seems we are having some issues with the results themselves.
May 23 2016
The copyvio text has been deleted so I can't really investigate this.
May 20 2016
This question was asked and answered above for me. I don't think Eranbot uses anything besides Turnitin; did you mean CSB? At the moment, it looks like usage has dropped from the previous estimate, perhaps because people are less satisfied with the current quality of results. Ballpark is between 1,000 and 4,000 per day.
May 17 2016
About half of all queries.
May 13 2016
Is anyone gonna answer my question first?
May 10 2016
@kaldari Probably—it's not a big deal to implement—but what about the API?
@Compassionate727 It's funny, I asked nearly the exact same question...
I've got Yandex up and running for now. I set up a proxy on a personal server, since I can't use the Lab's one due to the IP thing.
May 9 2016
I don't think the copyvios tool actually takes advantage of any Labs-specific features (IOW, the DB replicas). It might be cheaper for everyone if I self-host it and do some sketchy stuff on my end—like scraping Bing directly—so the Labs folks aren't held responsible.