@MusikAnimal: it seems we're experiencing 403s now from the google-api-proxy that I assume are coming from our end rather than Google's:
Apr 15 2020
Apr 11 2020
Good to hear, though I haven't changed anything on my end and it looks like @bd808's change did not stick due to some unknown issue?
Apr 2 2020
@zhuyifei1999 I didn't comment earlier, but I don't see anything unusual in these graphs.
Mar 9 2020
I also still worry about bots. If we could get some UA logging we could at least get a better idea of what we're up against.
Feb 24 2020
No, not random killing, this is the limit on the number of requests a single worker is allowed to handle before uwsgi restarts it that I mentioned earlier.
Feb 23 2020
Nothing unusual in the backtraces. In the one where there is actual activity, there's a PDF being parsed in one thread, ngrams being generated in another, and a URL being opened in another. These are all standard steps as part of a normal check/comparison.
Feb 22 2020
@Diannaa, I'm sincerely sorry for the recent problems. I've tried a few different configuration adjustments to get it to play nicely on the new infrastructure but nothing seems to help much. It seems there simply isn't enough available CPU with the current quota for the tool to do its job properly, or there is some other new limitation I can't figure out. This tells me the only solution will be a much more careful performance analysis that might require some substantial reworking of the code. I want to work on this, but I'm stuck due to some Real Life things that are eating my free time/energy for the moment. I can't promise a specific ETA.
Feb 18 2020
Initial results don't seem especially promising; the spiky CPU usage continues for about half an hour after restarting until reaching 100% and staying there seemingly indefinitely. It's not clear to me why this would happen. One conclusion I can make is that it's not directly related to memory usage; I had a theory earlier that we were bumping into the memory quota and swap thrashing, but I don't think this is happening. Memory usage is only around 600MB out of 4GB (per grafana) when we get stuck.
I have hacked my way to a higher apparent memory/CPU limit by adjusting the container resource requests to match the command-line limits instead of half (so 4 GiB RAM and 1 CPU instead of 2 GiB and 0.5, respectively). While this is certainly not a good long-term solution, I want to see if the tool behaves better under these conditions.
Feb 17 2020
The CPU usage being reported by grafana does not seem correct. Inside the container, with ps, I see:
Also, the error Arturo found is not a concern. That will happen for any Python 2 webservice because, as I understand it, the container image only makes Python 2 available but sets uwsgi up to load both the Python 2 and 3 plugins.
I'm not sure how to raise the limits beyond where they currently are (1 vCPU and 4GiB of RAM). Because we didn't have this issue with the old k8s cluster, I'm also not sure what has changed now. @bd808, any ideas?
Does this mean that uwsgi is actually pruning processes to stay within memory limits rather than Kubernetes or the kernel doing the pruning?
Feb 4 2020
Thanks for the help! The steps you followed seem to match what I tried, so my only theory now is that trying to override the memory/CPU caused it to fail. Unfortunately, now that it's up it seems the default memory limit is too low as my workers are getting SIGKILL'd frequently. Could you help me figure out how to raise that properly? Running webservice restart with -m 4 -c 1 doesn't seem to change anything.
Feb 3 2020
@bd808 I don't mind 15-20 minutes of downtime if you would like to try yourself (especially now when activity should be lower).
Jan 30 2020
To the end user, both issues (quota/IP change) appeared the same since copyvios shows the same error message.
Jan 27 2020
Prior discussion in this thread on my talk page.
Aug 16 2019
Hey Anomie. Here's the full request generated by Python's requests:
Aug 15 2019
Jun 9 2019
It seems to be working to me. The usual cause of this is going over the daily quota, which I don't always have good insight into. (The error message should mention the quota as a possible cause.)
May 15 2019
I've just released mwparserfromhell 0.5.4 with a fix for this specific bug (guarding that read with a NULL check and propagating the error instead). The interesting thing is that the conditions that lead to this crash should be very rare: the only situations I can think of are running out of memory or an exception being raised (like a KeyboardInterrupt) while we are in the middle of parsing a heading. I guess the latter is probably the cause, due to the timeout logic mentioned in T206654, which would also explain the reproducibility issue. (On my machine this page parses correctly in a couple seconds, but perhaps it's slow enough on ORES to trigger the timeout?)
Feb 24 2019
I've managed to fix a couple more bugs and poor design choices in the tool, and it looks like memory usage has fallen to more reasonable levels, so I'm closing this ticket. Thanks for the help earlier!
Feb 19 2019
Did some investigating with my tool of choice guppy and found a potential "leak" (really shouldn't be, but apparently a stack frame was living longer than intended and keeping a bunch of things alive with it). With that cleaned up, the pure-Python tools no longer seem to be reporting any leak candidates, but memory usage still seems kinda high. I'll follow up.
Feb 17 2019
uWSGI logs the following every several hours, which I assume is the OOM-killer:
Feb 16 2019
It's working now.
May 20 2018
Looks like there was no increase in tool usage on the 18th, but I don't have the exact number of Google API queries made readily available.
May 13 2018
OK, I'll try blocking the bot user agents from above and in @MusikAnimal's comment. If that doesn't reduce the rate on the 16th, or we still want to implement additional protections, we'll go for @Niharika's suggestion of requiring logins. (I'm not sure how this would integrate with the API, though.)
May 12 2018
OK, so starting around 2018-05-11 at 07:40, someone hammers the tool for two hours copyvio-checking a bit over a thousand AfC drafts. They're not using the API, but the sheer rate definitely makes it look like an automated process. They're checking mostly active drafts, but some declined submissions that haven't been touched in months as well. The URLs all have the same format as the copyvio check link in the submission template, a format which probably wouldn't arise if you were generating the URLs yourself, so I suspect it's some web crawler with a predictable activity pattern. I can't imagine why a person would behave in this manner, nor a real Wikipedia bot.
May 2 2018
I don't have access to request IPs on Toolforge. Other methods of tracking are creepy/error-prone (or maybe even disallowed?), and I don't want logging in to be required, so it's difficult.
Apr 11 2018
This was fixed in mwparserfromhell v0.5 (latest stable is 0.5.1, this bug existed in versions 0.4.4 and earlier). Please upgrade.
Aug 9 2017
I fixed it. Thanks.
Nov 18 2016
Yep. It might be covered by another ticket. These kinda things often are. I'm not sure.
It's a database desync issue. (I thought I mentioned that to primefac, guess it got miscommunicated?)
Aug 19 2016
Aug 4 2016
Redirects are followed, and cards are updated (T120695), as long as the new project title is used in wikiproject.json.
Implemented in 64aaa1d. Should work as expected, as long as the new project name is configured in wikiproject.json and the old one isn't (since that's how the bot determines which project names are valid).
We need some form of per-site configuration anyway. For example, sites have custom names for things like the wikiproject.json file, and there's localization questions with the bot's messages.
Added support for category trees. The configuration allows using a list of categories exclusively, mixing them with Wikidata, or using the project index. Should be good enough for most purposes.
It wasn't a user-agent issue, but something else that's hard to explain. Anyway, I fixed it.
Jul 7 2016
I checked my logs from that time, and it turns out the Labs databases were experiencing a bit of replication lag, which I had coincidentally happened to notice:
Jun 29 2016
Jun 23 2016
This is done in the schema, and will be deployed as soon as the new update_project_index script is finished.
All documentation is now in the README or module docstrings.
Jun 9 2016
This should work now. Simply pass detail=true when using action=compare.
@kaldari According to my logs, (human) tool usage has remained normal, but API usage completely stopped after Jun 8 at ~22:45 UTC — does this match with your info? If so, it would indicate that the German API users are responsible for the high usage rate. I don't know why they would suddenly stop using it, though, so we can't assume anything.
There are two links:
Jun 7 2016
I can do the implementation, but it would be helpful to get some suggestions for the output format.
Yes, it looks good now. Cheers.
Jun 6 2016
Google works, but unfortunately, it seems we are having some issues with the results themselves.
May 23 2016
The copyvio text has been deleted so I can't really investigate this.
May 20 2016
This question was asked and answered above for me. I don't think Eranbot uses anything besides Turnitin; did you mean CSB? At the moment, it looks like usage has dropped from the previous estimate, perhaps because people are less satisfied with the current quality of results. Ballpark is between 1,000 and 4,000 per day.
May 17 2016
About half of all queries.
May 13 2016
Is anyone gonna answer my question first?
May 10 2016
@kaldari Probably—it's not a big deal to implement—but what about the API?
@Compassionate727 It's funny, I asked nearly the exact same question...
I've got Yandex up and running for now. I set up a proxy on a personal server, since I can't use the Lab's one due to the IP thing.
May 9 2016
I don't think the copyvios tool actually takes advantage of any Labs-specific features (IOW, the DB replicas). It might be cheaper for everyone if I self-host it and do some sketchy stuff on my end—like scraping Bing directly—so the Labs folks aren't held responsible.
@Ricordisamoa As a service, it seems fairly limited. Maybe in the future? Is there a timeframe?
May 5 2016
Agreed, we can handle this without panic.
May 4 2016
Re point #2, can we argue that CSB is user-initiated on the principle that a user submitting an article implicitly triggers a check? Maybe bury it in an edit notice when you create a page?
May 2 2016
Probably not too crazy, but it depends on the way you want the results presented.
May 1 2016
Yes. Bing was shut off at the end of the month.
Apr 28 2016
If our usage remained at 300,000 queries per month [...]
Apr 20 2016
2.2. Subject to your strict compliance with these Terms, General Policies and Site ToS, Yandex grants you a non-exclusive, non-assignable, non-transferrable right to use the Service for the following purposes: (i) display Yandex search results on your website and in application; (ii) make temporary copies of Yandex search results for the use on your website or in application.
2.5. You shall be entitled to use the Service solely for the purpose of providing Yandex search results at your website or in application without alteration of order of Yandex search results impression, unless otherwise provided herein.
2.7. [...] You hereby further undertake at any time to refrain from, as well assist or permit any third parties performing the following actions:
2.7.7. Reorder, intermix, obscure, filter, replace the text, images or other information in Yandex search results obtained through the Service, unless otherwise required by applicable legislation and provided herein;
2.7.11. Modify the display of any website or webpage accessed by the links through the Service.
I'm hoping there isn't a limit on how many accounts can register the same IP address!
Apr 18 2016
So http://tools.wmflabs.org/copyvios/api, but a solution for caveat #1?
Apr 17 2016
Apr 13 2016
Apr 3 2016
It works. Hallelujah.
Mar 31 2016
EarwigBot doesn't tag revisions at the moment, and hasn't for a while; there's only the web interface which is run on demand. Its log of checks is not public, and cross-referencing at scale would be a bit difficult because it's not designed to retain results for more than a few days.
Mar 30 2016
Yes, it makes multiple queries per check, each with different chunks of sentences distributed somewhat uniformly throughout the article. The chunks have a size limit we can adjust, and the maximum number of queries (currently 10) can be changed as well, though it needs to be large enough to generate useful results when only a portion of an article is copied.
Mar 28 2016
Also, the WMF should have more accurate/long-term information on usage stats through BOSS's own interface, which I can't access myself. I don't know if said info would be per-user or including Coren's stuff, but either way it would be useful information.
Mar 23 2016
The ballpark is 6,000–12,000 queries/day, based on the past few days. We might be hitting up against the CSE limit then, but just barely. There's a (fairly conservative) bound of 300,000 queries/month...
Mar 3 2016
Clock is ticking. Any updates?
Feb 27 2016
A null edit fixed it. Here's a screenshot from before:
Feb 7 2016
Experience indicates search engines are miles better than Turnitin at detecting copyright violations as done by my tool.
Feb 2 2016
Google would be ideal, if we can work out a thing with them. I've looked into DuckDuckGo a bit and I'm not sure their setup is right for us; they seem more concerned with providing semantic search results than having a large text database, which is what we really need. Having dealt with Yahoo for (seven?) years at this point, I am not terribly impressed by them and suggest we look elsewhere.
Jan 31 2016
Not a huge fan of this. None of the other admin blocking powers can disable "reader-focused" features that only affect the user directly (e.g., we can't stop people from browsing pages). Also, it doesn't seem particularly useful.