Jun 9 2019
It seems to be working to me. The usual cause of this is going over the daily quota, which I don't always have good insight into. (The error message should mention the quota as a possible cause.)
May 15 2019
I've just released mwparserfromhell 0.5.4 with a fix for this specific bug (guarding that read with a NULL check and propagating the error instead). The interesting thing is that the conditions that lead to this crash should be very rare: the only situations I can think of are running out of memory or an exception being raised (like a KeyboardInterrupt) while we are in the middle of parsing a heading. I guess the latter is probably the cause, due to the timeout logic mentioned in T206654, which would also explain the reproducibility issue. (On my machine this page parses correctly in a couple seconds, but perhaps it's slow enough on ORES to trigger the timeout?)
Feb 24 2019
I've managed to fix a couple more bugs and poor design choices in the tool, and it looks like memory usage has fallen to more reasonable levels, so I'm closing this ticket. Thanks for the help earlier!
Feb 19 2019
Did some investigating with my tool of choice guppy and found a potential "leak" (really shouldn't be, but apparently a stack frame was living longer than intended and keeping a bunch of things alive with it). With that cleaned up, the pure-Python tools no longer seem to be reporting any leak candidates, but memory usage still seems kinda high. I'll follow up.
Feb 17 2019
uWSGI logs the following every several hours, which I assume is the OOM-killer:
Feb 16 2019
It's working now.
May 20 2018
Looks like there was no increase in tool usage on the 18th, but I don't have the exact number of Google API queries made readily available.
May 13 2018
OK, I'll try blocking the bot user agents from above and in @MusikAnimal's comment. If that doesn't reduce the rate on the 16th, or we still want to implement additional protections, we'll go for @Niharika's suggestion of requiring logins. (I'm not sure how this would integrate with the API, though.)
May 12 2018
OK, so starting around 2018-05-11 at 07:40, someone hammers the tool for two hours copyvio-checking a bit over a thousand AfC drafts. They're not using the API, but the sheer rate definitely makes it look like an automated process. They're checking mostly active drafts, but some declined submissions that haven't been touched in months as well. The URLs all have the same format as the copyvio check link in the submission template, a format which probably wouldn't arise if you were generating the URLs yourself, so I suspect it's some web crawler with a predictable activity pattern. I can't imagine why a person would behave in this manner, nor a real Wikipedia bot.
May 2 2018
I don't have access to request IPs on Toolforge. Other methods of tracking are creepy/error-prone (or maybe even disallowed?), and I don't want logging in to be required, so it's difficult.
Apr 11 2018
This was fixed in mwparserfromhell v0.5 (latest stable is 0.5.1, this bug existed in versions 0.4.4 and earlier). Please upgrade.
Aug 9 2017
I fixed it. Thanks.
Nov 18 2016
Yep. It might be covered by another ticket. These kinda things often are. I'm not sure.
It's a database desync issue. (I thought I mentioned that to primefac, guess it got miscommunicated?)
Aug 19 2016
Aug 4 2016
Redirects are followed, and cards are updated (T120695), as long as the new project title is used in wikiproject.json.
Implemented in 64aaa1d. Should work as expected, as long as the new project name is configured in wikiproject.json and the old one isn't (since that's how the bot determines which project names are valid).
We need some form of per-site configuration anyway. For example, sites have custom names for things like the wikiproject.json file, and there's localization questions with the bot's messages.
Added support for category trees. The configuration allows using a list of categories exclusively, mixing them with Wikidata, or using the project index. Should be good enough for most purposes.
It wasn't a user-agent issue, but something else that's hard to explain. Anyway, I fixed it.
Jul 7 2016
I checked my logs from that time, and it turns out the Labs databases were experiencing a bit of replication lag, which I had coincidentally happened to notice:
Jun 29 2016
Jun 23 2016
This is done in the schema, and will be deployed as soon as the new update_project_index script is finished.
All documentation is now in the README or module docstrings.
Jun 9 2016
This should work now. Simply pass detail=true when using action=compare.
@kaldari According to my logs, (human) tool usage has remained normal, but API usage completely stopped after Jun 8 at ~22:45 UTC — does this match with your info? If so, it would indicate that the German API users are responsible for the high usage rate. I don't know why they would suddenly stop using it, though, so we can't assume anything.
There are two links:
Jun 7 2016
I can do the implementation, but it would be helpful to get some suggestions for the output format.
Yes, it looks good now. Cheers.
Jun 6 2016
Google works, but unfortunately, it seems we are having some issues with the results themselves.
May 23 2016
The copyvio text has been deleted so I can't really investigate this.
May 20 2016
This question was asked and answered above for me. I don't think Eranbot uses anything besides Turnitin; did you mean CSB? At the moment, it looks like usage has dropped from the previous estimate, perhaps because people are less satisfied with the current quality of results. Ballpark is between 1,000 and 4,000 per day.
May 17 2016
About half of all queries.
May 13 2016
Is anyone gonna answer my question first?
May 10 2016
@kaldari Probably—it's not a big deal to implement—but what about the API?
@Compassionate727 It's funny, I asked nearly the exact same question...
I've got Yandex up and running for now. I set up a proxy on a personal server, since I can't use the Lab's one due to the IP thing.
May 9 2016
I don't think the copyvios tool actually takes advantage of any Labs-specific features (IOW, the DB replicas). It might be cheaper for everyone if I self-host it and do some sketchy stuff on my end—like scraping Bing directly—so the Labs folks aren't held responsible.
@Ricordisamoa As a service, it seems fairly limited. Maybe in the future? Is there a timeframe?
May 5 2016
Agreed, we can handle this without panic.
May 4 2016
Re point #2, can we argue that CSB is user-initiated on the principle that a user submitting an article implicitly triggers a check? Maybe bury it in an edit notice when you create a page?
May 2 2016
Probably not too crazy, but it depends on the way you want the results presented.
May 1 2016
Yes. Bing was shut off at the end of the month.
Apr 28 2016
If our usage remained at 300,000 queries per month [...]
Apr 20 2016
2.2. Subject to your strict compliance with these Terms, General Policies and Site ToS, Yandex grants you a non-exclusive, non-assignable, non-transferrable right to use the Service for the following purposes: (i) display Yandex search results on your website and in application; (ii) make temporary copies of Yandex search results for the use on your website or in application.
2.5. You shall be entitled to use the Service solely for the purpose of providing Yandex search results at your website or in application without alteration of order of Yandex search results impression, unless otherwise provided herein.
2.7. [...] You hereby further undertake at any time to refrain from, as well assist or permit any third parties performing the following actions:
2.7.7. Reorder, intermix, obscure, filter, replace the text, images or other information in Yandex search results obtained through the Service, unless otherwise required by applicable legislation and provided herein;
2.7.11. Modify the display of any website or webpage accessed by the links through the Service.
I'm hoping there isn't a limit on how many accounts can register the same IP address!
Apr 18 2016
So http://tools.wmflabs.org/copyvios/api, but a solution for caveat #1?
Apr 17 2016
Apr 13 2016
Apr 3 2016
It works. Hallelujah.
Mar 31 2016
EarwigBot doesn't tag revisions at the moment, and hasn't for a while; there's only the web interface which is run on demand. Its log of checks is not public, and cross-referencing at scale would be a bit difficult because it's not designed to retain results for more than a few days.
Mar 30 2016
Yes, it makes multiple queries per check, each with different chunks of sentences distributed somewhat uniformly throughout the article. The chunks have a size limit we can adjust, and the maximum number of queries (currently 10) can be changed as well, though it needs to be large enough to generate useful results when only a portion of an article is copied.
Mar 28 2016
Also, the WMF should have more accurate/long-term information on usage stats through BOSS's own interface, which I can't access myself. I don't know if said info would be per-user or including Coren's stuff, but either way it would be useful information.
Mar 23 2016
The ballpark is 6,000–12,000 queries/day, based on the past few days. We might be hitting up against the CSE limit then, but just barely. There's a (fairly conservative) bound of 300,000 queries/month...
Mar 3 2016
Clock is ticking. Any updates?
Feb 27 2016
A null edit fixed it. Here's a screenshot from before:
Feb 7 2016
Experience indicates search engines are miles better than Turnitin at detecting copyright violations as done by my tool.
Feb 2 2016
Google would be ideal, if we can work out a thing with them. I've looked into DuckDuckGo a bit and I'm not sure their setup is right for us; they seem more concerned with providing semantic search results than having a large text database, which is what we really need. Having dealt with Yahoo for (seven?) years at this point, I am not terribly impressed by them and suggest we look elsewhere.
Jan 31 2016
Not a huge fan of this. None of the other admin blocking powers can disable "reader-focused" features that only affect the user directly (e.g., we can't stop people from browsing pages). Also, it doesn't seem particularly useful.
Jan 21 2016
If a file has been uploaded to Commons it's free (otherwise it would have been deleted already), if a file has been uploaded to a Wikipedia it's non-free (otherwise it would have been moved to Commons already).
Jan 20 2016
This is... done, I think. I want to hack on the visual output further, but it works.
Dec 19 2015
Nov 11 2015
Okay, so Coren's been the point of contact in the past between me and the WMF with regards to managing the Yahoo! BOSS API keys that are necessary to use that service. As far as I know, he still has that role. I was suggesting that he could create a new key for Fhocutt for developing/testing this new feature (since sharing of keys doesn't sound like a good idea, although we could do that too, I guess).
Nov 5 2015
@kaldari Still useful to test how the results look when combined with the regular BOSS hits, I guess?
Nov 4 2015
Sorry Coren, I didn't really mean to add you as a subscriber...!
Hmm... I guess you can ask @coren for a BOSS key for testing? Alternatively, disable part of EarwigBot: in earwigbot/wiki/copyvios/__init__.py, comment out line 116 and change 133 to if True:. That should make it just report "no match" for everything. I might add a more graceful fallback in the future.
Nov 3 2015
You probably didn't put it in the "wiki" section.
To be honest, I'm struggling with free time right now. Not sure the best way for you to approach this.
Sep 27 2015
All done now.
Sounds fine. I'm not sure about putting the Turnitin results above the main result summary, but that's a nitpick.
Sep 26 2015
Oh, good point on that last one. I can definitely use posts from my own blog. Will try that.
Now at https://en.wikipedia.org/wiki/User:EarwigBot/Copyvios/Tests. Did some cleanup and added a few new tests.
Sep 22 2015
This is very useful. Thanks!
Sep 21 2015
I will likely work on this on my own over the next couple of weeks. It'll be useful for other improvements that I plan to make to the comparison engine.
Sep 17 2015
I don't understand. What kind of work are you doing that requires so much memory?
Sep 11 2015
For https://en.wikipedia.org/w/index.php?title=Clinoch_of_Alt_Clut&diff=prev&oldid=680314774, I believe the correct parsing is "Clinoch of Alt Clut" rather than "Clut, Clinoch of Alt", per WP:PEER. This is strange to me because WP:AWB/GF indicates it should be doing this already. For the second page, I think "Byzantine Master of the Crucifix of Pisa" without any modification is correct.
Aug 25 2015
Yes, this is a good idea. I already use https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioExample and https://en.wikipedia.org/wiki/User:The_Earwig/Sandbox/CopyvioPDFExample as basic sanity checks, but a more comprehensive suite would be much better.
Aug 22 2015
It is custom-written. You are right that the particular result there is poor; my first thought is to work on the confidence algorithm a bit to value large contiguous blocks more than lots of disjoint trigrams. For quotes, I'm not so sure; if that issue was fixed I think it might not be so important. I can look into that.
Aug 19 2015
Regarding l10n, the tool works fine for non-English content from a technical perspective (logs show many successful requests involving Korean etc wikis; people have added German and Russian mirrors...).
Aug 15 2015
There is only one outstanding bug with the tool that comes to mind. I have a memory leak that I've been unable to get to the bottom of for about a year now. It happens so slowly and unpredictably that progress on it is difficult, especially given the lack of urgency and questions about why Python's internal memory management isn't working. I could probably fix it if I devoted enough time to extra debugging.