Jun 6 2020
Should we not get rid of the word ‘spam’? It is a blocklist for external links, which are often, but certainly not only, spam
Jan 29 2020
Related discussion to show how spamming WikiData has an effect on other wikis: https://en.wikipedia.org/w/index.php?title=Talk:The_Pirate_Bay&oldid=938126260#Official_website_template
Jan 26 2020
Jan 23 2020
Nov 2 2019
The only way I would see is that there is an abusefilter-variety that is
enabled for checkusers (so a separate one). It was however confessed to me
that the AbuseFilter itself needs a serious upgrade, so I can imagine that
a CU-clone of it is not soon going to happen.
Oct 29 2019
(Did not see this earlier)
Oct 22 2019
LinkWatcher has moved to an own instance (VM). Not an issue anymore on the instances.
Oct 11 2019
A way to circumvent the large index is to turn this into something like an abusefilter for checkusers only. Get alerted when someone in a range uses a recognizable UA is a gazillion times better that finding a sock after 50 edits, waiting for CUs to check and confirm while the sock is already on a next account.
Jul 19 2019
@Bstorm: Thanks! All is working now, except I have to now make explicit to perl where 'toolsdb' is (previously, basically saying 'sqlhost=toolsdb' is enough). What is the full address? - Got it!!
Jul 17 2019
@Bstorm, is there anything you need from my end now? How do I proceed?
Jul 15 2019
Just as a very recent example, see https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spam/LinkReports/yeyebook.com. That is 1 year worth of very slow addition of external links by a multitude of IPs. If you see one individual IP doing one or two edits on one wiki you would not know that this is part of a larger campaign. You would only see one or two edits on one Wiki and without the db you would have no clue that this is happening on 6 different wikis by 13 IPs.
@bd808 I understand, I do maintain these bots with a ‘fear’ that at some
point a failure will render my db broken (it happened before, and this is
the third place where I started this db from scratch). It is ‘painful’ but
it happens. Thank you for your evaluation.
Jul 13 2019
@bd808 Can you tell me what was the outcome of the 9/7 meeting?
Jul 10 2019
The data is quite valuable, as it enables on-wiki to see who added what links in the past, and the content allows for statistical spam detection. It is therefore persistent. There is 7.5 years of data there, and seen that one tools-sgeexec has difficulties keeping up with current additions and statistics, rebuilding it is a gigantic task (plus, valuable information in the form of deleted articles is invisible and hence cannot be rebuilt without global admin bit).
Jul 7 2019
Re db size: this db is about 7.5 years worth of data, I expect that this
will be enough for more than 5 years in the future. As MediaWiki starts to
store similar data now itself, I may be able to use MediaWiki’s data in the
future and stop storing it myself (or store less).
Jul 6 2019
I have to see what is needed. It is also something that is useful for me
to learn, but I likely need help
Jun 26 2019
The idea is to move. However, I will need some help (some has already been
offered, and I have been asking around for more). Finding the time to get
this up and running is another issue (as a volunteer)
Jun 25 2019
We handle cases back to 2008 ... I currently have an AfD of a case that is
8.5 years old.
I am looking at this from a spam-detection point-of-view. The way I see
this, this may result in records on my name because I add a spamlink
because a spammer added a link to a template. That would disable a lot of
statistical spam-detection mechanisms (and, e.g. mechanisms like xLinkBot).
May 22 2019
I will have a look and maybe get the ball running. I generally do not have a lot oftime, but should have 2-3 weeks with more time in the end of July to do more work on it
Replicas are too slow, linkwatcher tries to work in real-time, it tries to
keep up with the edit feeds (if capacity of the sgeexec hosts allows, which
currently it doesn’t; warning/blocking ip/account hopping spammers, or
blacklisting their links to get the message through) makes only sense if
they are caught in the act). There is info in the wiki db, but I doubt it
is easy to search (even wiki-by-wiki, try to find those additions of
porhub.com, and realise that it are all single-edit IPs that add it, is
already a good test -write a query from which you can conclude it is xwiki
spam - two queries (or even 1 ..) on my db shows you that there are the
same number of additions by IPs as there are additions of the link they
were adding, an unlikely coincidence)
That would be a great idea. Note that also my tool coibot would need to go
there, and that both need significant capacity to run (linkwatcher is
struggling on its current instance due to workload, and if coibot needs to
run there as well ...).
Can someone please move all other bots away from the instance that runs linkwatcher? @valhallasw?
@Bstorm: can you provide me with the names/IPs of the editors that were
spamming porhub.com (including the diffs of addition and on which wiki)?
That would result in an immediate loss of functionality for the spam
Apr 7 2018
Jul 3 2017
@bd808 It is not that trivial, the new project would need to run coibot and linkwatcher, as they both do their share of analysis on the created db.
Jul 2 2017
@valhallasw Do you mind clearing the instance that linkwatcher is on .. it does not have enough resources and starts to build up a backlog. It is currently on 1438. Thanks!
Apr 18 2017
@valhallasw Do you mind to make sure that linkwatcher is the only bot on 1403? I had to start it this morning, it apparently crashed. Thanks!
Mar 29 2017
@Samwalton9, what do you mean with 'at some point'? Do you mean that this
has an enormous lag? We do see some effect in deterring spammers by acting
in real time (within minutes), many are hit and run editors, and I have
seen 'good faith spammers' with many warnings on many IPs complain that
they were never contacted ..
Feb 15 2017
If there is a rc-feed of edits for testwiki, I could set up LiWa3 to feed
added links to a channel on freenode (have to figure that out, it is a
matter of changing on-wiki settings and some killing on the server, it is a
long time since I added a feed manually). I would say that if the rc-feed
processed an edit, that then the db should also be updated. In my
experience, external link searches on wikis are quickly updated (as fast as
a diff gets saved and reported to rc), and as far as I understand that
search is based on a separate table that gets updated after every diff. I
presume that that same hook would update the list of added and removed
Feb 10 2017
This is probably better handled through T6459, a complete overhaul so it
also easier to administer
Feb 1 2017
Jan 24 2017
With hundreds of edits per minute to the 800 wikis that are checked ...
likely there are domains in every thinkable range ...
Just to clarify:
That is also one of my suspicions. The other one is that a domainowner
noticed that there is a bot requesting data from their site, and they want
to know whether that is/was legit ... or that the site itself got added a
lot in some tracking template in places where Linkwatcher and coibot would
notice (the latter being odd but not impossible). I would need to know
specific requests that triggered this now ...
I also need to know what they see as harmful. Coibot and Linkwatcher are
checking added links for viability, whether they are redirects, and whether
they are containing typical 'money making schemes'. If they note a lot of
traffic, then those links are added to Wikipedia at a somewhat alarming
Jan 4 2017
hmm. Any idea how long those 3 python scripts will stay? linkwatcher will
munch away its backlog in time. Until the wikimedia linklog system comes
online I don't foresee a way of making linkwatcher smaller.
@valhallasw, do you mind to clear the instance linkwatcher is on, there
are three heavy python scripts there as well, and LiWa3 is building up a
massive backlog. Thanks.
Dec 6 2016
If I understand it correctly, thse are basically giving the possibility to search 'by link/domain', 'by username/IP' or by 'pagename', right? That sounds about the most important, the first two is how we generally search, we either know the domain and find the spammers, or we know a spammer and want to find the domains. The third one is useful to monitor typical pages soammers would hit. System should be designed in such a way that the three searches can link to each other: if I am looking at a list of additions of a certain domain, with 3 different users adding them, I should have for each user a link to the 'search for this user', so I can snowball quickly.
Nov 14 2016
The only other things that now changed, is that I ran cpan to install LWP' - has that changed settings that now make everything run? Or did s.o. enforce a refresh on the modules serverwide - also the regular LWP::UserAgent now works ...
I resolved the first three
- the regex problem is a perl-problem, it has apparently been set to more strict (it is something all my bots complain about on those regexes, it is known for perl - just either in a newer version it has become more strict, or a setting has changed in how the regex module is loaded).
- the the BSD::Resource problem disappears when I change the order of the called modules (funny - it suggests that some things get loaded in earlier modules that make later modules fail - already loaded version of older modules that don't get reloaded with next modules and which have a different 'version'?)
- The getrlimit seems to have been resolved as well by reshuffling the module-calling ..
@valhallasw: you say ".. on an older Perl version doesn't work on a newer version anymore" .. there is a newer Perl version on the 14XX hosts (and also a new PHP for that matter)?
Nov 13 2016
Well, obviously, they are not the same as the 12xx nodes (see also a bug about sudden php errors on 14xx nodes that were not there on the 12xx nodes, bug T149810). These issues are also for me impossible to debug, as the perl errors were not there on the 12xx nodes, which apparently had everything correctly installed. Again, I did not change anything, yet everything crashes. Am I now to tell how the 14xx nodes are different from the 12xx nodes .. as you say libbsd-resource-perl is installed yet throws a 'not found' error ..
(I temporary killed the bot (tools.xlinkbot) that is now non-functional, I expect others to become problematic in time).
Nov 9 2016
I am going to work out some thought experiment here. My suggestion to re-write the current spam-blacklist extension (or better, rewrite another extension):
- take the current AbuseFilter, take out all the code that interprets the rule ('conditions').
- Make 2 fields:
- one text field for regexes that block added external links (the blacklist). Can contain many rules (one on each line).
- one text field for regexes that override the block (whitelist overriding this blacklist field; that is generally simpler and cleaner than writing a complex regex, not everybody is a specialist on regexes).
- Add namespace choice (checkboxes; so one can choose not to blacklist something in one particular namespace, or , with addition of an 'all', a 'content-namespace only' and 'talk-namespace only'.
- Add user status choice (checkboxes for the different roles, or like the page-protection levels)
- Some links are fine in discussions but should not be used in mainspace, others are a total nono
- Some image links are find in the file-namespace to tell where it came from, but not needed in mainspace
- Leave all the other options:
- Discussion field for evidence (or better, a talk-page like function)
- Enabled/disabled/deleted - not needed, turn it off, obsolete then delete
- 'Flag the edit in the edit filter log' - maybe nice to be able to turn it off, to get rid of the real rubbish that doesn't need to be logged
- Rate limiting - catch editors that start spamming an otherwise reasonably good link
- Warn - could be a replacement for en:User:XLinkBot
- Prevent the action - as is the current blacklist/whitelist function
- Revoke autoconfirmed - make sure that spammers are caught and checked
- Tagging - for combining certain rules to be checked by RC patrollers.
- I would consider to add a button to auto-block editors on certain typical spambot-domains.
Nov 6 2016
Nov 3 2016
@valhallasw The bot moved two days ago again, and I had to restart it now .. it is now on tools-exec-1417 (2 x edited comment).
Oct 26 2016
@valhallasw The bot yesterday moved to 1216. It is not backlogging, but maybe it is good to make sure other tasks do not run on this instance.
Sep 7 2016
@valhallasw - it crashed, and is now on 1213. Do you mind moving the other tasks (it is back making backlogs again)?
Aug 25 2016
@valhallasw can you please resubmit jobs on tools-exec-1203 .. linkwatcher seems to interfere with other scripts running there.
Jun 1 2016
May 19 2016
Feb 24 2016
@valhallasw - I had to move the bot to another instance, it is now on 1205 (if I become linkwatcher I can't ssh to 1209, access denied).
Feb 21 2016
@valhallasw - the bot moved to 1209
Jan 24 2016
I'm working on that @MarcoAurelio - Now back online.
@valhallasw - the bot crashed (no clue why, it seems to have troubles with MySQL). I restarted it this morning, and it is now on 1215
Jan 19 2016
Grr, I noted a bug on one of the counts (resolved) - it is now counting those and filling the proper table to reduce the counts. Re-indexing of the broken index is now done.
Thank you. Not sure if I understand the situation with the privacy, you mean that there is no way to exclude the queries from other people which may contain information that I should not see - as the bot operator, I do know (in principle) which queries the bot runs.
@valhallasw: a good solution would be assigning 200-300% processor to the whole task. I found http://wiki.crc.nd.edu/wiki/index.php/Submitting_Batch/SGE_jobs - which suggests "-pe mpi-# #" would be the option .. (I'm not a specialist in this)
Jan 18 2016
Can this enforce that all sub-spawned processes are running on the same exec-host (or can the spawning command 'enforce' that). As my bots are currently set up, the sub-processes communicate with the mother process through TCP, which means that they (at the moment) can not communicate between exec hosts (this would help with T123121). (I could make the communication through MySQL or files, but that would be quite a task).
Jan 17 2016
The bot is still eating away its (old) backlog, which goes slowly. Bot seems to operate fine now with way less processes. Still it uses 200-250% of processor power, which seems to be necessary for a bot doing all this work. As earlier, we could consider a rewrite making the sub-processes running independently, or I could split the bot into three smaller bots - but both actions require significant rewrites for which I do not have time.
Jan 14 2016
@valhallasw - I have added 2 more parsers (total now 12) - the bot is creating a backlog, likely during the American daytime, which it does not munch away at night.
Jan 12 2016
@valhallasw: taking the number of parsers down from 10 to 8 resulted in formation of a backlog within 10 minutes. Trying 9 .. (the parsers are the processor intensive processes, the others hardly ever take more than 3% each, and often are 0).
@valhallasw - thank you for the lengthy explanation. This bot has now been running on labs for a long time (and sometimes for long uptimes without problems - it has at least once managed to run for more than 6 months in a row), and has been running smoothly here. The main thing that I see from running this system on a multi-bot environment is the interaction indeed with the other bots. When it was privately hosted, sometimes the other bots were 'munching' too much and the bot started lagging - I see that here as well (and obviously, and my apologies for that, the opposite also happens). In the early times of Labs, it has indeed been running on an own instance for some time., both to avoid bringing down other bots, as well as being brought down by other bots.
Jan 11 2016
Valhallasw, it spawns many subprocesses to be able to keep up with
wikipedia editing. It needs to parse in real time as anti-spam bots and
work depend on it.
Jan 5 2016
@jcrespo. The last upgrade of the bot seems to have brought down the load significantly over the (my) night - doing successive 'show processlist;' statements does not show many queries running longer than 5 seconds, and hardly any longer than 10 seconds (which should now really happen less and less). When this bot ([[:m:User:LiWa3]]) is back up and running in full, I will turn my attention to the second bot (([[:m:User:COIBot]]) that makes heavy use of this db.
Jan 4 2016
Working on it again. Some of the new counting mechanisms were not performing as requested, but that has now been updated.
Dec 24 2015
Is that since this morning (UTC+3)?
Dec 14 2015
It would be great if this could be a real-time IRC feed as well - as then http://en.wikipedia.org/wiki/User:XLinkBot can hook live into the feed and revert when conditions are met.
@jcrespo: I have implemented new counting tables based on the three 'offending' queries above (and will implement if there are more, just tell me here what is being queried and I will devise a solution for it).
@jcrespo: I was notified immediately, but unfortunately at the start of my weekend, with an email which is hardly telling me anything, just that the number of connections were restricted. I reacted immediately after the weekend, and it still took time to realise why the bots were affected by this. Moreover, the en.wikipedia policy that you quote regards the editing on-wiki, which was here not the problem (the main bot does not even edit on-wiki) - it was the database.
Dec 13 2015
Let me have a manual look at the 'offending queries' one of these days .. see if I can reproduce. When WikiData started I had problems with three bot that ran at hundreds of edits per minute which brought everything down. Maybe I have a similar problem here now.
Well, with one connection the bots cannot run. LiWa3 uses something like 50 parallel processes (to keep up with the 600+ edits a minute) with each their own connection. COIBot adds a couple more. The first bot's main process takes down the only connection, the rest of the processes the bot spawns crashes the main bot then.
@jcrespo - I receive complains every time there is a CPU/IO spike from other users - that means that you knew for a long time there was a CPU/IO spike every now and then .. and you could have seen then which bot there was causing that, and ask the bot-owner/maintainer
@jcrespo .. what issue? What query is making this happening. It can't be the couple of hundred of usual insert queries that the bots do, it must be one (or a couple) of the select queries. Do I have a broken query, do I have a query that is not optimised, or do I have a missing index on a table ...??
@yuvipanda, @jcrespo - with all respect, this has just completely brought the complete Wikipedia anti-spam effort to a near halt (I've taken the bots offline). It is fine that there are problems, and that those need to be solved, but it would be great if we finally would get a bit more consideration from the WMF (this is not the first time that unannounced and undiscussed actions from WMF bring down bots - a couple of months ago my bots went down for days because of an unannounced and very minor change in server output) - your databases will run just fine when there are no bot operators that are willing to use Labs. Thank you.
Please revert this. This is effectively killing the hole anti-spam effort on Wikipedia. The bot needs multiple user connections into the database.
Nov 24 2015
Does this scheme also include a quick-searchable domain (it is unclear to me) - I mean storing the domain 'www.example.com' as 'com.example.www', as that greatly improves search speed for domains.