Page MenuHomePhabricator

Request creation of Linkwatcher and COIBot VPS project
Closed, ResolvedPublic


Project Name: LinkWatcher

Wikitech Usernames of requestors: Beetstra,, Billinghurst

Purpose: running the anti-spam tools ‘linkwatcher’ and ‘coibot’ with their shared, massive anti-spam database

Brief description: The bots run currently on tools. Linkwatcher does not allow other tools to run on the same instance as that slows it down (which is not really acceptable). COIBot runs just fine on an instance. To run them together I think I would need 2.5-3 times the capacity of a current instance of tools. For storage I need (for now) 2 times the current size of the linkwatcher db on tools.

The bots run under perl, the db needs SQL. PHP might (later) be needed for search capabilities.

How soon you are hoping this can be fulfilled: This week. I have time to work on moving the bots from July 10 till beginning of August..

I am not familiar with running a VM so I might need some help. Billinghurst has offered to help and has some experience.

Event Timeline

Great that LinkWatcher is going to have their own resources to run. I note this task is assigned to you @Beetstra , but can you create the VM yourself? I thought only cloud-services-team could do it.

I have to see what is needed. It is also something that is useful for me
to learn, but I likely need help

bd808 added a subscriber: bd808.

Removing @Beetstra as assignee. This request will be reviewed in the 2019-07-09 cloud-services-team meeting.

@Beetstra, I have a few concerns about this request:

For storage I need (for now) 2 times the current size of the linkwatcher db on tools.

The current linkwatcher database on ToolsDB is 393.5 GB. I am concerned about an 800GB instance. This is 5 times the size of disk allocated for an "m1.xlarge" instance and 2.66 the size of the "bigdisk" instance that is used by several projects. The only instances running in Cloud VPS with a larger disk allocation are the custom instances running the ToolsDB, Wikilabels, and Maps databases.

Do you have any historic tracking of the size of this database? When you specify double the current size, what are you basing that need on? Is this your 1 year expected growth? 3 years? 10 years?

I am not familiar with running a VM so I might need some help.

Are you familiar with maintaining and operating Linux servers in general and MySQL servers in particular?

Re db size: this db is about 7.5 years worth of data, I expect that this
will be enough for more than 5 years in the future. As MediaWiki starts to
store similar data now itself, I may be able to use MediaWiki’s data in the
future and stop storing it myself (or store less).

Regarding running VM: I have installed and maintained linux on private
computers, and am running these bots from linux/unix boxes for >10 years

A VM of this size will be quite difficult for us to manage -- among other things, it would take many hours to move off of a hypervisor. Generally when we create large VMs (although so far we have never created one of this size) it's with the understanding that I may need to delete it as part of routine maintenance and leave it to be rebuilt by the users.

Will the data on this VM be valuable, persistent, and only stored on a single VM? Or is it effectively a cache of data from other sources that can be reconstructed when necessary?

The data is quite valuable, as it enables on-wiki to see who added what links in the past, and the content allows for statistical spam detection. It is therefore persistent. There is 7.5 years of data there, and seen that one tools-sgeexec has difficulties keeping up with current additions and statistics, rebuilding it is a gigantic task (plus, valuable information in the form of deleted articles is invisible and hence cannot be rebuilt without global admin bit).

You are more familiar with system architecture, what is basically needed is execution power > 2 times of a current tools-sgeexec node, with access to about 800Gb (for another 5+ years) of persistent SQL storage. Having the SQL in a more public place so other people can access it would be fine and maybe even desired

As noted before, MediaWiki is starting to store similar data itself, it may at some point be feasible to store way less data and minimize future db size (though the mediawiki db will likely not store all data and not be fully xwiki, it may need a lot of queries to mediawiki api to get the necessary data).

I will try to write out how this data is valuable for such a long span of time this evening

@bd808 Can you tell me what was the outcome of the 9/7 meeting?

@bd808 Can you tell me what was the outcome of the 9/7 meeting?

The outcome was primarily the question that @Andrew posed in T227377#5319173. Andrew has a legitimate concern that moving your database off of the shared ToolsDB instance and into a project local instance could end up causing problems both for you and for us. The potential problem for the Cloud Services team is moving your very large database server instance(s) from one cloudvirt host to another as we do various maintenance tasks on the cloudvirt hosts themselves. The potential problem for your project is loss of an entire database server instance which could happen because of various problems on the underlying cloudvirt host.

Your response that the data set you are storing in ToolsDB today is extraordinarily hard to recreate makes these potential failure modes more difficult to deal with. We honestly do not have any systems in Cloud Services today that are designed to provide the level of service that you are asking for, including the systems you are currently using.

@bd808 I understand, I do maintain these bots with a ‘fear’ that at some
point a failure will render my db broken (it happened before, and this is
the third place where I started this db from scratch). It is ‘painful’ but
it happens. Thank you for your evaluation.

Linkwatcher is currently running with a parsing backlog of weeks (probably
started during my holiday, I suspect a high-speed WikiData bot that added
an external link property to 1000s of pages in a short burst, and when the
bot starts a significant backlog it tends to only grow bigger as it cannot
keep up). I am now pushing the bot a bit, but it will likely be another
2-4 weeks to solve it. There simply is not enough memory capacity and
computing power left over to parse faster (which is why backlogs start in
the first place).

Starting with a new db means that the first 2-3 months of its work is of no
‘use’ to automated detection (you need a solid background for statistics on
the data). The loss of historic data is more annoying. We run into cases
where the spammers are slow and long term. They create an account and add
one link or one article and leave. Wait and repeat. To any editor, you
revert/delete and move on. But when we encounter that we might/will run
this against the db and find years worth of that behaviour. I recently
worked on a case that goes 8.5 years back (more than the life of this db),
there is an old case that we first encountered years ago that is still
popping up every now and then which is 12 years old. The db serves its
function in some large investigations of paid editing. Maybe @Billinghurst
can tell how stewards use the linkwatcher data

As I said, MediaWiki is working on something similar (though it will miss
data I keep). I may at some point be able to use that data to generate
(and regenerate?) smaller tables of meaningful data that I can do
statistics on and stop storing the complete data. Until then, this db will
only grow.

I hope we can move forward on this in some way.

Just as a very recent example, see That is 1 year worth of very slow addition of external links by a multitude of IPs. If you see one individual IP doing one or two edits on one wiki you would not know that this is part of a larger campaign. You would only see one or two edits on one Wiki and without the db you would have no clue that this is happening on 6 different wikis by 13 IPs.

To find this manually, you would have to linksearch on all wikis, then dig through the histories of all articles having such a link, and collecting all the diffs. And the only articles you find is the one where the link has not been removed yet: if the edit was reverted you would only be able to find it if the account in question made other edits and you were cross-checking; if the page where the addition was on was deleted it is hidden to everyone but the local admins, who would have also difficulty finding it).

This was discussed at our weekly meeting. We decided to approve creation of the project to allow build out of the app, but we also agreed to keep the data on toolsdb for now. More consideration is needed to properly manage the data set in the future, but that will be out of scope for this, I think. That will help with toolforge performance concerns, but there is a need to work out more tooling and so forth on our end to provide a reasonable, reliable service for that much data.

Bstorm triaged this task as Medium priority.Jul 16 2019, 7:27 PM
Bstorm moved this task from Inbox to Approved on the Cloud-VPS (Project-requests) board.

@Bstorm, is there anything you need from my end now? How do I proceed?

Nope, basically, I just have to create the project for now. I'll note here when it's ready to go.

Bstorm, is there anything you need from my end now? How do I proceed?

Beestra: Saw your comment in IRC, I am here to help and to learn. I consider myself a pair of hands.

Thanks for your work and guidance here Bstorm

The project is now available in to spin up virtual machines and such (see
You should be able to make a couple of VMs and start setting up the systems on there. You won't have anywhere near the disk space right now to run the database in the project, so you'll want to continue to connect to the Toolsdb like you do now until we are able to work out another solution. If you need more RAM or CPU, etc. please request more quota and we can take a look!

Bstorm claimed this task.

@Bstorm: Thanks! All is working now, except I have to now make explicit to perl where 'toolsdb' is (previously, basically saying 'sqlhost=toolsdb' is enough). What is the full address? - Got it!!

@Beetstra tools.db.svc.eqiad.wmflabs (just in case :) )