Page MenuHomePhabricator

Reduce size of linkwatcher db on toolsdb if at all possible
Open, Needs TriagePublic

Description

Please clean up or reduce size of the database usage of toolsdb for the linkwatcher tool.

The shared space on the server is running low on the database replica, and this tool is the largest consumer of space right now (393.5 GB):
https://tools.wmflabs.org/tool-db-usage/

Event Timeline

Bstorm created this task.May 22 2019, 5:36 PM

That would result in an immediate loss of functionality for the spam
fighting community.

That being said, if there is an easy to search cross-wiki database with
link addition information, comparable to the functionality provided by the
linkwatchers, then storage of new material could be stopped and statistics
could be based on that in near future.

Can you provide an example, @Beetstra? I might be able to help find something if there is one.

@Bstorm: can you provide me with the names/IPs of the editors that were
spamming porhub.com (including the diffs of addition and on which wiki)?

(Because of this db, these spammers were automatically detected at the 4th
insertion)

Bstorm added a comment.EditedMay 22 2019, 6:59 PM

How about spinning out into a VPS project that runs your own DB server https://phabricator.wikimedia.org/project/view/2875/?
That would give it more room to grow on its own without impact to the ToolsDB service. It would need a fairly large instance size, but with a good reason backing it such as this, we grant those.
If you link the project request here, we already have a pretty good idea of the size of DB we are talking about. You'd need help from me or another admin to get the DB dumped out because of the file size limits on toolforge, but that could be arranged.

That would be a great idea. Note that also my tool coibot would need to go
there, and that both need significant capacity to run (linkwatcher is
struggling on its current instance due to workload, and if coibot needs to
run there as well ...).

(Still even better would be the xwiki db of link additions in an easy to
search format/api - then I can work with storing less info)

By any chance, there's no way that's quite in the wiki replicas? It does have link data in it. It would require queries across dbs, but people do that. I am imagining there's a performance barrier to doing that or similar for this tool, though.

Replicas are too slow, linkwatcher tries to work in real-time, it tries to
keep up with the edit feeds (if capacity of the sgeexec hosts allows, which
currently it doesn’t; warning/blocking ip/account hopping spammers, or
blacklisting their links to get the message through) makes only sense if
they are caught in the act). There is info in the wiki db, but I doubt it
is easy to search (even wiki-by-wiki, try to find those additions of
porhub.com, and realise that it are all single-edit IPs that add it, is
already a good test -write a query from which you can conclude it is xwiki
spam - two queries (or even 1 ..) on my db shows you that there are the
same number of additions by IPs as there are additions of the link they
were adding, an unlikely coincidence)

Yeah, sounds like you'd benefit quite a bit from moving to Cloud VPS, really.

Would you start a request from that link I posted above with some specs to get the ball rolling? I can help with some notes as well.

You can communicate from Toolforge to another VPS, if that is useful, btw. If some components/tools need to stay on Toolforge, it should be possible to use VPS resources with the right security groups and such.

I will have a look and maybe get the ball running. I generally do not have a lot oftime, but should have 2-3 weeks with more time in the end of July to do more work on it

Yeah, if you open the request, it will make it a bit easier. I can help flesh it out.

bd808 added a subscriber: bd808.May 22 2019, 11:08 PM

Related:

These tools provide a useful service to the Wikimedia community, but they are pushing to and beyond the limits of the Toolforge shared environment. Migrating to a dedicated Cloud VPS project would provide the ability to allocate more CPU, RAM, and dedicated database storage to linkwatcher & coibot. The downside of course is that @Beetstra would need to maintain the virtual machine instances rather than relying on the Toolforge admins to do so for them. Ideally one or more new folks could be attracted to the project to help with this administration burden. Getting more help seems like a good idea generally to ensure that the bot services can survive @Beetstra taking a wiki break from time to time.

Billinghurst added a comment.EditedMay 23 2019, 12:27 PM

The downside of course is that @Beetstra would need to maintain the virtual machine instances rather than relying on the Toolforge admins to do so for them. Ideally one or more new folks could be attracted to the project to help with this administration burden. Getting more help seems like a good idea generally to ensure that the bot services can survive @Beetstra taking a wiki break from time to time.

I am around and somewhat trainable to do on/off/restart switches, check services, etc. though not a coder. I have been working with COIBot and Linkwatchers for an extended period, and can check and run components via IRC commands. I have operated basic bots within the toolforge environment. I will add that I am around a lot (too much?)

bd808 moved this task from Backlog to ToolsDB on the Data-Services board.May 30 2019, 7:02 PM
bd808 added a comment.EditedJun 25 2019, 7:15 PM

@Beetstra I'm looking at your linkwatcher_linklog table and see that it has records going back to 2013-12-29 19:33:33 and a total of 528,994,352 rows. This 384.8 GB of storage is an order of magnitude larger than the next largest tool database and 2-3 orders of magnitude larger than the "typical" tools db. Is there no way to prune out old data here? At this point I'm actually not sure if we could even effectively move you to your own database instance in your own Cloud VPS project and handle this much mysql storage.

We handle cases back to 2008 ... I currently have an AfD of a case that is
8.5 years old.

We handle cases back to 2008 ... I currently have an AfD of a case that is 8.5 years old.

That certainly may be the case, but the problem is that the amount of data that you are storing is causing a great deal of concern for the long term viability of shared services in Toolforge. "I might need this data someday" is not a use case that our systems are sized to support.

I appreciate that you are a volunteer, and I am not challenging your assessment of the value of this data and tool in the larger Wikimedia ecosystem. I am however asking for you to work with us to find a solution that can be supported before something falls over and we lose all the data for everyone due to a catastrophic event.

In T224154#5206331 you indicated that moving to a dedicated Cloud VPS project would be a possibility and maybe even a net benefit for your tools. Do you have personally or can you find someone with the skills to help you make that move? At the current size of your ToolsDB database we would need to create a custom instance type just to host a ~400GB database in a Cloud VPS project, but that is something that could be done if you have a team that can actually keep the system running. If you want/need to stay in the Toolforge & ToolsDB managed services space instead, we need to find a way to reduce your footprint on the shared systems.

The idea is to move. However, I will need some help (some has already been
offered, and I have been asking around for more). Finding the time to get
this up and running is another issue (as a volunteer)