Page MenuHomePhabricator

Reduce size of linkwatcher db on toolsdb if at all possible
Open, HighPublic

Description

Please clean up or reduce size of the database usage of toolsdb for the linkwatcher tool.

The shared space on the server is running low on the database replica, and this tool is the largest consumer of space right now (393.5 GB):
https://tools.wmflabs.org/tool-db-usage/

Event Timeline

That would result in an immediate loss of functionality for the spam
fighting community.

That being said, if there is an easy to search cross-wiki database with
link addition information, comparable to the functionality provided by the
linkwatchers, then storage of new material could be stopped and statistics
could be based on that in near future.

Can you provide an example, @Beetstra? I might be able to help find something if there is one.

@Bstorm: can you provide me with the names/IPs of the editors that were
spamming porhub.com (including the diffs of addition and on which wiki)?

(Because of this db, these spammers were automatically detected at the 4th
insertion)

How about spinning out into a VPS project that runs your own DB server https://phabricator.wikimedia.org/project/view/2875/?
That would give it more room to grow on its own without impact to the ToolsDB service. It would need a fairly large instance size, but with a good reason backing it such as this, we grant those.
If you link the project request here, we already have a pretty good idea of the size of DB we are talking about. You'd need help from me or another admin to get the DB dumped out because of the file size limits on toolforge, but that could be arranged.

That would be a great idea. Note that also my tool coibot would need to go
there, and that both need significant capacity to run (linkwatcher is
struggling on its current instance due to workload, and if coibot needs to
run there as well ...).

(Still even better would be the xwiki db of link additions in an easy to
search format/api - then I can work with storing less info)

By any chance, there's no way that's quite in the wiki replicas? It does have link data in it. It would require queries across dbs, but people do that. I am imagining there's a performance barrier to doing that or similar for this tool, though.

Replicas are too slow, linkwatcher tries to work in real-time, it tries to
keep up with the edit feeds (if capacity of the sgeexec hosts allows, which
currently it doesn’t; warning/blocking ip/account hopping spammers, or
blacklisting their links to get the message through) makes only sense if
they are caught in the act). There is info in the wiki db, but I doubt it
is easy to search (even wiki-by-wiki, try to find those additions of
porhub.com, and realise that it are all single-edit IPs that add it, is
already a good test -write a query from which you can conclude it is xwiki
spam - two queries (or even 1 ..) on my db shows you that there are the
same number of additions by IPs as there are additions of the link they
were adding, an unlikely coincidence)

Yeah, sounds like you'd benefit quite a bit from moving to Cloud VPS, really.

Would you start a request from that link I posted above with some specs to get the ball rolling? I can help with some notes as well.

You can communicate from Toolforge to another VPS, if that is useful, btw. If some components/tools need to stay on Toolforge, it should be possible to use VPS resources with the right security groups and such.

I will have a look and maybe get the ball running. I generally do not have a lot oftime, but should have 2-3 weeks with more time in the end of July to do more work on it

Yeah, if you open the request, it will make it a bit easier. I can help flesh it out.

Related:

These tools provide a useful service to the Wikimedia community, but they are pushing to and beyond the limits of the Toolforge shared environment. Migrating to a dedicated Cloud VPS project would provide the ability to allocate more CPU, RAM, and dedicated database storage to linkwatcher & coibot. The downside of course is that @Beetstra would need to maintain the virtual machine instances rather than relying on the Toolforge admins to do so for them. Ideally one or more new folks could be attracted to the project to help with this administration burden. Getting more help seems like a good idea generally to ensure that the bot services can survive @Beetstra taking a wiki break from time to time.

The downside of course is that @Beetstra would need to maintain the virtual machine instances rather than relying on the Toolforge admins to do so for them. Ideally one or more new folks could be attracted to the project to help with this administration burden. Getting more help seems like a good idea generally to ensure that the bot services can survive @Beetstra taking a wiki break from time to time.

I am around and somewhat trainable to do on/off/restart switches, check services, etc. though not a coder. I have been working with COIBot and Linkwatchers for an extended period, and can check and run components via IRC commands. I have operated basic bots within the toolforge environment. I will add that I am around a lot (too much?)

@Beetstra I'm looking at your linkwatcher_linklog table and see that it has records going back to 2013-12-29 19:33:33 and a total of 528,994,352 rows. This 384.8 GB of storage is an order of magnitude larger than the next largest tool database and 2-3 orders of magnitude larger than the "typical" tools db. Is there no way to prune out old data here? At this point I'm actually not sure if we could even effectively move you to your own database instance in your own Cloud VPS project and handle this much mysql storage.

We handle cases back to 2008 ... I currently have an AfD of a case that is
8.5 years old.

We handle cases back to 2008 ... I currently have an AfD of a case that is 8.5 years old.

That certainly may be the case, but the problem is that the amount of data that you are storing is causing a great deal of concern for the long term viability of shared services in Toolforge. "I might need this data someday" is not a use case that our systems are sized to support.

I appreciate that you are a volunteer, and I am not challenging your assessment of the value of this data and tool in the larger Wikimedia ecosystem. I am however asking for you to work with us to find a solution that can be supported before something falls over and we lose all the data for everyone due to a catastrophic event.

In T224154#5206331 you indicated that moving to a dedicated Cloud VPS project would be a possibility and maybe even a net benefit for your tools. Do you have personally or can you find someone with the skills to help you make that move? At the current size of your ToolsDB database we would need to create a custom instance type just to host a ~400GB database in a Cloud VPS project, but that is something that could be done if you have a team that can actually keep the system running. If you want/need to stay in the Toolforge & ToolsDB managed services space instead, we need to find a way to reduce your footprint on the shared systems.

The idea is to move. However, I will need some help (some has already been
offered, and I have been asking around for more). Finding the time to get
this up and running is another issue (as a volunteer)

db needs to stay as is to have sufficient historical log. Db could be moved to own server though if it consumes too many resources in current config (see my previous comment)

aborrero triaged this task as High priority.
aborrero added a subscriber: aborrero.

Hi there,

We had an incident today in which was likely caused by high storage usage on Toolsdb, see T301951: toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication.

In T301967: toolsdb: evaluate storage usage by some tools we discovered that the actual filesystem usage by this tool is using way more space than originally known.
This tool alone seems to be using more than 30% of total space available for all Toolsdb.

We need to take actions now.

Follow up. I see that a linkwatcher Cloud VPS project exists, created on T227377: Request creation of Linkwatcher and COIBot VPS project

So the question perhaps is: Is the 1.1TB of data stored on toolsdb for linkwatcher required?

Aklapper added a subscriber: Beetstra.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

I just wanted to note here that I'm willing to commit a significant amount of (volunteer) time to the migration & maintenance of this tool and respective database — this data is incredibly useful for tracking and identifying long-term spam campaigns.

I've reached out to @Beetstra a few times to offer my help in maintaining and migrating this data. I hope to hear back from them here/via email :-)

Can we start making some initial enquiries as to how we can dump the tool database effectively? I can't imagine moving 1.1TB is going to be easy..

My understanding is that a (custom flavour of) WMCS VPS would still be the most appropriate place to run such a large database (optimisation notwithstanding)?

Thanks @TheresNoTime, really appreciated! I can also help with migrating the data.

Option 1: stop the tool, backup the full linkwatcher db, restore the database on the new server, change the tool configuration to point to the new db, restart the tool. Estimated downtime: 6-12 hours.

Option 2: spin up the new database as a replica of the current one (replicating only the linkwatcher db), promote the replica to primary, change the tool configuration to point to the new db, restart the tool. Estimated downtime: 1 minute.

Would option 1 be acceptable? Option 2 requires a lot more work :)

I would also check if there's any way of optimizing the database size by deleting unnecessary rows, or defragmenting if there's any wasted space.

@Beetstra. Hi 👋. Database architect at Wikimedia foundation here. This table has grown to 1.1TB been causing a lot of issues and needs high priority attention.

I do understand the need for having a good spam-fighting tool and I'm not suggesting to remove linklogs.

But I think the table can be drastically optimized without losing functionality. I made a quick query. Here is an example row:

*************************** 10. row ***************************
        ID: 10
 timestamp: 2013-12-29 19:34:19
   edit_id: -2
      lang: en
  pagename: Southampton and Dorchester Railway
 namespace: 
      diff: http://en.wikipedia.org/w/index.php?diff=588249476&oldid=588229337
     revid: 588249476
     oldid: 588229337
wikidomain: w
      user: Afterbrunel
   fullurl: http://www.hantsphere.org.uk/ixbin/hixclient.exe?a=query&p=hants&f=generic_objectrecord_postsearch.htm&_IXFIRST_=1&_IXMAXHITS_=1&m=quick_sform&tc1=i&text=eling%20junction&tc2=e&partner=hants&s=S3pU1Y61cce
    domain: uk.org.hantsphere.
  resolved: 217.160.95.104
        ip: 0
      date: 2013-12-29
      time: 19:34:19

(Table description:

MariaDB [s51230__linkwatcher]> desc linkwatcher_linklog;
+------------+---------------------+------+-----+-------------------+----------------+
| Field      | Type                | Null | Key | Default           | Extra          |
+------------+---------------------+------+-----+-------------------+----------------+
| ID         | bigint(20) unsigned | NO   | PRI | NULL              | auto_increment |
| timestamp  | timestamp           | NO   |     | CURRENT_TIMESTAMP |                |
| edit_id    | int(11)             | YES  |     | NULL              |                |
| lang       | tinytext            | YES  |     | NULL              |                |
| pagename   | text                | YES  | MUL | NULL              |                |
| namespace  | tinytext            | YES  |     | NULL              |                |
| diff       | text                | YES  | MUL | NULL              |                |
| revid      | bigint(20) unsigned | YES  | MUL | NULL              |                |
| oldid      | bigint(20) unsigned | YES  |     | NULL              |                |
| wikidomain | tinytext            | YES  | MUL | NULL              |                |
| user       | text                | YES  | MUL | NULL              |                |
| fullurl    | text                | YES  |     | NULL              |                |
| domain     | text                | YES  | MUL | NULL              |                |
| resolved   | tinytext            | YES  | MUL | NULL              |                |
| ip         | int(11)             | YES  |     | NULL              |                |
| date       | date                | YES  |     | NULL              |                |
| time       | time                | YES  |     | NULL              |                |
+------------+---------------------+------+-----+-------------------+----------------+
17 rows in set (0.00 sec)

Here are some suggestions:

  • delete any rows with invalid edit id (here it's -2?)
  • put a maximum size for the fullurl. Some urls can be really really long and take massive space. Just make it varbinary(255) and truncate anything larger than that. I'm sure it doesn't give any values but deletion saves unbelievable amount of data.
  • remove diff column, the values are already stored in revid and oldid (and wikidomain/lang), just compute it on the fly
  • drop date and time columns, they are already stored in timestamp it seems
  • if resolved is going to be IPv4, then store it as int and compute it, depending on your code's language, there are many libraries out there.
    • Even if you're using IPv6 or something like that, you can still store it in a more robust way.
  • Drop user column altogether, use centralauth global id instead (store it as int): for example https://meta.wikimedia.org/w/api.php?action=query&list=globalallusers&agufrom=Ladsgroup&aguprop=lockinfo|groups|existslocally (mine is 3349).
    • That way, your tool would work if the user renames. Moving back from global id to username is not hard.
  • Drop any additions done by bots, admins, etc.? Or at least anything older than a year that was done by a bot or an admin.
    • I can assure you there will be millions and millions of rows being removed because of this, specially in botpedias. I could give you numbers but I really can't do a full-table query of a 1TB table without basically bringing down everything.

HTH, let me know if I can help on anything. I suggest doing them one by one and see the impact.

Thanks @TheresNoTime, really appreciated! I can also help with migrating the data.

Option 1: stop the tool, backup the full linkwatcher db, restore the database on the new server, change the tool configuration to point to the new db, restart the tool. Estimated downtime: 6-12 hours.

Option 2: spin up the new database as a replica of the current one (replicating only the linkwatcher db), promote the replica to primary, change the tool configuration to point to the new db, restart the tool. Estimated downtime: 1 minute.

Would option 1 be acceptable? Option 2 requires a lot more work :)

I would also check if there's any way of optimizing the database size by deleting unnecessary rows, or defragmenting if there's any wasted space.

I'd ideally let @Beetstra answer that one, but from my experience with the linkwatcher tool/bot, downtime should be kept to an absolute minimum (6-12 hours of unrecorded link additions across projects would be a significant gap, though I'm not sure if that's a reasonable loss of data for the benefits of migrating..) — I appreciate setting up a db replica would entail a lot more work, and would advise Beetstra to weigh option 1 carefully :-)

@Beetstra. Hi 👋. Database architect at Wikimedia foundation here. This table has grown to 1.1TB been causing a lot of issues and needs high priority attention.

Here are some suggestions:

  • delete any rows with invalid edit id (here it's -2?)
  • put a maximum size for the fullurl. Some urls can be really really long and take massive space. Just make it varbinary(255) and truncate anything larger than that. I'm sure it doesn't give any values but deletion saves unbelievable amount of data.
  • remove diff column, the values are already stored in revid and oldid (and wikidomain/lang), just compute it on the fly
  • drop date and time columns, they are already stored in timestamp it seems
  • if resolved is going to be IPv4, then store it as int and compute it, depending on your code's language, there are many libraries out there.
    • Even if you're using IPv6 or something like that, you can still store it in a more robust way.
  • Drop user column altogether, use centralauth global id instead (store it as int): for example https://meta.wikimedia.org/w/api.php?action=query&list=globalallusers&agufrom=Ladsgroup&aguprop=lockinfo|groups|existslocally (mine is 3349).
    • That way, your tool would work if the user renames. Moving back from global id to username is not hard.
  • Drop any additions done by bots, admins, etc.? Or at least anything older than a year that was done by a bot or an admin.
    • I can assure you there will be millions and millions of rows being removed because of this, specially in botpedias. I could give you numbers but I really can't do a full-table query of a 1TB table without basically bringing down everything.

HTH, let me know if I can help on anything. I suggest doing them one by one and see the impact.

Not speaking for Beetstra, but did want to pass on thanks for these great suggestions @Ladsgroup — I know I certainly appreciate everyone rallying together to find workable solutions to support Beetstra's amazing work with the linkwatchers.

To your suggestions, I especially think

Drop any additions done by bots, admins, etc.? Or at least anything older than a year that was done by a bot or an admin

could be a really useful, safe but high-impact step to take prior to migrating the data.

I think the best low-hanging fruit from my suggestion is dropping diff column and computing it on the fly. It adds a quite long string to every row (1.4B). The second best one is putting 255 cap on fullurl and the third one is dropping edits by bots. I can even make queries and tell you which ones you should drop if the table gets slightly smaller (the first two low-hanging fruits).

I think the best low-hanging fruit from my suggestion is dropping diff column and computing it on the fly. It adds a quite long string to every row (1.4B). The second best one is putting 255 cap on fullurl and the third one is dropping edits by bots. I can even make queries and tell you which ones you should drop if the table gets slightly smaller (the first two low-hanging fruits).

I know this is likely unanswerable with any accuracy, but I can assume dropping the diff column would take a little while to complete — during this time, am I correct in my understanding that writes to the table would be blocked?

Yes, we can run it on the replica first and then fail over to minimize read-only time but I don't know how doable it is given that I don't know infra of wmcs well enough to judge.

Failing over is not really possible with the current setup, because some databases in clouddb1001 are not being replicated to clouddb1002. I was thinking of this possible alternative solution:

  1. create a new VM (let's call it linkwatcher-db-1) with a Ceph volume that replicates from clouddb1002, but configure it to only replicate the linkwatcher database
  2. makes sure it catches up and stays in sync
  3. promote linkwatcher-db-1 to primary and configure Linkwatcher to use it as its db host
  4. create a second VM linkwatcher-db-2 as a read-only replica of linkwatcher-db-1 and use linkwatcher-db-2 to test the operations described in this task (dropping the diff column, max size for fullurl, etc.)
  5. drop the linkwatcher database from ToolsDB (clouddb1001/clouddb1002) and free up 1TB there

WDYT?

(edit: this would basically be "Option 2" from my previous comment, it's a bit of work, but thinking again I kind of prefer it over "Option 1" because we can verify that the disk performance of the new host is good enough before pointing the tool to it)