Page MenuHomePhabricator

Fix Wikimini beta "eswiki" edition: database full of spam (27G)
Closed, ResolvedPublic16 Estimated Story Points

Description

There is some maintenance work to be done to deflate the database of the Spanish edition of Wikimini.

The current size of the es version is 27G but this has not much sense since the it + en + fr is <= 8G and the es version is just a beta.

Small overview:

$ du -sh /var/lib/mysql/* | sort -hr | grep wikimini
27G     /var/lib/mysql/wikimini_eswiki
5.1G    /var/lib/mysql/wikimini_itwiki
2.8G    /var/lib/mysql/wikimini_frwiki
2.1G    /var/lib/mysql/wikimini_beta_frwiki
1.1G    /var/lib/mysql/wikimini_stockwiki
930M    /var/lib/mysql/wikimini_beta_stockwiki
192M    /var/lib/mysql/wikimini_svwiki
157M    /var/lib/mysql/wikimini_labwiki
30M     /var/lib/mysql/wikimini_enwiki
16M     /var/lib/mysql/wikimini_testwiki
8.3M    /var/lib/mysql/wikimini_arwiki2
8.3M    /var/lib/mysql/wikimini_arwiki

Event Timeline

Hi @Lorangeo! I need your precious opinion. If I understand correctly the es version of Wikimini was affected by spam but now the situation is OK as the Wikimini community eliminated it. Is it correct?

If your opinion is "yes", maybe we just need to delete orphan texts from the database, resulting in the permanent removal of spam (since as default MediaWiki keeps everything to allow to restore it).

What do you think about it?

Hi,

The spam content is still present on Wikimini ES. Actually, nothing has been removed. As the project was inactive, we simply locked the database in January 2017 to prevent further damages.

It looks like the massive spam attack began in August 2016. The user account SheritaJewell created on Wikimini ES August, 11 (see user creation log) seems to be the first spambot of a very long list.

We can assume that everything that has been published on Wikimini ES after July of 2016 (thousand of pages with external links) is only spam content. As far as I can remember, legitimate content has only been published in 2014... maybe 2015.

There are https://es.wikimini.org/wiki/Especial:Estad%C3%ADsticas 739 227 "Content pages" (ARTICLES) on es while fr has 198 904 PAGES (all included). The 27 GB is simply spam, if you make the calculation it makes 38kio per page so it does make sense. Perhaps DeletePagesForGood could be useful. This extension is archived but in all cases we're on an old version of MediaWiki soooo… perhaps we could simply take the code in order to run it on all the pages. It will take long.

In all cases something needs to be done, either deleting either putting the wiki offline, because the spam content is still here including sexual content.

Perhaps it would be easier to export the good pages, throw away the whole wiki, and reimport the pages in a clean one.

either putting the wiki offline

As for now I've done this, I've removed the read permission to everyone so that the general public and crawlers don't get the nasty spammy links.

I'm once again a bit surprised that there is not any maintenance script to delete pages from a list of page IDs, instead of having a complicated and not-reliable generation of namespace names from an SQL query.

Forking again. And maybe if we have some time, we can check consensus for:

T357019: deleteBatch.php: accept minimal info (page ID, instead of full title + namespace)

BTW the Nuke extension is not feasible since it works on recent changes, and it's not the case. Also other extensions would be lovely but they DO NOT work for MediaWiki 1.26. And we should really do not try to upgrade and only after cleanup.

I'm not a designer but I'm quite good in scripting so I'm writing a script to find spam (content), and using that as datasource to find spammers (users), and then:

  • find page created by that spammer, and delete them all (with all related revisions)
  • and find all revisions authored by that spammer, and delete them all (this will also cleanup legitimate pages with consequent spam)

All of this as PHP command line script.

This is probably super-simple. Schema:

https://cryptpad.fr/diagram/#/2/diagram/edit/OEUER4I1QH4yKGV6dtLKgbeP/

Interestingly it's 2024 and it still does not exist an internal PHP official method call to block a specific user.

https://www.mediawiki.org/wiki/Manual:Block_abstraction_layer

ValerioBoz-WMCH lowered the priority of this task from High to Low.Feb 9 2024, 9:22 AM
IMPORTANT: I've explained the situation to Ilario and he is proposing to maybe just drop the experimental es version and eventually just restart from scratch with fresh initiatives. In this way we have less blocks.

BTW maybe this stub works in plain PHP.

$expiryTime = 'infinite';
$block = new Block();
$block->setTarget( (int) $blocked_id );
$block->setBlocker( (int) $blocker_id );
$block->mReason = "Spammer";
$block->mExpiry = $expiryTime;
$block->isHardblock( true );
$block->prevents( 'editownusertalk', true );
$block->prevents( 'createaccount', true );
if (!$block->insert()) {
  throw new Exception( "Cannot block. Why?" );
}

BTW I leave the partial script here, if ever desired to conclude:

/var/www/wikimini.org/www/w/maintenance/boz-obliviate-spammers.php

There has never really been a real community on Wikimini ES. However a lot of work had be done at the beginning to create the whole set of pages/categories/templates and other elements that we use on every version of Wikimini. For exemple, we use the same standardized set of predefined categories on every Wikimini. We also have some pages that need to be translated and created on each version before its successful launch (ex: the main page, the Teachers' corner, the children's corner, the help pages and many others (see here). There are also some templates, system messages and images that need to be translated/created on every Wikimini. (The list of elements that need to be translated and created on every new version of Wikimini can be found here: https://lab.wikimini.org/wiki/Translation)

A lot of this work had already been done at the beginning of Wikimini ES. So it would be a pity to lose this work which was initially done. However, I remember that the problems with the spam bots only appeared about 12-18 months after the site was "launched", At this time there was no real active community on Wikimini ES, and the amount of valid contributions after this period is probably very small. So maybe we could try to "restore" the database at its previous state (before the invasion of spam bots), or delete all the contributions that occurred after this period. Or maybe delete all the contributions of user accounts that were created after this same period. Would this be feasible?

I could try to help identify the date when the first spam bots appeared. Maybe I would need access to the Wiki (read permission) for this. I could also search into my past communications with the previous technical team about this (Olivier Mory form the Nixit Informatique company).

ValerioBoz-WMCH raised the priority of this task from Low to High.Feb 9 2024, 11:21 AM

Thanks Lorangeo. I'm considering "May 2016" as a raw cut line and I'm proceeding in releasing a wipe script and use that, ASAP.

Note that there are false-positives after May 2016. Example

https://es.wikimini.org/w/index.php?oldid=283

I'm now looking at 2017

BTW the "Wikimini spam killer" script was completed and it's running since minutes \o/

Wikimini spam killer.png (1×961 px, 657 KB)

wiki spam killer output.png (1×848 px, 273 KB)

Status update:

Our brand-new "MediaWiki HIGHLY-AGGRESSIVE SPAMMER OBLIVIATER ®©" is running since Friday 2024-02-09, identifying and killing spam at an average speed of 30 revisions / seconds and 15 pages / seconds.

This means we are killing 100K revisions / hour and 54K pages per hour. Indeed better than doing this operation manually.

Situation of 2024-02-10 morning in eswiki_wikimini:

  • identified 528K spam users (so we are incidentally working on T331110: Cleanup Wikimini users in database stockwiki (now they are 1+ million))
  • deleted 200K pages created by these spam users (and all related revisions)
  • deleted 900K revisions created by these spam users (edits also on already-existing pages, generally to be kept)
  • we still have a total of 1.3 million pages
  • we still have a total of 3.3 million revisions

Side-effects: While searching in eswiki_wikimini, I noticed this error:

Error: 144 Table './wikimini_eswiki/searchindex' is marked as crashed and last (automatic?) repair failed (127.0.0.1)

That error is under control and I will that ASAP. I cannot run the script right now, since I need extra space on the server, and to free space, I have to rebuild the revision table with OPTIMIZE TABLE, but to run that, I need to stop the spammer obliviater, and I have no intention to stop that bot right now since I'm having a really GOOD TIME in looking at its super-verbose output. CPU capped at 100%.


Situation of 2024-02-10 night:

  • still 783.6K pages
  • still 2.4 million revisions

I will leave the bot running completely without supervision :D good night

$ sudo -u www-data HTTP_HOST=es.wikimini.org php ./deleteArchivedRevisions.php
Found 666888 revisions to delete.

(666-888 means jackpot I think)

The bot is still running.


Situation of 2024-02-11 (Sunday) morning:

  • 391K page
  • 1.9 million revisions
  • 2.9 million text rows

Situation of 2024-02-11 (Sunday) evening:

  • 200K page (-190K since morning)
  • 1.28 million revisions (-600K !!)
  • 1 million text rows (-1.9M !!!!)

Bot still running. 'night.

ValerioBoz-WMCH renamed this task from Fix Wikimini beta "eswiki" edition: database too big (27G) to Fix Wikimini beta "eswiki" edition: database full of spam (27G).Feb 11 2024, 8:08 PM
ValerioBoz-WMCH lowered the priority of this task from High to Medium.
ValerioBoz-WMCH changed the point value for this task from 8 to 16.

Wow that's crazy! This bot will deserve a medal... and its developer too (or maybe a pizza) ;-)

Situation of 2024-02-12 (Monday) morning:

90K page (-110K since last night)
804K revisions (-476K)
777K text rows (-223K)

🍕 WAIT - you shared PIZZA with the bot. You unlocked the source code of the bot \o/

1<?php /*
2 __ ___ ___ _ ___ __ _
3 / |/ /__ ____/ (_)___ | | / (_) /__(_)
4 / /|_/ / _ \/ __ / / __ `/ | /| / / / //_/ /
5 / / / / __/ /_/ / / /_/ /| |/ |/ / / ,< / /
6/_/ /_/\___/\__,_/_/\__,_/ |__/|__/_/_/|_/_/
7 _____ ____ ___ __ _____ _____________
8 / ___// __ \/ | / |/ / |/ / ____/ __ \
9 \__ \/ /_/ / /| | / /|_/ / /|_/ / __/ / /_/ /
10 ___/ / ____/ ___ |/ / / / / / / /___/ _, _/
11/____/_/ /_/ |_/_/ /_/_/ /_/_____/_/ |_|
12 ____ __ ___ _ __
13 / __ \/ /_ / (_) __(_)___ _/ /____ _____
14 / / / / __ \/ / / | / / / __ `/ __/ _ \/ ___/
15/ /_/ / /_/ / / /| |/ / / /_/ / /_/ __/ /
16\____/_.___/_/_/ |___/_/\__,_/\__/\___/_/ - by bozz \o/
17
18
19FEATURES:
20 - keyword-based (to be added by you)
21 - find spammers, not just spam (!)
22 - then, OBLIVIATE SPAMMERS >:]
23 - dropping all pages created by that spammer
24 - dropping all revisions created by that spammer
25
26THINGS THAT ARE MISSING:
27 - user ban is done separately (feel free to contribute to do it here)
28
29RATIONALE
30 DON'T ADOPT THIS SPAMMER OBLIVIATER SCRIPT! THIS SCRIPT IS VERY RAW!
31 Even if "it worked on my computer" this script can nuke your wiki!
32 Batteries not included! Do a full backup of your wiki! Do it twice! Take a full snapshot!
33 Lock the garage! Keep children away! Give vitamins to children! and stay hydrated!
34 This script is supposed to be saved in the maintenance directory of MediaWiki.
35 This script was used to delete 700K+ pages of spam and 1+ millions of spammed revisions.
36 This script does not support any argument. It does not even support --wiki since in my case
37 I just need to execute something like:
38 HTTP_HOST=es.dicoado.org ./boz-obliviate-spammers.php
39 So, have fun, at your own risk.
40 Source code:
41 https://phabricator.wikimedia.org/P56671
42 Context:
43 https://phabricator.wikimedia.org/T308969
44 Author: Valerio Bozzolan, WMCH
45 License: GNU GPL v3+
46*/
47use MediaWiki\MediaWikiServices;
48
49// The script processes this number of pages at a time in a single pagination.
50$BATCH = 5000;
51
52//
53// SPAM KEYWORDS
54//
55// This is an input file.
56// Format: 1 keyword for each line, separated by just one newline.
57// So for example in one line you can have just "xxx" or " xxx " if it should be one word. Spaces are respected.
58// Lines surrounded by just newlines will be ignored.
59// Don't let children read this file... :D :D
60// Don't contact me to provide this file. I nuked it to do not hurt my sensible eyes.
61define( 'XXX_SPAM_KEYWORDS', '/tmp/spam-keywords.txt' );
62
63//
64// SPAMMERS FOUND
65//
66// This is an output file.
67// Each line is a spammer user ID.
68define( 'OBLIVIATED_USERS_LOG', 'obliviated-users.txt' );
69
70// FALSE: you don't want interaction. You can still edit the spam keywords and the bot will reload it.
71// TRUE: you want to review every page, manually.
72define( 'INTERACTION', false );
73
74// Super important constant.
75define( 'OH_NO_A_CHAD', !INTERACTION );
76
77// Spam time cut line. After this timestamp, there is spam.
78// TODO: make an argument.
79$AFTER_TIMESTAMP = '20160501000000';
80
81require_once __DIR__ . "/commandLine.inc";
82
83// TODO: create argument for this
84$SYSTEM_USERNAME = 'WikiSysopMini';
85$SYSTEM_USER = User::newFromName( $SYSTEM_USERNAME );
86if ( !$SYSTEM_USER ) {
87 throw new Exception( "Invalid username" );
88}
89
90if(!INTERACTION) {
91 echo "------------------------------------\n";
92 echo "[WARNING] INTERACTION is disabled!!!\n";
93 echo "------------------------------------\n";
94}
95
96// Show some info.
97echo "Current XXX_SPAM_KEYWORDS: " . XXX_SPAM_KEYWORDS . "\n";
98echo "Current AFTER_TIMESTAMP: " . $AFTER_TIMESTAMP . "\n";
99echo "Current SYSTEM_USERNAME: " . $SYSTEM_USERNAME . "\n";
100
101// Allow a few seconds to change your mind.
102if(OH_NO_A_CHAD) {
103 echo "TIME TO KILL DAMN SPAMMERZ. How? In the only known way: the HARD way.\n";
104 echo "Starting...\n";
105 echo "[Press CTRL+C to quit]\n";
106 sleep( 5 );
107}
108
109$to_be_obliviated_user_ids = [];
110
111$paging = 0;
112do {
113 echo "Querying $BATCH results (paging $paging)...\n";
114
115 $found_something = false;
116
117 // Find spammed revisions. Confirm. Oblivate authors.
118 foreach( search_some_spam_pages_with_rev_user( $BATCH, $paging ) as $spammed_page ) {
119
120 $found_something = true;
121
122 if(is_user_trusted($spammed_page->rev_user)) {
123 echo "Skip trusted user {$spammed_page->rev_user}\n";
124 continue;
125 }
126
127 $page = Title::newFromID( $spammed_page->page_id );
128 $text_short = trim( $spammed_page->old_text );
129 $text_short = mb_strimwidth($text_short, 0, 500, '...');
130
131 if( $text_short === '' ) {
132 echo "Skipped empty page\n";
133 continue;
134 }
135
136 // Confirm one by one.
137 echo "\n----------------\n";
138 echo "$text_short\n\n";
139 echo "----------------\n";
140 echo 'https://' . $_SERVER['HTTP_HOST'] . '/wiki/Special:PermaLink/' . $spammed_page->rev_id . "\n";
141 echo "\n";
142
143 $obliviate = false;
144 if( $found_term = contains_spam_keyword( $spammed_page->old_text, $surely_spam_keywords ) ) {
145 echo "SUSPICIOUS TERM FOUND FROM UNTRUSTED USER: <$found_term> AND THE DATE MATCHES THE SPECIFIED SPAM TIME WINDOW >= $AFTER_TIMESTAMP\n";
146 $obliviate = true;
147 } else if( INTERACTION ) {
148 $obliviate = yesno("Is this SPAM? Delete its user and ALL their edits?");
149 }
150
151 if( $obliviate ) {
152 echo "OBLIVIATING >:D\n";
153 obliviate_user( $spammed_page->rev_user );
154 } else {
155 echo "Skipping...\n Consider putting user as autopatrolled if you trust that.\n";
156 }
157 }
158
159 $paging++;
160
161} while( $found_something );
162
163echo "DONE! Let's play again soon :D :D :D\n";
164
165echo "Now please run:\n";
166echo " ./deleteOrphanedRevisions.php\n";
167echo " ./deleteArchivedRevisions.php\n";
168
169/**
170 * Completely nuke one MediaWiki user by its User ID (numeric).
171 * @param int $user_id
172 */
173function obliviate_user( $user_id ) {
174
175 // Put obliviated users in a log file
176 file_put_contents(OBLIVIATED_USERS_LOG, $user_id . "\n", FILE_APPEND);
177
178 // Find pages where the spammer is author, and delete one by one.
179 foreach( search_pages_where_user_is_author( $user_id ) as $created_page ) {
180 drop_page( $created_page->page_id );
181 }
182
183 // Find other edits where the spammer is author.
184 $revisions_to_be_nuked = [];
185 foreach( search_revisions_where_user_is_author( $user_id ) as $created_revision ) {
186 $revisions_to_be_nuked[] = $created_revision->rev_id;
187 }
188 // Delete revisions, in batches.
189 while( $revisions_to_be_nuked ) {
190 $some_revisions_to_be_nuked = array_splice($revisions_to_be_nuked, 0, 50);
191 drop_revisions($some_revisions_to_be_nuked);
192 }
193}
194
195function search_pages_where_user_is_author( $user_id ) {
196 $user_id_safe = (int) $user_id;
197 if( !$user_id_safe ) {
198 throw new Exception( "Not a valid User ID: $user_id");
199 }
200
201 $sql_first_revision_of_user = "
202 SELECT
203 page_id,
204 rev_page,
205 page_namespace,
206 page_title
207 FROM
208 revision AS mainrev
209 JOIN page ON page_id = rev_page
210 WHERE rev_user = $user_id_safe
211 AND rev_timestamp =
212 ( SELECT MIN(rev_timestamp)
213 FROM revision AS childrev
214 WHERE childrev.rev_page = mainrev.rev_page
215 )
216 ";
217
218 return query_objects( $sql_first_revision_of_user );
219}
220
221function search_revisions_where_user_is_author( $user_id ) {
222 $user_id_safe = (int) $user_id;
223 if( !$user_id_safe ) {
224 throw new Exception( "Not a valid User ID: $user_id");
225 }
226
227 $sql_revisions_of_user = "
228 SELECT
229 rev_id
230 FROM
231 revision
232 WHERE rev_user = $user_id_safe
233 ";
234
235 return query_objects( $sql_revisions_of_user );
236}
237
238/**
239 * Search some pages and return the related revision's User.
240 * PLEASE SANITIZE THE $AFTER_TIMESTAMP GLOBAL
241 * @param int $limit How many results for each page.
242 * @param int $page From zero to infinite
243 */
244function search_some_spam_pages_with_rev_user( $limit, $page ) {
245
246 global $AFTER_TIMESTAMP;
247
248 $offset = $limit * $page;
249
250 $timestamp_safe = wfGetDB( DB_MASTER )->addQuotes( $AFTER_TIMESTAMP );
251
252 /*
253 * At the moment we just inspect all revisions with runtime checks.
254 * But you may want to activate something like this to do SQL-based checks.
255 * (spoiler: probably slower, if you have so much spam) - so, commented.
256 $like_queries = [];
257 foreach( $argv as $arg ) {
258 $arg = trim( $arg );
259 $search_safe = 'old_text LIKE %' . $dbw->addQuotes( $arg ) . '%';
260 $like_queries[] = $search_safe;
261 }
262 $like_queries_safe = implode( ' AND ', $like_queries );
263 */
264
265 // Select last version
266 $sql_spammed_pages = "
267 SELECT
268 rev_id,
269 page_id,
270 rev_user,
271 old_text
272 FROM page
273 JOIN revision
274 ON (page_latest = rev_id)
275 JOIN text
276 ON (rev_text_id = old_id)
277 WHERE rev_timestamp > $timestamp_safe
278 AND
279 page_namespace IN( 0, 2 )
280 LIMIT $limit
281 OFFSET $offset
282 ";
283
284 return query_objects( $sql_spammed_pages );
285}
286
287/**
288 * Please do not specify too much arguments or the shell may crash.
289 * Please VALIDATE the input. Each revision MUST be numeric integer.
290 * @param array $revision_ids
291 */
292function drop_revisions( $revision_ids ) {
293
294 if(!$revision_ids) {
295 return;
296 }
297
298 // LOL
299 $revision_ids_str = implode( " ", $revision_ids );
300 echo "Deleting revisions created by known spammer: $revision_ids_str\n";
301 $result = passthru( sprintf( 'php ./deleteRevision.php %s', $revision_ids_str ) );
302 if($result === false) {
303 throw new Exeption("Failed deleting revision: $revision_id");
304 }
305}
306
307function drop_page( $page_id ) {
308
309 global $SYSTEM_USER;
310
311 $page_id = (int) $page_id;
312 if( !$page_id ) {
313 throw new Exception( "Not a valid page ID: $page_id");
314 }
315
316 $title = Title::newFromID( $page_id );
317 if(!$title) {
318 throw new Exception("Wrong page ID $page_id" );
319 }
320
321 echo "Dropping page $title\n";
322
323 // Nothing to do
324 if(!$title->exists()){
325 return;
326 }
327
328 if ( $title->getNamespace() == NS_FILE ) {
329 $img = wfFindFile( $title, [ 'ignoreRedirect' => true ] );
330 if ( $img && $img->isLocal() && !$img->delete( $reason ) ) {
331 echo "[WARNING] Failed to delete local file\n";
332 }
333 }
334
335 $wikipage = WikiPage::factory( $title );
336 if(!$wikipage) {
337 throw new Exception("Wrong Wikipage $title");
338 }
339
340 $error = '';
341 $success = $wikipage->doDeleteArticle( "Spam", false, 0, true, $error, $SYSTEM_USER );
342 if(!$success) {
343 throw new Exception("Cannot drop page $title: $error");
344 }
345
346 echo "Dropping page... done!\n";
347}
348
349function is_user_trusted( $user_id ) {
350 $user = User::newFromId( $user_id );
351 return $user->isAllowed( 'autopatrol' );
352}
353
354/**
355 * Ran a raw SQL query and get an array of objects.
356 * @param string $query
357 * @return array
358 */
359function query_objects( $query ) {
360 $query = str_replace( "\n", " ", $query );
361 return wfGetDB( DB_MASTER )->query( trim( $query ) );
362}
363
364/**
365 * Check whenever an input string contains a well-known spam keyword.
366 * It will respect your spaces around a keyword.
367 * A line consisting in just a single space (or multiple ones) will be ignored.
368 * The comparison is case-insensive.
369 * @param string $text
370 * @return false|string False if not found, or, the found keyword.
371 */
372function contains_spam_keyword( $text ) {
373
374 $terms_raw = file_get_contents(XXX_SPAM_KEYWORDS);
375 $terms = explode( "\n", trim( $terms_raw ) );
376
377 $text = strtolower( $text );
378 foreach( $terms as $term ) {
379 // A line consisting of a single space would be a huge harm
380 $term_space = trim( $term );
381 if($term_space !== '') {
382 $term = strtolower( $term );
383 if(false !== mb_strpos( $text, $term ) ) {
384 return $term;
385 }
386 }
387 }
388 return false;
389}
390
391/**
392 * Ask for a 'Y' or a 'n'. Yes is default.
393 * @param string $question Creative question text.
394 * @return bool True if the user pressed yes or just enter.
395 */
396function yesno( $question ) {
397 echo "$question [Y/n]\n";
398 $a = trim(readline());
399 return $a === '' || $a === 'Y' || $a === 'y';
400}

Our cute "spammer-obliviater" is starting to run dry without being able to identify more spam. And I'm not creative enough to put additional keywords. Anyway we gained +20 GB free disk space and that's enough \o/

Now the database of the "es" edition is "just" ~7GB.

Spam is more under control for the Wikimini community.

Let's flag as closed.