Page MenuHomePhabricator

How many revision comments are exactly the same? Get some stats.
Closed, ResolvedPublic

Event Timeline

Ideally, not just count unique; group them and get the number of re-uses in each group, to get a distribution.

Which wikis will you run this on? I guess the more bots and gadgets are used on a wiki, the more re-usable messages we'll see.

I am running something on enwiki- we can test others depending on the first results. For example, maybe commons and wikidata have more bot-like edits?

17 minutes for a full tables scan, less than I expected:

mysql> SELECT rev_comment FROM revision PROCEDURE ANALYSE(1);
+-----------------------------+-------------------------------------------------------------------------------------------------+--------------+------------+------------+------------------+-------+-------------------------+------+-------------------+
| Field_name                  | Min_value                                                                                       | Max_value    | Min_length | Max_length | Empties_or_zeros | Nulls | Avg_value_or_avg_length | Std  | Optimal_fieldtype |
+-----------------------------+-------------------------------------------------------------------------------------------------+--------------+------------+------------+------------------+-------+-------------------------+------+-------------------+
| enwiki.revision.rev_comment |         ''The factual accuracy of this article is [[Wikipedia:accuracy dispute|disputed]]'' (see talk) | <U+2B7FC> U+2B7FC     |          1 |        255 |        139355534 |     0 | 43.3397                 | NULL | TINYBLOB NOT NULL |
+-----------------------------+-------------------------------------------------------------------------------------------------+--------------+------------+------------+------------------+-------+-------------------------+------+-------------------+
1 row in set (17 min 58.51 sec)

For better clarity:

Field_nameMin_valueMax_valueMin_lengthMax_lengthEmpties_or_zerosNullsAvg_value_or_avg_lengthStdOptimal_fieldtype
enwiki.revision.rev_comment''The factual accuracy of this article is [[Wikipedia:accuracy dispute|disputed]]'' (see talk)<U+2B7FC> U+2B7FC1255139355534043.3397NULLTINYBLOB NOT NULL

I am getting better and more stats soon, hold your breath!

Note: above-Min_value may had some space-like characters for start.

BTW, the avg_value_or_avg_length = 43.3397 means there are approximately 43.3397*745508534 = 30GB only on comment text (probably more due to blob storing inneficiences), which is likely very compressible (even without deduplication).

So I created:

root@dbstore2002[ops]> SHOW CREATE TABLE comment_summary2\G
*************************** 1. row ***************************
       Table: comment_summary2
Create Table: CREATE TABLE `comment_summary2` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `comment` tinyblob,
  `sha256` binary(32) DEFAULT NULL,
  `comment_count` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `sha256` (`sha256`),
  KEY `comment` (`comment`(100)),
  KEY `comment_count` (`comment_count`)
) ENGINE=InnoDB AUTO_INCREMENT=393211 DEFAULT CHARSET=binary ROW_FORMAT=COMPRESSED
1 row in set (0.00 sec)

and then I ran:

root@dbstore2002[ops]> INSERT INTO comment_summary2 SELECT NULL, rev_comment, unhex(sha2(rev_comment, 256)), 1 FROM enwiki.revision WHERE rev_id BETWEEN 700000001 and 701000000 ON DUPLICATE KEY UPDATE comment_count = comment_count + 1;
Query OK, 1551253 rows affected (10 hours 2 min 55.02 sec)
Records: 961408  Duplicates: 589845  Warnings: 0

-I assumed for easy of operation hash collisions cannot exist, something logically incorrect but ok in practice. (check the processing time, BTW!)

And I got this:

root@dbstore2002[ops]> SELECT count(*) FROM enwiki.revision WHERE rev_id BETWEEN 700000001 and 701000000;
+----------+
| count(*) |
+----------+
|   961403 |
+----------+
1 row in set (0.57 sec)

root@dbstore2002[ops]> SELECT max(id) FROM comment_summary2;
+---------+
| max(id) |
+---------+
|  371563 |
+---------+
1 row in set (0.00 sec)

root@dbstore2002[ops]> SELECT comment_count, LEFT(comment, 100) as comment FROM comment_summary2 ORDER BY comment_count DESC LIMIT 30;
+---------------+------------------------------------------------------------------------------------------------------+
| comment_count | comment                                                                                              |
+---------------+------------------------------------------------------------------------------------------------------+
|        170764 |                                                                                                      |
|         11238 | Notification of altered sources needing review #IABot                                                |
|          9664 | Autotagging stub articles // Contact [[User talk:QEDK|Operator]] if any bugs found // AWB 5.8.0.0 (1 |
|          8435 | /* External links */                                                                                 |
|          6810 | add WikiProject tag using [[Project:AWB|AWB]]                                                        |
|          5748 | /* References */                                                                                     |
|          4710 | tag with {{WikiProject Scotland}} using [[Project:AWB|AWB]]                                          |
|          4127 | create using [[Project:AWB|AWB]]                                                                     |
|          3971 | Tagging pages in [[Category:Canoing and Kayaking|]] for [[WP:WikiProject Kayaking|]], Added {{WikiPr |
|          3877 | /* History */                                                                                        |
|          3114 | User warning for unconstructive editing found using [[WP:STiki|STiki]]                               |
|          2902 | Copying assessment table to wiki                                                                     |
|          2050 | most recent scoring                                                                                  |
|          2016 | Welcome to Wikipedia! ([[WP:TW|TW]])                                                                 |
|          1925 | /* See also */                                                                                       |
|          1884 | /* Plot */                                                                                           |
|          1843 | Updating                                                                                             |
|          1794 | /* Current squad */                                                                                  |
|          1791 | /* Cast */                                                                                           |
|          1739 | New Message (simulated automatically as part of [[WP:The Wikipedia Adventure|The Wikipedia Adventure |
|          1624 | /* Career */                                                                                         |
|          1580 | /* Filmography */                                                                                    |
|          1527 | Rescuing 1 sources, flagging 0 as dead, and archiving 0 sources. #IABot                              |
|          1412 | copyedit,refine category structure, general fixes using AWB using [[Project:AWB|AWB]]                |
|          1380 | /* top */                                                                                            |
|          1365 | /* Personal life */                                                                                  |
|          1292 | /* Early life */                                                                                     |
|          1279 | Notification of [[WP:AGF|good faith]] revert found using [[WP:STiki|STiki]]                          |
|          1243 | ce                                                                                                   |
|          1234 | Fixed typo                                                                                           |
+---------------+------------------------------------------------------------------------------------------------------+
30 rows in set (0.21 sec)

root@dbstore2002[ops]> SELECT count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count > 100000
    -> UNION
    -> SELECT count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count BETWEEN \c
root@dbstore2002[ops]> SELECT '> 100000', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count > 100000
    -> UNION
    -> SELECT '10000-99999', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count BETWEEN 10000 AND 99999
    -> UNION
    -> SELECT '1000-9999', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count BETWEEN 1000 AND 9999
    -> UNION
    -> SELECT '100-999', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count BETWEEN 100 AND 999
    -> UNION
    -> SELECT '10-99', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count BETWEEN 10 AND 99
    -> UNION
    -> SELECT '2-9', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count BETWEEN 2 AND 9
    -> UNION
    -> SELECT '1', count(*), sum(comment_count) FROM comment_summary2 WHERE comment_count = 1
    -> ;
+-------------+----------+--------------------+
| > 100000    | count(*) | sum(comment_count) |
+-------------+----------+--------------------+
| > 100000    |        1 |             170764 |
| 10000-99999 |        1 |              11238 |
| 1000-9999   |       32 |              86594 |
| 100-999     |      456 |             109705 |
| 10-99       |     5255 |             125603 |
| 2-9         |    45519 |             137205 |
| 1           |   320299 |             320299 |
+-------------+----------+--------------------+
7 rows in set (5.58 sec)

2/3 of them would repeat at least once, and around half of them at least 10 times, with a general reduction of 1/3 of the original size. Note this is processing a bit less than a million of revisions out of 1000M of them, comment collisions will be more frequent, when there are more of them. Compressed, the table took just over 100MB, that means a projected total size of 70GB (although with this structure, indexes take half of the total size. The original size on revision is probably around 200GB, which is around half of the total revision table size.

The first million of revisions is a bit more diverse, but the results are mostly maintained (probably less bots were active)- more savings in number of rows due to a lot of empty comments, plus the other top 4 comments:

root@dbstore2002[ops]> SELECT count(*) FROM enwiki.revision WHERE rev_id BETWEEN 1 and 1000000;
+----------+
| count(*) |
+----------+
|   933909 |
+----------+
1 row in set (0.90 sec)

root@dbstore2002[ops]> SELECT max(id) FROM comment_summary;
+---------+
| max(id) |
+---------+
|  306485 |
+---------+
1 row in set (0.00 sec)

root@dbstore2002[ops]> SELECT comment_count, LEFT(comment, 100) as comment FROM comment_summary ORDER BY comment_count DESC LIMIT 30;
+---------------+--------------------------------------------+
| comment_count | comment                                    |
+---------------+--------------------------------------------+
|        343115 |                                            |
|         87443 | *                                          |
|         30634 | geography and demographics                 |
|         26204 | Automated conversion                       |
|          4829 | sp                                         |
|          2606 | typo                                       |
|          2414 | stub                                       |
|          2307 | copyedit                                   |
|          2223 | Added coordinates and updated the wording. |
|          2050 | links                                      |
|          1871 | link                                       |
|          1598 | de:                                        |
|          1461 | misc updates                               |
|          1216 | fix link                                   |
|          1125 | nl:                                        |
|          1051 | from [[Federal Standard 1037C]]            |
|          1028 | wikify                                     |
|           956 | formatting                                 |
|           906 | disambiguate                               |
|           875 | spelling                                   |
|           816 | -/Talk                                     |
|           775 | wikified                                   |
|           767 | Stub                                       |
|           742 | disambiguation                             |
|           685 | fm                                         |
|           679 | -/talk                                     |
|           679 | sp.                                        |
|           643 | +pl:                                       |
|           639 | da:                                        |
|           624 | more                                       |
+---------------+--------------------------------------------+
30 rows in set (3.35 sec)

root@dbstore2002[ops]> SELECT '> 100000', count(*), sum(comment_count) FROM comment_summary WHERE comment_count > 100000
    -> UNION
    -> SELECT '10000-99999', count(*), sum(comment_count) FROM comment_summary WHERE comment_count BETWEEN 10000 AND 99999
    -> UNION
    -> SELECT '1000-9999', count(*), sum(comment_count) FROM comment_summary WHERE comment_count BETWEEN 1000 AND 9999
    -> UNION
    -> SELECT '100-999', count(*), sum(comment_count) FROM comment_summary WHERE comment_count BETWEEN 100 AND 999
    -> UNION
    -> SELECT '10-99', count(*), sum(comment_count) FROM comment_summary WHERE comment_count BETWEEN 10 AND 99
    -> UNION
    -> SELECT '2-9', count(*), sum(comment_count) FROM comment_summary WHERE comment_count BETWEEN 2 AND 9
    -> UNION
    -> SELECT '1', count(*), sum(comment_count) FROM comment_summary WHERE comment_count = 1
    -> ;
+-------------+----------+--------------------+
| > 100000    | count(*) | sum(comment_count) |
+-------------+----------+--------------------+
| > 100000    |        1 |             343115 |
| 10000-99999 |        3 |             144281 |
| 1000-9999   |       13 |              25779 |
| 100-999     |      153 |              41035 |
| 10-99       |     1735 |              41306 |
| 2-9         |    17430 |              51243 |
| 1           |   287150 |             287150 |
+-------------+----------+--------------------+
7 rows in set (1.58 sec)

The total size is only 80M, less than the previous chunk analyzed.

2/3 of them would repeat at least once, and around half of them at least 10 times, with a general reduction of 1/3 of the original size. Note this is processing a bit less than a million of revisions out of 1000M of them, comment collisions will be more frequent, when there are more of them. Compressed, the table took just over 100MB, that means a projected total size of 70GB (although with this structure, indexes take half of the total size. The original size on revision is probably around 200GB, which is around half of the total revision table size.

Such reduction would make lots of difference for us when doing maintenance over this table, which can take days to get altered (or analyzed!). If this is indeed the case, it would be such a massive win!
Thanks for taking the time to analyze all this data and putting it together

Wikidata seems to be the one that would benefit the most, due to many bot repeating messages:

root@dbstore2002[ops]> SELECT count(*) FROM wikidatawiki.revision WHERE rev_id BETWEEN 471000001  and 472000001;
+----------+
| count(*) |
+----------+
|   998601 |
+----------+
1 row in set (0.39 sec)

root@dbstore2002[ops]> SELECT max(id) FROM comments_wikidata;
+---------+
| max(id) |
+---------+
|  355947 |
+---------+
1 row in set (0.00 sec)

root@dbstore2002[ops]> SELECT comment_count, LEFT(comment, 100) as comment FROM comments_wikidata ORDER BY comment_count DESC LIMIT 30;
+---------------+------------------------------------------------------------------------------------------------------+
| comment_count | comment                                                                                              |
+---------------+------------------------------------------------------------------------------------------------------+
|        117252 | /* wbsetdescription-add:1|ar */ نوع من الحشرات, #quickstatements                                     |
|         60899 | /* wbeditentity-update:0| */ BOT:Add label (only latin alphabet) and descriptions on disambiguation  |
|         45825 | /* wbeditentity-create:0| */ #petscan                                                                |
|         41753 | /* wbcreateclaim-create:1| */ [[Property:P31]]: [[Q4167410]], #petscan                               |
|         33204 | /* wbeditentity-update:0| */ BOT:Add labels (Upper case)                                             |
|         33165 | /* wbeditentity-update:0| */ nl-description, [[User:Edoderoobot/Set-nl-description|python code]], lo |
|         18255 | /* wbcreateclaim-create:1| */ [[Property:P27]]: [[Q30]], #petscan                                    |
|         15375 | /* wbsetdescription-add:1|ar */ نوع من الرخويات, #quickstatements                                    |
|         15364 | /* wbeditentity-update:0| */ Bot: - Add descriptions: fr                                             |
|         10276 | /* wbeditentity-update:0| */ Bot: - Add descriptions: ar,fr                                          |
|          6857 | /* wbcreateclaim-create:1| */ [[Property:P971]]: [[Q29017630]], #petscan                             |
|          6724 | /* wbsetqualifier-add:1| */ [[Property:P585]]: 2004, #quickstatements                                |
|          6676 | /* wbsetqualifier-add:1| */ [[Property:P459]]: [[Q28878253]], #quickstatements                       |
|          6158 | /* wbeditentity-update:0| */ Bot: - add Arabic descriptions: نوع من القشريات                         |
|          5192 | /* wbsetqualifier-add:1| */ [[Property:P794]]: [[Q486972]], #quickstatements                         |
|          4506 | /* wbeditentity-update:0| */ BOT:Add category description                                            |
|          4213 | /* wbsetdescription-add:1|ru */ категория в проекте Викимедиа                                        |
|          3775 | /* wbeditentity-update:0| */ Bot: - add Arabic descriptions: نوع من المفصليات                        |
|          3410 | /* wbsetreference-add:2| */ [[Property:P127]]: [[Q13646]]                                            |
|          3410 | /* wbcreateclaim-create:1| */ [[Property:P127]]: [[Q13646]]                                          |
|          3363 | /* wbeditentity-update:0| */ Bot: - add Arabic descriptions: نوع من البرمائيات                       |
|          2944 | /* wbsetdescription-add:1|pt */ categoria de um projeto da Wikimedia                                 |
|          2940 | /* wbsetdescription-add:1|es */ categoría de Wikimedia                                               |
|          2939 | /* wbsetdescription-add:1|it */ categoria di un progetto Wikimedia                                   |
|          2936 | /* wbsetdescription-add:1|fr */ page de catégorie d'un projet Wikimedia                              |
|          2885 | /* wbsetdescription-add:1|de */ Wikimedia-Kategorie                                                  |
|          2872 | /* wbsetdescription-add:1|en */ Wikimedia category                                                   |
|          2704 | /* wbeditentity-update:0| */ Bot: - add Arabic descriptions: نوع من القراصات                         |
|          2577 | /* wbsetreference-add:2| */ [[Property:P495]]: [[Q35]]                                               |
|          2577 | /* wbcreateclaim-create:1| */ [[Property:P495]]: [[Q35]]                                             |
+---------------+------------------------------------------------------------------------------------------------------+
30 rows in set (3.21 sec)

root@dbstore2002[ops]> SELECT '> 100000', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count > 100000
    -> UNION
    -> SELECT '10000-99999', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count BETWEEN 10000 AND 99999
    -> UNION
    -> SELECT '1000-9999', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count BETWEEN 1000 AND 9999
    -> UNION
    -> SELECT '100-999', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count BETWEEN 100 AND 999
    -> UNION
    -> SELECT '10-99', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count BETWEEN 10 AND 99
    -> UNION
    -> SELECT '2-9', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count BETWEEN 2 AND 9
    -> UNION
    -> SELECT '1', count(*), sum(comment_count) FROM comments_wikidata WHERE comment_count = 1
    -> ;
+-------------+----------+--------------------+
| > 100000    | count(*) | sum(comment_count) |
+-------------+----------+--------------------+
| > 100000    |        1 |             117252 |
| 10000-99999 |        9 |             274116 |
| 1000-9999   |       40 |             115399 |
| 100-999     |      190 |              55803 |
| 10-99       |     2308 |              52513 |
| 2-9         |    13301 |              43501 |
| 1           |   340098 |             340098 |
+-------------+----------+--------------------+
7 rows in set (2.09 sec)

Final size: 96M

Wikidata seems to be the one that would benefit the most, due to many bot repeating messages:

Final size: 96M

99G -> 96M

:-O

No, 96M is the final size for 1 million comments- including indexes. The projected size for revision is less than the current half (99G compressed-> 45G) plus a comments table of around 30GB (again, half of that are indexes, and half of half is extra columns). If we index the comments on a separate table or on ES, or we just do not index it (we are not doing it now on revision) the table would be only ~15GB.