Allow comments longer than 255 bytes
Closed, ResolvedPublic

Tokens
"Party Time" token, awarded by RandomDSdevel."Yellow Medal" token, awarded by Vachovec1."Yellow Medal" token, awarded by jcrespo."Like" token, awarded by Liuxinyu970226."Like" token, awarded by eugrus."Like" token, awarded by Luke081515."Like" token, awarded by Asahiko."Like" token, awarded by He7d3r.
Assigned To
Authored By
bzimport, Jan 22 2006

Description

Author: avarab

Description:
To support longer edit summaries the rev_deleted field has to be changed from
TINYBLOB to BLOB and a relevant database upgrade has to be written.

This change also required bumping the minimum MySQL version to 5.0.3 (documented in the files RELEASE-NOTES-1.25 and INSTALL, brought up on wikitech-l).


Version: unspecified
Severity: enhancement
URL: http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html

Details

Reference
bz4715

Related Objects

StatusAssignedTask
OpenNone
OpenNone
DuplicateNone
ResolvedNone
InvalidNone
OpenNone
ResolvedAnomie
Resolvedkaldari
Resolved jcrespo
OpenAnomie
ResolvedAnomie
OpenNone
ResolvedMarostegui
OpenBstorm
ResolvedMarostegui
OpenNone
OpenAnomie
ResolvedAnomie
Resolvedmatmarex
StalledNone
There are a very large number of changes, so older changes are hidden. Show Older Changes
DannyH added a subscriber: DannyH.Dec 20 2016, 12:08 AM
scfc removed a subscriber: scfc.Dec 20 2016, 12:51 AM

So I suppose one possible solution would be to:

  • Not touch the oldimage/image table (nobody really cares about upload summaries anyways)

Indeed. All upload summaries are also duplicated in the logging table, so we'd really lose nothing at all.

  • Add a table revision_comment with a primary key corresponding to rev_id, and presumably equivalent tables for archive, ipblocks, image, oldimage, filearchive, recentchanges, logging, and protectedtitles.
  • Add a comment table. Then add fields rev_comment_id and so on to the existing tables, referring to this table.

Yeah, an idea along these lines was suggested in the first comments on this task by @Domas ;) but there was no concrete proposal.

From what I remember, the extension of rev_comment and other fields to 767 bytes was always a "temporary" solution, implemented in the schema only because, at the time, it looked like a simple database migration, and moving comments to a separate table looked like a major undertaking. We could actually revert it easily – the MediaWiki software still enforces the maximum of 255 bytes for the comments, and it looks like we consistently do it everywhere (search EditPage.php, LocalFile.php or LogEntry.php for '255'), only the schema was changed.

We could actually revert it easily

Can we? One thing is mediawiki at WMF, another thing is for other wikis that may have started using such feature. That is why I don't dare so say "do this", I do not have the vision or skills that you may have.

Can I interpret that statement as meaning that you think Anomie's proposal of a separate editsummary table, containing just the revision id and the edit summary, and having all future revision entries after the new table is introduced have rev_comment as the empty string, would be a "huge win"?

It will be a win, we will have separated info in 2 tables, and we will be able to complete this ticket "easily". I cannot say if that would be enough and if it will be, for how long, but that, I assume, will be out of the scope of this ticket. One small comment- do not use empty string- use NULL, that will be packed more efficiently. Ideally, we will move old comments eventually to this system too, as that would deflate the current size. Why do not say "huge win" for sure?- because there are proposals (like MCR) to double or triplicate the revision number, which will bring us in the future (they are new features in planning phase) to square one.

Are you referring to queries from updateSpecialPages.php

Not only those- currently, there are many api calls that fail because they ask for combinations of filters that require to scan the huge revision table in large chunks (such as requesting all contributors for a page, which can be millions, or requesting all edits for a bot account, bringing down api servers. What is worse, some recentchanges queries have started querying the revision table (instead of the just the much smaller and reasonable recentchanges one) making the problem worse.

Also, I cannot just stop crons and dumps- database failover can happen at any time (either for server maintenance or for server failure), because despite providing the service, mediawiki refuses to change the server it is querying mid-batch, as it does not reload its config.

those with a rev_page equality condition, those with a specific rev_user/rev_user_text, and those with a rev_id condition [Several, but not all of which could be rewritten to have a rev_page equality in them].

That is not what happens. We have 2 million different rc query plans/filters (not different queries, those would be virtually infinite): T101502#1471866. A similar thing happens to revision- if this was a "give me a single row" table, the table scalability would be larger than you can think, as long as it fits the disk. The problem is when it does ranges that it can be of millions of rows, based on many different filters. I agree that a simplification would help it (a couple of well defined filters that are perfectly indexed), but that doesn't happen.

So would the idea here be to partition (using mysql partitioning) the master based on rev_id, and then have three different groups of slaves that have partitioning based on rev_page, rev_user, and rev_id respectively?

Do you know that that already exists, but apparently nobody uses it or knows about it? The special slaves watchlist, recentchanges, contributions, logpager have the following structure:

CREATE TABLE `revision` (
  `rev_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `rev_page` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_text_id` int(8) unsigned NOT NULL DEFAULT '0',
  `rev_comment` varbinary(255) DEFAULT NULL,
  `rev_user` int(5) unsigned NOT NULL DEFAULT '0',
  `rev_user_text` varbinary(255) NOT NULL DEFAULT '',
  `rev_timestamp` varbinary(14) NOT NULL DEFAULT '',
  `rev_minor_edit` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_deleted` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `rev_len` int(8) unsigned DEFAULT NULL,
  `rev_parent_id` int(8) unsigned DEFAULT NULL,
  `rev_sha1` varbinary(32) NOT NULL DEFAULT '',
  `rev_content_model` varbinary(32) DEFAULT NULL,
  `rev_content_format` varbinary(64) DEFAULT NULL,
  PRIMARY KEY (`rev_id`,`rev_user`),
  KEY `rev_timestamp` (`rev_timestamp`,`rev_id`),
  KEY `page_timestamp` (`rev_page`,`rev_timestamp`,`rev_id`),
  KEY `user_timestamp` (`rev_user`,`rev_timestamp`,`rev_id`),
  KEY `usertext_timestamp` (`rev_user_text`,`rev_timestamp`,`rev_id`),
  KEY `page_user_timestamp` (`rev_page`,`rev_user`,`rev_timestamp`,`rev_id`)
) ENGINE=InnoDB AUTO_INCREMENT=755820458 DEFAULT CHARSET=binary
/*!50100 PARTITION BY RANGE (rev_user)
(PARTITION p1 VALUES LESS THAN (1) ENGINE = InnoDB,
 PARTITION p50000 VALUES LESS THAN (50000) ENGINE = InnoDB,
 PARTITION p100000 VALUES LESS THAN (100000) ENGINE = InnoDB,
 PARTITION p200000 VALUES LESS THAN (200000) ENGINE = InnoDB,
 PARTITION p300000 VALUES LESS THAN (300000) ENGINE = InnoDB,
 PARTITION p400000 VALUES LESS THAN (400000) ENGINE = InnoDB,
 PARTITION p500000 VALUES LESS THAN (500000) ENGINE = InnoDB,
 PARTITION p700000 VALUES LESS THAN (750000) ENGINE = InnoDB,
 PARTITION p1000000 VALUES LESS THAN (1000000) ENGINE = InnoDB,
 PARTITION p1200000 VALUES LESS THAN (1500000) ENGINE = InnoDB,
 PARTITION p2000000 VALUES LESS THAN (2000000) ENGINE = InnoDB,
 PARTITION p3000000 VALUES LESS THAN (3000000) ENGINE = InnoDB,
 PARTITION p4000000 VALUES LESS THAN (4000000) ENGINE = InnoDB,
 PARTITION p5000000 VALUES LESS THAN (5000000) ENGINE = InnoDB,
 PARTITION p6000000 VALUES LESS THAN (6000000) ENGINE = InnoDB,
 PARTITION p7000000 VALUES LESS THAN (7000000) ENGINE = InnoDB,
 PARTITION p8000000 VALUES LESS THAN (8000000) ENGINE = InnoDB,
 PARTITION p9000000 VALUES LESS THAN (9000000) ENGINE = InnoDB,
 PARTITION p10000000 VALUES LESS THAN (10000000) ENGINE = InnoDB,
 PARTITION p12000000 VALUES LESS THAN (12000000) ENGINE = InnoDB,
 PARTITION p14000000 VALUES LESS THAN (14000000) ENGINE = InnoDB,
 PARTITION p16000000 VALUES LESS THAN (16000000) ENGINE = InnoDB,
 PARTITION p18000000 VALUES LESS THAN (18000000) ENGINE = InnoDB,
 PARTITION p24000000 VALUES LESS THAN (24000000) ENGINE = InnoDB,
 PARTITION pMAXVALUE VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

They are partitioned by user, but mediawiki know nothing about it. Keeping a special slave on infrastructure only is a pain because:

  • Most people do not know about it, it seems, and complain that their queries do not work there because it has different indexes (it is required by mysql partitioning) (T132416)
  • MySQL partitioning has some restrictions (all unique keys must be part of the partitioning key)
  • Maintaining a special slave on infrastructure only requires a significant resource investment: we need to have 2 special servers per shard, (that is almost 14 new servers) duplicating all resources
  • This solves, for the few queries that use them, the issues with queries per user, but not queries by page, or other filters. It also doesn't make maintenance easier, if something, it makes them more difficult because partitioned tables cannot be imported in binary format, and have further restrictions
  • A logical implementation would always be more efficient than a mysql one (but would require code changes). Separating comments allows queries not using comments to run faster (in some cases, even all). Partitioning by just setting up a hash based on the primary key would make PK lookups faster, but range queries slower. Separating, eg. by user, or by page would speed each one its particular queries. In an extreme case we could even duplicate disk data (a revision table partitioned by user, and another one partitioned by page) if that helps make more queries faster.

Sadly, nowadays, we are so dependent on partitioning, that if I was to pool a regular slave as watchlist, recentchanges, contributions, logpager, a lot of functionality on enwiki, commons and wikidata (like recentchanges and watchlists) would stop working.

I do not have a perfect solution, I have identified more or less the issues, and have some (many, not all good) potential ideas on how to move forward that I want to present at T149534 (because it requires a lot of thinking, I do not have perfect ideas now) and then let you (as in, all developers and contributors) decide, with that information, what is the best way to actually proceed and with which scheduling. The summary that Daniel wrote summarizes it perfectly:

The way MediaWiki stores information in a relational database is not very efficient. In some respects are are reaching the limits of the current design. We need to explore options to address this issue, like normalization, denormalization, partitioning, as well as alternative storage technologies.

We just discussed 2 of those things- normalizing comments and partitioning. Did I just say this is not an easy topic 0:-)

TL;TR options:

a) Apply the original patch, knowing it is a timed bomb
b) Separate comments on a separate table, do patches to avoid querying comments almost everwhere where possible, be aware that this is a mere quick patch, and not a proper fix, but at least it resolves the current ticket while maybe other refactoring is under work (avoid analysis-paralysis)
c) Wait for a proper refactoring of revision, which may take more time, and may even reach the same conclusion than a) for comments; but after all, if we have waited so many years, waiting some moths is not that bad

greg removed a subscriber: greg.Dec 20 2016, 8:22 PM
Arbnos added a subscriber: Arbnos.Jan 5 2017, 7:52 PM
Meno25 added a subscriber: Meno25.Jan 17 2017, 11:44 AM
mxn added a subscriber: mxn.Jan 29 2017, 7:49 PM

Was there any discussion related to this at the Developer Summit? Would this be a good topic for an ArchCom discussion?

From talking with Jdforrester, it sounds like ArchComm has already discussed this and the plan is to refactor the revision table (as part of the work on multi-content revisions) and in the process move edit summaries to another table (with a larger field). @daniel, would you say that is accurate?

Base added a subscriber: Base.Mar 6 2017, 12:14 AM
Alsee added a subscriber: Alsee.Mar 8 2017, 4:51 PM

I know next to squat about databases, so I apologize if this is incredibly dumb or obvious. Would it help to: Create the new comment table as write-only, fill it at leisure, and switch all reads&writes to the new table after it fully mirrors the existing data? Then increase the comment limit and clean up the old unused data at leisure.
That at least avoids the need to check both places during reads.

jcrespo added a comment.EditedMar 8 2017, 5:31 PM

Alsee - I do not think anybody here is worried about the deployment, but there were many discussion about design and performance.

In the end, the most important part is "show me the code"- someone has to write the code and make it work -*not* the alters/create table, the changes to all core functions (and extensions, if they touch these tables), and the fixes for all external labs-based bots and gadgets, plus maintenance scripts modifications. In most cases this was delayed because not being a simple change, other refactoring was being researched to be done at the same time- https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2 .

Your proposal is quite normal- although it is not as simple as put it- code deployment to 200 apps servers take 1-2 minutes, plus long running requests can continue running for minutes, so it is not a transactional change- filling/altering the revision table can take days -up to 5- per db server (and we have >100 of those), by the time it has been filled in, new revisions have been added, and you will never be able to catch up 100%, but again nothing that cannot be done in a slightly different way as long as the code is ready. That is why new deployed code has to work independently if a schema change is ready- it takes time to be deployed, and can be deployed asynchronously- even minutes apart- on different servers.

This is not about deployment, this is about architecture and programming. Please join the discussion about that at: T153333

Reedy removed a project: Patch-For-Review.
Reedy removed a subscriber: MediaWiki-Core-Team.
Krinkle moved this task from Backlog to Schema on the MediaWiki-Database board.

This will be done as stated on T174569, no longer blocked on schema change.

Ltrlg added a subscriber: Ltrlg.Dec 26 2017, 7:50 PM
Restricted Application added a subscriber: alanajjar. · View Herald TranscriptJan 27 2018, 11:09 PM
Anomie closed this task as Resolved.Mar 2 2018, 12:34 PM
Anomie claimed this task.

While there's still some backend cleanup work needed (see T166733), comments longer than 255 bytes are now allowed.

I suppose an announcement is due once T185948 is resolved?

Anomie added a comment.Mar 2 2018, 3:46 PM

T185948 pretty much is resolved, other than the vague "anything else" bullet.

Johan added a subscriber: Johan.Mar 2 2018, 4:05 PM

I'll include it in the Tech News issue that goes out on Monday.

While there's still some backend cleanup work needed (see T166733), comments longer than 255 bytes are now allowed.

This shouldn't have been just done on the fly, as this is entirely surprising for most users. Furthermore, I doubt it's even wanted like it works right now; see T6714 for example.

Alsee removed a subscriber: Alsee.Mar 3 2018, 2:01 AM

I believe that such a concern would belong in T6714 since that is the task for edit summary length.

@MGChecker I am no developer, but this task is about backend- meaning now there is a *possibility* of larger comments (or comments with multi-byte characters) to be stored on the database, something that was incredibly difficult to handle with the previous structure and required lots and lots of redesign and refactoring on a multi-month effort. A priority was set to this after it reached the top requested feature of the comunity on the yearly survey.

Now, being able to store larger comments doesn't mean you have to- the difficult part was to *allow* them, putting technical or policy restrictions is, once this backend work has been completed, in my opinion- and I want to stress I am no developer-, very very easy compared to the work to make it possible. Frontend can be accommodated, for example, to allow only 255 characters (which can take up to 1000+ bytes). Or it can be stored, but limited or filtered on display (e.g. not counting ipv6 or tags).

Of course my concern isn't the backend possibility to store longer comments, but that it's possible right now to actually enter any text up to 1000 bytes at the ui.

jcrespo added a comment.EditedMar 3 2018, 7:27 PM

Actually, for what I see, it can be quickly disabled per site by doing: https://en.wikipedia.org/wiki/Special:AbuseFilter/904 Ask an admin to apply it until a more definitive ui fix, balancing all issues (ipv6 addreses, legibility, editor preferences) are taken into account.

Please let's centralize further discussion on T6714, as this is a closed ticket, on future improvements.

Meno25 removed a subscriber: Meno25.Mar 5 2018, 10:31 AM

Change 192170 abandoned by Brian Wolff:
[WIP] Increase edit summary limit in the UI

https://gerrit.wikimedia.org/r/192170