Page MenuHomePhabricator

Scaling the MediaWiki database schema
Closed, ResolvedPublic

Assigned To
Authored By
daniel
Oct 30 2016, 7:54 PM
Referenced Files
None
Tokens
"Like" token, awarded by tstarling."Like" token, awarded by Tgr."Cookie" token, awarded by Marostegui."Party Time" token, awarded by demon."100" token, awarded by bd808."Like" token, awarded by Krenair."Like" token, awarded by Miriya52."Like" token, awarded by Physikerwelt."Like" token, awarded by Bawolff."Like" token, awarded by Racso."Cookie" token, awarded by Addshore."Like" token, awarded by Volans.

Description

Type of activity: Pre-scheduled session. Unconference would also be OK.
Main topic: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/How_to_manage_our_technical_debt / https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Handling_wiki_content_beyond_plaintext

The problem

The way MediaWiki stores information in a relational database is not very efficient. In some respects we are reaching the limits of the current design. We need to explore options to address this issue, like normalization, denormalization, partitioning, as well as alternative storage technologies.

Concrete problems include the size of the revision table and the performance impact of redundant storage of page titles in link tables.

Expected outcome

  • A roadmap proposal for improving storage scalability

Current status of the discussion

No centralized discussion, but the issue as been brought up in the context of various discussions, see below.

Links

Related Objects

Event Timeline

I was going to create my own proposal, but can I co-own this? I have lots of things to say: :-) http://www.slideshare.net/jynus/mysql-schema-design-in-practice

@jcrespo Of course! I'd actually be happy to assign this to you, I intentionally left the assignee blank when I created this ticket. I only filed this because I didn't see any proposal from you, and the deadline is today.

So feel free to just take this over, or create your own, as you see fit.

No, I do not want to own it, just work together on it, and actively participate on it.

Ok. I don't plan to present much, this is intended to be a pretty open session. I'll present a few problems, and some ideas of how to approach them, but that's about it.

Is that sufficient for you?

Is that sufficient for you?

If you is me, yes.

Might be worth splitting this in half -- a session on the "beyond wikitext" track would concentrate on *new things* we wanted to add to the schema, and the "manage technical debt" could focus on *old things to purge or refactor*. There could be communication between these two. Depending on how the tracks are arranged, perhaps this is even a "two track session" where both tracks merge temporarily...

@cscott: for putting new things into the storage schema, I think T149532: Why Multi-Content-Revisions? Use cases and requirements. would fit better. Many of the new things I know of are covered by or related to MCR.

@cscott I agree with Daniel. My opinion is that "beyond wikitext" session such as T149532 should definitely happen, and they certainly depend on this, but I think this should be just about issues that block those from being created, such as the ones that Daniel correctly identified:

  • Large table handling such as revision and the difficulty of quering ranges of those efficiently
  • Inefficient denormalization such as titles and users on page and *links
  • Partitioning implementation into core?
  • How to better handle Mediawiki storage and performance for XL wikis (enwiki, wikidata, commons)

To give you an idea, I (databases/infrastructure in general, actually) am kind of blocking thing like T149532 going forward, not on purpose, but because they depend on this having a way to move forward. But once this very issue is """solved""" (triple quotes), I have nothing to say or complain about about new features, including MCR- I welcome all of them. Once we have a better framework (in the general sense), we unblock lots of possibilities.

I am CCing @brion and @tstarling, whom I briefly commented some of the issues that are now are hackingly and badly solved at infrastructure side. This is of course not only an WMF issue, any large wiki and even the smaller ones should benefit from better performance.

Also, this is not a proposal to rewrite mediawiki, I believe very very small improvements can make a huge difference.

Who would be the person facilitating this session? Please assign this task to that person if you are aiming to have this session pre-scheduled. Thank you!

looks like that will be me.

@daniel Sure, we can cover the content-related portion in T149532. Part of the reason I'm flagging the overlap is for future scheduling reasons -- I want to be sure we schedule this proposal and T149532 in such a way that the necessary crossover discussion can happen.

@cscott seems like I will be running both sessions - that should prevent any scheduling conflicts ;)

Would something like T33863: Fix use of DB schema so RenameUser is trivial be in scope for this discussion? Or at least something to be considered...?

@Legoktm: my own opinion is that T33863: Fix use of DB schema so RenameUser is trivial would be on topic, but it is not be the largest issue by far. I consider the rename issue an architecture problem that needs to be addressed with schema changes, but as it is not an operation that happens every day that limits the growth -not as urgent as other scaling issues on enwiki, commons and wikidata regarding revision storage and usage tracking. So definitely it should be on the table of TODO things, but I have the feeling that that one is more or less clear in which direction to go (it is just a very hard problem), while some of the above have not been deeply discussed.

So I definitely would like to mention it, but I do not think we could discuss it in depth unless there is enough time, or maybe we can do a later informal, more in depth, brainstorming for that one in particular?

greg triaged this task as Medium priority.Nov 30 2016, 11:11 PM

Just to follow up on my earlier comment: T149532: Why Multi-Content-Revisions? Use cases and requirements. didn't get pre-scheduled, so I'm hoping that MCR gets at least a brief mention during this session.

To the owner of this session: Here is the link to the session guidelines page: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines. We encourage you to recruit Note-taker(s) 2(min) and 3(max), Remote Moderator, and Advocate (optional) on the spot before the beginning of your session. Instructions about each role player's task are outlined in the guidelines. The physical version of the role cards will be made available in all the session rooms. Good luck prepping, see you at the summit! :)

Note-taker(s) of this session: Follow the instructions here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Session_Guidelines#NOTE-TAKER.28S.29 After the session, DO NOT FORGET to copy the relevant notes and summary into a new wiki page following the template here: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Your_Session and also link this from the All Session Notes page: https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/All_Session_Notes. The EtherPad links are also now linked from the Schedule page (https://www.mediawiki.org/wiki/Wikimedia_Developer_Summit/2017/Schedule) for you!

I've also got a few old notes at https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table which proposes using string interning for edit comments & user text based on some earlier discussions along these lines. I'll use this as a base to update on writing up a new RFC tomorrow during 'Get Stuff Done Day'.

Some followup, specifically one requested by @aaron. I made an update to https://wikitech.wikimedia.org/wiki/MariaDB/query_performance I have yet to sanitize it to remove all private data. Here you have a sneak peak:

# 1670.2s user time, 3.2s system time, 228.34M rss, 291.58M vsz
# Current date: Wed Feb  8 18:12:21 2017
# Hostname: db1089
# Files: STDIN
# Overall: 10.39M total, 1.14k unique, 120.42 QPS, 0.02x concurrency _____
# Time range: 2017-02-07 14:43:51 to 2017-02-08 14:42:17
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time          1815s     1us      2s   174us   445us   862us   113us
# Lock time           456s       0     8ms    43us    76us   109us    33us
# Rows sent         16.77M       0   3.45k    1.69    3.89   14.11    0.99
# Rows examine      52.55M       0   6.14M    5.30   13.83   1.95k    0.99
# Merge passes           0       0       0       0       0       0       0
# Query size         1.81G       8  81.54k  186.60  441.81  390.17  151.03
# Boolean:
# Filesort      85% yes,  14% no
# Full scan     98% yes,   1% no
# Priority que  84% yes,  15% no
# Tmp table      2% yes,  97% no
# Tmp table on   0% yes,  99% no

# Profile
# Rank Query ID           Response time  Calls   R/Call V/M   Item
# ==== ================== ============== ======= ====== ===== ============
#    1 0xSANITIZED        411.7317 22.7% 2388097 0.0002  0.00 SELECT page
#    2 0xSANITIZED        135.0172  7.4%  688675 0.0002  0.00 SELECT page
#    3 0xSANITIZED        132.4130  7.3%  721116 0.0002  0.00 SELECT heartbeat.heartbeat
#    4 0xSANITIZED         79.1531  4.4%  185390 0.0004  0.00 SELECT page
#    5 0xSANITIZED         77.0136  4.2%  101100 0.0008  0.00 SELECT redirect page
#    6 0xSANITIZED         67.9033  3.7%  100997 0.0007  0.00 SELECT page_props
#    7 0xSANITIZED         44.9982  2.5%   65863 0.0007  0.00 SELECT page page_props
#    8 0xSANITIZED         43.3802  2.4%  162336 0.0003  0.00 SELECT page_props
#    9 0xSANITIZED         40.8492  2.3%  285934 0.0001  0.00 SELECT user
#   10 0xSANITIZED         38.6305  2.1%  731282 0.0001  0.00 SET
#   11 0xSANITIZED         35.4164  2.0%   94114 0.0004  0.00 SELECT page_props

There seems to be a heavy page query, I will search what that it is. This also probably confirms contention from the heartbeat table (T150474) which will be fixed soon. However, none of this will probably show the queries I am most worried about, which are related to revision/logging/rc, which are part of a specific role. I will do an rc/log/contributions slave next.

@ Are there more action items from this task or plan to use this for further discussion? If not, I would close this task as resolved!

This https://wikitech.wikimedia.org/wiki/MariaDB/query_performance/coreproduction-main-20170207 shows that, in ~a year, with a machine with SSDs + 3x as much memory, our 95 percentile of response time has decreased a 50%.

@ Are there more action items from this task or plan to use this for further discussion? If not, I would close this task as resolved!

@brion said that he wants to follow up on this with a concrete proposal soon. We (ArchCom) have penciled in an IRC descussion of this for next week, but it's not certain yet that this will happen.

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

Krinkle claimed this task.