Page MenuHomePhabricator

Move database for wikitech (labswiki) to a main cluster section
Open, MediumPublic


Historically wikitech has been deployed as self-contained systems with a local MySQL/MariaDB database server co-located with the MediaWiki software. This was done in an attempt to maintain availability of the technical documentation on wikitech in the event of outages in other parts of the network.

Today, we are monitoring and responding to alerts about staleness on the off-site hosted wikitech-static server. The cloud-services-team is also moving forward with a long term plan to remove LDAP auth and OpenStackManager from wikitech so that it can become a normal SUL wiki. These changes mean that eventually wikitech will be hosted in the common MediaWiki wiki farm (T161859). We will rely on the offsite copy of documentation for severe outage support rather than maintaining a single point of failure machine and hoping that it is unaffected.

Moving the labswiki database from m5 to the s3 slice will benefit us with better maintenance, upgrades, monitoring and high availability. Making this move sooner rather than later will reduce the potential to lag behind in schema updates and cause issues with cross-wiki maintenance like we have seen in the past (e.g. T167961).

This could happen at the same time that other movements are being prepared

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Andrew renamed this task from move wikitech and labstestwiki to s3 (needs discussion) to move wikitech and labstestwiki to s3.Oct 5 2017, 4:02 PM
bd808 renamed this task from move wikitech and labstestwiki to s3 to Move wikitech and labstestwiki to s3.Oct 5 2017, 4:08 PM
bd808 triaged this task as Medium priority.
bd808 updated the task description. (Show Details)

I think the first thing is to amend the description so that cloud (in particular), or anyone else agrees with the plan.


Is there anything I can do to nudge this along, short of 'clone Jaime'?

Also don't underestimate the amount of time that you can help with communication or trivial patches (specially general ops, like firewall)- everything ends up piling up, and I know, for example, the firewall exactly the same as you, or in the case of mediawiki-config, you probably have much more experience than me.

Should we merge this into T184805: Move some wikis to s5? From the cloud-services-team side we don't have a concern about which particular prod slice the two databases end up sitting on. We just want to get into the nice world of redundant DBs that are managed the same as the other sister wikis.

We still don't have a clear idea what will be moved to where, but it is good to know you guys don't really mind. Thanks!

At the moment, we are also waiting for Release-Engineering-Team to provide some feedback (T184805#3896861)

(quoting for future reference)

<jynus> Jaime Crespo I would personally export a copy of labswiki, import it on each s5 node disabling the binary logging, start replication from silver to s5-master-eqiad with some filters, and the a mostly hot failover
8:09 AM not immadiate, that is why we have left it for later
8:09 AM we also need to analyze if there are tables with private data to setup sanitarium filters
8:10 AM setup backups
8:10 AM setup labs grants
8:10 AM etc., which takes some time
<jynus> Jaime Crespo the best way to accelarate that work is *helping* by doing as much work in advance or proposing patches for the whole process and helping pacing the way, "we are ready for you start doing all the work" is less tempting, if you know what I mean 0:-)
8:15 AM s/pacing/paving/
Andrew renamed this task from Move wikitech and labstestwiki to s3 to Move wikitech and labstestwiki to s5.Feb 21 2018, 9:47 PM

Proposed checklist:

  • exclude labswiki and labtestwiki from dumps and tools replicas (this may happen by default but I'm not sure)
  • set up replica of the silver labswiki db (read-only) on s5
  • set up replica of the labtestweb labtestwiki db (read-only) on s5
  • create grants for labswiki (permitting access to mediawiki on silver and labweb1001/1002)
  • if all is well, stop replication with silver and labtestweb, convert s5 databases to read/write
  • Figure out about wikitech-static syncing. Can the script that dumps the whole database still run on silver? If so then this should be fairy simple.
  • now we can build out 'newwikitech' on labweb1001 in a way that's mostly decoupled from the above -- they'll be additional db hosts using the same db
  • once 'newwikitech' is safe and sound, rename it to 'wikitech' so that no web traffic is hitting silver anymore
  • move wikitech-static sync logic from silver to labweb1001
  • turn off silver, cheer

core dbs being accessed from non-core applications servers is something that needs checking- we do not have any of those right now, and many things can go wrong: firewall, grants, routing, etc. I am not saying it cannot be done, jus that it is the first time, and requires detailed planning.

If it makes anything any easier, I think we can easily tolerate a multi-hour (probably up to 24?) read-only period for wikitech while the database is moved. Setting up replication from silver to the s5 master seems like unnecessary complication unless the process of importing the data dump from silver into the s5 cluster takes days.

Figure out about wikitech-static syncing. Can the script that dumps the whole database still run on silver? If so then this should be fairy simple.

This should probably move to terbium (or whatever the main cron host is) once its reading from the main db cluster. Does this need access to local media files though? That may be a complicating thing for dual wikitech hosts that we haven't fully thought through yet.

See T188029: Move labswiki database to m5 for near term plan to move off of silver. This will satisfy the immediate needs for FY17/18 Q3 goals.

chasemp changed the task status from Open to Stalled.Mar 13 2018, 1:52 PM
chasemp lowered the priority of this task from Medium to Lowest.

Should we maybe change db-codfw.php to get db1073 instead of db2037 until we come up with a better solution or that wouldn't fix the issue?

jcrespo added a comment.EditedSep 4 2018, 7:17 AM

Cross-dc queries- I would try to avoid them unless we are in an emergency- I have 2 options, either create a temporary new section for wikitech (as we planned to do for labstestwiki) or we accelerate the import to s5 (for which the dc failover could be a good time), or both.

I would prefer option 2, import wikitech to s5. However, I am not sure how many blockers we have along the way and if we could get them resolved before the DC failover (considering the fact that we have some other stuff before the failover to get ready).

We can do it on switchback- that is why I suggested to create a temporary host for wikitech. Of course, that will increase the chances of a split brain, as m5-maste will be kept in rw mode. Not an easy problem to solve, honestly.

Yeah, I am not sure if I would prefer a split brain or cross-dc queries only for wikitech (that's why I suggested to change db-codfw.php to point to db1073)

There is stuff that constantly tries to write to db2037 (even now) but fails because of read-only:

jijiki added a subscriber: jijiki.Sep 5 2018, 3:47 PM
Andrew added a comment.Sep 5 2018, 5:34 PM

Hi all! I'm a bit lost because I think this task no longer has anything to do with its original post (which is about moving the databases off of the local wikitech server, long since done but to m5 rather than s5.) If I understand it, there are two different issues under discussion:

  1. What happens to wikitech during the DC failover, when m5 becomes read-only
  1. labtestwikitech not having access to a read/write database since there aren't any in codfw (except temporarily during failovers).

Is that correct? If so, I propose that we discuss issue 2 under T201082 (as I think that states #2 as a problem. As for issue 1... I'm hoping that m5 will /not/ go read-only during the switch-over since everything public to do with WMCS is currently eqiad-only.

And, for that matter, it seems like issue 1 is a good argument for /not/ moving wikitech to s5 until it stops being hosted on a special server all together.

I apologize if I'm missing the point here :)

In the end nothing is needed. m5 will not be read only in eqiad. Wikitech traffic will be redirected to eqiad and thus to m5. So nothing to worry about.

Andrew added a comment.Sep 5 2018, 5:38 PM

In the end nothing is needed. m5 will not be read only in eqiad. Wikitech traffic will be redirected to eqiad and thus to m5. So nothing to worry about.

Great! That's what I thought :) Should we close this bug in favor of the more expansive T161859 ?

I am happy to merge this task with T161859 if @jcrespo is (as he is the original task creator)

why not just make this a child of the other and agree to talk only on a single place? This is mostly for DBA work of data migration, which is only a small part of the SUL migration. We can agree to not talk here anything that is not the literal migration.

@Bugreporter, that merge was incorrect -- Wikitech currently /is/ hosted on M5, the bug I opened is about moving it off of M5. Is it possible to revert a ticket merge, or should I just start another one?

Bugreporter added a comment.EditedNov 10 2019, 8:52 PM

The purpose of this task is also move the wiki to another cluster.

bd808 renamed this task from Move wikitech and labstestwiki to s5 to Move databases for wikitech (labswiki) and labstestwiki to a main cluster section (s5?).Nov 10 2019, 10:50 PM
bd808 changed the task status from Stalled to Open.Nov 10 2019, 10:56 PM
bd808 raised the priority of this task from Lowest to Medium.

With recent changes to MediaWiki-extensions-OpenStackManager wikitech and labstestwiki will no longer communicate with the OpenStack APIs (Keystone, Nova, etc). This unblocks moving into the main wiki hosting cluster. LDAP Authentication will still be used, but the LDAP servers are accessible from the mw* hosts.

The full dream of T161859: Make Wikitech an SUL wiki is still a bit further out. The primary blocker for that is T196171: Developer account creation without OpenStackManager and my ideal solution for that is T179463: Create a single application to provision and manage developer (LDAP) accounts.

jcrespo changed the task status from Open to Stalled.Nov 11 2019, 8:38 AM

Just to clarify, does that mean there are blockers to move wikitech to production mw application servers, but there is green light to import it to, e.g. s5 production database cluster, with some sanity checks, of course, or there is still blockers on that too? Openstack creates frequent m5 outages, so wikitech would benefit from a more stable db cluster.

jcrespo changed the task status from Stalled to Open.Nov 11 2019, 8:39 AM
bd808 added a comment.Nov 11 2019, 3:25 PM

@jcrespo Once the next MediaWiki deployment train runs, Wikitech's OpenStackManager extension will no longer interact with OpenStack APIs. The extension has been reduced to merely providing additional LDAP integration beyond the LDAPAuthentication extension. At that point we believe that Wikitech can and should move from the labweb* hosts to the main MediaWiki hosting cluster (mw*). We know that connecting to the production MediaWiki database section servers from the current labweb* is not possible.

We do not have a complete checklist of next steps written out yet, but it will need to include:

  • This task to create a labswiki database in a production MediaWiki database section
  • T237889: Install php-ldap on all MW appservers so that Wikitech's LDAP integration works from those hosts
  • A lot of configuration changes in wmf-config to point Wikitech at the proper database servers and other support services

labtestwiki no longer lives in m5 T233236: Move labtestwikitech database to clouddb2001-dev

root@cumin1001:/home/marostegui# -hdb1133 -e "show databases like '%wik%'"
| Database (%wik%) |
| labswiki         |
Marostegui renamed this task from Move databases for wikitech (labswiki) and labstestwiki to a main cluster section (s5?) to Move database for wikitech (labswiki) to a main cluster section (s5?).Mar 18 2020, 12:14 PM
Marostegui renamed this task from Move database for wikitech (labswiki) to a main cluster section (s5?) to Move database for wikitech (labswiki) to a main cluster section.
Marostegui updated the task description. (Show Details)
Kormat added a subscriber: Kormat.Jul 20 2020, 8:51 AM

We have picked up this topic in our weekly and we're going to see how and if it is possible to do this with without having to depend on the DC switchover.
The first step would be to measure times for:

  • mysqldump/dumper labswiki
  • importing that data into a single host.

Once we know that, we'll see if it can be done with a read-only time just for wikitech or if it would be too large to proceed and needs to be done in some other way.

Thank you for looking at this! The only issue I can think of with extended read-only time on wikitech is that it will break the SAL; any other edits can easily wait a few hours.

I have done a quick check using mysqldump (and not my dumper), as the test was done from a 10.1 host to a 10.4 one:

  • Extracting the DB locally: 4 minutes
  • Copying it over to the desired host: 8 minutes
  • Importing the database on the final host: 67 minutes.

Using mydumper+myloader:

  • Extracting the DB locally: 1:30 minutes (mydumper -h localhost -t 8 -u root -B labswiki -S /run/mysqld/mysqld.m5.sock)
  • Copying it over to the desired host: 4 minutes
  • Importing the database on the final host: 67 minutes (myloader -h localhost -t 8 -u root -S /run/mysqld/mysqld.sock -d .)

Considering the fact that we'd need to do load the data on every host of the section we choose (s6 or s5 I would say), we can do this in parallel, but not all hosts will perform at the same speed (as they'll be still serving production traffic), so we are probably looking at around 2-3 hours RO time for wikitech.

Not sure how large wikitech is, but 67 minutes looks to me like a long time, given I was able to import much larger wikis in less time in the past. Consider tuning the compression and rows per file for a faster import; maybe there is one large table making things much slower- we have a wrapper of mysqldump ( which tries to optimize some parameters for speed). Or maybe I can try? Were you importing with or without binlog?

Yeah, the problem is at the text table from what I can see, so we end up with the usual issue: even though we have many threads, we end up just waiting for that one to finish
Please go ahead and give it a try - much appreciated

Hosts I used fyi, feel free to use those too (as they both have 10.4 - even though in production it is likely we'll be using 10.1 on both ends):

  • db1117:3325 for exporting
  • db2093 for testing the import.

binlog was disabled. I did try using the -r during the export, but I tried several figures but some made it slower on the export process.

jcrespo added a comment.EditedJul 22 2020, 10:05 AM

Yeah, the problem is at the text table from what I can see

I was able to export the table on smaller chunks by setting the backups like this:

backup config
    host: 'db2078.codfw.wmnet'
    port: 3325
    rows: 50000
    regex: '^labswiki\.'
    threads: 16

and then just running the standard (taking 4m30s remotely over the network):

sudo -u dump

The issue is not only that the text table is large, its ids are not uniform (they group toghether on the last ids), so it requires more fine-grained parallelism. After that, it was loaded (again, over the network- slower, but requires no later copy):

time --user=root --host=db2102.codfw.wmnet --password='root password here' --threads=16 labswiki

taking 5m36.074s

Note that I used db2102, not db2093, as I noticed that, aside from concurrency, the main limiting factor was the small buffer pool (20GB) not fitting into memory/host performance issues. My point here was not a competition, just showing some tweaks because the 1h+ import time sounded very different/worrying from my own benchmarks. There will be of course some extra penalty for ongoing IOPS from writes and replication.

I also wanted to encourage getting everybody more familiar with the existing python wrappers because they simplify some tasks/pre-optimize them for speed.

Those are excellent news!
Good point, db2093 is the tendril host which has very low innodb buffer pool because of the OOM we had, that's a huge limiting factor indeed.
So with those tweaks, we can probably say that 1h RO time should be enough (considering that it needs to be loaded in all the production hosts and we have to wait for the slower one to finish before we can make it writable again).
Codfw isn't a worry, neither should be labs, dbstore, or backup hosts, as those won't serve production traffic, so we don't need to wait for them to finish, we can just stop replication and let it finish there.

@Andrew I have double checked with Chris Danis and looks like we can set wikitech on RO via dbctl as any other section.
Do you think 1h RO time for wikitech is ok?

Obviously !log would stop working, but we found that we also store those on

Wikimedia Toolforge
This service is currently running in Wikimedia Toolforge as the sal tool. It uses an Elasticsearch index maintained by a Python IRC bot which collects messages that start with !log from various freenode channels.`

That looks independent from the database itself, so that would keep working? We can migrate even manually the entries produced by !log to wikitech back once it ready.

bd808 added a comment.Jul 29 2020, 4:18 PM

That looks independent from the database itself, so that would keep working? We can migrate even manually the entries produced by !log to wikitech back once it ready.

Yes, Stashbot should continue to write !log messages into the Toolforge elasticsearch cluster even with wikitech read-only. I haven't done the dance to recover SAL messages on wiki from the alternate storage for a long time, but it is possible. Worst case the messages can be manually added on wiki using the tool's data as a guide.

One hour RO sounds just fine. Two would also be fine :)

Excellent - thank you guys.
I will get a procedure written and pasted here in a few days!

Marostegui moved this task from Backlog to Next on the DBA board.Aug 11 2020, 6:18 AM
Marostegui moved this task from Next to Ready on the DBA board.Oct 13 2020, 11:19 AM

Checking in -- could we go ahead and make this move after the datacenter switchover?

The DC switchover is tomorrow, so we can try to plan for it in Q2, if we find the time for it.
I will ping you once I've come up with a plan!

@LSobanski, I would very much like to see movement on this task (mostly because it will unblock T237773 which will remove a substantial thorn from my side). My understanding is that this is a fairly small amount of work; if necessary I can schedule downtime for Wikitech during the transition if that makes things easier. Is it possible to get this on the DBA roadmap?


@Andrew I am not sure this is a fairly small task, even if the process can looks simple (RO on wikitech + mysqldump + importing the dump on the destination hosts), it is a process that will require quite some babysitting (we need to place all the data on each host of the selected MW section (either s5 or s6). That means we need to depool all the hosts, one by one, and then place this data on the master, which can be very delicate and could have a potential to overload the primary master, so this need quite a bit of planning and having all the steps crystal clear before proceeding.

Regarding the MW side of things, I am also not sure whether these are the right steps, not sure if you've thought about the MW steps and can provide some clarification to the below steps:

  • RO for wikitech on dbctl
  • do all the data importing
  • change db-eqiad.php to change wikitech to either s5 or s6 and deploy it to production
  • change dblists to add labswiki to whatever section we decide to place it and deploy it to production
  • Remove s10 from dbctl? I am not sure about this step (we've never leave a MW section emtpy, so not sure how this removal is done and what are the implications)
LSobanski moved this task from Ready to Refine on the DBA board.Mon, Jan 18, 5:13 PM