Page MenuHomePhabricator

Move wikitech and labstestwiki to s5
Open, Stalled, LowestPublic

Description

Historically wikitech (and labtestwikitech) have been deployed as self-contained systems with a local MySQL/MariaDB database server co-located with the MediaWiki software. This was done in an attempt to maintain availability of the technical documentation on wikitech in the event of outages in other parts of the network.

Today, we are monitoring and responding to alerts about staleness on the off-site hosted wikitech-static server. The cloud-services-team is also moving forward with a long term plan to remove LDAP auth and OpenStackManager from wikitech so that it can become a normal SUL wiki. These changes mean that eventually wikitech will be hosted in the common MediaWiki wiki farm (T161859). We will rely on the offsite copy of documentation for severe outage support rather than maintaining a single point of failure machine and hoping that it is unaffected.

Moving the labswiki and labtestwiki databases from silver/labtestweb2001.wikimedia.org to the s3 slice will benefit us with better maintenance, upgrades, monitoring and high availability. Making this move sooner rather than later will reduce the potential to lag behind in schema updates and cause issues with cross-wiki maintenance like we have seen in the past (e.g. T167961).

This could happen at the same time that other movements are being prepared due to the s8 setup (moving away several other wikis from s3) T140746

Event Timeline

jcrespo created this task.Jun 15 2017, 2:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 15 2017, 2:50 PM
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJun 15 2017, 2:51 PM

Something oauth from horizon seem to be the largest blocker, what do you know about that? How would that fit with the main infrastructure?

bd808 added a subscriber: bd808.Jun 15 2017, 3:19 PM

Something oauth from horizon seem to be the largest blocker, what do you know about that? How would that fit with the main infrastructure?

We have actually changed how that works in Horizon. The OATH (two-factor) checks now use a MediaWiki Action API call. There is no longer a need to reach into the labswiki database directly.

Andrew added a subscriber: Andrew.Jun 15 2017, 3:23 PM

I am pretty sure that this is fine. I would like to be present and alert during the switchover, though, in case I'm forgetting about corner cases.

jcrespo added a subscriber: Marostegui.EditedJun 15 2017, 3:24 PM

That- which is a really good thing for separation of concerns (Thank you!)- would mean that the generic grants would work (I think), and we should just use them from now on CC @Marostegui for the other ticket.

Marostegui moved this task from Triage to Backlog on the DBA board.Jun 16 2017, 8:44 AM
bd808 moved this task from Triage to Database on the Cloud-Services board.Jun 17 2017, 7:59 PM
bd808 edited projects, added Data-Services; removed Cloud-Services.Jul 5 2017, 12:40 AM

I cannot think any reason why wikitech's database (and the similar labstestwiki) is not part of s3

Historically, wikitech was separate because it held important documentation that ops would want to continue to access if most of the site went down. Now we have wikitech-static, so...

Andrew added a comment.Oct 5 2017, 2:45 PM

Is there anything I can do to nudge this along, short of 'clone Jaime'?

jcrespo added a comment.EditedOct 5 2017, 2:59 PM

I think the first thing is to amend the description so that cloud (in particular), or anyone else agrees with the plan. Other than that, probably the s8 setup will take precedence, and only then we can migrate the wiki. It is in theory simple, more given it is a small wiki, but because it requires read-only, coordination, announcement, checking firewall, deployment-help (with mediawiki config), application changes, etc. it is a bit involved. My estimated time would be January, assuming everybody is onboard with it.

Andrew renamed this task from move wikitech and labstestwiki to s3 (needs discussion) to move wikitech and labstestwiki to s3.Oct 5 2017, 4:02 PM
bd808 renamed this task from move wikitech and labstestwiki to s3 to Move wikitech and labstestwiki to s3.Oct 5 2017, 4:08 PM
bd808 triaged this task as Normal priority.
bd808 updated the task description. (Show Details)

I think the first thing is to amend the description so that cloud (in particular), or anyone else agrees with the plan.

{{done}}

Is there anything I can do to nudge this along, short of 'clone Jaime'?

Also don't underestimate the amount of time that you can help with communication or trivial patches (specially general ops, like firewall)- everything ends up piling up, and I know, for example, the firewall exactly the same as you, or in the case of mediawiki-config, you probably have much more experience than me.

Should we merge this into T184805: Move some wikis to s5? From the cloud-services-team side we don't have a concern about which particular prod slice the two databases end up sitting on. We just want to get into the nice world of redundant DBs that are managed the same as the other sister wikis.

We still don't have a clear idea what will be moved to where, but it is good to know you guys don't really mind. Thanks!

At the moment, we are also waiting for Release-Engineering-Team to provide some feedback (T184805#3896861)

(quoting for future reference)

<jynus> Jaime Crespo I would personally export a copy of labswiki, import it on each s5 node disabling the binary logging, start replication from silver to s5-master-eqiad with some filters, and the a mostly hot failover
8:09 AM not immadiate, that is why we have left it for later
8:09 AM we also need to analyze if there are tables with private data to setup sanitarium filters
8:10 AM setup backups
8:10 AM setup labs grants
8:10 AM etc., which takes some time
<jynus> Jaime Crespo the best way to accelarate that work is *helping* by doing as much work in advance or proposing patches for the whole process and helping pacing the way, "we are ready for you start doing all the work" is less tempting, if you know what I mean 0:-)
8:15 AM s/pacing/paving/
Andrew renamed this task from Move wikitech and labstestwiki to s3 to Move wikitech and labstestwiki to s5.Feb 21 2018, 9:47 PM

Proposed checklist:

  • exclude labswiki and labtestwiki from dumps and tools replicas (this may happen by default but I'm not sure)
  • set up replica of the silver labswiki db (read-only) on s5
  • set up replica of the labtestweb labtestwiki db (read-only) on s5
  • create grants for labswiki (permitting access to mediawiki on silver and labweb1001/1002)
  • if all is well, stop replication with silver and labtestweb, convert s5 databases to read/write
  • Figure out about wikitech-static syncing. Can the script that dumps the whole database still run on silver? If so then this should be fairy simple.
  • now we can build out 'newwikitech' on labweb1001 in a way that's mostly decoupled from the above -- they'll be additional db hosts using the same db
  • once 'newwikitech' is safe and sound, rename it to 'wikitech' so that no web traffic is hitting silver anymore
  • move wikitech-static sync logic from silver to labweb1001
  • turn off silver, cheer

core dbs being accessed from non-core applications servers is something that needs checking- we do not have any of those right now, and many things can go wrong: firewall, grants, routing, etc. I am not saying it cannot be done, jus that it is the first time, and requires detailed planning.

If it makes anything any easier, I think we can easily tolerate a multi-hour (probably up to 24?) read-only period for wikitech while the database is moved. Setting up replication from silver to the s5 master seems like unnecessary complication unless the process of importing the data dump from silver into the s5 cluster takes days.

Figure out about wikitech-static syncing. Can the script that dumps the whole database still run on silver? If so then this should be fairy simple.

This should probably move to terbium (or whatever the main cron host is) once its reading from the main db cluster. Does this need access to local media files though? That may be a complicating thing for dual wikitech hosts that we haven't fully thought through yet.

See T188029: Move labswiki database to m5 for near term plan to move off of silver. This will satisfy the immediate needs for FY17/18 Q3 goals.

chasemp changed the task status from Open to Stalled.Mar 13 2018, 1:52 PM
chasemp lowered the priority of this task from Normal to Lowest.

Should we maybe change db-codfw.php to get db1073 instead of db2037 until we come up with a better solution or that wouldn't fix the issue?

jcrespo added a comment.EditedSep 4 2018, 7:17 AM

Cross-dc queries- I would try to avoid them unless we are in an emergency- I have 2 options, either create a temporary new section for wikitech (as we planned to do for labstestwiki) or we accelerate the import to s5 (for which the dc failover could be a good time), or both.

I would prefer option 2, import wikitech to s5. However, I am not sure how many blockers we have along the way and if we could get them resolved before the DC failover (considering the fact that we have some other stuff before the failover to get ready).

We can do it on switchback- that is why I suggested to create a temporary host for wikitech. Of course, that will increase the chances of a split brain, as m5-maste will be kept in rw mode. Not an easy problem to solve, honestly.

Yeah, I am not sure if I would prefer a split brain or cross-dc queries only for wikitech (that's why I suggested to change db-codfw.php to point to db1073)

There is stuff that constantly tries to write to db2037 (even now) but fails because of read-only: https://logstash.wikimedia.org/goto/bd4966ed0f1b5b7d576e239636fbd9fa

jijiki added a subscriber: jijiki.Sep 5 2018, 3:47 PM
Andrew added a comment.Sep 5 2018, 5:34 PM

Hi all! I'm a bit lost because I think this task no longer has anything to do with its original post (which is about moving the databases off of the local wikitech server, long since done but to m5 rather than s5.) If I understand it, there are two different issues under discussion:

  1. What happens to wikitech during the DC failover, when m5 becomes read-only
  1. labtestwikitech not having access to a read/write database since there aren't any in codfw (except temporarily during failovers).

Is that correct? If so, I propose that we discuss issue 2 under T201082 (as I think that states #2 as a problem. As for issue 1... I'm hoping that m5 will /not/ go read-only during the switch-over since everything public to do with WMCS is currently eqiad-only.

And, for that matter, it seems like issue 1 is a good argument for /not/ moving wikitech to s5 until it stops being hosted on a special server all together.

I apologize if I'm missing the point here :)

In the end nothing is needed. m5 will not be read only in eqiad. Wikitech traffic will be redirected to eqiad and thus to m5. So nothing to worry about.

Andrew added a comment.Sep 5 2018, 5:38 PM

In the end nothing is needed. m5 will not be read only in eqiad. Wikitech traffic will be redirected to eqiad and thus to m5. So nothing to worry about.

Great! That's what I thought :) Should we close this bug in favor of the more expansive T161859 ?

I am happy to merge this task with T161859 if @jcrespo is (as he is the original task creator)

why not just make this a child of the other and agree to talk only on a single place? This is mostly for DBA work of data migration, which is only a small part of the SUL migration. We can agree to not talk here anything that is not the literal migration.