Page MenuHomePhabricator

Deploy Phabricator with scap
Closed, ResolvedPublic5 Estimated Story Points

Description

Currently, we deploy Phabricator fairly manually, using the steps outlined here:

https://wikitech.wikimedia.org/wiki/Phabricator/Deployment

We should use scap instead.

Originally, we planned to get that working on phab2001, then use it for the new machines (phab1002, phab2002).

As an intermediate step to that, we're planning to test necessary puppet changes and new scap configuration with the local puppetmaster in devtools.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptJul 18 2022, 6:53 PM
brennen changed the task status from Open to In Progress.Jul 21 2022, 5:39 PM
brennen moved this task from To Triage to Infrastructure on the Phabricator board.

Notes so far:

  • Deploy targets live in scap/phabricator-targets in the repo
  • /srv/phab is a symlink to /srv/deployment/phabricator/deployment
    • This is managed in operations/puppet/modules/phabricator/manifests/init.pp - $phabdir linked to $deploy_root
  • /srv/phab/phabricator/conf
    • conf/* is in .gitignore in the phab repo, so presumably the stuff here won't be present after a scap deploy is run since the target of the /srv/phab symlink is rewritten with a new checkout?
    • But this is $confdir in puppet and there's at least some stuff for writing local/local.json - is this getting applied?

Neglected to add a Bug field, but seeing what happens with this: 816016: scap: target phab2001 for trial run

Change 816027 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[phabricator/deployment@wmf/stable] scap: remove tag plugin & asciitable dependency

https://gerrit.wikimedia.org/r/816027

Change 816027 merged by Brennen Bearnes:

[phabricator/deployment@wmf/stable] scap: remove tag plugin & asciitable dependency

https://gerrit.wikimedia.org/r/816027

scap deploy -v -l 'phab2001.codfw.wmnet' fails from deploy1002 -

...
Received disconnect from 10.192.32.147 port 22:2: Too many authentication failures                                                                
Disconnected from 10.192.32.147 port 22

There's a phabricator key:

debug1: Will attempt key: /etc/keyholder.d/phabricator

...but I don't think it gets as far as offering this before erroring out.

10.64.32.28 is deploy1002

in the logs on phab2001, looking for connections from deploy1002:

Jul 19 14:23:02 phab2001 sshd[13278]: Connection from 10.64.32.28 port 53702 on 10.192.32.147 port 22
Jul 19 14:23:03 phab2001 sshd[13278]: Accepted key ED25519 ... found at /etc/ssh/userkeys/scap:1

Jul 19 14:23:03 phab2001 sshd[13316]: Starting session: command for scap from 10.64.32.28 port 53702 id 0
Jul 19 14:23:04 phab2001 sshd[13316]: Received disconnect from 10.64.32.28 port 53702:11: disconnected by user
Jul 19 14:23:04 phab2001 sshd[13316]: Disconnected from user scap 10.64.32.28 port 53702

.

Jul 21 22:02:54 phab2001 sshd[27864]: Accepted key ED25519 SHA256:...found at /etc/ssh/userkeys/scap:1
Jul 21 22:02:54 phab2001 sshd[27864]: Accepted publickey for scap from 10.64.32.28 port 54460 ssh2: ED25519 SHA256:...
Jul 21 22:02:54 phab2001 sshd[27864]: pam_unix(sshd:session): session opened for user scap by (uid=0)
Jul 21 22:02:54 phab2001 systemd-logind[967]: New session 34676 of user scap.
Jul 21 22:02:55 phab2001 sshd[27864]: User child is on pid 27887
Jul 21 22:02:55 phab2001 sshd[27887]: Starting session: command for scap from 10.64.32.28 port 54460 id 0
Jul 21 22:02:55 phab2001 sshd[27887]: Received disconnect from 10.64.32.28 port 54460:11: disconnected by user
Jul 21 22:02:55 phab2001 sshd[27887]: Disconnected from user scap 10.64.32.28 port 54460
Jul 21 22:02:55 phab2001 sshd[27864]: pam_unix(sshd:session): session closed for user scap

root@deploy1002:/home/dzahn# ssh -i /etc/keyholder.d/phabricator scap@phab2001.codfw.wmnet

Jul 22 20:12:39 phab2001 sshd[27629]: Failed publickey for scap from 10.64.32.28

For the scap user it would be: ssh -i /etc/keyholder.d/scap scap@phab2001.codfw.wmnet. scap key for scap user.

but that one has:

Load key "/etc/keyholder.d/scap": bad permissions..will be ignored.

if you do it as root

This is how it actually works, using the AUTH_SOCK from keyholder, and using the correct "phab-deploy" user and not trying it as root:

[deploy1002:~] $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/phabricator phab-deploy@phab2001.codfw.wmnet

Linux phab2001 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64
Debian GNU/Linux 10 (buster)
phab2001 is a Phabricator (Main) Server (phabricator)
Backed up on this host: srv-repos

login as deployment user from deploy1002 to phab2001, with keyholder

@brennen Seems to me the issue is it's trying to connect as "scap" but it should use "phab-deploy" user. Then it should work together with the AUTH_SOCK that is the loded /etc/keyholder.d/phabricator key.

Change 816223 had a related patch set uploaded (by Brennen Bearnes; author: Brennen Bearnes):

[phabricator/deployment@wmf/stable] scap: set keyholder_key: phabricator

https://gerrit.wikimedia.org/r/816223

Change 816223 merged by Brennen Bearnes:

[phabricator/deployment@wmf/stable] scap: set keyholder_key: phabricator

https://gerrit.wikimedia.org/r/816223

Paired with @jeena today experimenting with scap3 command checks for service restarts and database migrations, will try to finish bashing out a patch for that tomorrow, then I think it's mostly the config files to worry about.

dduvall updated the task description. (Show Details)

Change 828619 had a related patch set uploaded (by Dduvall; author: Dduvall):

[mediawiki/tools/scap@master] checks: Define environment variables for current/done/revs dirs

https://gerrit.wikimedia.org/r/828619

Change 828619 merged by jenkins-bot:

[mediawiki/tools/scap@master] checks: Define environment variables for current/done/revs dirs

https://gerrit.wikimedia.org/r/828619

Change 830682 had a related patch set uploaded (by Dduvall; author: Dduvall):

[operations/puppet@production] phabricator: Allow deploy user to preserve environment when sudoing

https://gerrit.wikimedia.org/r/830682

Change 831634 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: Allow deploy user to keep scap3 environment variables with sudo

https://gerrit.wikimedia.org/r/831634

Change 831634 merged by Dzahn:

[operations/puppet@production] phabricator: Allow deploy user to keep scap3 environment variables with sudo

https://gerrit.wikimedia.org/r/831634

Change 831637 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: move scap user sudo defaults to file, fix puppet

https://gerrit.wikimedia.org/r/831637

Change 831637 merged by Dzahn:

[operations/puppet@production] phabricator: move scap user sudo defaults to file, fix puppet

https://gerrit.wikimedia.org/r/831637

Change 831638 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] phabricator: use content, not source with a plain file

https://gerrit.wikimedia.org/r/831638

Change 831638 merged by Dzahn:

[operations/puppet@production] phabricator: use content, not source with a plain file

https://gerrit.wikimedia.org/r/831638

Change 831965 had a related patch set uploaded (by Dduvall; author: Dduvall):

[operations/puppet@production] phabricator: Fix sudo env_keep format

https://gerrit.wikimedia.org/r/831965

Change 831965 merged by Dzahn:

[operations/puppet@production] phabricator: Fix sudo env_keep format

https://gerrit.wikimedia.org/r/831965

We now have the following sudo config on all Phabricator servers:

any command as phab-deploy:

cat /etc/sudoers.d/scap_phab-deploy 
# This file is managed by Puppet!

phab-deploy ALL=(phab-deploy) NOPASSWD: ALL

the deploy commands as root run by phab-deploy:

cat /etc/sudoers.d/scap_sudo_rules_phab-deploy_phabricator_deployment 
# This file is managed by Puppet!

phab-deploy ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_config_deploy
phab-deploy ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_promote
phab-deploy ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_rollback
phab-deploy ALL=(root) NOPASSWD: /usr/local/sbin/phab_deploy_finalize

allow keeping the scap related env variables (but only those, not everything!)

cat /etc/sudoers.d/scap_sudo_defaults 
Defaults:phab-deploy env_keep+="SCAP_REVS_DIR SCAP_FINAL_PATH SCAP_REV_PATH SCAP_CURRENT_REV_DIR SCAP_DONE_REV_DIR"

Tested like this:

  • temp replaced content of /usr/local/sbin/phab_deploy_config_deploy with just 'echo $SCAP_REVS_DIR'
  • become phab-deploy and export $SCAP_REVS_DIR

phab-deploy@phab2001:/$ export SCAP_REVS_DIR="/tmp"

  • run the "fake deploy config" command with sudo:

phab-deploy@phab2001:/$ sudo /usr/local/sbin/phab_deploy_config_deploy

exported value is here:

/tmp

I edited the check boxes a bit (there is no phab1002, instead there is phab1004, phab2002 can be used for testing first, phab2001 you can skip unless you want more test hosts). I think we can:

  • ignore phab2001 (whatever, we could decom it now and wouldn't be a big difference)
  • deploy to phab2002 (nothing much can go wrong)
  • deploy to phab1004 (nothing much can go wrong)
  • double check everything
  • deploy to phab1001 (actual prod host, carefully)

Here's the result of a recent test deploy to phab2002:

[...]
returned [1]: Executing check 'finalize'
Check 'finalize' failed: 
   ->Running puppet...
Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'phabricator deployment - phab-deploy');                                         
Use 'puppet agent --enable' to re-enable.

   ->Applying storage migrations
[2022-09-14 21:37:44] PHLOG: 'Retrying database connection to "m3-slave.codfw.wmnet" after connection failure (attempt 1; "AphrontConnectionQueryException"; error #2002):
 Attempt to connect to phabricatorphd@m3-slave.codfw.wmnet failed with error #2002: Connection timed out.' at [/srv/deployment/phabricator/deployment-cache/revs/3137c9217
337d2b6fd1312f328f7e366abbf0d75/phabricator/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:136]                                      
[2022-09-14 21:37:45] PHLOG: 'Retrying database connection to "m3-slave.codfw.wmnet" after connection failure (attempt 2; "AphrontConnectionQueryException"; error #2002):
 Attempt to connect to phabricatorphd@m3-slave.codfw.wmnet failed with error #2002: Connection timed out.' at [/srv/deployment/phabricator/deployment-cache/revs/3137c9217
337d2b6fd1312f328f7e366abbf0d75/phabricator/src/infrastructure/storage/connection/mysql/AphrontBaseMySQLDatabaseConnection.php:136]                                      
MySQL Credentials Not Configured

Unable to connect to MySQL using the configured credentials. You must
configure standard credentials before you can upgrade storage. Run these
commands to set up credentials:

  $ ./bin/config set mysql.host __host__
  $ ./bin/config set mysql.user __username__
  $ ./bin/config set mysql.pass __password__

These standard credentials are separate from any administrative credentials
provided to this command with __--user__ or __--password__, and must be
configured correctly before you can proceed.

Raw MySQL Error: Attempt to connect to phabricatorphd@m3-slave.codfw.wmnet
failed with error #2002: Connection timed out.

   ->Restarting PHD
Job for phd.service failed because the control process exited with error code.
See "systemctl status phd.service" and "journalctl -xe" for details.

   ->Reloading apache

   ->Enabling puppet agent

   ->Verifying database status

<13>Sep 14 21:38:51 phab-deploy: >>>ERROR: Phabricator storage is in a bad state.


phabricator/deployment: finalize stage(s): 100% (in-flight: 0; ok: 0; fail: 1; left: 0) |                                                                                
21:38:51 1 targets had deploy errors
21:38:51 1 targets failed
21:38:51 Finished deploy [phabricator/deployment@3137c92]: testing phabricator deployment to phab2002 (duration: 01m 48s)                                                
21:38:51 Finished deploy [phabricator/deployment@3137c92] (duration: 01m 48s)

I verified manually that this is indeed a timeout to m3-slave.codfw.wmnet and not invalid credentials (well, we don't know that yet).

I verified manually that this is indeed a timeout to m3-slave.codfw.wmnet and not invalid credentials (well, we don't know that yet).

Thanks for the report. We had been told to use -slave and that there should now be misc cluster in codfw, as opposed to the past when this did not exist.

I will go back to DBA and ask why that does not seem to be the case.

It seems like we are not supposed to use default port 3306 but one of 3321 or 3322 (judging by the iptables rules on db2160). Now figuring out which one.

It's port 3323. Determined with ps aux and netstat (there are multiple mysql daemons per host).

reopening T315713 because I could confirm that this works:

mysql -h m3-slave.eqiad.wmnet -P 3323 -u phstats -D phabricator_project -p

but the equivalent in codfw does not:

mysql -h m3-slave.codfw.wmnet -P 3323 -u phstats -D phabricator_project -p

ERROR 1045 (28000): Access denied for user 'phstats'@'10.192.32.54' (using password: YES)

It does not work from phab2001 either.

Looks like we need to care about mysql GRANTS after all but only in codfw but not in eqiad. Because using m3-master means going via dbproxy* and using m3-slave means connecting directly to a db host.

So modules/profile/templates/mariadb/grants/production-m3.sql.erb has 259 lines and basically all of it are phab related GRANTs.

Some for db some for dbproxy, multiple users, multiple databases.

All of that is needed but for the new IPs and we need to figure out which proxy to use if any.

Please also see T315713#8243258 and try one more time.

Nevermind, I don't think it's going to work yet, reopened T315713#8243271

Change 830682 abandoned by Dduvall:

[operations/puppet@production] scap: Allow deploy user to keep scap3 environment variables with sudo

Reason:

Superseded by Ie44a16e6b6d883d67279f23ddf7adba7509c05b7

https://gerrit.wikimedia.org/r/830682

brennen moved this task from Doing/Involved to Done or Declined on the User-brennen board.

phab1004 is now the production Phabricator instance, deployed from scap.

Related commit f962d0eb pushed by brennen (author: Brennen Bearnes):

[ repos/phabricator/deployment@wmf/stable ] scap: remove tag plugin & asciitable dependency

Related commit 8a7d4bf1 pushed by brennen (author: Brennen Bearnes):

[ repos/phabricator/deployment@wmf/stable ] scap: set keyholder_key: phabricator