Page MenuHomePhabricator

Set up AQS in Beta
Closed, ResolvedPublic

Description

Now that Analytics Query Service (AQS, wikitech is in production and that soon RESTBase will expose its public API, there should be an AQS instance in the Beta Cluster as well to allow full infrastructure-integration tests.

Event Timeline

mobrovac raised the priority of this task from to Medium.
mobrovac updated the task description. (Show Details)
mobrovac added subscribers: mobrovac, Milimetric.
hashar set Security to None.
hashar moved this task from To Triage to Next: Feature on the Beta-Cluster-Infrastructure board.
hashar added a project: OKR-Work.
hashar subscribed.

@mobrovac:

Taking to Release-Engineering-Team doesn't seem that they own this item, they help other teams to set up stuff in beta cluster but the setup (puppet and such) is done by teams themselves.

As an example analytics team does manintenance of eventlogging beta cluster setup involving ops and Release-Engineering-Team as needed. I think this is a fine paradigm to follow cause it seems unrealistic to expect Release-Engineering-Team setup testing environments for every single service we create for every team and own that. Seems that this needs to be done in a collaborative fashion..

@GWicke: Can we get a commitment from services team to work on this so analytics has a better testing environment for the pageviewAPI for our upcoming changes?

@Nuria, this is part of the effort of creating a new deployment tool (called Scap3). RelEng and Services will make the initial deployment of AQS in beta with it. That said, since AQS does not produce any data on its own, we'll need from you guys some fake data (a GB or 2) to backfill the BetaCluster Cassandra instance.

Once the initial deployment is completed, we will show you how to keep it updated, at which point ownership will be transferred to your team. Sounds fair?

I am talking for myself, but I am pretty sure anyone from Release-Engineering-Team will be happy to assist provide support. But do not expect us to take the lead of setting up every single software stacks being in use at the wmf. We are too small and can't really glance all the frontend / middlewar / backend / infra requirements needed.

In most case the best persons suited to add a new system to the beta cluster are members of the team that creates/leads it. That is way more powerful than having releng trying to figure it. That being said, I think of our role has being a support for the beta cluster specific environment, such as use of hiera to vary configuration settings or the local puppet master to test puppet changes before they get merged by ops.

A good communication channels to raise awareness of such tasks or help clarify responsibility is the Scrum-of-Scrums. We have a representative in that instance and that freed up a few tasks that were blocked due to mis understanding as to whom was to be responsible for them.

Once the initial deployment is completed, we will show you how to keep it updated, at which point ownership will be transferred to your team. Sounds fair?

This will include a full environment RestBase +AQS? cause that is what we are after.

Once the initial deployment is completed, we will show you how to keep it updated, at which point ownership will be transferred to your team. Sounds fair?

This will include a full environment RestBase +AQS? cause that is what we are after.

What I mean here by BetaCluster is the deployment-prep project in Labs used as a testing mirror for production. Services' RESTBase instance is already there, so we need to set up AQS and connect them. We would still be responsible for our instance, while you'd need to take care of AQS.

So it seems AQS on beta is "just" waiting for T116335: Deploy RESTBase with scap3 / T114999: Deploy AQS with scap3, isn't it?

Only the latter. I'll add it as a blocker

This morning @mobrovac and I setup deployment-aqs01 on beta. Deployed via Scap3 successfully and without issue.

The AQS service is not 100% operational just yet, there are some puppet tweaks that need to be made to get it working and resolve this task:

  • Cassandra for aqs needs some hiera configuration (in hieradata/labs/deployment-prep/host/deployment-aqs01.yaml). At this point Cassandra won't start because it is trying to gossip with the other RESTBase nodes in labs (likely pulling in the common heira configuration for Cassandra)
  • AQS uses /srv/deployment/restbase/deploy as the deployed repository since it uses the restbase module. This should be /srv/deployment/aqs/deploy—less confusing, and actually matches the upstream repo. Possibly to be configured via hiera.
  • The service file for restbase specifies, /usr/lib/restbase/deploy as the path for the restbase code since it is symlinked to /srv/deployment/restbase/deploy. This should be updated to match the previous point to the path /srv/deployment/aqs/deploy.

@mobrovac: those sound like things I can do? Let me know if you've started on them yet, and if not please assign this to me and I'll get started.

@mobrovac: those sound like things I can do? Let me know if you've started on them yet, and if not please assign this to me and I'll get started.

I can do the simple ops/puppet patch for setting the right stuff in deployment-prep. What we'd need from you guys, @Milimetric / @JAllemandou, is some sample or fake data that we can store into Cassandra there so queries succeed. 10GB of data or less will do.

Change 257406 had a related patch set uploaded (by Mobrovac):
AQS: Configure Cassandra for AQS in BetaCluster

https://gerrit.wikimedia.org/r/257406

Change 257406 merged by Alexandros Kosiaris:
AQS: Configure Cassandra for AQS in BetaCluster

https://gerrit.wikimedia.org/r/257406

Quick note: we are getting the following failure while trying to deploy:

elukey@deployment-tin:/srv/deployment/analytics/aqs/deploy$ deploy
14:30:43 Started Deploy: analytics/aqs/deploy
Entering 'src'
14:30:43
== DEFAULT ==
:* deployment-aqs01.deployment-prep.eqiad.wmflabs
14:30:43 ['/usr/bin/deploy-local', '-v', '--repo', 'analytics/aqs/deploy', '-g', 'default', 'fetch'] on deployment-aqs01.deployment-prep.eqiad.wmflabs returned [255]: Host key verification failed.

analytics/aqs/deploy: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)
14:30:43 1 targets had deploy errors
Stage 'fetch' failed on group 'default'. Perform rollback? [y]: n
14:30:46 Finished Deploy: analytics/aqs/deploy (duration: 00m 02s)
elukey@deployment-tin:/srv/deployment/analytics/aqs/deploy$ keyholder status
keyholder-agent start/running, process 31934
- 4096 59:58:0c:eb:35:bc:7a:5d:76:6b:bc:ad:18:bc:d1:60 /etc/keyholder.d/eventlogging (RSA)
- 4096 cb:43:c3:76:3a:68:b0:29:1a:8f:3f:31:1f:87:f8:7c /etc/keyholder.d/mwdeploy (RSA)
- 4096 f9:49:1c:4c:4d:4a:b1:f2:20:1f:28:88:d3:27:9c:6f /etc/keyholder.d/phabricator (RSA)
- 4096 5d:ae:1b:24:40:20:89:b3:e1:74:51:9a:e7:64:a3:5d /etc/keyholder.d/servicedeploy (RSA)
keyholder-proxy start/running, process 31939
- 4096 59:58:0c:eb:35:bc:7a:5d:76:6b:bc:ad:18:bc:d1:60 /etc/keyholder.d/eventlogging (RSA)
- 4096 cb:43:c3:76:3a:68:b0:29:1a:8f:3f:31:1f:87:f8:7c /etc/keyholder.d/mwdeploy (RSA)
- 4096 f9:49:1c:4c:4d:4a:b1:f2:20:1f:28:88:d3:27:9c:6f /etc/keyholder.d/phabricator (RSA)
- 4096 5d:ae:1b:24:40:20:89:b3:e1:74:51:9a:e7:64:a3:5d /etc/keyholder.d/servicedeploy (RSA)

Are we missing something?

Blerg. We really need to automate known_hosts for scap targets in beta.

When connecting to a server for the first time, a fingerprint of the server's public key is presented to the user (unless the option StrictHostKeyChecking has been disabled). This authentication method closes security holes due to IP spoofing,
DNS spoofing, and routing spoofing. ssh automatically maintains and checks a database containing identification for all hosts it has ever been used with. Host keys are stored in ~/.ssh/known_hosts in the user's home directory. Additionally, the file /etc/ssh/ssh_known_hosts is automatically checked for known hosts.

The error Host key verification failed means that the client could verify the server's public key fingerprint. It couldn't do this because we're using ssh in the background via: ssh -oBatchMode=yes since this is a "non-interactive" ssh session. You can see the failure trying to ssh to the host using the keyholder proxy socket.

thcipriani@deployment-tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oBatchMode=yes -l deploy-service deployment-aqs01.deployment-prep.eqiad.wmflabs
Host key verification failed.

The fix is to accept the host key in your local ~/.ssh/known_hosts:

thcipriani@deployment-tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service deployment-aqs01.deployment-prep.eqiad.wmflabs
The authenticity of host 'deployment-aqs01.deployment-prep.eqiad.wmflabs (10.68.18.237)' can't be established.
ECDSA key fingerprint is 1f:2a:34:31:8a:1c:5c:42:8f:96:6c:b4:d3:35:3b:39.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'deployment-aqs01.deployment-prep.eqiad.wmflabs' (ECDSA) to the list of known hosts.
Linux deployment-aqs01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) x86_64
Debian GNU/Linux 8.2 (jessie)
deployment-aqs01 is a Puppet client of deployment-puppetmaster.deployment-prep.eqiad.wmflabs (puppetclient)
deployment-aqs01 is a Analytics Query Service Node (role::aqs)
The last Puppet run was at Fri Apr 29 14:52:26 UTC 2016 (21 minutes ago). 
Last login: Fri Apr 29 15:03:02 2016 from deployment-tin.deployment-prep.eqiad.wmflabs
deploy-service@deployment-aqs01:~$

In production this isn't a problem, we need to setup something for beta to handle this.

It should be fixed now for you since I've run:

root@deployment-tin:~# while read host; do ssh-keyscan -H "$host"; done < /srv/deployment/analytics/aqs/deploy/scap/aqs-deployment-prep >> /etc/ssh/ssh_known_hosts
# deployment-aqs01.deployment-prep.eqiad.wmflabs SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u2
# deployment-aqs01.deployment-prep.eqiad.wmflabs SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u2

Adding some info after the chat with @thcipriani - to avoid "Agent admitted failure to sign key" we had to add myself and @joal to the deploy-service group, otherwise we wouldn't had access to the key in keyholder.

Puppet on this host is broken: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item analytics_hadoop_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/aqs.pp:46 on node deployment-aqs01.deployment-prep.eqiad.wmflabs

@elukey is on vacation, and I'm not really sure what changed. But if this is urgent for anyone, just ping me on IRC in #wikimedia-analytics

Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to not change aqs.pp but do a proper clean up.

This comment was removed by elukey.

Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to not change aqs.pp but do a proper clean up.

To unbreak puppet I made this hiera edit on wikitech. I have no idea if this means that AQS is not configured properly, but it does get Puppet running again on this host which is important for keeping up with other changes in Labs and the deployment-prep project.

Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to not change aqs.pp but do a proper clean up.

To unbreak puppet I made this hiera edit on wikitech. I have no idea if this means that AQS is not configured properly, but it does get Puppet running again on this host which is important for keeping up with other changes in Labs and the deployment-prep project.

Thanks a lot, sorry for the delay but I found a free moment to make the change only today (https://gerrit.wikimedia.org/r/#/c/307703/). I tried to remove the hiera page but I don't have permission :(

Thanks a lot, sorry for the delay but I found a free moment to make the change only today (https://gerrit.wikimedia.org/r/#/c/307703/). I tried to remove the hiera page but I don't have permission :(

I deleted the hiera settings page and verified that puppet still runs cleanly on deployment-aqs01. Thanks.

Status on this old task? :)

An instance of aqs on beta for testing for analytics already exists. This instance is a testing environment helpful for analytics changes but it does not receive any pageviews from beta cluster.

Note: puppet is disabled on deployment-aqs01 since June 8th though there is no reason given.

The last Puppet run was at Thu Jun  8 13:29:40 UTC 2017

Apparently disabled by @elukey according to last.

Note: puppet is disabled on deployment-aqs01 since June 8th though there is no reason given.

The last Puppet run was at Thu Jun  8 13:29:40 UTC 2017

Apparently disabled by @elukey according to last.

Fixed! Thanks for the heads up :)

Is there still anything actionable left on this task, or is it time to declare victory?

Mentioned in SAL (#wikimedia-releng) [2018-01-19T15:34:38Z] <elukey> added deployment-eventlog02.deployment-prep.eqiad.wmflabs to /etc/ssh/ssh_known_hosts on deployment-tin (following https://phabricator.wikimedia.org/T116206#2251441) to unblock "Host key verification failed" for Analytics