Set up AQS in Beta
Closed, ResolvedPublic
Actions

Description

Now that Analytics Query Service (AQS, wikitech is in production and that soon RESTBase will expose its public API, there should be an AQS instance in the Beta Cluster as well to allow full infrastructure-integration tests.

Details

	Subject	Repo	Branch	Lines +/-
	AQS: Configure Cassandra for AQS in BetaCluster	operations/puppet	production	+2 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	EddieGP	T132259 Deployment-prep hosts with puppet errors (tracking)
Resolved	None	T116206 Set up AQS in Beta
Resolved	None	T114999 Deploy AQS with scap3
Resolved	None	T120841 Create fake data for beta AQS deployment
Resolved	None	T126294 Separate AQS off of RESTBase
Resolved	Ottomata	T127720 Allow aqs-admins to deploy via scap using deploy-service ssh key
Resolved	Ladsgroup	T132267 deployment-((sca\|aqs)01\|ores-web) puppet failures due to scap3 errors
Resolved	Eevans	T137706 Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found

Event Timeline

• mobrovac created this task.Oct 21 2015, 6:08 PM

• mobrovac raised the priority of this task from to Medium.

• mobrovac updated the task description. (Show Details)

• mobrovac added projects: Analytics, Services, Beta-Cluster-Infrastructure.

• mobrovac added subscribers: • mobrovac, Milimetric.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 21 2015, 6:08 PM

hashar updated the task description. (Show Details)Oct 21 2015, 7:49 PM

hashar set Security to None.

hashar moved this task from To Triage to Next: Feature on the Beta-Cluster-Infrastructure board.

hashar added a project: OKR-Work.

hashar subscribed.

@mobrovac:

Taking to Release-Engineering-Team doesn't seem that they own this item, they help other teams to set up stuff in beta cluster but the setup (puppet and such) is done by teams themselves.

As an example analytics team does manintenance of eventlogging beta cluster setup involving ops and Release-Engineering-Team as needed. I think this is a fine paradigm to follow cause it seems unrealistic to expect Release-Engineering-Team setup testing environments for every single service we create for every team and own that. Seems that this needs to be done in a collaborative fashion..

@GWicke: Can we get a commitment from services team to work on this so analytics has a better testing environment for the pageviewAPI for our upcoming changes?

• Nuria edited projects, added Analytics-Backlog; removed Analytics.Nov 25 2015, 4:51 PM

• Nuria moved this task from Incoming to Radar on the Analytics-Backlog board.

@Nuria, this is part of the effort of creating a new deployment tool (called Scap3). RelEng and Services will make the initial deployment of AQS in beta with it. That said, since AQS does not produce any data on its own, we'll need from you guys some fake data (a GB or 2) to backfill the BetaCluster Cassandra instance.

Once the initial deployment is completed, we will show you how to keep it updated, at which point ownership will be transferred to your team. Sounds fair?

I am talking for myself, but I am pretty sure anyone from Release-Engineering-Team will be happy to assist provide support. But do not expect us to take the lead of setting up every single software stacks being in use at the wmf. We are too small and can't really glance all the frontend / middlewar / backend / infra requirements needed.

In most case the best persons suited to add a new system to the beta cluster are members of the team that creates/leads it. That is way more powerful than having releng trying to figure it. That being said, I think of our role has being a support for the beta cluster specific environment, such as use of hiera to vary configuration settings or the local puppet master to test puppet changes before they get merged by ops.

A good communication channels to raise awareness of such tasks or help clarify responsibility is the Scrum-of-Scrums. We have a representative in that instance and that freed up a few tasks that were blocked due to mis understanding as to whom was to be responsible for them.

Once the initial deployment is completed, we will show you how to keep it updated, at which point ownership will be transferred to your team. Sounds fair?

This will include a full environment RestBase +AQS? cause that is what we are after.

greg subscribed.Nov 25 2015, 5:15 PM

In T116206#1831788, @Nuria wrote:

Once the initial deployment is completed, we will show you how to keep it updated, at which point ownership will be transferred to your team. Sounds fair?

This will include a full environment RestBase +AQS? cause that is what we are after.

What I mean here by BetaCluster is the deployment-prep project in Labs used as a testing mirror for production. Services' RESTBase instance is already there, so we need to set up AQS and connect them. We would still be responsible for our instance, while you'd need to take care of AQS.

Nice summary @mobrovac. So it seems AQS on beta is "just" waiting for T116335: Deploy RESTBase with scap3 / T114999: Deploy AQS with scap3, isn't it?

In T116206#1831808, @hashar wrote:

So it seems AQS on beta is "just" waiting for T116335: Deploy RESTBase with scap3 / T114999: Deploy AQS with scap3, isn't it?

Only the latter. I'll add it as a blocker

• mobrovac added a subtask: T114999: Deploy AQS with scap3.Nov 25 2015, 5:21 PM

This morning @mobrovac and I setup deployment-aqs01 on beta. Deployed via Scap3 successfully and without issue.

The AQS service is not 100% operational just yet, there are some puppet tweaks that need to be made to get it working and resolve this task:

Cassandra for aqs needs some hiera configuration (in hieradata/labs/deployment-prep/host/deployment-aqs01.yaml). At this point Cassandra won't start because it is trying to gossip with the other RESTBase nodes in labs (likely pulling in the common heira configuration for Cassandra)
AQS uses /srv/deployment/restbase/deploy as the deployed repository since it uses the restbase module. This should be /srv/deployment/aqs/deploy—less confusing, and actually matches the upstream repo. Possibly to be configured via hiera.
The service file for restbase specifies, /usr/lib/restbase/deploy as the path for the restbase code since it is symlinked to /srv/deployment/restbase/deploy. This should be updated to match the previous point to the path /srv/deployment/aqs/deploy.

@mobrovac: those sound like things I can do? Let me know if you've started on them yet, and if not please assign this to me and I'll get started.

In T116206#1846067, @Milimetric wrote:

@mobrovac: those sound like things I can do? Let me know if you've started on them yet, and if not please assign this to me and I'll get started.

I can do the simple ops/puppet patch for setting the right stuff in deployment-prep. What we'd need from you guys, @Milimetric / @JAllemandou, is some sample or fake data that we can store into Cassandra there so queries succeed. 10GB of data or less will do.

Change 257406 had a related patch set uploaded (by Mobrovac):
AQS: Configure Cassandra for AQS in BetaCluster

https://gerrit.wikimedia.org/r/257406

gerritbot added a project: Patch-For-Review.Dec 7 2015, 7:59 PM

Change 257406 merged by Alexandros Kosiaris:
AQS: Configure Cassandra for AQS in BetaCluster

https://gerrit.wikimedia.org/r/257406

Milimetric added a subtask: T120841: Create fake data for beta AQS deployment.Dec 8 2015, 4:59 PM

thcipriani moved this task from Needs triage to Services MVP on the Scap board.Dec 11 2015, 6:47 PM

greg added a project: Deployments.Jan 5 2016, 8:11 PM

• ggellerman edited projects, added Analytics; removed Analytics-Backlog.Jan 12 2016, 7:31 PM

Milimetric moved this task from Incoming to Radar on the Analytics board.Jan 12 2016, 7:32 PM

• mobrovac added a subtask: T126294: Separate AQS off of RESTBase.Feb 9 2016, 11:27 PM

greg edited projects, added scap2; removed Deployments.Feb 9 2016, 11:33 PM

dduvall moved this task from Services MVP to Adoption on the Scap board.Feb 12 2016, 7:00 PM

greg reopened subtask T114999: Deploy AQS with scap3 as Open.Feb 22 2016, 4:35 PM

thcipriani closed subtask T114999: Deploy AQS with scap3 as Resolved.Mar 3 2016, 6:12 PM

• mmodell moved this task from Adoption to Scap3-Adoption-Phase1 on the Scap board.Mar 4 2016, 7:01 PM

• mmodell edited projects, added Scap (Scap3-Adoption-Phase1); removed Scap.

• Nuria closed subtask T126294: Separate AQS off of RESTBase as Resolved.Mar 8 2016, 9:15 PM

Krenair added a subtask: T132267: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors.Apr 14 2016, 2:29 PM

Ladsgroup closed subtask T132267: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors as Resolved.Apr 26 2016, 3:05 PM

Ladsgroup reopened subtask T132267: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors as Open.Apr 26 2016, 3:07 PM

Krenair closed subtask T132267: deployment-((sca|aqs)01|ores-web) puppet failures due to scap3 errors as Resolved.Apr 26 2016, 10:25 PM

Quick note: we are getting the following failure while trying to deploy:

elukey@deployment-tin:/srv/deployment/analytics/aqs/deploy$ deploy
14:30:43 Started Deploy: analytics/aqs/deploy
Entering 'src'
14:30:43
== DEFAULT ==
:* deployment-aqs01.deployment-prep.eqiad.wmflabs
14:30:43 ['/usr/bin/deploy-local', '-v', '--repo', 'analytics/aqs/deploy', '-g', 'default', 'fetch'] on deployment-aqs01.deployment-prep.eqiad.wmflabs returned [255]: Host key verification failed.

analytics/aqs/deploy: fetch stage(s): 100% (ok: 0; fail: 1; left: 0)
14:30:43 1 targets had deploy errors
Stage 'fetch' failed on group 'default'. Perform rollback? [y]: n
14:30:46 Finished Deploy: analytics/aqs/deploy (duration: 00m 02s)

elukey@deployment-tin:/srv/deployment/analytics/aqs/deploy$ keyholder status
keyholder-agent start/running, process 31934
- 4096 59:58:0c:eb:35:bc:7a:5d:76:6b:bc:ad:18:bc:d1:60 /etc/keyholder.d/eventlogging (RSA)
- 4096 cb:43:c3:76:3a:68:b0:29:1a:8f:3f:31:1f:87:f8:7c /etc/keyholder.d/mwdeploy (RSA)
- 4096 f9:49:1c:4c:4d:4a:b1:f2:20:1f:28:88:d3:27:9c:6f /etc/keyholder.d/phabricator (RSA)
- 4096 5d:ae:1b:24:40:20:89:b3:e1:74:51:9a:e7:64:a3:5d /etc/keyholder.d/servicedeploy (RSA)
keyholder-proxy start/running, process 31939
- 4096 59:58:0c:eb:35:bc:7a:5d:76:6b:bc:ad:18:bc:d1:60 /etc/keyholder.d/eventlogging (RSA)
- 4096 cb:43:c3:76:3a:68:b0:29:1a:8f:3f:31:1f:87:f8:7c /etc/keyholder.d/mwdeploy (RSA)
- 4096 f9:49:1c:4c:4d:4a:b1:f2:20:1f:28:88:d3:27:9c:6f /etc/keyholder.d/phabricator (RSA)
- 4096 5d:ae:1b:24:40:20:89:b3:e1:74:51:9a:e7:64:a3:5d /etc/keyholder.d/servicedeploy (RSA)

Are we missing something?

Blerg. We really need to automate known_hosts for scap targets in beta.

When connecting to a server for the first time, a fingerprint of the server's public key is presented to the user (unless the option StrictHostKeyChecking has been disabled). This authentication method closes security holes due to IP spoofing,
DNS spoofing, and routing spoofing. ssh automatically maintains and checks a database containing identification for all hosts it has ever been used with. Host keys are stored in ~/.ssh/known_hosts in the user's home directory. Additionally, the file /etc/ssh/ssh_known_hosts is automatically checked for known hosts.

The error Host key verification failed means that the client could verify the server's public key fingerprint. It couldn't do this because we're using ssh in the background via: ssh -oBatchMode=yes since this is a "non-interactive" ssh session. You can see the failure trying to ssh to the host using the keyholder proxy socket.

thcipriani@deployment-tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oBatchMode=yes -l deploy-service deployment-aqs01.deployment-prep.eqiad.wmflabs
Host key verification failed.

The fix is to accept the host key in your local ~/.ssh/known_hosts:

thcipriani@deployment-tin:~$ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -l deploy-service deployment-aqs01.deployment-prep.eqiad.wmflabs
The authenticity of host 'deployment-aqs01.deployment-prep.eqiad.wmflabs (10.68.18.237)' can't be established.
ECDSA key fingerprint is 1f:2a:34:31:8a:1c:5c:42:8f:96:6c:b4:d3:35:3b:39.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'deployment-aqs01.deployment-prep.eqiad.wmflabs' (ECDSA) to the list of known hosts.
Linux deployment-aqs01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u4 (2016-02-29) x86_64
Debian GNU/Linux 8.2 (jessie)
deployment-aqs01 is a Puppet client of deployment-puppetmaster.deployment-prep.eqiad.wmflabs (puppetclient)
deployment-aqs01 is a Analytics Query Service Node (role::aqs)
The last Puppet run was at Fri Apr 29 14:52:26 UTC 2016 (21 minutes ago). 
Last login: Fri Apr 29 15:03:02 2016 from deployment-tin.deployment-prep.eqiad.wmflabs
deploy-service@deployment-aqs01:~$

In production this isn't a problem, we need to setup something for beta to handle this.

It should be fixed now for you since I've run:

root@deployment-tin:~# while read host; do ssh-keyscan -H "$host"; done < /srv/deployment/analytics/aqs/deploy/scap/aqs-deployment-prep >> /etc/ssh/ssh_known_hosts
# deployment-aqs01.deployment-prep.eqiad.wmflabs SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u2
# deployment-aqs01.deployment-prep.eqiad.wmflabs SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u2

T72792: Set up puppet exported resources to collect ssh host keys for beta

Adding some info after the chat with @thcipriani - to avoid "Agent admitted failure to sign key" we had to add myself and @joal to the deploy-service group, otherwise we wouldn't had access to the key in keyholder.

Krenair added a subtask: T137706: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found.Jun 13 2016, 1:08 PM

Krenair closed subtask T137706: Puppet errors on deploment-aqs01 because E: Version '2.2.6' for 'cassandra' was not found as Resolved.Jun 13 2016, 9:02 PM

greg moved this task from Next: Feature to Extensions & services config on the Beta-Cluster-Infrastructure board.Aug 5 2016, 8:53 PM

Puppet on this host is broken: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find data item analytics_hadoop_hosts in any Hiera data file and no default supplied at /etc/puppet/manifests/role/aqs.pp:46 on node deployment-aqs01.deployment-prep.eqiad.wmflabs

@elukey is on vacation, and I'm not really sure what changed. But if this is urgent for anyone, just ping me on IRC in #wikimedia-analytics

Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to not change aqs.pp but do a proper clean up.

elukey added a comment.Aug 26 2016, 9:33 AM

This comment was removed by elukey.

In T116206#2582429, @elukey wrote:

Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to not change aqs.pp but do a proper clean up.

To unbreak puppet I made this hiera edit on wikitech. I have no idea if this means that AQS is not configured properly, but it does get Puppet running again on this host which is important for keeping up with other changes in Labs and the deployment-prep project.

bd808 added a parent task: T132259: Deployment-prep hosts with puppet errors (tracking).Aug 30 2016, 7:15 PM

In T116206#2595478, @bd808 wrote:

In T116206#2582429, @elukey wrote:

Thanks for reporting, this is my bad since analytics_hadoop_hosts is not in hiera labs. Since this value should be removed soon I would prefer to not change aqs.pp but do a proper clean up.

To unbreak puppet I made this hiera edit on wikitech. I have no idea if this means that AQS is not configured properly, but it does get Puppet running again on this host which is important for keeping up with other changes in Labs and the deployment-prep project.

Thanks a lot, sorry for the delay but I found a free moment to make the change only today (https://gerrit.wikimedia.org/r/#/c/307703/). I tried to remove the hiera page but I don't have permission :(

In T116206#2597678, @elukey wrote:

Thanks a lot, sorry for the delay but I found a free moment to make the change only today (https://gerrit.wikimedia.org/r/#/c/307703/). I tried to remove the hiera page but I don't have permission :(

I deleted the hiera settings page and verified that puppet still runs cleanly on deployment-aqs01. Thanks.

• Nuria closed subtask T120841: Create fake data for beta AQS deployment as Resolved.Jun 12 2017, 4:02 PM

Status on this old task? :)

An instance of aqs on beta for testing for analytics already exists. This instance is a testing environment helpful for analytics changes but it does not receive any pageviews from beta cluster.

Note: puppet is disabled on deployment-aqs01 since June 8th though there is no reason given.

The last Puppet run was at Thu Jun  8 13:29:40 UTC 2017

Apparently disabled by @elukey according to last.

In T116206#3378114, @hashar wrote:
Note: puppet is disabled on deployment-aqs01 since June 8th though there is no reason given.
The last Puppet run was at Thu Jun  8 13:29:40 UTC 2017
Apparently disabled by @elukey according to last.

Fixed! Thanks for the heads up :)

Is there still anything actionable left on this task, or is it time to declare victory?

Victory it is.

Mentioned in SAL (#wikimedia-releng) [2018-01-19T15:34:38Z] <elukey> added deployment-eventlog02.deployment-prep.eqiad.wmflabs to /etc/ssh/ssh_known_hosts on deployment-tin (following https://phabricator.wikimedia.org/T116206#2251441) to unblock "Host key verification failed" for Analytics

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:44 AM

Set up AQS in BetaClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Set up AQS in Beta
Closed, ResolvedPublic
Actions

Related Objects
Search...