Page MenuHomePhabricator

Send Jenkins build log and results to ElasticSearch
Closed, DeclinedPublic

Description

Purpose: to give us trend data on test failure/pass, which would give us an easy view to see if eg: all browser tests were failing aroud a giving point in time (which would point to a Beta Cluster issue, for instance).

It is getting more and more tedious to dig in Jenkins logs to find out regressions in jobs or the infrastructure. The way I handle it is usually to grep the logs on gallium /var/lib/jenkins/jobs/*/builds/*/log or look at the XML build results. That requires shell access on the Jenkins master and it is not convenient.

Jenkins has a plugin to emit console log and build data to ElasticSearch: Logstash Plugin. With Kibana in front of it that would dramatically improve everyone experience.

The following is extracted from a conversation between @hashar and @bd808.

Spinning up a single node or multiple node logstash cluster is easy. It's all Puppet and Hiera. The grunt work has been done. Production has a large setup.

Concerns

Access

For Kibana access, the Beta Cluster one is public but the Production one is behind a paywall harness by LDAP authentication (i.e. requires NDA).

gallium has a public IP and is already able to reach instances on the beta cluster and should be able to access production.

Mixing sources

@hashar wrote:
If we emit to beta or production, will have different sources mixed up (eg for beta: beta logs and CI build data and logs). Might makes Kibana dashboard slightly more complicated to filter out.

@bd808 reply:
This is what the log "types" are used for. If you look at the production or beta cluster kibana frontends you can see this in action. It's trivially easy to make a new dashboard that is pre-filtered by type.

Disk usage

@hashar wrote:
The backend ElasticSearch will grow by an order of magnitude. But maybe we can have different index with different retention policy. Most Jenkins jobs garbage collect logs after 15 days.

@bd808 wrote:

A different retention policy is a pain but I really don't think disk space will be a problem. deployment-logstash2 has 33G used for the log indices and 99G of free space. In production the numbers are 2.1T used & 683G free.

RealmUsedFree
Production2.1T683G
Beta cluster33G99G

Buckets

We may need to add some new config to handle the sorting the logs
into useful buckets. In effect, rules like ones we already have https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/files/logstash/filter-syslog.conf that take generic inputs like syslog, split the record up and add type and channel tags to make searching/filtering easier.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
greg raised the priority of this task from to Low.Dec 16 2014, 9:03 PM
greg updated the task description. (Show Details)
greg changed Security from none to None.
greg added subscribers: greg, bd808.
hashar claimed this task.
hashar subscribed.

For browser tests we now have raita: http://raita.wmflabs.org/ (kudos Dan!).

For CI logs, that is really yet another epic like project and we definitely have nobody to work on that this year. The idea is floating around but nothing serious started. That is a whole different story really :-}

hashar renamed this task from Install the Logstash plugin for Jenkins to Send Jenkins build log and results to ElasticSearch.Feb 26 2016, 9:01 PM
hashar reopened this task as Open.
hashar raised the priority of this task from Low to Medium.
hashar added a project: Wikimedia-Logstash.
hashar updated the task description. (Show Details)
hashar added a subscriber: bd808.

From the plugin page:

The plugin can write directly to ElasticSearch but also support RabbitMQ, Redis and Syslog.

Then there is a job wrapper and a post build action:

Buildwrapper
This component streams individual log lines to the indexer for post-processing, along with any build data that is available at the start (some information such as the build status is unavailable or incomplete).

Post-Build Action
This component pushes the tail of the job's log to the indexer for post-processing, along with all build data at the time the post-build action had started (if any post-build actions are scheduled after this plugin they will not be recorded).

An example JSON payload send to the indexer:

{
   "data":{
      "id":"2014-10-13_19-51-29",
      "result":"SUCCESS",
      "projectName":"my_example_job",
      "displayName":"#1",
      "fullDisplayName":"My Example Job #1",
      "url":"job/my_example_job/1/",
      "buildHost":"Jenkins",
      "buildLabel":"",
      "buildNum":1,
      "buildDuration":0,
      "rootProjectName":"my_example_job",
      "rootProjectDisplayName":"#1",
      "rootBuildNum":1,
      "buildVariables":{
         "PARAM1":"VALUE1",
         "PARAM2":"VALUE2"
      },
      "testResults":{
         "totalCount":45,
         "skipCount":0,
         "failCount":0,
         "failedTests":[]
      }
   },
   "message":[
      "Started by user anonymous",
      "Building in workspace /var/lib/jenkins/jobs/my_example_job/workspace",
      "Hello, World!"
   ],
   "source":"jenkins",
   "source_host":"http://localhost:8080/jenkins/",
   "@timestamp":"2014-10-13T19:51:29-0700",
   "@version":1
}

With the buildwrapper build result are missing/incomplete and the message array contains a single log line.

The build wrapper is hence meant to send the whole log line per line but miss the build result.

The post build action has the build results, like we did with raita, and a tail of log. Might be nice for dashboarding.

For disk space. Gallium has 273GB used all included (system, websites docroots, build results, console logs etc).

I looked at the npm job and the console logs use 1.5 Gbytes with a 30 days retention.

Command for the nerd to be run on gallium:
find /var/lib/jenkins/jobs/npm/builds -name log -printf "%s\n"| awk '{total += $1} END { print total }'

I dont think the 100Gbytes we have on deployment-logstash would be able to hold everything. But that is probably a good place to experiment with it.

hashar lowered the priority of this task from Medium to Low.

Not much bandwidth to set that up.

I have created a very rough proof of concept in labs. Instance logstash.integration.eaiad.wmflabs. I have reused the role classes and hiera config from deployment-logstash02.

Still have to:

  • setup authentication
  • expose the ElasticSearch web service via a wmflabs webproxy
  • add a ferm rule
  • configure Jenkins
  • enable the feature on a single job

Change 322488 had a related patch set uploaded (by Hashar):
contint: ElasticSearch role for build logs

https://gerrit.wikimedia.org/r/322488

That is worth discussion / proof of concept during the hackathon or as part of Release-Engineering-Team offsite that happens the week before.

As discussed at the hackathon eventually this should probably be logging to the logstash cluster, where the existing infrastructure can probably do what you need. If you need a place to do some initial testing you can use the relforge100[12].eqiad.wmnet servers. These live in the production network, have about 9TB of free space, and the elasticsearch instances are accessible over ssl on port 9243 from labs (you need to copy over the appropriate production ssl ca certificate though). These servers have no up-time guarantees and are basically our test boxes for a variety of things elasticsearch. This should make it possible to at least see how much data gets generated, and do some initial tests against production size data before deciding if we should get it all going into the proper logstash boxes.

Mentioned in SAL (#wikimedia-releng) [2017-06-02T12:12:51Z] <hashar> jenkins: rebuild logstash plugin from HEAD of master for jenkins 2 back compat. logstash-1.2.0-4-gbcbc19e - T78705

Thanks @EBernhardson !!!!

Jenkins on a Labs instance

I have used a pre existing labs instance jenkinstest.integration.eqiad.wmflabs

ssh -L 8080:127.0.0.1:8080 jenkinstest.integration.eqiad.wmflabs tail -F /var/log/jenkins/jenkins.log

Jenkins can then be reached via http://localhost:8080/ci/

The Jenkins logstash plugin 1.2.0 has some issue with Jenkins 2.x. I thus rebuild a package from https://github.com/jenkinsci/logstash-plugin

In the Global Tool Configuration http://localhost:8080/ci/configureTools/ :

Logstash Plugin
Indexer type: ELASTICSEARCH
Host name: https://relforge1001.eqiad.wmnet
Port: 9243
Key: /jenkins/jenkins

Where Key is described as: The name and type path. Example: "/indexName/type"

Production CA/cert

openssl s_client -connect relforge1001.eqiad.wmnet:9243
...
issuer=/CN=Puppet CA: palladium.eqiad.wmnet

After checking with ops that is fine to copy it from a production host to labs I copied /etc/ssl/certs/Puppet_Internal_CA.pem under /etc/ssl/certs and ran update-ca-certificates --fresh.

And that works with: curl --verbose https://relforge1002.eqiad.wmnet:9243/

WARNING: that is broken in Jenkins still

Dummy jobs

I have created two jobs:

  1. http://localhost:8080/ci/job/send-console-to-logstash/configure

Build Environment
[ X ] Send console log to Logstash

  1. http://localhost:8080/ci/job/publish-log-to-logstash/configure

**Post-build Actions
Send console log to Logstash
Max lines: 5

Change 356838 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] (DO NOT SUBMIT) jenkins: hack for custom keystore

https://gerrit.wikimedia.org/r/356838

Change 356838 abandoned by Hashar:
(DO NOT SUBMIT) jenkins: hack for custom keystore

Reason:
Cherry picked on CI puppet master but not meant to be merged.

https://gerrit.wikimedia.org/r/356838

Java has its own cert storage system. I have followed the instructions on https://support.cloudbees.com/hc/en-us/articles/203821254-How-to-install-a-new-SSL-certificate

keytool -import -file /etc/ssl/certs/production_puppet_CA.pem

Plus a puppet hack and Jenkins is able to talk to relforge1001.eqiad.wmnet \o/

I have manually created a new index jenkins on relforg1001:

$ curl -XPUT https://relforge1001.eqiad.wmnet:9243/jenkins?pretty
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}
$ curl -s "https://relforge1001.eqiad.wmnet:9243/_cat/indices/jenkins?v"
health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   jenkins 7FH-BrIDQtC_CLnAvoRUpA   5   1          0            0      1.2kb           650b

I then triggered a few random builds of the jobs. Will need to make them slightly more useful :]

I have created a job that attempt to replay the console output and test result of a production job that is fairly large/spammy. http://localhost:8080/ci/job/mediawiki-extensions-hhvm-jessie/

I have enabled the logstash build wrapper (a log entry per console line).

The job cat the production job console and grab its test result:

curl -s 'https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/lastSuccessfulBuild/consoleText'
curl -s 'https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/lastSuccessfulBuild/artifact/log/junit-phpunit-allexts.xml' > junit.xml

Then on completion, the job interprets the junit file and send the last 30 lines of console to logstash together with the test results. That last event should have something like:

"testResults":{
   "totalCount":45,
   "skipCount":0,
   "failCount":0,
   "failedTests":[]
}

On the labs instance the job takes THREE MINUTES to run. That is because every time there is a line of output it does an https request to relforge. A snippet of the curl output (times are relative to start of build:

00:00:00.789 + curl -s https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/lastSuccessfulBuild/consoleText
00:00:00.985 Started by user anonymous
00:00:01.122 Building remotely on ci-jessie-wikimedia-685855 (ci-jessie-wikimedia) in workspace /home/jenkins/workspace/mediawiki-extensions-hhvm-jessie
00:00:01.285 [mediawiki-extensions-hhvm-jessie] $ /bin/bash -xe /tmp/hudson8679592491589430907.sh
00:00:01.419 + set -eu +x
00:00:01.544 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/mediawiki-extensions-hhvm-jessie"
00:00:01.677 [mediawiki-extensions-hhvm-jessie] $ /bin/bash -xe /tmp/hudson7839898166477548713.sh
00:00:01.808 + . castor-set-namespace.env
00:00:01.935 ++ declare -x CASTOR_NAMESPACE=castor-mw-ext-and-skins/master/mediawiki-extensions-hhvm-jessie
00:00:02.076 + echo Syncing...
00:00:02.212 Syncing...
00:00:02.342 + rsync --archive --compress --contimeout 3 rsync://castor.integration.eqiad.wmflabs:/caches/castor-mw-ext-and-skins/master/mediawiki-extensions-hhvm-jessie/ /home/jenkins
00:00:02.470 + echo -e '\nDone'
00:00:02.624 
00:00:02.624 Done

1.7 seconds for 14 lines of output. So I guess we can dismiss the idea of collecting the full console or the plugin would need to be seriously reworked.


@EBernhardson is there a Kibana instance to look at relforge1001.eqiad.wmnet data?

No kibana instance currently, it should be pretty painless to install one on your labs instance though. The entirety of the puppet module for kibana is basically just install package, put config in place, and run service. Kibana defaults to finding elasticsearch at localhost so it will need an extra option in the yaml to point it at relforge.

Also that is a crazy slow indexing rate. It should really be using bulk indexing requests to push everything at once instead of doing a request per line. After a quick look over the plugin its not trivial to make that happen sadly.

Thank you for the Kibana setup, that works like a charm. Guess I can start prototyping a few dashboards.

Self note to access kibana:

ssh -L 5601:localhost:5601
http://localhost:5601/

Also that is a crazy slow indexing rate. It should really be using bulk indexing requests to push everything at once instead of doing a request per line. After a quick look over the plugin its not trivial to make that happen sadly.

Yup I guess that would require a lot of work. Moreover the plugin is up for adoption on the wiki page :(

Change 322488 abandoned by Hashar:
contint: ElasticSearch role for build logs

Reason:
We are using relforge1001.eqiad.wmnet instead and there is a Kibana frontend on jenkinstest.integration.eqiad.wmflabs instance.

https://gerrit.wikimedia.org/r/322488

Mentioned in SAL (#wikimedia-releng) [2017-06-06T14:34:03Z] <hashar> deleting buildlog.integration.eqiad.wmflabs was mean to receive Jenkins logs in ElasticSearch. We are experimenting with relforge1001.eqiad.wmnet now - T78705

Quick update:

I have been busy with other duties and bugs. Will come back to it eventually though. The idea would be to generate a few jobs that run continuously and then see what happens on ElasticSearch / Kibana side.

I'd like to revisit this as a way to record build durations along with the useful metadata (job name, result, node, build parameters, etc.), without keeping the entirety of the log output. In fact, if ditching the log output entirely means we can retain durations and a few other fields for an extended (or indefinite) amount of time, than we should do that. (IMHO, the log output's usefulness sharply declines over time, and the current Jenkin's build retention policy of 30 days is plenty, but historical durations w/ metadata are always useful.)

Optimizing the CI infrastructure is going to be vitally important with the ongoing CD Pipeline and Docker-ization work. We need this data to make informed optimization decisions.

We also don't necessarily need line-wise metrics—and based on what @hashar has said about it's overhead, we can't do it anyway. A better option might be to use the post-action/publisher that the Logstash plugin provides and set the max-lines to 0.

@EBernhardson, I'm testing the logstash publisher to see if it'll at least work for our use case of collecting duration stats, and it seems like it executes on the master node (contint1001.wikimedia.org) which appears to be firewalled from accessing relforge1001.eqiad.wmnet:9243. Is that still the most appropriate server to test on? I realize this is a pretty old task, so any guidance on the best elasticsearch endpoint to hit would be great.

And if relforge1001 is still a good target for testing, would it make sense to open up 9243 from contint1001?

17:21:41 [logstash-plugin]: Failed to send log data to ELASTICSEARCH:https://relforge1001.eqiad.wmnet:9243.
17:21:41 [logstash-plugin]: No Further logs will be sent to https://relforge1001.eqiad.wmnet:9243.
17:21:41 org.apache.http.conn.HttpHostConnectException: Connect to relforge1001.eqiad.wmnet:9243 [relforge1001.eqiad.wmnet/10.64.4.13] failed: Connection timed out (Connection timed out)
17:21:41 	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
$ ssh contint1001.wikimedia.org nc -zv -w 3 relforge1001.eqiad.wmnet 9243
relforge1001.eqiad.wmnet [10.64.4.13] 9243 (?) : Connection timed out

That is still the address for our testing servers, yes. The server is setup to be accessible from the production and cloud networks, I guess contint1001 must be in yet another network? TBH I'm not too familiar with how CI is separated from the rest of things, i would have assumed it was part of cloud but maybe not?

One other option would be that the relforge servers could be dropping the packets. The relevant ferm config is:

ebernhardson@relforge1001:~$ sudo grep -r 9243 /etc/ferm/conf.d/
/etc/ferm/conf.d/10_elastic-https-mwmaint:&R_SERVICE(tcp, 9243, (10.64.16.62 2620:0:861:102:10:64:16:62 10.192.48.45 2620:0:860:104:10:192:48:45));
/etc/ferm/conf.d/10_elastic-https:&R_SERVICE(tcp, 9243, $LABS_NETWORKS);

By chance do you know where the contint servers live?

That is still the address for our testing servers, yes. The server is setup to be accessible from the production and cloud networks, I guess contint1001 must be in yet another network? TBH I'm not too familiar with how CI is separated from the rest of things, i would have assumed it was part of cloud but maybe not?

The master Jenkins host is on the production network but the nodes are in labs (the integration project).

One other option would be that the relforge servers could be dropping the packets. The relevant ferm config is:

ebernhardson@relforge1001:~$ sudo grep -r 9243 /etc/ferm/conf.d/
/etc/ferm/conf.d/10_elastic-https-mwmaint:&R_SERVICE(tcp, 9243, (10.64.16.62 2620:0:861:102:10:64:16:62 10.192.48.45 2620:0:860:104:10:192:48:45));
/etc/ferm/conf.d/10_elastic-https:&R_SERVICE(tcp, 9243, $LABS_NETWORKS);

By chance do you know where the contint servers live?

Those ferm rules look like the right place to make a change (if that's appropriate): the contint1001 host is on the production eqiad network, and FWICT only has a public address and no x.eqiad.wmnet name, and looking again it can ping relforge1001.eqiad.wmnet (maybe over multicast or something? I don't see an interface on any 10.x.x.x network). @hashar, do you know more about it?

T140257 has the history.

In short, contint1001 is in the production network and has a public IP (and thus has a wikimedia.org domain, not eqiad.wmnet). It has to talk to Nodepool and WMCS instances which are in a different network, there is a firewall between production and WMCS, so rules have been added to allow some traffic.

relforge1001 ferm rule indicates $LABS_NETWORKS so we might assume it is in the WMCS network just like Nodepool. If that is the case, an exception has to be granted by netops to allow that production <-> WMCS communication. Else, we would need an ElasticSearch instance in the production network.

If that is the case, an exception has to be granted by netops to allow that production <-> WMCS communication. Else, we would need an ElasticSearch instance in the production network.

The servers do live in the production network, much like the cloud mysql replicas. There is already a network exception to route traffic from WMCS to relforge. I think we might only need an updated ferm rule? That would likely be accomplish by changing https://github.com/wikimedia/puppet/blob/production/hieradata/role/common/elasticsearch/relforge.yaml#L10

Might still need some form of approval from ops though, the network exception was granted with the understanding that the service was not used, only maintained, from the production networks.

Reviewing my past comments, looks like I did the experiment on a Jenkins instance on WMCS. It would thus have been able to reach relforge1001.eqiad.wmnet.

contint1001 can reach it (there is a route to WMCS), but indeed some firewall would need a hole to be punched. That is a question for netops
I believe and most probably require SRE approval.

Change 479567 had a related patch set uploaded (by Dduvall; owner: Dduvall):
[operations/puppet@production] ci: Permit ES traffic from jenkins masters to relforge

https://gerrit.wikimedia.org/r/479567

My two cents as being one of the owners of both metrics and logging in SRE, with the disclaimer that I might be missing something!

I'd like to revisit this as a way to record build durations along with the useful metadata (job name, result, node, build parameters, etc.), without keeping the entirety of the log output. In fact, if ditching the log output entirely means we can retain durations and a few other fields for an extended (or indefinite) amount of time, than we should do that. (IMHO, the log output's usefulness sharply declines over time, and the current Jenkin's build retention policy of 30 days is plenty, but historical durations w/ metadata are always useful.)

Optimizing the CI infrastructure is going to be vitally important with the ongoing CD Pipeline and Docker-ization work. We need this data to make informed optimization decisions.

Agreed with the above, with that in mind it seems to me this is a metrics use case rather than logging (save for build metadata perhaps)? At any rate, let me know if you'd be interested in using the production logging cluster instead (30d retention ATM now though, might get extended in the future when newer hardware is online)

Change 479567 merged by Dzahn:
[operations/puppet@production] ci: Permit ES traffic from jenkins masters to relforge

https://gerrit.wikimedia.org/r/479567

applied the ferm change. now we have:

relforge1001:

ACCEPT tcp -- contint1001.wikimedia.org anywhere tcp dpt:9443
ACCEPT tcp -- contint2001.wikimedia.org anywhere tcp dpt:9443

relforge1002:

ACCEPT tcp -- contint1001.wikimedia.org anywhere tcp dpt:9443
ACCEPT tcp -- contint2001.wikimedia.org anywhere tcp dpt:9443

Mentioned in SAL (#wikimedia-releng) [2019-10-21T18:38:18Z] <hashar> Jenkins: upgraded logstash plugin to 2.3.1 , overriding a custom build snapshot of 1.2.1. # T78705

I have upgraded the jenkins Logstash plugin on https://integration.wikimedia.org/ci/ I previously used a snapshot build as part of an experiment T78705#3310687 but it never got used.

From the plugin page ( https://plugins.jenkins.io/logstash ):

With version 2.0 the global configuration has been moved from **Global Tool Configuration to the regular Jenkins configuration page (Jenkins → Manage Jenkins → Configure System).
There was also a major change in the way the plugin works. It is no longer a BuildWrapper but a ConsoleLogFilter and you enable it via a JobProperty. This is necessary to reliably ensure that passwords are masked when the MaskPasswords plugin is installed and allow to enable log forwarding globally.

So it is all about ticking a job priority. Will probably want to check support in JJB. I also have no idea whether the new version does bulk updates, previously it reestablished a tcp connection AND a ssl handshake for each line of the console!

Mentioned in SAL (#wikimedia-releng) [2020-03-12T08:17:34Z] <hashar> integration: removing logstash Jenkins plugin from https://integration.wikimedia.org/ci/ | T247036 T78705

Closing this one again since we have no plan to work on it.

Change 893457 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] elastic relforge: rm rules for Jenkins

https://gerrit.wikimedia.org/r/893457

Change 893457 merged by Bking:

[operations/puppet@production] elastic relforge: rm rules for Jenkins

https://gerrit.wikimedia.org/r/893457