Send Jenkins build log and results to ElasticSearch
Open, LowPublic

Description

Purpose: to give us trend data on test failure/pass, which would give us an easy view to see if eg: all browser tests were failing aroud a giving point in time (which would point to a Beta Cluster issue, for instance).

It is getting more and more tedious to dig in Jenkins logs to find out regressions in jobs or the infrastructure. The way I handle it is usually to grep the logs on gallium /var/lib/jenkins/jobs/*/builds/*/log or look at the XML build results. That requires shell access on the Jenkins master and it is not convenient.

Jenkins has a plugin to emit console log and build data to ElasticSearch: Logstash Plugin. With Kibana in front of it that would dramatically improve everyone experience.

The following is extracted from a conversation between @hashar and @bd808.

Spinning up a single node or multiple node logstash cluster is easy. It's all Puppet and Hiera. The grunt work has been done. Production has a large setup.

Concerns

Access

For Kibana access, the Beta Cluster one is public but the Production one is behind a paywall harness by LDAP authentication (i.e. requires NDA).

gallium has a public IP and is already able to reach instances on the beta cluster and should be able to access production.

Mixing sources

@hashar wrote:
If we emit to beta or production, will have different sources mixed up (eg for beta: beta logs and CI build data and logs). Might makes Kibana dashboard slightly more complicated to filter out.

@bd808 reply:
This is what the log "types" are used for. If you look at the production or beta cluster kibana frontends you can see this in action. It's trivially easy to make a new dashboard that is pre-filtered by type.

Disk usage

@hashar wrote:
The backend ElasticSearch will grow by an order of magnitude. But maybe we can have different index with different retention policy. Most Jenkins jobs garbage collect logs after 15 days.

@bd808 wrote:

A different retention policy is a pain but I really don't think disk space will be a problem. deployment-logstash2 has 33G used for the log indices and 99G of free space. In production the numbers are 2.1T used & 683G free.

RealmUsedFree
Production2.1T683G
Beta cluster33G99G

Buckets

We may need to add some new config to handle the sorting the logs
into useful buckets. In effect, rules like ones we already have https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/files/logstash/filter-syslog.conf that take generic inputs like syslog, split the record up and add type and channel tags to make searching/filtering easier.

greg created this task.Dec 16 2014, 9:03 PM
greg updated the task description. (Show Details)
greg raised the priority of this task from to Low.
greg changed Security from none to None.
greg added subscribers: greg, bd808.
bd808 removed a subscriber: bd808.Dec 16 2014, 9:20 PM
hashar closed this task as Resolved.Jun 9 2015, 8:03 PM
hashar claimed this task.
hashar added a subscriber: hashar.

For browser tests we now have raita: http://raita.wmflabs.org/ (kudos Dan!).

For CI logs, that is really yet another epic like project and we definitely have nobody to work on that this year. The idea is floating around but nothing serious started. That is a whole different story really :-}

hashar renamed this task from Install the Logstash plugin for Jenkins to Send Jenkins build log and results to ElasticSearch.Feb 26 2016, 9:01 PM
hashar reopened this task as Open.
hashar raised the priority of this task from Low to Normal.
hashar added a project: Wikimedia-Logstash.
hashar updated the task description. (Show Details)
hashar added a subscriber: bd808.
hashar added a comment.EditedFeb 26 2016, 9:11 PM

From the plugin page:

The plugin can write directly to ElasticSearch but also support RabbitMQ, Redis and Syslog.

Then there is a job wrapper and a post build action:

Buildwrapper
This component streams individual log lines to the indexer for post-processing, along with any build data that is available at the start (some information such as the build status is unavailable or incomplete).

Post-Build Action
This component pushes the tail of the job's log to the indexer for post-processing, along with all build data at the time the post-build action had started (if any post-build actions are scheduled after this plugin they will not be recorded).

An example JSON payload send to the indexer:

{
   "data":{
      "id":"2014-10-13_19-51-29",
      "result":"SUCCESS",
      "projectName":"my_example_job",
      "displayName":"#1",
      "fullDisplayName":"My Example Job #1",
      "url":"job/my_example_job/1/",
      "buildHost":"Jenkins",
      "buildLabel":"",
      "buildNum":1,
      "buildDuration":0,
      "rootProjectName":"my_example_job",
      "rootProjectDisplayName":"#1",
      "rootBuildNum":1,
      "buildVariables":{
         "PARAM1":"VALUE1",
         "PARAM2":"VALUE2"
      },
      "testResults":{
         "totalCount":45,
         "skipCount":0,
         "failCount":0,
         "failedTests":[]
      }
   },
   "message":[
      "Started by user anonymous",
      "Building in workspace /var/lib/jenkins/jobs/my_example_job/workspace",
      "Hello, World!"
   ],
   "source":"jenkins",
   "source_host":"http://localhost:8080/jenkins/",
   "@timestamp":"2014-10-13T19:51:29-0700",
   "@version":1
}

With the buildwrapper build result are missing/incomplete and the message array contains a single log line.

The build wrapper is hence meant to send the whole log line per line but miss the build result.

The post build action has the build results, like we did with raita, and a tail of log. Might be nice for dashboarding.

For disk space. Gallium has 273GB used all included (system, websites docroots, build results, console logs etc).

I looked at the npm job and the console logs use 1.5 Gbytes with a 30 days retention.

Command for the nerd to be run on gallium:
find /var/lib/jenkins/jobs/npm/builds -name log -printf "%s\n"| awk '{total += $1} END { print total }'

I dont think the 100Gbytes we have on deployment-logstash would be able to hold everything. But that is probably a good place to experiment with it.

hashar lowered the priority of this task from Normal to Low.Apr 19 2016, 9:39 AM
hashar removed hashar as the assignee of this task.

Not much bandwidth to set that up.

I have created a very rough proof of concept in labs. Instance logstash.integration.eaiad.wmflabs. I have reused the role classes and hiera config from deployment-logstash02.

Still have to:

  • setup authentication
  • expose the ElasticSearch web service via a wmflabs webproxy
  • add a ferm rule
  • configure Jenkins
  • enable the feature on a single job

Change 322488 had a related patch set uploaded (by Hashar):
contint: ElasticSearch role for build logs

https://gerrit.wikimedia.org/r/322488

That is worth discussion / proof of concept during the hackathon or as part of Release-Engineering-Team offsite that happens the week before.

greg awarded a token.May 3 2017, 4:17 PM
EBernhardson added a subscriber: EBernhardson.EditedMay 30 2017, 11:35 PM

As discussed at the hackathon eventually this should probably be logging to the logstash cluster, where the existing infrastructure can probably do what you need. If you need a place to do some initial testing you can use the relforge100[12].eqiad.wmnet servers. These live in the production network, have about 9TB of free space, and the elasticsearch instances are accessible over ssl on port 9243 from labs (you need to copy over the appropriate production ssl ca certificate though). These servers have no up-time guarantees and are basically our test boxes for a variety of things elasticsearch. This should make it possible to at least see how much data gets generated, and do some initial tests against production size data before deciding if we should get it all going into the proper logstash boxes.

Mentioned in SAL (#wikimedia-releng) [2017-06-02T12:12:51Z] <hashar> jenkins: rebuild logstash plugin from HEAD of master for jenkins 2 back compat. logstash-1.2.0-4-gbcbc19e - T78705

hashar added a comment.Jun 2 2017, 1:50 PM

Thanks @EBernhardson !!!!

Jenkins on a Labs instance

I have used a pre existing labs instance jenkinstest.integration.eqiad.wmflabs

ssh -L 8080:127.0.0.1:8080 jenkinstest.integration.eqiad.wmflabs tail -F /var/log/jenkins/jenkins.log

Jenkins can then be reached via http://localhost:8080/ci/

The Jenkins logstash plugin 1.2.0 has some issue with Jenkins 2.x. I thus rebuild a package from https://github.com/jenkinsci/logstash-plugin

In the Global Tool Configuration http://localhost:8080/ci/configureTools/ :

Logstash Plugin
Indexer type: ELASTICSEARCH
Host name: https://relforge1001.eqiad.wmnet
Port: 9243
Key: /jenkins/jenkins

Where Key is described as: The name and type path. Example: "/indexName/type"

Production CA/cert

openssl s_client -connect relforge1001.eqiad.wmnet:9243
...
issuer=/CN=Puppet CA: palladium.eqiad.wmnet

After checking with ops that is fine to copy it from a production host to labs I copied /etc/ssl/certs/Puppet_Internal_CA.pem under /etc/ssl/certs and ran update-ca-certificates --fresh.

And that works with: curl --verbose https://relforge1002.eqiad.wmnet:9243/

WARNING: that is broken in Jenkins still

Dummy jobs

I have created two jobs:

  1. http://localhost:8080/ci/job/send-console-to-logstash/configure

Build Environment
[ X ] Send console log to Logstash

  1. http://localhost:8080/ci/job/publish-log-to-logstash/configure

**Post-build Actions
Send console log to Logstash
Max lines: 5

Change 356838 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] (DO NOT SUBMIT) jenkins: hack for custom keystore

https://gerrit.wikimedia.org/r/356838

Change 356838 abandoned by Hashar:
(DO NOT SUBMIT) jenkins: hack for custom keystore

Reason:
Cherry picked on CI puppet master but not meant to be merged.

https://gerrit.wikimedia.org/r/356838

hashar claimed this task.Jun 2 2017, 2:22 PM

Java has its own cert storage system. I have followed the instructions on https://support.cloudbees.com/hc/en-us/articles/203821254-How-to-install-a-new-SSL-certificate

keytool -import -file /etc/ssl/certs/production_puppet_CA.pem

Plus a puppet hack and Jenkins is able to talk to relforge1001.eqiad.wmnet \o/

hashar added a comment.Jun 2 2017, 2:30 PM

I have manually created a new index jenkins on relforg1001:

$ curl -XPUT https://relforge1001.eqiad.wmnet:9243/jenkins?pretty
{
  "acknowledged" : true,
  "shards_acknowledged" : true
}
$ curl -s "https://relforge1001.eqiad.wmnet:9243/_cat/indices/jenkins?v"
health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   jenkins 7FH-BrIDQtC_CLnAvoRUpA   5   1          0            0      1.2kb           650b

I then triggered a few random builds of the jobs. Will need to make them slightly more useful :]

hashar added a comment.Jun 2 2017, 3:24 PM

I have created a job that attempt to replay the console output and test result of a production job that is fairly large/spammy. http://localhost:8080/ci/job/mediawiki-extensions-hhvm-jessie/

I have enabled the logstash build wrapper (a log entry per console line).

The job cat the production job console and grab its test result:

curl -s 'https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/lastSuccessfulBuild/consoleText'
curl -s 'https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/lastSuccessfulBuild/artifact/log/junit-phpunit-allexts.xml' > junit.xml

Then on completion, the job interprets the junit file and send the last 30 lines of console to logstash together with the test results. That last event should have something like:

"testResults":{
   "totalCount":45,
   "skipCount":0,
   "failCount":0,
   "failedTests":[]
}

On the labs instance the job takes THREE MINUTES to run. That is because every time there is a line of output it does an https request to relforge. A snippet of the curl output (times are relative to start of build:

00:00:00.789 + curl -s https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/lastSuccessfulBuild/consoleText
00:00:00.985 Started by user anonymous
00:00:01.122 Building remotely on ci-jessie-wikimedia-685855 (ci-jessie-wikimedia) in workspace /home/jenkins/workspace/mediawiki-extensions-hhvm-jessie
00:00:01.285 [mediawiki-extensions-hhvm-jessie] $ /bin/bash -xe /tmp/hudson8679592491589430907.sh
00:00:01.419 + set -eu +x
00:00:01.544 Defined: CASTOR_NAMESPACE="castor-mw-ext-and-skins/master/mediawiki-extensions-hhvm-jessie"
00:00:01.677 [mediawiki-extensions-hhvm-jessie] $ /bin/bash -xe /tmp/hudson7839898166477548713.sh
00:00:01.808 + . castor-set-namespace.env
00:00:01.935 ++ declare -x CASTOR_NAMESPACE=castor-mw-ext-and-skins/master/mediawiki-extensions-hhvm-jessie
00:00:02.076 + echo Syncing...
00:00:02.212 Syncing...
00:00:02.342 + rsync --archive --compress --contimeout 3 rsync://castor.integration.eqiad.wmflabs:/caches/castor-mw-ext-and-skins/master/mediawiki-extensions-hhvm-jessie/ /home/jenkins
00:00:02.470 + echo -e '\nDone'
00:00:02.624 
00:00:02.624 Done

1.7 seconds for 14 lines of output. So I guess we can dismiss the idea of collecting the full console or the plugin would need to be seriously reworked.


@EBernhardson is there a Kibana instance to look at relforge1001.eqiad.wmnet data?

No kibana instance currently, it should be pretty painless to install one on your labs instance though. The entirety of the puppet module for kibana is basically just install package, put config in place, and run service. Kibana defaults to finding elasticsearch at localhost so it will need an extra option in the yaml to point it at relforge.

Also that is a crazy slow indexing rate. It should really be using bulk indexing requests to push everything at once instead of doing a request per line. After a quick look over the plugin its not trivial to make that happen sadly.

hashar added a comment.Jun 2 2017, 8:54 PM

Thank you for the Kibana setup, that works like a charm. Guess I can start prototyping a few dashboards.

Self note to access kibana:

ssh -L 5601:localhost:5601
http://localhost:5601/

Also that is a crazy slow indexing rate. It should really be using bulk indexing requests to push everything at once instead of doing a request per line. After a quick look over the plugin its not trivial to make that happen sadly.

Yup I guess that would require a lot of work. Moreover the plugin is up for adoption on the wiki page :(

Change 322488 abandoned by Hashar:
contint: ElasticSearch role for build logs

Reason:
We are using relforge1001.eqiad.wmnet instead and there is a Kibana frontend on jenkinstest.integration.eqiad.wmflabs instance.

https://gerrit.wikimedia.org/r/322488

Mentioned in SAL (#wikimedia-releng) [2017-06-06T14:34:03Z] <hashar> deleting buildlog.integration.eqiad.wmflabs was mean to receive Jenkins logs in ElasticSearch. We are experimenting with relforge1001.eqiad.wmnet now - T78705

hashar added a comment.Jul 7 2017, 3:06 PM

Quick update:

I have been busy with other duties and bugs. Will come back to it eventually though. The idea would be to generate a few jobs that run continuously and then see what happens on ElasticSearch / Kibana side.