Purpose: to give us trend data on test failure/pass, which would give us an easy view to see if eg: all browser tests were failing aroud a giving point in time (which would point to a Beta Cluster issue, for instance).
It is getting more and more tedious to dig in Jenkins logs to find out regressions in jobs or the infrastructure. The way I handle it is usually to grep the logs on gallium /var/lib/jenkins/jobs/*/builds/*/log or look at the XML build results. That requires shell access on the Jenkins master and it is not convenient.
Jenkins has a plugin to emit console log and build data to ElasticSearch: Logstash Plugin. With Kibana in front of it that would dramatically improve everyone experience.
The following is extracted from a conversation between @hashar and @bd808.
Spinning up a single node or multiple node logstash cluster is easy. It's all Puppet and Hiera. The grunt work has been done. Production has a large setup.
Concerns
Access
For Kibana access, the Beta Cluster one is public but the Production one is behind a paywall harness by LDAP authentication (i.e. requires NDA).
gallium has a public IP and is already able to reach instances on the beta cluster and should be able to access production.
Mixing sources
@hashar wrote:
If we emit to beta or production, will have different sources mixed up (eg for beta: beta logs and CI build data and logs). Might makes Kibana dashboard slightly more complicated to filter out.
@bd808 reply:
This is what the log "types" are used for. If you look at the production or beta cluster kibana frontends you can see this in action. It's trivially easy to make a new dashboard that is pre-filtered by type.
Disk usage
@hashar wrote:
The backend ElasticSearch will grow by an order of magnitude. But maybe we can have different index with different retention policy. Most Jenkins jobs garbage collect logs after 15 days.
@bd808 wrote:
A different retention policy is a pain but I really don't think disk space will be a problem. deployment-logstash2 has 33G used for the log indices and 99G of free space. In production the numbers are 2.1T used & 683G free.
Realm | Used | Free |
---|---|---|
Production | 2.1T | 683G |
Beta cluster | 33G | 99G |
Buckets
We may need to add some new config to handle the sorting the logs
into useful buckets. In effect, rules like ones we already have https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/files/logstash/filter-syslog.conf that take generic inputs like syslog, split the record up and add type and channel tags to make searching/filtering easier.