Page MenuHomePhabricator

service-runner apps running on kubernetes emit logs with log level 50
Closed, ResolvedPublic

Description

Intro

Up to service-runner 2.7.7, logging to stdout would emit an integer log level, something that while functional, would not be very nice to see in logstash, nor would it convey any information (e.g. is "50" a warning or an information message?). While this wasn't a problem when services would directly log to logstash, in the logging pipeline that is becoming a minor nuisance and would be nice to solve.

We want to drop direct logstash logging and rely on the logging pipeline alone for a number of reasons, but this is a blocker.

Starting with service-runner 2.7.7, we can add named_levels: true to the configuration of services, thus emitting the more useful "WARNING", "INFO", "DEBUG" and so on.

Services

Below is a list of all services that need to be updated, before we disable direct logstash logging.

Those services need a service-runner bump and then the corresponding helm chart change.

Details

Looking at https://logstash.wikimedia.org/goto/eeb430dc9ce3bb2006e74d5535d72220 it appears that we have log messages of log level "50"

A sample would be

{
  "_index": "logstash-2019.11.29",
  "_type": "unsupported_language",
  "_id": "AW62zVUfarkxubcmhm9E",
  "_version": 1,
  "_score": null,
  "_source": {
    "msg": "501: unsupported_language",
    "kubernetes": {
      "namespace_id": "78fc9cd8-d3db-11e9-9f9a-aa0000fe6bdf",
      "container_name": "wikifeeds-production",
      "creation_timestamp": "2019-09-10T14:58:39Z",
      "host": "kubernetes1003.eqiad.wmnet",
      "master_url": "https://kubemaster.svc.eqiad.wmnet:6443",
      "pod_id": "f9c9f488-0fbe-11ea-a07e-aa0000fe6bdf",
      "namespace_labels": {
        "app": "raw",
        "heritage": "Tiller",
        "chart": "raw-0.2.0",
        "release": "wikifeeds-namespace"
      },
      "labels": {
        "app": "wikifeeds",
        "pod-template-hash": "698dd6b976",
        "release": "production"
      },
      "pod_name": "wikifeeds-production-698dd6b976-vx2lw",
      "namespace_name": "wikifeeds"
    },
    "request": {
      "headers": {
        "x-request-id": "8eae2f83-2080-4d65-9862-5118cde9160b",
        "content-length": "0"
      },
      "method": "GET",
      "query": {},
      "remotePort": 42778,
      "params": {
        "0": "/simple.wikipedia.org/v1/feed/onthisday/selected/11/23"
      },
      "url": "/simple.wikipedia.org/v1/feed/onthisday/selected/11/23",
      "remoteAddress": "10.64.0.145"
    },
    "metadata": {
      "filename": "/var/log/containers/wikifeeds-production-698dd6b976-vx2lw_wikifeeds_wikifeeds-production-51f6e0ed11a8bf36e95ee3c2b312804c351557ea7c01772bdb37d402d1b1f37f.log",
      "fileoffset": "115335246"
    },
    "pid": 16,
    "program": "input-file-kubernetes",
    "type": "unsupported_language",
    "docker": {
      "container_id": "51f6e0ed11a8bf36e95ee3c2b312804c351557ea7c01772bdb37d402d1b1f37f"
    },
    "levelPath": "error/501",
    "hostname": "wikifeeds-production-698dd6b976-vx2lw",
    "stream": "stdout",
    "host": "kubernetes1003",
    "@version": "1",
    "timestamp": "2019-11-29T10:55:39.480534+00:00",
    "level": "50",
    "logsource": "kubernetes1003",
    "message": "501: unsupported_language",
    "normalized_message": "501: unsupported_language",
    "tags": [
      "input-kafka-rsyslog-shipper",
      "rsyslog-shipper",
      "kafka",
      "syslog",
      "es",
      "normalized_message_untrimmed"
    ],
    "@timestamp": "2019-11-29T10:55:39.574Z",
    "v": 0,
    "name": "wikifeeds",
    "time": "2019-11-29T10:55:39.480Z",
    "detail": "The language you have requested is not yet supported.",
    "facility": "local0",
    "request_id": "8eae2f83-2080-4d65-9862-5118cde9160b",
    "status": 501
  },
  "fields": {
    "@timestamp": [
      1575024939574
    ]
  },
  "sort": [
    1575024939574
  ]
}

kubectl logs for wikifeeds shows the same thing (which is expected since that's the exact content that get shipped to logstash)

Details

ProjectBranchLines +/-Subject
operations/deployment-chartsmaster+4 -2
mediawiki/services/eventstreamsmaster+5 -4
operations/deployment-chartsmaster+12 -12
eventgate-wikimediamaster+2 -2
operations/deployment-chartsmaster+3 -3
mediawiki/services/citoidmaster+32 -17
operations/deployment-chartsmaster+238 -214
operations/deployment-chartsmaster+522 -367
mediawiki/services/cxservermaster+1 K -934
operations/deployment-chartsmaster+383 -493
mediawiki/services/mathoidmaster+1 -1
operations/deployment-chartsmaster+547 -359
operations/deployment-chartsmaster+3 -3
operations/deployment-chartsmaster+3 -0
mediawiki/services/chromium-rendermaster+3 -1
mediawiki/services/chromium-rendermaster+4 K -2
mediawiki/services/mobileappsmaster+11 -4
mediawiki/services/wikifeedsmaster+3 -1
mediawiki/services/recommendation-apimaster+3 -1
mediawiki/services/recommendation-apimaster+61 -19
mediawiki/services/mobileappsmaster+47 -10
mediawiki/services/wikifeedsmaster+33 -68
Show related patches Customize query in gerrit

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Mholloway I 'd says at least the standard syslog levels[1], as their names and meaning permeate the entire industry regardless of technology, being presented in that column. There is nothing wrong with having more, but the only case where I 've found use of more is to have more debugging log output.

[1] https://en.wikipedia.org/wiki/Syslog#Severity_level

Thanks @akosiaris. I did some digging, and found that the integer value of 50 for errors is coming from bunyan, the logging package used by service-runner.

bearND added a subscriber: bearND.

This is strange. When I run this locally I get a 404, not a 501. Same when I use the aggregated=true query parameter or run it through a local RESTBase instance (http://localhost:8889/en.wikipedia.org/v1/feed/onthisday/selected/11/23). No 501 when requesting the featured feed either.

FWIW, there is this in the OpenAPI spec for wikifeeds:

501:
  description: Unsupported language
  schema:
    $ref: '#/definitions/problem'

But I don't think it's coming from this or wikifeeds.

new version of service-runner was release. You need to update service-runner in the container, add 'named_levels: true' to the stdout stream in the config and deploy.

Change 587744 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/mobileapps@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587744

Change 587749 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/wikifeeds@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587749

Change 587771 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/recommendation-api@master] Bump service runner to 2.7.7

https://gerrit.wikimedia.org/r/587771

Change 587774 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[mediawiki/services/chromium-render@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587774

Change 587777 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[mediawiki/services/cxserver@master] Update service-runner from 2.7.6 -> 2.7.7

https://gerrit.wikimedia.org/r/587777

Change 587749 merged by jenkins-bot:
[mediawiki/services/wikifeeds@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587749

Change 587804 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[mediawiki/services/mathoid@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587804

Change 587744 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587744

Change 587771 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] Bump service runner to 2.7.7

https://gerrit.wikimedia.org/r/587771

Change 587818 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] Add named log levels

https://gerrit.wikimedia.org/r/587818

@Pchelolo Do we need to add named_levels: true to the deploy repos, too? There are already two streams.

This is only needed for k8s, not for scap deploys. For k8s - yes, there should be 'named_levels: true' for the stdout stream.

Change 587829 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/recommendation-api@master] Add named log levels

https://gerrit.wikimedia.org/r/587829

Change 587837 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/wikifeeds@master] Add named log levels

https://gerrit.wikimedia.org/r/587837

Good to know. So, no need to make changes to the deploy repos.

Change 587845 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/deployment-charts@master] Update multiple services' stdout logging to use named_levels

https://gerrit.wikimedia.org/r/587845

Change 587818 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] Add named log levels

https://gerrit.wikimedia.org/r/587818

Change 587829 merged by jenkins-bot:
[mediawiki/services/recommendation-api@master] Add named log levels

https://gerrit.wikimedia.org/r/587829

Change 587837 merged by jenkins-bot:
[mediawiki/services/wikifeeds@master] Add named log levels

https://gerrit.wikimedia.org/r/587837

Change 587774 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587774

Change 587852 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/chromium-render@master] Add named log levels

https://gerrit.wikimedia.org/r/587852

Change 587852 merged by jenkins-bot:
[mediawiki/services/chromium-render@master] Add named log levels

https://gerrit.wikimedia.org/r/587852

Change 587854 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/deployment-charts@master] Bump wikifeeds to 2020-04-09-185833-production

https://gerrit.wikimedia.org/r/587854

Change 587845 merged by jenkins-bot:
[operations/deployment-charts@master] Update multiple services' stdout logging to use named_levels

https://gerrit.wikimedia.org/r/587845

Change 587854 merged by jenkins-bot:
[operations/deployment-charts@master] Bump wikifeeds to 2020-04-09-185833-production

https://gerrit.wikimedia.org/r/587854

Change 587868 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/deployment-charts@master] Release wikifeeds 0.0.10

https://gerrit.wikimedia.org/r/587868

Change 587868 merged by jenkins-bot:
[operations/deployment-charts@master] Release new charts for wikifeeds, mobileapps, chromium-render

https://gerrit.wikimedia.org/r/587868

Change 587804 merged by Alexandros Kosiaris:
[mediawiki/services/mathoid@master] Bump service-runner to 2.7.7

https://gerrit.wikimedia.org/r/587804

Change 587980 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] mathoid: Enabled named loglevels

https://gerrit.wikimedia.org/r/587980

Change 587980 merged by jenkins-bot:
[operations/deployment-charts@master] mathoid: Enabled named loglevels

https://gerrit.wikimedia.org/r/587980

Mathoid has been deployed with this change and it works fine. I can now see in logstash names for the log levels versus the numbers which is definitely more intuitive and useful. So this has been proven even in production and the rest of the services can follow suit relatively (covid-19 permitting) fast.

@Pchelolo many thanks for reviewing and making my proposed change viable. This issue has been bugging me for some time.

Change 587777 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Update service-runner from 2.7.6 -> 2.7.7

https://gerrit.wikimedia.org/r/587777

Just verified that wikifeeds is now using named_levels.

Change 588540 had a related patch set uploaded (by KartikMistry; owner: KartikMistry):
[operations/deployment-charts@master] Update cxserver to 2020-04-13-094138-production

https://gerrit.wikimedia.org/r/588540

Change 588540 merged by jenkins-bot:
[operations/deployment-charts@master] Update cxserver to 2020-04-13-094138-production

https://gerrit.wikimedia.org/r/588540

Mentioned in SAL (#wikimedia-operations) [2020-04-15T05:22:24Z] <kart_> Update cxserver to 2020-04-13-094138-production (T239459, T249469)

Verified: cxserver is now using named levels.

akosiaris renamed this task from service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 to service-runner apps running on kubernetes emit logs with log level 50 .Apr 15 2020, 4:50 PM
akosiaris updated the task description. (Show Details)

Adding owners of services in the subscribers list.

Change 589079 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[operations/deployment-charts@master] Update change-prop container version to v0.9.5

https://gerrit.wikimedia.org/r/589079

Change 589079 merged by jenkins-bot:
[operations/deployment-charts@master] Update change-prop container version to v0.9.5

https://gerrit.wikimedia.org/r/589079

Change 594492 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/deployment-charts@master] eventgate, eventstreams, citoid: Log with namedlevels

https://gerrit.wikimedia.org/r/594492

Change 594504 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] Update service-runner to 2.7.7

https://gerrit.wikimedia.org/r/594504

Change 594504 merged by jenkins-bot:
[mediawiki/services/citoid@master] Update service-runner to 2.7.7

https://gerrit.wikimedia.org/r/594504

Change 594513 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[operations/deployment-charts@master] Citoid: Update restbase to 2.7.7

https://gerrit.wikimedia.org/r/594513

Change 594513 merged by jenkins-bot:
[operations/deployment-charts@master] Citoid: Update service-runner to 2.7.7

https://gerrit.wikimedia.org/r/594513

Change 609841 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[eventgate-wikimedia@master] Bump EventGate version to 85d95ea to get service-runner 2.7.7

https://gerrit.wikimedia.org/r/609841

Change 609841 merged by Ottomata:
[eventgate-wikimedia@master] Bump EventGate version to 85d95ea to get service-runner 2.7.7

https://gerrit.wikimedia.org/r/609841

Change 610057 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-* - bump to version 2020-07-07-130941-production

https://gerrit.wikimedia.org/r/610057

Change 610057 merged by Ottomata:
[operations/deployment-charts@master] eventgate-* - bump to version 2020-07-07-130941-production

https://gerrit.wikimedia.org/r/610057

All eventgate-* services are updated to service-runner 2.7.7

The hold-up seems to be eventstreams; it actually uses a fork of service runner, and the fork is missing the feature, so it's slightly more complicated to update. https://github.com/wikimedia/service-runner/tree/prometheus_metrics

The hold-up seems to be eventstreams; it actually uses a fork of service runner, and the fork is missing the feature, so it's slightly more complicated to update. https://github.com/wikimedia/service-runner/tree/prometheus_metrics

Indeed. Citoid has been done and eventgate's chart is being bumped in https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/594492

I 've pinged on https://github.com/wikimedia/service-runner/pull/230, it would be nice if we could finally wrap this up.

Change 641846 had a related patch set uploaded (by Clarakosi; owner: Clarakosi):
[mediawiki/services/eventstreams@master] Update service-runner to v2.8.0 and modify guage metric to use set()

https://gerrit.wikimedia.org/r/641846

Change 641846 merged by jenkins-bot:
[mediawiki/services/eventstreams@master] Update service-runner to v2.8.0 and modify guage metric to use set()

https://gerrit.wikimedia.org/r/641846

Change 594492 merged by jenkins-bot:
[operations/deployment-charts@master] eventgate, eventstreams: Log with namedlevels

https://gerrit.wikimedia.org/r/594492

eventstreams done. Double checked in logstash and I can see nice log levels now.

akosiaris claimed this task.

eventgate done. And with this, we can close this task. Thanks to all those that contributed.