Page MenuHomePhabricator

Bug: jobs-framework-api job run fails silently on local development environment
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

  • clone the jobs-framework-api repository (https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api) and follow the instructions to set it up on your local machine.
  • submit a job job1.sh that echoes "delaying by 1hr" to stdout and then sleep 3600 (ensure that filelog is enabled)
  • submit a job job_does_not_exist.sh where the given script file doesn't exist.

What happens?:

  • job1.sh job doesn't produce any log and with status being reported as "Completed". This is a pointer that the script wasn't executed at all. If the script was executed then it should have logged our desired string to .out file and should have waited for 1hr (if k8s allows) before the job terminates, but it terminates almost immediately.
  • job_does_not_exist.sh is also reported as "Completed" even though this job should clearly fail because the script doesn't exist.

What should have happened instead?:

  • job1.sh job should log to .out file then wait for as long as 1hr before the job pod is marked as "Completed" and terminated.
  • show command for job_does_not_exist.sh job should have status that reports that the job has error and failed.

Why was this not detected by the tests
because we weren't testing for this. Right now in the tests what we do is to submit jobs then read the k8s object and inspect it. The problem is that this approach is not always reliable and while it eventually becomes reliable, it doesn't always report the correct information especially when you try to inspect the k8s object soon after the job is created.

Solution

Event Timeline

I have a theory why this happens. This is because our fake/local toolforge kubernetes deployment in our development laptops. I don't think this affects actual kubernetes in tools/toolsbeta.

The pods are created with different attributes in an actual kubernetes deployment (tools, or toolsbeta) compared to the local development environment.

Example of missing bits in the development k8s pod template:

[..]
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      runAsGroup: 54005
      runAsUser: 54005

There are at least 2 things here:

  • the whole securityContext stanza is not present. I'm not sure yet why, or what should be injecting it into the pod template. I suspect a misconfiguration somewhere in the fake local toolforge PSP. Or perhaps this is added by some admission controller?
  • uid/gids are special in tools/toolsbeta. They come from LDAP. So if we create a test tool account in the development environment we should make sure the uid/gid are consistent somehow inside/outside the devel kubernetes

I think both problems can be mitigated by moving into a more robust way of creating/maintaining our local kubernetes deployment. That's why I started https://gitlab.wikimedia.org/aborrero/cloud-toolforge-lima-kilo

Change 870686 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes

https://gerrit.wikimedia.org/r/870686

Change 870694 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs-api: deployment: security context fixes

https://gerrit.wikimedia.org/r/870694

Change 870697 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] deployment: use privileged-psp

https://gerrit.wikimedia.org/r/870697

Change 870887 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: migrate to helm

https://gerrit.wikimedia.org/r/870887

I have been investigating this. Apparently the conclusion by @Raymond_Ndibe is right, relocating the output redirection doesn't work as expected.

Change 871171 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] command: introduce filelog_std{out,err} parameters

https://gerrit.wikimedia.org/r/871171

Hello @aborrero some of the related patches submitted already have some reviews that needs attending to. We need to resolve this issue because it's blocking other patches

Change 870887 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: migrate to helm

https://gerrit.wikimedia.org/r/870887

Change 870694 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] jobs-api: deployment: security context fixes

Reason:

This is wrong, we actually need a privileged-psp because we need access to every tool home dir.

https://gerrit.wikimedia.org/r/870694

Change 870697 abandoned by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] deployment: use privileged-psp

Reason:

already present in https://gerrit.wikimedia.org/r/c/cloud/toolforge/jobs-framework-api/+/870887

https://gerrit.wikimedia.org/r/870697

Change 871171 merged by jenkins-bot:

[cloud/toolforge/jobs-framework-api@main] command: introduce filelog_std{out,err} parameters

https://gerrit.wikimedia.org/r/871171

aborrero renamed this task from Bug: jobs-framework-api job run fails silently to Bug: jobs-framework-api job run fails silently on local development environment.Jan 12 2023, 10:13 AM
aborrero claimed this task.

Change 879610 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: adopt the lima-kilo setup

https://gerrit.wikimedia.org/r/879610

Change 879610 merged by Arturo Borrero Gonzalez:

[cloud/toolforge/jobs-framework-api@main] jobs-framework-api: adopt the lima-kilo setup

https://gerrit.wikimedia.org/r/879610