Page MenuHomePhabricator

Add tracing to understand Toolforge and CloudVPS usage and dependencies
Open, HighPublic

Description

We’d like to add tracing to both Toolforge and CloudVPS so we can better understand how tools are being used internally and how they interact with each other.
Right now, many tools rely on others as part of larger workflows, but these relationships aren’t always obvious. This becomes a real problem when we’re thinking about migrations or decommissioning - for example, a tool that looks unused might actually be quietly powering dozens of others. If we remove it without realizing that, we could end up breaking a whole chain of tools.
To prevent this, we want visibility into how tools are connected and what services they’re talking to.

This includes things like:

  • Which tools are calling other tools
  • Which tools are making requests to shared infrastructure, like:
    • Wikireplicas
    • MediaWiki APIs
    • Redis, databases, proxy services, etc.
  • How often these interactions are happening and in what direction

Ideally, we want a way to trace and quantify this usage: number of requests, how frequently they happen, which tools are involved, and which external components are being hit.

This will help us:

  • Understand hidden dependencies between tools
  • Group tools that need to be migrated or maintained together
  • Avoid surprises during future infrastructure changes
  • Improve reliability across Toolforge and CloudVPS

We could be looking into a few different options to make this work, including:

  • eBPF: way to observe behavior (like network traffic or syscalls) without needing to change the tools themselves. It’s lightweight and could give us good insights into what’s talking to what.
  • SSL sniffing + HTTP protocol inspection: this also would let us passively watch encrypted traffic and decode requests to see which tools are communicating and how.
  • Zero-code OpenTelemetry setups: anothar way of collecting tracing data without needing tool owners to manually instrument their code. That might include things like sidecar proxies or dynamic injection.

Next Steps:

  • Explore options for implementing tracing
  • Start collecting dependency data for a representative set of tools
  • Build some basic dashboards or reports to visualize tool-to-tool and tool-to-service usage
  • Document any key findings and plan for scaling this out

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
infra-tracing: extend loki retention to 60 daysrepos/cloud/toolforge/toolforge-deploy!1092volansloki-tracingmain
shared: k8s security group for infra-metrics-lokirepos/cloud/toolforge/tofu-provisioning!104volansloki-tracingmain
tools/toolsbeta: add CNAME for infra-tracing-lokirepos/cloud/toolforge/tofu-provisioning!103volansloki-tracingmain
infra-tracing: set the X-Scope-OrgId headerrepos/cloud/toolforge/toolforge-deploy!1087volansloki-tracingmain
ingress-admission: bump to 0.0.72-20251118104433-d892c480repos/cloud/toolforge/toolforge-deploy!1082group_203_bot_f4d95069bb2675e4ce1fff090c1c1620bump_ingress-admissionmain
Use final namespace name for the tracing lokirepos/cloud/toolforge/ingress-admission!32volansT399313main
tracing: add tracing loki instancerepos/cloud/toolforge/toolforge-deploy!1040volansloki-tracingmain
kind: add port 30004 for loki-tracingrepos/cloud/toolforge/lima-kilo!294volansloki-tracingmain
shared: add loki-tracing S3 bucketsrepos/cloud/toolforge/tofu-provisioning!92volansloki-tracingmain
Customize query in GitLab

Related Objects

Event Timeline

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

For reference, there was a POC created several years back that was able to extract the network connections (to redis, wikis, ...):

https://github.com/david-caro/netsnoop

And for file snooping I used

1#!/usr/bin/env python3
2import datetime
3from bcc import BPF
4from typing import Any
5from pathlib import Path
6from time import sleep
7from functools import cache
8import os
9import pwd
10import click
11import logging
12from prometheus_client import CollectorRegistry, Counter, write_to_textfile
13
14
15LOGGER = logging.getLogger(__name__)
16BASE_PATH=Path("/mnt/nfs")
17
18
19BPF_PROGRAM = """
20#include <uapi/linux/ptrace.h>
21
22int check_if_fname(struct pt_regs *ctx) {
23 char fname[256];
24 bpf_probe_read_user(&fname, sizeof(fname), (void *)PT_REGS_PARM2(ctx));
25 int uid = bpf_get_current_uid_gid();
26
27 // trying to optimize the amount of instructions :/
28 // char by char hardcoded check, as loops are costly in ebpf
29 if (
30 uid == 0 // skip root
31 || fname[0] == '/' && ( // it's a full path
32 fname[1] != 'm' // but not /mnt/nfs/
33 || fname[2] != 'n'
34 || fname[3] != 't'
35 || fname[4] != '/'
36 || fname[5] != 'n'
37 || fname[6] != 'f'
38 || fname[7] != 's'
39 )
40 )
41 return 0;
42
43 bpf_trace_printk("%s@@@%d", fname, uid);
44 return 0;
45}
46"""
47
48
49def get_dependency(path: Path) -> dict[str, Any]:
50 if len(path.parts) < 4:
51 return {}
52
53 if path.parts[3].startswith("dumps"):
54 return {
55 "dependency": "dumps",
56 "dumps_server": path.parts[3],
57 "dumps_subpath": path.parts[4] if len(path.parts) >= 5 else "",
58 "dest_user": "",
59 "dest_tool": "",
60 }
61
62 if path.parts[3].endswith("tools-home"):
63 return {
64 "dependency": "users-home",
65 "dest_user": path.parts[4] if len(path.parts) >= 5 else "",
66 "dumps_server": "",
67 "dumps_subpath": "",
68 "dest_tool": "",
69 }
70
71 if path.parts[3].endswith("tools-project"):
72 return {
73 "dependency": "tools-home",
74 "dest_tool": f"tools.{path.parts[4]}" if len(path.parts) >= 5 else "",
75 "dumps_server": "",
76 "dumps_subpath": "",
77 "dest_user": "",
78 }
79
80 if path.parts[3].endswith("scratch"):
81 return {
82 "dependency": "scratch",
83 }
84
85 return {}
86
87
88
89
90 ""
91# dumps -> subdir
92# scratch
93# tool-home -> toolname
94# user-home -> username
95
96@cache
97def resolve_uid(uid: int) -> str:
98 return pwd.getpwuid(uid)
99
100def resolve_path(pid: int, relative_path: Path) -> Path:
101 try:
102 cwd_path = Path(f"/proc/{pid}/cwd")
103 return cwd_path.resolve() / relative_path
104 except Exception:
105 pass
106
107 return relative_path
108
109
110
111@click.option("-d", "--debug", is_flag=True, show_default=True, default=False)
112@click.option("-p", "--prometheus-file", type=click.Path(), show_default=True, default='./filesnoop.prom')
113@click.command()
114def main(debug: bool, prometheus_file: Path):
115
116 logging.basicConfig(level=logging.DEBUG if debug else logging.INFO)
117 LOGGER.debug(f"Using prometheus file {prometheus_file}")
118 LOGGER.debug(f"Loading eBPF code:\n{BPF_PROGRAM}\n{'-'*80}")
119 b = BPF(text=BPF_PROGRAM, debug=1 if debug else 0)
120 b.attach_kprobe(event="do_sys_openat2", fn_name="check_if_fname")
121 b.attach_kprobe(event="do_sys_open", fn_name="check_if_fname")
122
123 LOGGER.info(f"Watching directory: {BASE_PATH}")
124 LOGGER.info("Press Ctrl+C to stop.\n")
125
126 registry = CollectorRegistry()
127 stats = Counter(
128 "toolforge_internal_dependencies",
129 "Number of times a tool has known to start a connection to the given known dependency",
130 ['tool', 'dependency', 'dumps_server', 'dumps_subpath', 'dest_user', 'dest_tool'],
131 registry=registry,
132 )
133 try:
134 last_time = datetime.datetime.now()
135 while True:
136 cur_time = datetime.datetime.now()
137 if cur_time - last_time > datetime.timedelta(seconds=60):
138 LOGGER.info(f"Writting stats to file {prometheus_file}")
139 write_to_textfile(prometheus_file, registry)
140 last_time = cur_time
141
142 (task, pid, cpu, flags, ts, msg) = b.trace_fields()
143 raw_path, raw_uid = msg.decode().split("@@@", 1)
144 path, uid = Path(raw_path), int(raw_uid)
145
146 if not path.is_absolute():
147 LOGGER.debug(f"Resolving non-absolute path {path}")
148 path = resolve_path(pid, path)
149
150 userinfo = resolve_uid(uid)
151
152 if not userinfo.pw_name.startswith("tools."):
153 LOGGER.debug(f"Ignoring entry for user {userinfo}, not a tools user")
154 continue
155
156 if path.is_relative_to(BASE_PATH):
157 dependency_info = get_dependency(path)
158 LOGGER.info(f"PID {pid}, USER {userinfo.pw_name}({uid}), DEPENDENCY {dependency_info}, PATH {path}")
159 if dependency_info:
160 stats.labels(tool=userinfo.pw_name, **dependency_info).inc()
161 else:
162 LOGGER.debug(f"Path {path} for {userinfo} not relative to {BASE_PATH}, ignored")
163 except KeyboardInterrupt:
164 LOGGER.info("\nStopped.")
165
166
167if __name__ == "__main__":
168 main()

That ran on tools-k8s-worker-nfs-47 for some time, populating the 'tools dependency' graps from https://grafana-rw.wmcloud.org/d/3jhWxB8Vk/toolforge-general-overview?orgId=1&from=now-6h&to=now&timezone=utc&var-cluster_datasource=P8433460076D33992&var-cluster=tools&forceLogin (just ran it again just now to refresh it).

dcaro triaged this task as High priority.Jul 31 2025, 12:21 PM

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-11T15:28:24Z] <volans@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry for Loki 3.5.7 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-11T15:28:28Z] <volans@cloudcumin1001> Updating container image docker-registry.svc.toolforge.org/grafana/loki:3.5.7 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-11-11T15:28:45Z] <volans@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry (exit_code=0) for Loki 3.5.7 (T399313)

group_203_bot_f4d95069bb2675e4ce1fff090c1c1620 opened https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1082

ingress-admission: bump to 0.0.72-20251118104433-d892c480

Change #1210582 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] wmcs k8s nfs: add NFS tracing script

https://gerrit.wikimedia.org/r/1210582

Change #1210591 had a related patch set uploaded (by Volans; author: Volans):

[labs/private@master] labs: add infra-tracing-nfs account

https://gerrit.wikimedia.org/r/1210591

Change #1210664 had a related patch set uploaded (by Volans; author: Volans):

[labs/private@master] labs: enable infra-tracing-nfs tracing

https://gerrit.wikimedia.org/r/1210664

Change #1210591 merged by Volans:

[labs/private@master] labs: add infra-tracing-nfs account

https://gerrit.wikimedia.org/r/1210591

Change #1210582 merged by Volans:

[operations/puppet@production] wmcs k8s nfs: add NFS tracing script

https://gerrit.wikimedia.org/r/1210582

Change #1211164 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] wmcs k8s nfs: pass the config to the NFS tracer

https://gerrit.wikimedia.org/r/1211164

Change #1210664 abandoned by Volans:

[labs/private@master] labs: enable infra-tracing-nfs tracing

Reason:

Putting the setting in horizon like the other similar ones.

https://gerrit.wikimedia.org/r/1210664

Change #1211164 merged by Volans:

[operations/puppet@production] wmcs k8s nfs: pass the config to the NFS tracer

https://gerrit.wikimedia.org/r/1211164

Change #1211610 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] toolforge: add ingress for infra-tracing-loki

https://gerrit.wikimedia.org/r/1211610

Change #1211610 merged by Volans:

[operations/puppet@production] toolforge: add ingress for infra-tracing-loki

https://gerrit.wikimedia.org/r/1211610

Change #1212186 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:toolforge::prometheus: Collect metrics for infra-tracing-loki

https://gerrit.wikimedia.org/r/1212186

Change #1212559 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] wmcs infra-tracing: optimize Loki indexing

https://gerrit.wikimedia.org/r/1212559

Change #1212559 merged by Volans:

[operations/puppet@production] wmcs infra-tracing: optimize Loki indexing

https://gerrit.wikimedia.org/r/1212559

Change #1212186 merged by Majavah:

[operations/puppet@production] P:toolforge::prometheus: Collect metrics for infra-tracing-loki

https://gerrit.wikimedia.org/r/1212186

Mentioned in SAL (#wikimedia-cloud-feed) [2025-12-01T22:30:37Z] <volans@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry for Alloy 1.4.0 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-12-01T22:30:42Z] <volans@cloudcumin1001> Updating container image docker-registry.svc.toolforge.org/grafana/alloy:v1.4.0 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-12-01T22:31:20Z] <volans@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry (exit_code=0) for Alloy 1.4.0 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-12-02T08:30:52Z] <volans@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry for Alloy 1.11.3 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-12-02T08:30:57Z] <volans@cloudcumin1001> Updating container image docker-registry.svc.toolforge.org/grafana/alloy:v1.11.3 (T399313)

Mentioned in SAL (#wikimedia-cloud-feed) [2025-12-02T08:31:45Z] <volans@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.logging.copy_images_to_registry (exit_code=0) for Alloy 1.11.3 (T399313)

Change #1214012 had a related patch set uploaded (by Volans; author: Volans):

[operations/puppet@production] wmcs infra-tracing: simplify Loki indexing

https://gerrit.wikimedia.org/r/1214012

Change #1214012 merged by Volans:

[operations/puppet@production] wmcs infra-tracing: simplify Loki indexing

https://gerrit.wikimedia.org/r/1214012