We’d like to add tracing to both Toolforge and CloudVPS so we can better understand how tools are being used internally and how they interact with each other.
Right now, many tools rely on others as part of larger workflows, but these relationships aren’t always obvious. This becomes a real problem when we’re thinking about migrations or decommissioning - for example, a tool that looks unused might actually be quietly powering dozens of others. If we remove it without realizing that, we could end up breaking a whole chain of tools.
To prevent this, we want visibility into how tools are connected and what services they’re talking to.
This includes things like:
- Which tools are calling other tools
- Which tools are making requests to shared infrastructure, like:
- Wikireplicas
- MediaWiki APIs
- Redis, databases, proxy services, etc.
- How often these interactions are happening and in what direction
Ideally, we want a way to trace and quantify this usage: number of requests, how frequently they happen, which tools are involved, and which external components are being hit.
This will help us:
- Understand hidden dependencies between tools
- Group tools that need to be migrated or maintained together
- Avoid surprises during future infrastructure changes
- Improve reliability across Toolforge and CloudVPS
We could be looking into a few different options to make this work, including:
- eBPF: way to observe behavior (like network traffic or syscalls) without needing to change the tools themselves. It’s lightweight and could give us good insights into what’s talking to what.
- SSL sniffing + HTTP protocol inspection: this also would let us passively watch encrypted traffic and decode requests to see which tools are communicating and how.
- Zero-code OpenTelemetry setups: anothar way of collecting tracing data without needing tool owners to manually instrument their code. That might include things like sidecar proxies or dynamic injection.
Next Steps:
- Explore options for implementing tracing
- Start collecting dependency data for a representative set of tools
- Build some basic dashboards or reports to visualize tool-to-tool and tool-to-service usage
- Document any key findings and plan for scaling this out