Page MenuHomePhabricator

Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model
Open, MediumPublic

Description

At the Hackathon, @TK-999 wrote a proof-of-concept patch integrating Mediawiki with the OTel library, including creation of Spans for database queries. We're requesting a few passes of code review on this patch, beginning with any architectural issues and an estimate of how much more work is needed.

Some docs on what features & extra data you can get by linking against instrumentation: https://opentelemetry.io/docs/concepts/instrumentation/libraries/

Docs on the PHP library: https://opentelemetry.io/docs/instrumentation/php/

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
mediawiki/coremaster+65 -3
mediawiki/coremaster+40 -3
mediawiki/libs/Timestampmaster+89 -34
mediawiki/coremaster+48 -24
operations/mediawiki-configmaster+9 -8
mediawiki/coremaster+7 -1
mediawiki/coremaster+80 -1
mediawiki/coremaster+16 -3
mediawiki/coremaster+11 -12
mediawiki/coremaster+317 -48
operations/puppetproduction+6 -0
mediawiki/coremaster+199 -0
mediawiki/coremaster+4 -3
operations/mediawiki-configmaster+1 -7
operations/mediawiki-configmaster+5 -0
mediawiki/coremaster+7 -0
mediawiki/corewmf/1.44.0-wmf.11+1 -1
mediawiki/coremaster+1 -1
operations/mediawiki-configmaster+9 -0
mediawiki/coremaster+16 -3
mediawiki/coremaster+54 -0
mediawiki/coremaster+3 -1
mediawiki/coremaster+13 -10
mediawiki/coremaster+1 -1
mediawiki/coremaster+2 -2
mediawiki/coremaster+40 -4
mediawiki/coremaster+36 -2
mediawiki/coremaster+14 -7
mediawiki/coremaster+102 -3
mediawiki/coremaster+4 -1
mediawiki/coremaster+63 -4
mediawiki/coremaster+87 -0
mediawiki/coremaster+2 K -0
mediawiki/coremaster+643 -5
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1062104 merged by jenkins-bot:

[mediawiki/libs/Timestamp@master] Implement fake-able hrtime(), deprecate microtime()

https://gerrit.wikimedia.org/r/1062104

Change #1027519 abandoned by Máté Szabó:

[mediawiki/core@master] [DNM] Add basic OpenTelemetry instrumentation

Reason:

Superseded by Ibc3910058cd7ed064cad293a3cdc091344e66b86.

https://gerrit.wikimedia.org/r/1027519

Change #1061573 merged by jenkins-bot:

[mediawiki/core@master] Introduce minimal OTEL tracing library

https://gerrit.wikimedia.org/r/1061573

pmiazga renamed this task from MediaWiki imports OpenTelemetry client instrumentation library for enhanced trace metadata to Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model.Oct 11 2024, 2:53 PM

Updated ticket title to match the current state of things. yesterday we merged the minimal OTEL tracing library. Thank you @mszabo for great work!.

The remaining steps are:

  1. Decide whether we need to discuss the RAII implications more (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1061573/comments/6d9c0d1e_23dcce5b?tab=comments).

@mszabo found an issue when we have multiple spans inside same function. The spans are auto-ended when PHP gets out of function, not out of the scope where span was created. IMHO we | should treat RAII as syntax sugar/fallback and expect engineers to start/end spans manually. The RAII would be a prevention mechanism when PHP jumps out of scope (because of exception) to make sure we auto deactivate and end open spans.

  1. implement logic to skip sampling and respect traceparent header. When the header is present the sampling logic shouldn't trigger, instead it should trace the request.
  2. wire up the library with MediaWiki logic, eg start a root span.
  3. instrument database calls
  4. update places where Telemetry::getRequestHeaders is used and instrument it with (Http factories/MWRequest/etc)
  5. update wmf-configs - set sampling rate to 0% as we shouldn't sample by our own, only depend on header existence.

@CDanis @Krinkle - what do you think would be the best areas to instrument? IMHO we should wrap entire request with a root span and instrument Database calls and the HTTPFactory. We could instrument the Parser too. For the first run we should keep it small.

@mszabo when testing the library I noticed that it may be tricky to find a good place where to start root span and call the shutdown/export. Initially, I followed the POC and started instrumentation in the MediaWikiEntryPoint, but might seem bit late. I tried to instrument all hooks and discovered that when we get to MediaWikiEntryPoint::setup - some hooks are already called.

@CDanis @Krinkle - what do you think would be the best areas to instrument? IMHO we should wrap entire request with a root span and instrument Database calls and the HTTPFactory. We could instrument the Parser too. For the first run we should keep it small.

+1 to all.

If you wanted something even smaller to instrument as a starting point, PoolCounter locks wouldn't be bad either.

@pmiazga @mszabo will either one of you have some time soon to at least do #3 above? That's the one I know the least about.

@pmiazga @mszabo will either one of you have some time soon to at least do #3 above? That's the one I know the least about.

Sure thing, I'll polish https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1061574 and that should do the trick

Change #1061574 merged by jenkins-bot:

[mediawiki/core@master] Add request-level OpenTelemetry instrumentation

https://gerrit.wikimedia.org/r/1061574

Change #1105792 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Rdbms\Database: export OTel spans for db operations

https://gerrit.wikimedia.org/r/1105792

Change #1105816 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Setup: parse incoming traceparent header

https://gerrit.wikimedia.org/r/1105816

Change #1105940 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/core@master] rdbms: Inject a Tracer into the DatabaseFactory service

https://gerrit.wikimedia.org/r/1105940

Change #1105792 merged by jenkins-bot:

[mediawiki/core@master] Rdbms\Database: export OTel spans for db operations

https://gerrit.wikimedia.org/r/1105792

Change #1105940 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Inject a Tracer into the DatabaseFactory service

https://gerrit.wikimedia.org/r/1105940

Change #1105955 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] telemetry: Factor out context propagation

https://gerrit.wikimedia.org/r/1105955

Change #1105816 merged by jenkins-bot:

[mediawiki/core@master] Setup: parse incoming traceparent header

https://gerrit.wikimedia.org/r/1105816

Change #1108083 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] draft: holding-the-lock spans for PoolCounters

https://gerrit.wikimedia.org/r/1108083

Change #1108117 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] NoopSpan: optionally activate

https://gerrit.wikimedia.org/r/1108117

Change #1108117 merged by jenkins-bot:

[mediawiki/core@master] NoopSpan: optionally activate

https://gerrit.wikimedia.org/r/1108117

Change #1108083 merged by jenkins-bot:

[mediawiki/core@master] PoolCounter: holding-the-lock otel spans

https://gerrit.wikimedia.org/r/1108083

Change #1108463 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] ParserOutputAccess: add tracing spans

https://gerrit.wikimedia.org/r/1108463

Change #1108467 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Database: prefix tracing operation names

https://gerrit.wikimedia.org/r/1108467

Change #1108463 merged by jenkins-bot:

[mediawiki/core@master] ParserOutputAccess: add tracing spans

https://gerrit.wikimedia.org/r/1108463

Change #1108467 merged by jenkins-bot:

[mediawiki/core@master] Database: prefix tracing operation names

https://gerrit.wikimedia.org/r/1108467

Change #1108481 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Setup: tracing: tweak the root span name

https://gerrit.wikimedia.org/r/1108481

Change #1108481 merged by jenkins-bot:

[mediawiki/core@master] Setup: tracing: tweak the root span name

https://gerrit.wikimedia.org/r/1108481

Change #1108766 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Tracing: add span statuses

https://gerrit.wikimedia.org/r/1108766

Change #1108767 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Span: fluentize activate()/deactivate()

https://gerrit.wikimedia.org/r/1108767

Change #1108771 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Database: tracing: query errors => span error

https://gerrit.wikimedia.org/r/1108771

Change #1108767 merged by jenkins-bot:

[mediawiki/core@master] Span: fluentize activate()/deactivate()

https://gerrit.wikimedia.org/r/1108767

Change #1108766 merged by jenkins-bot:

[mediawiki/core@master] Tracing: add span statuses

https://gerrit.wikimedia.org/r/1108766

Change #1108771 merged by jenkins-bot:

[mediawiki/core@master] Database: tracing: query errors => span error

https://gerrit.wikimedia.org/r/1108771

Change #1108783 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Tracing: span statuses for root spans

https://gerrit.wikimedia.org/r/1108783

Change #1108783 merged by jenkins-bot:

[mediawiki/core@master] Tracing: span statuses for root spans

https://gerrit.wikimedia.org/r/1108783

Change #1108802 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/mediawiki-config@master] group0: enable OpenTelemetry exports

https://gerrit.wikimedia.org/r/1108802

Change #1108802 merged by jenkins-bot:

[operations/mediawiki-config@master] group0: enable OpenTelemetry exports

https://gerrit.wikimedia.org/r/1108802

Mentioned in SAL (#wikimedia-operations) [2025-01-07T20:00:38Z] <cdanis@deploy2002> Started scap sync-world: Backport for [[gerrit:1108802|group0: enable OpenTelemetry exports (T340552)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-07T20:07:35Z] <cdanis@deploy2002> cdanis: Backport for [[gerrit:1108802|group0: enable OpenTelemetry exports (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-07T20:16:45Z] <cdanis@deploy2002> Finished scap sync-world: Backport for [[gerrit:1108802|group0: enable OpenTelemetry exports (T340552)]] (duration: 16m 06s)

Change #1105949 had a related patch set uploaded (by CDanis; author: Máté Szabó):

[mediawiki/core@master] tracing: Disable tracing in CLI mode

https://gerrit.wikimedia.org/r/1105949

Change #1105949 merged by jenkins-bot:

[mediawiki/core@master] tracing: Disable tracing in CLI mode

https://gerrit.wikimedia.org/r/1105949

Change #1108850 had a related patch set uploaded (by CDanis; author: Máté Szabó):

[mediawiki/core@wmf/1.44.0-wmf.11] tracing: Disable tracing in CLI mode

https://gerrit.wikimedia.org/r/1108850

Change #1108850 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.11] tracing: Disable tracing in CLI mode

https://gerrit.wikimedia.org/r/1108850

Mentioned in SAL (#wikimedia-operations) [2025-01-07T23:20:48Z] <cdanis@deploy2002> Started scap sync-world: Backport for [[gerrit:1108850|tracing: Disable tracing in CLI mode (T340552)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-07T23:27:02Z] <cdanis@deploy2002> cdanis: Backport for [[gerrit:1108850|tracing: Disable tracing in CLI mode (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-07T23:35:12Z] <cdanis@deploy2002> Finished scap sync-world: Backport for [[gerrit:1108850|tracing: Disable tracing in CLI mode (T340552)]] (duration: 14m 23s)

Change #1108869 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Database: tracing: emit row respose stats

https://gerrit.wikimedia.org/r/1108869

Change #1108869 merged by jenkins-bot:

[mediawiki/core@master] Database: tracing: emit row respose stats

https://gerrit.wikimedia.org/r/1108869

Change #1109133 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/mediawiki-config@master] group1: enable OpenTelemetry exports

https://gerrit.wikimedia.org/r/1109133

Change #1109442 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] Telemetry: allow mutation of trace context

https://gerrit.wikimedia.org/r/1109442

Change #1109133 merged by jenkins-bot:

[operations/mediawiki-config@master] group1: enable OpenTelemetry exports

https://gerrit.wikimedia.org/r/1109133

Mentioned in SAL (#wikimedia-operations) [2025-01-09T16:48:33Z] <cdanis@deploy2002> Started scap sync-world: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-09T16:53:42Z] <cdanis@deploy2002> cdanis: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-09T17:02:56Z] <cdanis@deploy2002> Finished scap sync-world: Backport for [[gerrit:1109133|group1: enable OpenTelemetry exports (T340552)]] (duration: 14m 22s)

Change #1109706 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] [DNM] proof-of-concept of Telemetry/TracerState

https://gerrit.wikimedia.org/r/1109706

Change #1109717 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] HttpRequestFactory: use Tracer telemetry

https://gerrit.wikimedia.org/r/1109717

Change #1107971 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] WANObjectCache: trace basic operations

https://gerrit.wikimedia.org/r/1107971

Change #1105955 merged by jenkins-bot:

[mediawiki/core@master] telemetry: Factor out context propagation

https://gerrit.wikimedia.org/r/1105955

Change #1109754 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/mediawiki-config@master] OpenTelemetry tracing to all wikis

https://gerrit.wikimedia.org/r/1109754

Change #1109754 merged by jenkins-bot:

[operations/mediawiki-config@master] OpenTelemetry tracing to all wikis

https://gerrit.wikimedia.org/r/1109754

Mentioned in SAL (#wikimedia-operations) [2025-01-13T17:11:03Z] <cdanis@deploy2002> Started scap sync-world: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-13T17:15:51Z] <cdanis@deploy2002> cdanis: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-13T17:25:04Z] <cdanis@deploy2002> Finished scap sync-world: Backport for [[gerrit:1109754|OpenTelemetry tracing to all wikis (T340552)]] (duration: 14m 00s)

Change #1109717 merged by jenkins-bot:

[mediawiki/core@master] HttpRequestFactory: use Tracer telemetry

https://gerrit.wikimedia.org/r/1109717

Change #1111265 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/puppet@production] urldownloader: scrub outbound privacy-sensitive hdrs

https://gerrit.wikimedia.org/r/1111265

Change #1111687 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/core@master] telemetry: Add propagator tests

https://gerrit.wikimedia.org/r/1111687

Change #1111687 merged by jenkins-bot:

[mediawiki/core@master] telemetry: Add propagator tests

https://gerrit.wikimedia.org/r/1111687

Change #1111265 merged by CDanis:

[operations/puppet@production] urldownloader: scrub outbound privacy-sensitive hdrs

https://gerrit.wikimedia.org/r/1111265

Change #1114424 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] ObjectCacheFactory: use Tracer telemetry

https://gerrit.wikimedia.org/r/1114424

Change #1114479 had a related patch set uploaded (by Máté Szabó; author: Máté Szabó):

[mediawiki/core@master] objectcache: Require a TracerInterface in ObjectCacheFactory

https://gerrit.wikimedia.org/r/1114479

Change #1114424 merged by jenkins-bot:

[mediawiki/core@master] ObjectCacheFactory: use Tracer telemetry

https://gerrit.wikimedia.org/r/1114424

Change #1114479 merged by jenkins-bot:

[mediawiki/core@master] objectcache: Require a TracerInterface in ObjectCacheFactory

https://gerrit.wikimedia.org/r/1114479

Change #1115390 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] doc: SpanInterface: more dev-friendly comments

https://gerrit.wikimedia.org/r/1115390

Change #1115398 had a related patch set uploaded (by CDanis; author: CDanis):

[mediawiki/core@master] doc: add warning re: telemetry headers

https://gerrit.wikimedia.org/r/1115398

Change #1107971 merged by jenkins-bot:

[mediawiki/core@master] WANObjectCache: trace basic operations

https://gerrit.wikimedia.org/r/1107971

Change #1115398 merged by jenkins-bot:

[mediawiki/core@master] doc: add warning re: telemetry headers

https://gerrit.wikimedia.org/r/1115398

Change #1115436 had a related patch set uploaded (by CDanis; author: CDanis):

[operations/mediawiki-config@master] Trace only on k8s. Not (yet?) available on bare metal

https://gerrit.wikimedia.org/r/1115436

Change #1115436 merged by jenkins-bot:

[operations/mediawiki-config@master] Trace only on k8s.

https://gerrit.wikimedia.org/r/1115436

Mentioned in SAL (#wikimedia-operations) [2025-01-30T16:06:01Z] <cdanis@deploy2002> Started scap sync-world: Backport for [[gerrit:1115436|Trace only on k8s. (T321211 T340552 T385037)]]

Mentioned in SAL (#wikimedia-operations) [2025-01-30T16:09:45Z] <cdanis@deploy2002> cdanis: Backport for [[gerrit:1115436|Trace only on k8s. (T321211 T340552 T385037)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-01-30T16:17:57Z] <cdanis@deploy2002> Finished scap sync-world: Backport for [[gerrit:1115436|Trace only on k8s. (T321211 T340552 T385037)]] (duration: 11m 55s)

Change #1115390 merged by jenkins-bot:

[mediawiki/core@master] doc: SpanInterface: more dev-friendly comments

https://gerrit.wikimedia.org/r/1115390

The assignment in T340552#10038643 was to review the intial core lib, which is done, and I haven't looked at it much since. The expanded instrumention added since is great, and I've seen it work in Jaeger. Un-assiginging to reflect reality. Feel free to tag MediaWiki-Engineering back if you need anything!

Change #1109442 abandoned by CDanis:

[mediawiki/core@master] Telemetry: allow mutation of trace context

Reason:

Ib1cf361dd800c81a7943231441d58c790191ef5a

https://gerrit.wikimedia.org/r/1109442

Change #1109706 abandoned by CDanis:

[mediawiki/core@master] [DNM] proof-of-concept of Telemetry/TracerState

Reason:

Ib1cf361dd800c81a7943231441d58c790191ef5a

https://gerrit.wikimedia.org/r/1109706