Page MenuHomePhabricator

analytics/wmde/scripts Graphite to Prometheus migration
Closed, ResolvedPublic

Description

This task is meant to split part of T389061, which came from a discussion on Slack. Please note that it also falls under the WMDE Graphite to Prometheus migration in T371616.

Follow the migration process as outlined below.

Secure/Conduct code review(s).
Deploy the changes to production via the train (https://wikitech.wikimedia.org/wiki/Deployments/Train).
Verify that the changes have been successfully implemented.
Update the relevant dashboard(s) by replacing the old Graphite metric(s) with the new Prometheus metric(s).
Please follow the guidelines and standards outlined in the provided documentation:

https://www.mediawiki.org/wiki/Manual:Stats for detailed guidance on the conversion process.
https://drive.google.com/file/d/12yQEuOapkML1vb9MgCaX1QzbLBdXE6X2/view for a video tutorial on the conversion process.
https://docs.google.com/presentation/d/1SZWf_D3mWNX-XHN8PHYI84LDZr6GUQC2AMhZ9mQXCI0/edit#slide=id.g2795460c956_0_23 for slides on the best practices for converting metrics to statslib.

Specifically it looks like the way that the metrics are collected is via:

We'll need to migrate these classes and all their instances given the migration procedures above.

GitHub mirror of the code: GitHub:wikimedia/analytics-wmde-scripts

Details

Related Changes in Gerrit:
SubjectRepoBranchLines +/-
analytics/wmde/scriptsproduction+1 -1
analytics/wmde/scriptsmaster+1 -1
analytics/wmde/scriptsproduction+7 -7
analytics/wmde/scriptsproduction+2 -2
analytics/wmde/scriptsproduction+10 -12
analytics/wmde/scriptsmaster+7 -7
analytics/wmde/scriptsmaster+2 -2
analytics/wmde/scriptsmaster+10 -12
analytics/wmde/scriptsproduction+2 -1
analytics/wmde/scriptsmaster+2 -1
analytics/wmde/scriptsproduction+1 -0
analytics/wmde/scriptsmaster+1 -0
analytics/wmde/scriptsproduction+1 -1
analytics/wmde/scriptsmaster+1 -1
analytics/wmde/scriptsmaster+218 -16
analytics/wmde/scriptsproduction+218 -16
operations/puppetproduction+2 -2
operations/puppetproduction+2 -0
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change #1136417 had a related patch set uploaded (by Hasan Akgün (WMDE); author: Hasan Akgün (WMDE)):

[analytics/wmde/scripts@master] Add Prometheus stats push

https://gerrit.wikimedia.org/r/1136417

Change #1136741 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/puppet@production] statistics::wmde: Configure Prometheus Pushgateway

https://gerrit.wikimedia.org/r/1136741

Change #1136741 merged by RLazarus:

[operations/puppet@production] statistics::wmde: Configure Prometheus Pushgateway

https://gerrit.wikimedia.org/r/1136741

Change #1139431 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[operations/puppet@production] statistics::wmde: Configure statsd_exporter

https://gerrit.wikimedia.org/r/1139431

Change #1139431 merged by Filippo Giunchedi:

[operations/puppet@production] statistics::wmde: Configure statsd_exporter

https://gerrit.wikimedia.org/r/1139431

Change #1136417 merged by jenkins-bot:

[analytics/wmde/scripts@master] Add Prometheus stats push

https://gerrit.wikimedia.org/r/1136417

Change #1139471 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Hasan Akgün (WMDE)):

[analytics/wmde/scripts@production] Add Prometheus stats push

https://gerrit.wikimedia.org/r/1139471

Change #1139471 merged by jenkins-bot:

[analytics/wmde/scripts@production] Add Prometheus stats push

https://gerrit.wikimedia.org/r/1139471

Change #1139473 had a related patch set uploaded (by Hasan Akgün (WMDE); author: Hasan Akgün (WMDE)):

[analytics/wmde/scripts@master] Add -w 1 to current nc command to close the connection after sending stats

https://gerrit.wikimedia.org/r/1139473

Change #1139473 merged by jenkins-bot:

[analytics/wmde/scripts@master] Add -w 1 to current nc command to close the connection after sending stats

https://gerrit.wikimedia.org/r/1139473

Change #1139480 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Hasan Akgün (WMDE)):

[analytics/wmde/scripts@production] Add -w 1 to current nc command to close the connection after sending stats

https://gerrit.wikimedia.org/r/1139480

Change #1139480 merged by jenkins-bot:

[analytics/wmde/scripts@production] Add -w 1 to current nc command to close the connection after sending stats

https://gerrit.wikimedia.org/r/1139480

This should be deployed now:

$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ log --oneline | head -2
f82f1af Add -w 1 to current nc command to close the connection after sending stats
b38dbd1 Add Prometheus stats push

But I don’t see any metrics in Thanos yet :( I would expect at least some wikidata_rc_edits_*, some wikidata_maxlag_seconds, and some wikidata_dispatch_job_wb_changes_* – those should be tracked minutely.

@fgiunchedi any idea what could be wrong? Can you see anything in the wmde-analytics-minutely.service journal, perhaps? (I don’t have journal access, I’m afraid.)

This should be deployed now:

$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ log --oneline | head -2
f82f1af Add -w 1 to current nc command to close the connection after sending stats
b38dbd1 Add Prometheus stats push

But I don’t see any metrics in Thanos yet :( I would expect at least some wikidata_rc_edits_*, some wikidata_maxlag_seconds, and some wikidata_dispatch_job_wb_changes_* – those should be tracked minutely.

@fgiunchedi any idea what could be wrong? Can you see anything in the wmde-analytics-minutely.service journal, perhaps? (I don’t have journal access, I’m afraid.)

The first step is to check whether stats made it to statsd-exporter at all, and it doesn't look like they did:

stat1011:~$ curl localhost:9112/metrics -s | grep -i wikidata
stat1011:~$

your test metric from earlier did though:

stat1011:~$ curl localhost:9112/metrics -s | grep -i lucas
# HELP test_Lucas_T389344_2_total Metric autogenerated by statsd_exporter.
# TYPE test_Lucas_T389344_2_total counter
test_Lucas_T389344_2_total 5
test_Lucas_T389344_2_total{testlabel="Z"} 1

You’re right, it looks like my manually sent test metric is the only “real” metric so far:

lucaswerkmeister-wmde@stat1011:~$ curl -s localhost:9112/metrics | grep -v -e '^#' -e '^go_' -e '^process_' -e '^statsd_' -e '^promhttp_'
test_Lucas_T389344_2_total 5
test_Lucas_T389344_2_total{testlabel="Z"} 1

(Do I need to somehow delete this one as well, by the way? But that’s not as important.)

Change #1139499 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@master] Add WikimediaStatsdExporter to load.php

https://gerrit.wikimedia.org/r/1139499

Presumably there’s a bunch of “class not found” errors in the journal where I can’t see them :(

You’re right, it looks like my manually sent test metric is the only “real” metric so far:

lucaswerkmeister-wmde@stat1011:~$ curl -s localhost:9112/metrics | grep -v -e '^#' -e '^go_' -e '^process_' -e '^statsd_' -e '^promhttp_'
test_Lucas_T389344_2_total 5
test_Lucas_T389344_2_total{testlabel="Z"} 1

(Do I need to somehow delete this one as well, by the way? But that’s not as important.)

No don't worry about it, all good as is

Presumably there’s a bunch of “class not found” errors in the journal where I can’t see them :(

Your user should be able to run journalctl, is it not working?

root@stat1011:~# cat /etc/sudoers.d/analytics-wmde-users
# This file is managed by Puppet!

%analytics-wmde-users ALL = (analytics-wmde) NOPASSWD: ALL
%analytics-wmde-users ALL = NOPASSWD: /bin/journalctl -u wmde-analytics-minutely.service
%analytics-wmde-users ALL = NOPASSWD: /bin/journalctl -u wmde-analytics-daily-early.service
%analytics-wmde-users ALL = NOPASSWD: /bin/journalctl -u wmde-analytics-daily-noon.service
%analytics-wmde-users ALL = NOPASSWD: /bin/journalctl -u wmde-analytics-weekly.service

Change #1139499 merged by jenkins-bot:

[analytics/wmde/scripts@master] Add WikimediaStatsdExporter to load.php

https://gerrit.wikimedia.org/r/1139499

Change #1139504 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@production] Add WikimediaStatsdExporter to load.php

https://gerrit.wikimedia.org/r/1139504

Change #1139504 merged by jenkins-bot:

[analytics/wmde/scripts@production] Add WikimediaStatsdExporter to load.php

https://gerrit.wikimedia.org/r/1139504

Oh right, I totally forgot about that :D

And yeah, there’s a bunch of errors like “PHP Fatal error: Uncaught Error: Class 'WikimediaStatsdExporter' not found in /srv/analytics-wmde/graphite/src/scripts/src/wikidata/wb_changes.php:25”

Mentioned in SAL (#wikimedia-operations) [2025-04-28T15:31:03Z] <Lucas_WMDE> lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ pull # T389344, I don’t want to wait until the next Puppet run in 26 minutes

Buhhhhhh

PHP Fatal error: Uncaught Error: Call to undefined function str_ends_with() in /srv/analytics-wmde/graphite/src/scripts/lib/WikimediaStatsdExporter.php:20

Change #1139505 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@master] Don’t use str_ends_with() with yet

https://gerrit.wikimedia.org/r/1139505

Change #1139505 merged by jenkins-bot:

[analytics/wmde/scripts@master] Don’t use str_ends_with() yet

https://gerrit.wikimedia.org/r/1139505

Change #1139506 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@production] Don’t use str_ends_with() yet

https://gerrit.wikimedia.org/r/1139506

Change #1139506 merged by jenkins-bot:

[analytics/wmde/scripts@production] Don’t use str_ends_with() yet

https://gerrit.wikimedia.org/r/1139506

Mentioned in SAL (#wikimedia-operations) [2025-04-28T15:47:24Z] <Lucas_WMDE> lucaswerkmeister-wmde@stat1011:~$ sudo -u analytics-wmde git -C /srv/analytics-wmde/graphite/src/scripts/ pull --ff-only # T389344

The metrics are showing up in Thanos! \o/ \o/ \o/

I took a look at the metrics and we'll need to iterate, specifically:

  • targetDate label set to change every minute is definitely a no-go, for example:
wikidata_rc_edits_summary_total{key="wbsetqualifier",targetDate="2025-04-28 22:07:00"} 10
wikidata_rc_edits_summary_total{key="wbsetqualifier",targetDate="2025-04-28 22:08:00"} 8

please remove the label, we can't store the whole history in labels. These are the metrics with targetDate I could find:

daily_wikidata_datamodel_lexeme_total
daily_wikidata_dumpRequests_total
daily_wikidata_entityUsagePages_total
daily_wikidata_entityUsage_total
wikidata_rc_edits_all_total
wikidata_rc_edits_anon_total
wikidata_rc_edits_bot_total
wikidata_rc_edits_length_total
wikidata_rc_edits_maxForAUser_total
wikidata_rc_edits_mobile_total
wikidata_rc_edits_new_total
wikidata_rc_edits_oauth_total
wikidata_rc_edits_summary_total
  • labels with potentially unbounded set of values (?) are these labels going to contain an arbitrary set of values over time? i.e. unlimited growth ?
    1. for example here what is the bound on statements ? daily_wikidata_datamodel_entities_with_statement_count{entityType="property",statements="4",type="statements"} 7
    2. ditto here on sitelinks daily_wikidata_datamodel_item_sitelinks_count{sitelinks="508"} 1
    3. ditto here on modifier daily_wikidata_entityUsage_total{aspect="C",modifier="_P11163
    4. ditto here on category daily_wikidata_datamodel_lexeme_total{category="Q10134"
  • daily_wikidata_entityUsage_total and daily_wikidata_entityUsagePages_total are using site label, please replace with site_id like with daily_wikidata_datamodel_item_sitelinks_sites_total

Mentioned in SAL (#wikimedia-operations) [2025-04-29T08:39:42Z] <godog> bounce prometheus-statsd-exporter on stat1011 - T389344

I took a look at the metrics and we'll need to iterate, specifically:

  • targetDate label set to change every minute is definitely a no-go, for example:
wikidata_rc_edits_summary_total{key="wbsetqualifier",targetDate="2025-04-28 22:07:00"} 10
wikidata_rc_edits_summary_total{key="wbsetqualifier",targetDate="2025-04-28 22:08:00"} 8

please remove the label, we can't store the whole history in labels.

To clarify the extend of the problem, this is the number of metrics collected by prometheus from statsd-exporter maybe ~1h apart:

stat1011:~$ curl localhost:9112/metrics -s | wc -l
34816
stat1011:~$ wc -l metrics 
31984 metrics

Goes without saying that 3k new metrics with lastDate updating every minute is not sustainable. I've just restarted statsd-exporter on stats1011 to reset the metrics

  • labels with potentially unbounded set of values (?) are these labels going to contain an arbitrary set of values over time? i.e. unlimited growth ?
    1. for example here what is the bound on statements ? daily_wikidata_datamodel_entities_with_statement_count{entityType="property",statements="4",type="statements"} 7

The current maximum number of statements (per this panel via Graphite) is 8345, and has been for a year; the number could go a bit higher but not much: at some point people hit the maximum single page size.

  1. ditto here on sitelinks daily_wikidata_datamodel_item_sitelinks_count{sitelinks="508"} 1

The upper bound here should be the number of different Wikidata client wikis (wc -l dblists/wikidataclient.dblist), currently 944.

  1. ditto here on modifier daily_wikidata_entityUsage_total{aspect="C",modifier="_P11163

In general the modifier can be either a language code or a property ID, I believe; property IDs currently go up to P13519, though new properties are created relatively regularly.

  1. ditto here on category daily_wikidata_datamodel_lexeme_total{category="Q10134"

In theory this can be any Wikidata item, but in practice only few are actually used – 318 at the moment. Only the top 50 are tracked individually by the lexemes.php script (but the exact set of which 50 will vary slightly over time).

Goes without saying that 3k new metrics with lastDate updating every minute is not sustainable.

And can I just say that at least for me this absolutely does not go without saying?

Change #1139800 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@master] Don’t send targetDate to Prometheus in minutely metrics

https://gerrit.wikimedia.org/r/1139800

Change #1139801 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@master] Rename site label to site_id in Prometheus metrics

https://gerrit.wikimedia.org/r/1139801

Change #1139802 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@master] Remove current date targetDate from Prometheus metrics

https://gerrit.wikimedia.org/r/1139802

Change #1139800 merged by jenkins-bot:

[analytics/wmde/scripts@master] Don’t send targetDate to Prometheus in minutely metrics

https://gerrit.wikimedia.org/r/1139800

Change #1139801 merged by jenkins-bot:

[analytics/wmde/scripts@master] Rename site label to site_id in Prometheus metrics

https://gerrit.wikimedia.org/r/1139801

Change #1139802 merged by jenkins-bot:

[analytics/wmde/scripts@master] Remove current date targetDate from Prometheus metrics

https://gerrit.wikimedia.org/r/1139802

Change #1139813 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@production] Don’t send targetDate to Prometheus in minutely metrics

https://gerrit.wikimedia.org/r/1139813

Change #1139814 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@production] Rename site label to site_id in Prometheus metrics

https://gerrit.wikimedia.org/r/1139814

Change #1139816 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@production] Remove current date targetDate from Prometheus metrics

https://gerrit.wikimedia.org/r/1139816

Change #1139813 merged by jenkins-bot:

[analytics/wmde/scripts@production] Don’t send targetDate to Prometheus in minutely metrics

https://gerrit.wikimedia.org/r/1139813

Change #1139814 merged by jenkins-bot:

[analytics/wmde/scripts@production] Rename site label to site_id in Prometheus metrics

https://gerrit.wikimedia.org/r/1139814

Change #1139816 merged by jenkins-bot:

[analytics/wmde/scripts@production] Remove current date targetDate from Prometheus metrics

https://gerrit.wikimedia.org/r/1139816

The above changes should all be deployed now (apparently I got lucky and merged them on the production branch just as a Puppet run was starting ^^), hopefully that helps.

  • labels with potentially unbounded set of values (?) are these labels going to contain an arbitrary set of values over time? i.e. unlimited growth ?
    1. for example here what is the bound on statements ? daily_wikidata_datamodel_entities_with_statement_count{entityType="property",statements="4",type="statements"} 7

The current maximum number of statements (per this panel via Graphite) is 8345, and has been for a year; the number could go a bit higher but not much: at some point people hit the maximum single page size.

Ok thank you, if I'm reading the metric correctly then it is statements=<integer> plus type=statements and type=wb-identifiers so ~8k times 2 ?

  1. ditto here on sitelinks daily_wikidata_datamodel_item_sitelinks_count{sitelinks="508"} 1

The upper bound here should be the number of different Wikidata client wikis (wc -l dblists/wikidataclient.dblist), currently 944.

Ok we're fine here

  1. ditto here on modifier daily_wikidata_entityUsage_total{aspect="C",modifier="_P11163

In general the modifier can be either a language code or a property ID, I believe; property IDs currently go up to P13519, though new properties are created relatively regularly.

makes sense, and each modifier is multiplied by (each?) site_id ?

  1. ditto here on category daily_wikidata_datamodel_lexeme_total{category="Q10134"

In theory this can be any Wikidata item, but in practice only few are actually used – 318 at the moment. Only the top 50 are tracked individually by the lexemes.php script (but the exact set of which 50 will vary slightly over time).

ack

Goes without saying that 3k new metrics with lastDate updating every minute is not sustainable.

And can I just say that at least for me this absolutely does not go without saying?

hehe that's fair, in general any label value with unbounded / unlimited cardinality is a no-go (for example user=user_id where user_id can be any number or string)

Mentioned in SAL (#wikimedia-operations) [2025-04-29T11:10:18Z] <godog> bounce prometheus-statsd-exporter on stat1011 - T389344

Ok thank you, if I'm reading the metric correctly then it is statements=<integer> plus type=statements and type=wb-identifiers so ~8k times 2 ?

Should be, yeah – I agree those seem to be the only two type values. (idk why we map one of them from wb-claims to the more readable “statements” but leave wb-identifiers unmapped 🤷)

makes sense, and each modifier is multiplied by (each?) site_id ?

Looks like it, yeah. We could probably also get the total cardinality from Graphite, by the way, if that helps? However many metrics match daily.wikidata.entity_usage.*.

And can I just say that at least for me this absolutely does not go without saying?

hehe that's fair, in general any label value with unbounded / unlimited cardinality is a no-go (for example user=user_id where user_id can be any number or string)

Okay… I can see now that this was mentioned on Manual:Stats, but it hasn’t been on my mind at all – I’ve been thinking of these labels more or less like structured logging context fields.

The above changes should all be deployed now (apparently I got lucky and merged them on the production branch just as a Puppet run was starting ^^), hopefully that helps.

It really does -- thank you! I think we're getting there, some more notes below:

  • some metrics still carry targetDate, specifically the daily ones below, if I'm understanding the code correctly daily cronjob runs multiple times a day and updates the counts as it goes. In which case we can remove targetDate from here too
daily_wikidata_datamodel_lexeme_total
daily_wikidata_dumpRequests_total
daily_wikidata_entityUsagePages_total
daily_wikidata_entityUsage_total
  • For daily_wikidata_entityUsage_total I recommend breaking it up in two different metrics. In general it is recommended for the same metric name (daily_wikidata_entityUsage_total) to always have the same set of labels, e.g. not:
daily_wikidata_entityUsage_total{aspect="T",site="bmwiki",targetDate="2025-04-29T03:00:11+00:00"} 2782
daily_wikidata_entityUsage_total{aspect="C",modifier="_P1001",site="bnwiki",targetDate="2025-04-29T03:00:11+00:00"} 1

Perhaps daily_wikidata_entityUsage_total (with modifier) and daily_wikidata_entityUsage_total (with only site). Or you can keep only the metric with modifier and sum by site_id later at query time.

  • I'm still seeing site as opposed to site_id on the latest metrics pushed on stat1001: curl localhost:9112/metrics -s | less though I'm not sure what's up with that

I think we're getting there, some more notes below:

  • some metrics still carry targetDate, specifically the daily ones below, if I'm understanding the code correctly daily cronjob runs multiple times a day and updates the counts as it goes. In which case we can remove targetDate from here too
daily_wikidata_datamodel_lexeme_total
daily_wikidata_dumpRequests_total
daily_wikidata_entityUsagePages_total
daily_wikidata_entityUsage_total

Yeah, I started with the low-hanging fruit / less confusing cases 😅 it sounded like the daily metrics should be less urgent (not as much cardinality explosion), but I can try to take another look at them.

  • For daily_wikidata_entityUsage_total I recommend breaking it up in two different metrics. In general it is recommended for the same metric name (daily_wikidata_entityUsage_total) to always have the same set of labels, e.g. not:
daily_wikidata_entityUsage_total{aspect="T",site="bmwiki",targetDate="2025-04-29T03:00:11+00:00"} 2782
daily_wikidata_entityUsage_total{aspect="C",modifier="_P1001",site="bnwiki",targetDate="2025-04-29T03:00:11+00:00"} 1

Perhaps daily_wikidata_entityUsage_total (with modifier) and daily_wikidata_entityUsage_total (with only site). Or you can keep only the metric with modifier and sum by site_id later at query time.

I’m slightly confused, AFAICT we always log this with modifier set… but sometimes it’s the empty string, does the statsd exporter translate that into dropping the label?

In any case – in Graphite we didn’t really track the modifier separately AFAICT. The Graphite metrics could look like:

  • daily.wikidata.entity_usage.enwiki.C
  • daily.wikidata.entity_usage.enwiki.C_P1
  • daily.wikidata.entity_usage.enwiki.L
  • daily.wikidata.entity_usage.enwiki.L_en

I suspect we didn’t split anything on the _ in Grafana and just treated C and C_P1 as totally distinct. (In fact, given that in the database the latter is stored as C.P1 with a dot, arguably this code goes out of its way to separate this for Graphite in a way that won’t create a “hierarchy” or different “components” or “nodes” or whatever they’re called.) So maybe we should just ditch the $aspect and $modifierSuffix, and directly send the $row['aspect'] (i.e. C or C.P1) to Prometheus in one label?

  • I'm still seeing site as opposed to site_id on the latest metrics pushed on stat1001: curl localhost:9112/metrics -s | less though I'm not sure what's up with that

You answered this in IRC, just copying here for the record: today’s “early” daily run is still ongoing, with whatever PHP code was deployed at 3AM UTC (when it started).

daily_wikidata_datamodel_lexeme_total
daily_wikidata_entityUsagePages_total
daily_wikidata_entityUsage_total

We actually removed targetDate from these three already, so I guess this is the same explanation of the daily run just still using the old code.

That just leaves daily_wikidata_dumpRequests_total with a targetDate, and this metric is weird… dumpDownloads.php can be called with a custom date in $argv[1], but it doesn’t look like that happens in cron/daily.03.sh, so the target date will default to 4 days ago (strtotime( '-4 days', time() )). And indeed, the data is still working in Grafana, it just cuts out 4 days ago. IIRC we discussed this case and didn’t find a satisfactory solution yet, so we wanted to go ahead with targetDate for now and see if that helped us in Grafana.

Change #1139854 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@master] Send entity usage aspect with modifier combined to Prometheus

https://gerrit.wikimedia.org/r/1139854

So maybe we should just ditch the $aspect and $modifierSuffix, and directly send the $row['aspect'] (i.e. C or C.P1) to Prometheus in one label?

Done in the above Gerrit change.

I think we're getting there, some more notes below:

  • some metrics still carry targetDate, specifically the daily ones below, if I'm understanding the code correctly daily cronjob runs multiple times a day and updates the counts as it goes. In which case we can remove targetDate from here too
daily_wikidata_datamodel_lexeme_total
daily_wikidata_dumpRequests_total
daily_wikidata_entityUsagePages_total
daily_wikidata_entityUsage_total

Yeah, I started with the low-hanging fruit / less confusing cases 😅 it sounded like the daily metrics should be less urgent (not as much cardinality explosion), but I can try to take another look at them.

Yes please, you are correct though that targetDate is indeed less of a problem (i.e. daily instead of minutely) wrt cardinality explosion

  • For daily_wikidata_entityUsage_total I recommend breaking it up in two different metrics. In general it is recommended for the same metric name (daily_wikidata_entityUsage_total) to always have the same set of labels, e.g. not:
daily_wikidata_entityUsage_total{aspect="T",site="bmwiki",targetDate="2025-04-29T03:00:11+00:00"} 2782
daily_wikidata_entityUsage_total{aspect="C",modifier="_P1001",site="bnwiki",targetDate="2025-04-29T03:00:11+00:00"} 1

Perhaps daily_wikidata_entityUsage_total (with modifier) and daily_wikidata_entityUsage_total (with only site). Or you can keep only the metric with modifier and sum by site_id later at query time.

I’m slightly confused, AFAICT we always log this with modifier set… but sometimes it’s the empty string, does the statsd exporter translate that into dropping the label?

I suspect that is what might be going on, should be fixed by your latest change though (?)

In any case – in Graphite we didn’t really track the modifier separately AFAICT. The Graphite metrics could look like:

  • daily.wikidata.entity_usage.enwiki.C
  • daily.wikidata.entity_usage.enwiki.C_P1
  • daily.wikidata.entity_usage.enwiki.L
  • daily.wikidata.entity_usage.enwiki.L_en

I suspect we didn’t split anything on the _ in Grafana and just treated C and C_P1 as totally distinct. (In fact, given that in the database the latter is stored as C.P1 with a dot, arguably this code goes out of its way to separate this for Graphite in a way that won’t create a “hierarchy” or different “components” or “nodes” or whatever they’re called.) So maybe we should just ditch the $aspect and $modifierSuffix, and directly send the $row['aspect'] (i.e. C or C.P1) to Prometheus in one label?

Sounds good

Change #1139890 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[analytics/wmde/scripts@production] Send entity usage aspect with modifier combined to Prometheus

https://gerrit.wikimedia.org/r/1139890

Change #1139854 merged by jenkins-bot:

[analytics/wmde/scripts@master] Send entity usage aspect with modifier combined to Prometheus

https://gerrit.wikimedia.org/r/1139854

Change #1139890 merged by jenkins-bot:

[analytics/wmde/scripts@production] Send entity usage aspect with modifier combined to Prometheus

https://gerrit.wikimedia.org/r/1139890

Mentioned in SAL (#wikimedia-operations) [2025-04-30T09:28:44Z] <godog> bounce prometheus-statsd-exporter on stat1011 - T389344

daily_wikidata_entityUsage_total is looking much better now in Thanos – site_id label (instead of site), and aspect contains the full aspect including the modifier.