Page MenuHomePhabricator

[Story] Monitor size of some Wikidata database tables
Open, NormalPublic

Description

We should monitor the size of some of our database tables to make sure they're ok. These tables include:

  • changes
  • terms
  • sitelinks
  • entityperpage

also possibly some mediawiki tables for wikibase installs:

  • revisions
  • slots / content / text tables
  • log tables

Details

Reference
bz66025
Related Gerrit Patches:

Event Timeline

bzimport raised the priority of this task from to Normal.Nov 22 2014, 3:24 AM
bzimport added a project: Wikidata.org.
bzimport set Reference to bz66025.
bzimport added a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:34 PM
thiemowmde renamed this task from monitor size of some database tables to [Story] Monitor size of some Wikidata database tables.Aug 13 2015, 3:00 PM
thiemowmde updated the task description. (Show Details)
thiemowmde set Security to None.
thiemowmde added subscribers: daniel, hoo.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 13 2015, 3:00 PM
daniel added a project: DBA.Feb 5 2016, 2:28 PM
daniel added a subscriber: jcrespo.

This could be trivial in the same way that we currently record other numbers.
What priority is this?

I wanted to be added to these kind of task because this is already done for *all tables* for operations-related tasks, so it may duplicate efforts. Improvements have to be made, though.

daniel added a comment.Feb 5 2016, 3:06 PM

@jcrespo I think this is a pretty old ticket. Is this monitored publicly? The important think for is is to be able to actually look at it.

Is this monitored publicly?

That is part of the improvements, it is currently non-public (NDA required). It will probably be on grafana soon, but we are migrating the monitoring backend system at the same time (being tested on http://prometheus.wmflabs.org/grafana/dashboard/db/mysql-labsdb-cluster).

I have to say it is low-priority for us ops- but if you do something first, let us know. But there will be eventually more metrics, and publicly available.

As a guess it looks like the data is stored in prometheus?
Could we not set this labs prometheus set-up as a datasource backend for the production grafana install now?
Or is there some reason that we want to avoid that?

Is this monitored publicly?

That is part of the improvements, it is currently non-public (NDA required).

What URL exactly should we be looking at now to be able to see this (with an NDA)?

NDA only link (example for db1058, s5-master):
https://tendril.wikimedia.org/report/table_status?host=db1058&schema=wikidata&table=&engine=&data=&index=

This is NDA-only because it is unlikely, but possible that confidential information may appear there, related to user's privacy (query log).
Tendril is nice but it is admin-focused and its data extraction is not statistics-friendly.

Could we not set this labs prometheus set-up as a datasource backend for the production grafana install now?

The idea is that -that it will eventually be on graphana alongside the rest of the data. That is a test that Filippo setup in labs and for public labs data only, for performance/features evaluation purposes. I just mentioned that both of us have currently higher priorities so it will take some time until everything we want is in a public datasource (graphite/promethous/or anything else) and available for everybody to see (graphana, direct comsumption). We are still in testing phase, and this is still WIP. Any help is welcome :-)

PS: Unrelated to this ticket, but https://tendril.wikimedia.org/report/slow_queries?host=^db1058&user=wiki.*&schema=wikidata&qmode=eq&query=&hours=24 is something you may want to keep monitored from time to time.

Addshore moved this task from incoming to monitoring on the Wikidata board.May 11 2016, 1:49 PM

@lydia if we want to do this we can easily :)

Addshore updated the task description. (Show Details)Jun 22 2019, 10:21 PM

We don't really have daily stats of any sort in grafana but maybe we should (or have a place for it to live in any case). I'd like to see # revisions, # slots, # slots -not-main-slot, all on Commons. And eventually # slots of each type on the projects that use more than the main slot.

So we (WMDE) actually made a daily namespace in graphite for tracking data daily instead of minutely.
We have been using daily metrics there generated from various scripts for a few years now.

There is no reason we can't have a similar script to get the table info / run some counts on the analytics replicas every day and have some nice dashboards for these areas.

Will have to think if the naming schema for these metrics but this should be very achievable :)
Our scripts can current be found in analytics/wmde/scripts on gerrit.

Are there visible graphs for these? Looking a bit at the repo scripts now. I guess we'd want something in src/commons/contentgrowth (? names are hard)

Are there visible graphs for these?

For which in particular? If you mean daily stuff in general then yes.

Looking a bit at the repo scripts now. I guess we'd want something in src/commons/contentgrowth (? names are hard)

Maybe something more generic.

src/dbmonitoring/content which could just be run with "wikidatawiki" and "commonswiki" Params and do the needed :)

We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard. The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis. Please talk to security before doing it. Please talk to us DBAs befere reimplementing an existing feature.

@jcrespo I want specifically slot count by type, and revision count, for commons. Eventually I will want the same for other projects with multiple slots, once the numbers are over some threshhold. See T68025#5313973 and more specifically T226093 (but I think you are on that ticket too).

Having them on a dashboard is fine; the raw table sizes though aren't quite enough for my needs. I want to be ahead of Commons structured data growth as it relates to dumps, and I expect you want to stay ahead for the dbs too. And I want if possible to anticipate structured data use elsewhere.

Note that I don't need daily reports, weekly or even monthly would be good enough. But if there is an easy way to just see it on a dashboard or graph with little (or no!) work already, and the frequency is more often, then I will not complains.

Change 523938 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@master] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/523938

The patch above does tracking of # of slots by type for wikibase repos (so commons and wikidata).
I don't think there should be any security implications with this script.
Currently it is set to run daily, there is also a weekly cron it could be moved to.

@ArielGlenn let me know what you think.

Thanks, @Addshore! The data looks like exactly what I was hoping to track.

I agree that if we just publish the slot counts, there should be no privacy concerns; this information could be computed by anyone from the monthly dumps with some crunching, once the new slot-based schema is used (at current, only revision counts can be determined). We can see what @Reedy (hereby pinged) thinks anyways just in case. (Feel free to punt this to someone else, Reedy, if you'd rather.)

Weekly is often enough for my purposes, and that will save 18 to 30 minutes of query time :-D

Switched to weekly in the latest PR and also fixed the missing ;.
If you give it a +1 I'll go ahead and get it running in the coming days.

Sure thing! As soon as jenkins is happy I'll give it the thumbs up. I'll poke Reedy about it tomorrow too.

Change 532763 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@production] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/532763

I just hit +2 on the patch.
It runs weekly, which is "every sunday at 01 hours".
So let's check back next week!

Change 523938 merged by jenkins-bot:
[analytics/wmde/scripts@master] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/523938

I think Reedy was away and didn't see my pings. Anyways, thanks for moving forward on this, and we'll see how it looks in a week!

Change 532763 merged by jenkins-bot:
[analytics/wmde/scripts@production] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/532763

Looks like the data has landed.

https://graphite.wikimedia.org/render/?width=586&height=308&target=daily.dbtables.*.slots.byRole.*

@ArielGlenn Shall I leave you to make a dashboard somewhere?

One the dashboard is created I'll link to it from the code that generates the metrics (and maybe this ticket).
And the dashboard should probably also link to the script.

When I look at that image it looks pretty empty, am I missing something?

There is only a single point right now (per metric) and it might have dropped off the time range by now! But it is there :)
In grafana you might find it doesn't draw a line when looking at a time range including the point.

https://graphite.wikimedia.org/S/i I see growth on commons is quite... healthy :-D There are a few bots adding mediainfo data to images based on information in Wikidata; we'll likely see more of this in the next few months.