Page MenuHomePhabricator

[Story] Monitor size of some Wikidata database tables
Closed, ResolvedPublic

Description

We should monitor the size of some of our database tables to make sure they're ok. These tables include:

  • changes
  • terms
  • sitelinks
  • entityperpage

also possibly some mediawiki tables for wikibase installs:

  • revisions
  • slots / content / text tables
  • log tables

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:24 AM
bzimport added a project: Wikidata.org.
bzimport set Reference to bz66025.
bzimport added a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).
Lydia_Pintscher removed a subscriber: Unknown Object (MLST).Dec 1 2014, 2:34 PM
thiemowmde renamed this task from monitor size of some database tables to [Story] Monitor size of some Wikidata database tables.Aug 13 2015, 3:00 PM
thiemowmde updated the task description. (Show Details)
thiemowmde set Security to None.
thiemowmde added subscribers: daniel, hoo.

This could be trivial in the same way that we currently record other numbers.
What priority is this?

I wanted to be added to these kind of task because this is already done for *all tables* for operations-related tasks, so it may duplicate efforts. Improvements have to be made, though.

@jcrespo I think this is a pretty old ticket. Is this monitored publicly? The important think for is is to be able to actually look at it.

Is this monitored publicly?

That is part of the improvements, it is currently non-public (NDA required). It will probably be on grafana soon, but we are migrating the monitoring backend system at the same time (being tested on http://prometheus.wmflabs.org/grafana/dashboard/db/mysql-labsdb-cluster).

I have to say it is low-priority for us ops- but if you do something first, let us know. But there will be eventually more metrics, and publicly available.

As a guess it looks like the data is stored in prometheus?
Could we not set this labs prometheus set-up as a datasource backend for the production grafana install now?
Or is there some reason that we want to avoid that?

Is this monitored publicly?

That is part of the improvements, it is currently non-public (NDA required).

What URL exactly should we be looking at now to be able to see this (with an NDA)?

NDA only link (example for db1058, s5-master):
https://tendril.wikimedia.org/report/table_status?host=db1058&schema=wikidata&table=&engine=&data=&index=

This is NDA-only because it is unlikely, but possible that confidential information may appear there, related to user's privacy (query log).
Tendril is nice but it is admin-focused and its data extraction is not statistics-friendly.

Could we not set this labs prometheus set-up as a datasource backend for the production grafana install now?

The idea is that -that it will eventually be on graphana alongside the rest of the data. That is a test that Filippo setup in labs and for public labs data only, for performance/features evaluation purposes. I just mentioned that both of us have currently higher priorities so it will take some time until everything we want is in a public datasource (graphite/promethous/or anything else) and available for everybody to see (graphana, direct comsumption). We are still in testing phase, and this is still WIP. Any help is welcome :-)

@lydia if we want to do this we can easily :)

We don't really have daily stats of any sort in grafana but maybe we should (or have a place for it to live in any case). I'd like to see # revisions, # slots, # slots -not-main-slot, all on Commons. And eventually # slots of each type on the projects that use more than the main slot.

So we (WMDE) actually made a daily namespace in graphite for tracking data daily instead of minutely.
We have been using daily metrics there generated from various scripts for a few years now.

There is no reason we can't have a similar script to get the table info / run some counts on the analytics replicas every day and have some nice dashboards for these areas.

Will have to think if the naming schema for these metrics but this should be very achievable :)
Our scripts can current be found in analytics/wmde/scripts on gerrit.

Are there visible graphs for these? Looking a bit at the repo scripts now. I guess we'd want something in src/commons/contentgrowth (? names are hard)

Are there visible graphs for these?

For which in particular? If you mean daily stuff in general then yes.

Looking a bit at the repo scripts now. I guess we'd want something in src/commons/contentgrowth (? names are hard)

Maybe something more generic.

src/dbmonitoring/content which could just be run with "wikidatawiki" and "commonswiki" Params and do the needed :)

We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard. The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis. Please talk to security before doing it. Please talk to us DBAs befere reimplementing an existing feature.

@jcrespo I want specifically slot count by type, and revision count, for commons. Eventually I will want the same for other projects with multiple slots, once the numbers are over some threshhold. See T68025#5313973 and more specifically T226093 (but I think you are on that ticket too).

Having them on a dashboard is fine; the raw table sizes though aren't quite enough for my needs. I want to be ahead of Commons structured data growth as it relates to dumps, and I expect you want to stay ahead for the dbs too. And I want if possible to anticipate structured data use elsewhere.

Note that I don't need daily reports, weekly or even monthly would be good enough. But if there is an easy way to just see it on a dashboard or graph with little (or no!) work already, and the frequency is more often, then I will not complains.

Change 523938 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@master] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/523938

The patch above does tracking of # of slots by type for wikibase repos (so commons and wikidata).
I don't think there should be any security implications with this script.
Currently it is set to run daily, there is also a weekly cron it could be moved to.

@ArielGlenn let me know what you think.

Thanks, @Addshore! The data looks like exactly what I was hoping to track.

I agree that if we just publish the slot counts, there should be no privacy concerns; this information could be computed by anyone from the monthly dumps with some crunching, once the new slot-based schema is used (at current, only revision counts can be determined). We can see what @Reedy (hereby pinged) thinks anyways just in case. (Feel free to punt this to someone else, Reedy, if you'd rather.)

Weekly is often enough for my purposes, and that will save 18 to 30 minutes of query time :-D

Switched to weekly in the latest PR and also fixed the missing ;.
If you give it a +1 I'll go ahead and get it running in the coming days.

Sure thing! As soon as jenkins is happy I'll give it the thumbs up. I'll poke Reedy about it tomorrow too.

Change 532763 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@production] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/532763

I just hit +2 on the patch.
It runs weekly, which is "every sunday at 01 hours".
So let's check back next week!

Change 523938 merged by jenkins-bot:
[analytics/wmde/scripts@master] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/523938

I think Reedy was away and didn't see my pings. Anyways, thanks for moving forward on this, and we'll see how it looks in a week!

Change 532763 merged by jenkins-bot:
[analytics/wmde/scripts@production] Create script tracking number of slots on wikibase repos

https://gerrit.wikimedia.org/r/532763

Looks like the data has landed.

https://graphite.wikimedia.org/render/?width=586&height=308&target=daily.dbtables.*.slots.byRole.*

@ArielGlenn Shall I leave you to make a dashboard somewhere?

One the dashboard is created I'll link to it from the code that generates the metrics (and maybe this ticket).
And the dashboard should probably also link to the script.

When I look at that image it looks pretty empty, am I missing something?

There is only a single point right now (per metric) and it might have dropped off the time range by now! But it is there :)
In grafana you might find it doesn't draw a line when looking at a time range including the point.

https://graphite.wikimedia.org/S/i I see growth on commons is quite... healthy :-D There are a few bots adding mediainfo data to images based on information in Wikidata; we'll likely see more of this in the next few months.

I just created a basic dashboard with these metrics.
https://grafana.wikimedia.org/d/pwq8ZIxWk/large-site-db-tables

Is there anything else we want to add (given our discussions last week?)

I just created a basic dashboard with these metrics.
https://grafana.wikimedia.org/d/pwq8ZIxWk/large-site-db-tables

Is there anything else we want to add (given our discussions last week?)

I would say we should add size and not just the number of rows. There's a big refactor of revision table being deployed that will free up lots of space and that's what matters. Getting size of tables should be pretty easy. Do you want to put it in WMDE analytics refinery?

SELECT table_schema, table_name AS `Table`, round(((data_length + index_length) / 1024 / 1024), 2) `Size in MB`  FROM information_schema.TABLES WHERE  table_schema in ('wikidatawiki', 'commonswiki') order by (data_length + index_length) desc;

Thanks @Addshore!
I think the size as @Ladsgroup points out could be pretty cool. Not sure if we should add totals and/or just data+indexes separately, what do you guys think?
I am definitely interested on the total size for sure (data+indexes), but not sure if it would be interesting to split it too.

...

I would say we should add size and not just the number of rows. There's a big refactor of revision table being deployed that will free up lots of space and that's what matters.

Size matters but number of rows matters for some things too (like the dumps!) so let's have both please.

I think the size as @Ladsgroup points out could be pretty cool. Not sure if we should add totals and/or just data+indexes separately, what do you guys think?

We can send it separately to stastd and sum it up in grafana.

OTOH in observability principles, it's suggested to keep cardinality to minimum so merging even the metrics sound good to me.

OTOOH (As you already know, I have three hands) This is bikeshedding, I can't care less.

We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard. The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis. Please talk to security before doing it. Please talk to us DBAs befere reimplementing an existing feature.

If this is the case then I'm not sure that we should be re tracking these same things?
But perhaps it might be an idea for more people to have access to this zarcillo thing?

Change 577202 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@master] Track wikibase repo table auto increment usage above 25%

https://gerrit.wikimedia.org/r/577202

perhaps it might be an idea for more people to have access to this zarcillo thing?

Stress on "The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis". Please involve them before making anything public, as they warned us in the past. I don't think that would be the case for these tables, on wikidata only.

Also be careful with size- compression could made the report on I_S very wrong, depending on the used page size (don't remember the details, but I know that is a thing). We should sync more to make sure no missleading values are used for anything important. And indeed, that we don't work twice. :-D

This is going to be looked at by a focus group of the Wikidata-Campsite starting soon

Change 577202 abandoned by Addshore:

[analytics/wmde/scripts@master] Track wikibase repo table auto increment usage above 25%

Reason:

We do this as a nice % of ids available elsewhere now, thanks Amir!

https://gerrit.wikimedia.org/r/577202

Krinkle subscribed.

Is this task considered resolved, or is there more?

I'm triaging the oldest open incident follow-ups, and this appears done. I've seen the Grafana: Wikidata reliability dashboard, which is quite nice! Also related: T291332.

Yup, im gonna go ahead and resolve this