Page MenuHomePhabricator

Set up dashboard to track resource usage for Commons and Wikidata Elasticsearch indexes
Closed, ResolvedPublic5 Estimated Story Points


As a ES cluster maintainer I want separate indices for Wikimedia Commons and Wikidata so that the wikipedia indices aren't afffected by commons/wikidata growth; to plan for that I need to track how resource usage for the Commons and Elasticsearch indexes are growing

Wikibase related indices are larger and growing faster than our other indices. This presents an issue for resources allocation. Isolating those indices on a specific cluster should allow better visibility, resources management / planning.

Once we have a handle on the current resource usage (T265914) and how to measure it, we need to track its growth over time so we can make plans, especially for hardware needs.


  • We have a dashboard† that can track ≥ 6 months of resource usage for Commons and Wikidata Elasticsearch indexes.

† It doesn't need to be a spiffy special-purpose "dashboard", just a reliable method for viewing the relevant historical data

Event Timeline

CBogen set the point value for this task to 5.Oct 26 2020, 6:41 PM
Gehel triaged this task as High priority.Oct 28 2020, 1:29 PM

I'm seeing this as a task to setup metrics for tracking individual index metrics, and then turning that on for a number of our largest indices. There is already a small mountain of metrics available to be collected, we initially left them turned off because collecting for all indices would be a small mountain of data we never look at. I will be reviewing index metrics provided by elasticsearch and keeping most anything that seems sensible.

I checked in with observability regarding prometheus metric retention. While today we only see metrics going back to mid june, this is not due to retention but the newness of the system. The system has a target retention of three years for one-minute resolution metrics, but that system was deployed in june and has no historical data. Mostly this means we can submit metrics as normal and expect them to be available for capacity planning not only in 2021, but also future years.

We use prometheus-elasticsearch-exporter to proxy stats from the elasticsearch apis into prometheus. I've reviewed the software, unfortunately collecting data from a limited set of indices is not supported, it's an all or none affair. I don't think we can simply turn on stat collection for the ~2k indices in eqiad and codfw chi (primary) clusters. Not only would it be a deluge of data held in prometheus that we are unlikely to look at, but simply repeatedly asking the clusters to lookup these values will be some drain on the master nodes that doesn't seem necessary.

A PR exists since May, but hasn't been reviewed (publicly, at least) by the maintainers. A review of other merged PR's suggests the maintainers have not abandoned the project, but many PR's seem to go unanswered.

Unclear on best way forward here. Our current solution wont work. We could build binaries based on the PR, but that puts us in a wierd place of maintaining that fork. We could collect the data ourselves from, but it seems like a waste of work to write something that the software can already do, except it can't do that. Even if the maintainers of the exporter merged the patch, it would likely be months between a merge and a new release causing new binaries to land on our systems. We can't really wait months for these metrics, we need the data when doing capacity planning in the early months of 2021.

Overall, while I don't like it, it seems like the way forward today is to collect the specific metrics we care about in the wmf exporter, and ignore that the provided exporter could provide the data.

Change 642182 had a related patch set uploaded (by Ebernhardson; owner: Ebernhardson):
[operations/puppet@production] elasticsearch: Add per-index metrics

Change 642182 merged by Ryan Kemper:
[operations/puppet@production] elasticsearch: Add per-index metrics

Setup a basic dashboard: Elasticsearch Index Stats

This dashboard is parameterized on cluster and index, showing a single index at a time. The main metrics we are trying to capture are utilization of CPU/Memory/Disk.

  • The stacked time taken graph should give a reasonable approximation for how many cores are occupied by a specific index. Unsurprisingly, while wikidata and commons take up significant disk/memory, they only appear to be using about 50 cores/sec for actual searches and indexing. Comparatively, enwiki peaks in the low 400's.
  • The Store Size in Bytes graph should cover our need to know how much disk space is used and how it grows over time.
  • This dashboard does not give us a great way to monitor an indexes memory needs. This is because the majority of memory usage for an index is in the linux page cache. If desired we can work up some custom python that invokes mincore against all the files an index owns to determine how many pages are actually loaded. This would still be far from perfect, all available memory will be used, but it will at least give us some ratios to work with.

for page cache usage i threw together a small tool (P13520) that will report page cache usage for specific indices on the local server. Unclear why, but on 1040 and 1050 this works as the user, but on 1060 the results are incorrect unless run as root. This has to open and mmap every file associated with an index (one at a time), taking several seconds to check a single large shard. This is not something that can be run at any kind of high frequency (every minute) like standard prometheus metrics. I could still see some use if we could record this every 30 minutes or some such though.


  • Page cache expands to fill all available space. Just because data is in the page cache doesn't mean it's particularly useful. On the other hand we also know we aren't too far from running out of page cache on the 128GB machines, in which case all loaded pages are likely useful.
  • The amount of page cache used per index depends on the current index/query load of that index compared to other shards on the same node. If a node has lots of large shards it will have less page cache per shard. Aggregate values over all instances might mitigate this.
  • If we want to figure out how much memory is actually required, we would need to use a second process to allocate memory until the node shows some threshold of IO load, and then report page cache usage.
  • Probably others i haven't thought of.

Some initial data for posterity, against the same 10 indices we are collecting for the dashboards. These values are from ~21:00 UTC, close to our daily peak load. Getting this into prometheus or something else would make actually analyzing this information significantly easier:

ebernhardson@elastic1040:~$  python3 commonswiki_file wikidatawiki_content enwiki_general enwiki_content viwiki_general commonswiki_general cebwiki_content metawiki_general dewiki_content frwiki_content
commonswiki_file_1595354515 30 8012.74 MB
commonswiki_file_1595354515 31 8021.44 MB
enwiki_general_1587198756 7 1603.26 MB
frwiki_content_1605324830 6 5196.94 MB
enwiki_content_1594689468 6 9350.99 MB
enwiki_content_1594689468 1 9304.82 MB
metawiki_general_1583759756 4 26.04 MB
commonswiki_general_1582452273 4 878.01 MB
wikidatawiki_content_1587076364 7 4595.67 MB
cebwiki_content_1605166211 2 138.13 MB
ebernhardson@elastic1050:~$ python3 commonswiki_file wikidatawiki_content enwiki_general enwiki_content viwiki_general commonswiki_general cebwiki_content metawiki_general dewiki_content frwiki_content
commonswiki_file_1595354515 8 7172.28 MB
commonswiki_file_1595354515 15 6862.34 MB
commonswiki_file_1595354515 2 7465.66 MB
enwiki_general_1587198756 16 1464.20 MB
enwiki_general_1587198756 1 1191.53 MB
viwiki_general_1605993168 0 30.00 MB
frwiki_content_1605324830 1 4652.67 MB
enwiki_content_1594689468 4 9399.72 MB
commonswiki_general_1582452273 5 1009.80 MB
dewiki_content_1605225710 1 3410.16 MB
wikidatawiki_content_1587076364 9 4354.45 MB
wikidatawiki_content_1587076364 1 4395.59 MB

And for comparison, this is one of our 256G servers.

ebernhardson@elastic1060:~$ sudo python3 commonswiki_file wikidatawiki_content enwiki_general enwiki_content viwiki_general commonswiki_general cebwiki_content metawiki_general dewiki_content frwiki_content
commonswiki_file_1595354515 23 27289.64 MB
commonswiki_file_1595354515 26 26879.77 MB
enwiki_general_1587198756 8 8674.02 MB
enwiki_general_1587198756 4 7164.62 MB
viwiki_general_1605993168 2 532.23 MB
frwiki_content_1605324830 4 8405.00 MB
enwiki_content_1594689468 9 12766.45 MB
enwiki_content_1594689468 1 12362.60 MB
commonswiki_general_1582452273 1 1542.43 MB
dewiki_content_1605225710 6 4979.77 MB
wikidatawiki_content_1587076364 2 12970.17 MB
wikidatawiki_content_1587076364 9 13053.73 MB
cebwiki_content_1605166211 3 1052.48 MB

This is cool, Erik. Thanks for all the details in the write up. I'm not the target audience, but I always appreciate graphs and big piles of numbers! Gathering page cache data every ~30 minutes also seems way better than none at all.

Do you think we need to work on getting the page cche data into prometheseus right now, or can we set up a cron job to take notes and worry about import/export at a later time? I figure that while we can only gather the data in the present, we can worry about making it pretty/readable in the future.

It's hard to say, i suppose my worry is this still isn't completely representative of the information we really want. These values are unlikely to change much over time, since the overall size of the cache is constant. We can of course shrink the cache, that's as easy as allocating some memory in a process so it's not available for the page cache (since we have swap disabled). I'm not sure how to reliably automate that though, it seems like the heuristics to decide how much memory to take from the page cache before starting measurement are going to be hard to pin down, and simply doing this is somewhat invasive. Perhaps we could use metrics like major page faults/sec, or kB read/s from disk, but at that point i'm not sure it's something that should be run many times a day.

The values from the 128G instances perhaps puts a lower bound on the amount of memory we would need per shard for page cache purposes, but based on the above I'm not sure how to turn it into a value that lets us try and predict the future.

Talked with @dcausse about page cache needs, agreed that the constant nature of page cache size will make measurements over time less useful. We can though make some more direct estimates:

  • commonswiki_file shards are 70-85GB each
  • commonswiki_file reports ~8GB page cache per shard, around 10% of shard size
  • wikidatawiki_content shards are ~40GB each
  • wikidatawiki_content reports ~4.5GB page cache per shard, around 11% of shard size
  • cluster wide we have 21TB of data and 2.8TB of memory cache, giving 13% of store as cache

This gives us a lower estimate of 10% of index size as page cache. This is a lower bound because these are are current values and they are currently acceptable, but we know they aren't too far from having trouble. , 15% might be the safe number to use for future planning purposes.

One other thought that occured to me over the weekend which is perhaps obvious but not included: Page cache needs are directly correlated to query rate. enwiki_content which takes ~400 cores sees 9G of page cache. This is > 50% of the index that needs to be kept in memory. Essentially the 10% estimate is only good as long as commons + wikidata have the current query load, if the query rate increases we would need a higher % of memory.