Page MenuHomePhabricator

Compute an estimated label cache space usage for Items
Closed, ResolvedPublic

Description

Let's say we will store all the Items' labels and descriptions in cache (memcached).
We need to understand how much space we will need (max, avg).

We want each entity type to be able to define its storage independently, so this should be limited exclusively to Items.

Details

Related Gerrit Patches:
analytics/wmde/toolkit-analyzer : masterAdd item.label.length.avg metric
analytics/wmde/toolkit-analyzer : masterAdded item.label.total metric

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 11 2018, 10:38 AM

Change 440133 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[analytics/wmde/toolkit-analyzer@master] Added item.label.total metric

https://gerrit.wikimedia.org/r/440133

I saw "memcached" in the description so I will add this comment here:

So, one bit of feedback from a discussion that I just had in #wikimedia-operations on IRC is that the memcached cluster is already pretty full.
This can be seen by the evictions rate on https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats
For a not full cluster you would expect to see no evictions.
I found this out while looking into the problem described @ T197252#4284642
Before continuing to think about memcached I would definitely get in touch with operations.

For a not full cluster you would expect to see no evictions.

Depends on how you use it. If you let memcached remove items on his own when they expire (or if they never expire, like revision content) and create new items faster than the old expire then it is normal to see evictions.

BTW: TTL_INDEFINITE is the default ttl in WANObjectCache::set

Change 440876 had a related patch set uploaded (by WMDE-leszek; owner: WMDE-leszek):
[analytics/wmde/toolkit-analyzer@master] [WIP] Add item.label.length.avg metric

https://gerrit.wikimedia.org/r/440876

WMDE-leszek added a comment.EditedJun 20 2018, 12:59 PM

Based on queries against wb_terms replica, numbers look the following, as of today:

  • Number of labels: 250 M (250 778 841)
  • Average length of a label: 22.8056
Ladsgroup moved this task from incoming to in progress on the Wikidata board.Jun 28 2018, 3:47 PM
Vvjjkkii renamed this task from Compute an estimated label cache space usage for Items to cabaaaaaaa.Jul 1 2018, 1:04 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: gerritbot, Aklapper.
CommunityTechBot renamed this task from cabaaaaaaa to Compute an estimated label cache space usage for Items.Jul 2 2018, 2:55 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: gerritbot, Aklapper.

Based on queries against wb_terms replica, numbers look the following, as of today:

  • Number of labels: 250 M (250 778 841)
  • Average length of a label: 22.8056

So, 0.6GB of raw data ish?

So, 0.6GB of raw data ish?

More like 6GB+

Change 440133 abandoned by WMDE-leszek:
Added item.label.total metric

https://gerrit.wikimedia.org/r/440133

Change 440876 abandoned by WMDE-leszek:
Add item.label.length.avg metric

https://gerrit.wikimedia.org/r/440876

Addshore closed this task as Resolved.Dec 18 2018, 11:34 AM
Addshore claimed this task.
Restricted Application added a project: User-Addshore. · View Herald TranscriptDec 18 2018, 11:34 AM