Page MenuHomePhabricator

🤖 Issues with sessions on wikibase.cloud?
Closed, ResolvedPublic

Description

This is half a question and half a bug that I wish to file.
I have seen reports before on telegram, and now I'm doing some more editing I am also experiencing it myself, that sometimes my sessions just randomly disappear?
This is also particularly easy to spot while using cradle I believe.

At a guess, redis is too small, sessions and other things are stored in there, and things are being dumped.
redis is:

The max memory for all of wikibase.cloud is still set to 75mb? https://github.com/wmde/wbaas-deploy/blob/548385dc5b0a259445929ca2c65e823d5cc9176d/k8s/helmfile/env/production/redis.values.yaml.gotmpl#L55

Ultimately server logs and metrics are needed to actually figure out what is going on.
If it's no redis, it would still be great to not have user sessions randomly disappear!

Event Timeline

@Addshore Might be I'm missing something obvious but why would redis running short on memory result in losing sessions stored in it? Isn't redis persisting this Append Only File to disk?

FWIW this file does seem to be being used and is very regularly being rewritten to shrink it:

From the redis logs

1357551:C 16 Oct 2024 10:40:08.763 * Successfully created the temporary AOF base file temp-rewriteaof-bg-1357551.aof
1357551:C 16 Oct 2024 10:40:08.764 * Fork CoW for AOF rewrite: current 3 MB, peak 3 MB, average 2 MB
1:S 16 Oct 2024 10:40:08.858 * Background AOF rewrite terminated with success
1:S 16 Oct 2024 10:40:08.858 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-1357551.aof into appendonly.aof.94958.base.rdb
1:S 16 Oct 2024 10:40:08.861 * Removing the history file appendonly.aof.94971.incr.aof in the background
1:S 16 Oct 2024 10:40:08.861 * Removing the history file appendonly.aof.94957.base.rdb in the background
1:S 16 Oct 2024 10:40:08.864 * Background AOF rewrite finished successfully
1:S 16 Oct 2024 10:41:33.041 * Starting automatic rewriting of AOF on 189% growth
1:S 16 Oct 2024 10:41:33.044 * Creating AOF incr file appendonly.aof.94973.incr.aof on background rewrite
1:S 16 Oct 2024 10:41:33.046 * Background append only file rewriting started by pid 1358164
1358164:C 16 Oct 2024 10:41:33.339 * Successfully created the temporary AOF base file temp-rewriteaof-bg-1358164.aof
1358164:C 16 Oct 2024 10:41:33.341 * Fork CoW for AOF rewrite: current 4 MB, peak 4 MB, average 2 MB
1:S 16 Oct 2024 10:41:33.349 * Background AOF rewrite terminated with success
1:S 16 Oct 2024 10:41:33.349 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-1358164.aof into appendonly.aof.94959.base.rdb
1:S 16 Oct 2024 10:41:33.351 * Removing the history file appendonly.aof.94972.incr.aof in the background
1:S 16 Oct 2024 10:41:33.351 * Removing the history file appendonly.aof.94958.base.rdb in the background
1:S 16 Oct 2024 10:41:33.354 * Background AOF rewrite finished successfully
1:S 16 Oct 2024 10:43:17.376 * Starting automatic rewriting of AOF on 202% growth
1:S 16 Oct 2024 10:43:17.379 * Creating AOF incr file appendonly.aof.94974.incr.aof on background rewrite
1:S 16 Oct 2024 10:43:17.381 * Background append only file rewriting started by pid 1358921
1358921:C 16 Oct 2024 10:43:17.649 * Successfully created the temporary AOF base file temp-rewriteaof-bg-1358921.aof
1358921:C 16 Oct 2024 10:43:17.651 * Fork CoW for AOF rewrite: current 3 MB, peak 3 MB, average 1 MB
1:S 16 Oct 2024 10:43:17.683 * Background AOF rewrite terminated with success
1:S 16 Oct 2024 10:43:17.683 * Successfully renamed the temporary AOF base file temp-rewriteaof-bg-1358921.aof into appendonly.aof.94960.base.rdb
1:S 16 Oct 2024 10:43:17.686 * Removing the history file appendonly.aof.94973.incr.aof in the background
1:S 16 Oct 2024 10:43:17.686 * Removing the history file appendonly.aof.94959.base.rdb in the background
1:S 16 Oct 2024 10:43:17.688 * Background AOF rewrite finished successfully

maxmemory decides how much stuff redis will remmeber https://redis.io/docs/latest/develop/reference/eviction/

The AOF is only used to persist between restarts or crashes

So the AOF should only reflect what is held based on max memory (give or take)

Thanks, that's my long-term misunderstanding then...

running the INFO command on production master: kubectl exec -it sts/redis-master -- /bin/bash -c 'REDISCLI_AUTH=$REDIS_PASSWORD redis-cli INFO'

results in:

# Server

redis_version:7.0.5

redis_git_sha1:00000000

redis_git_dirty:0

redis_build_id:4178210212378ca8

redis_mode:standalone

os:Linux 6.1.100+ x86_64

arch_bits:64

monotonic_clock:POSIX clock_gettime

multiplexing_api:epoll

atomicvar_api:c11-builtin

gcc_version:10.2.1

process_id:1

process_supervised:no

run_id:96ada08f37d5c6fdd15e669274bcfeac4e877f8d

tcp_port:6379

server_time_usec:1729086982125933

uptime_in_seconds:204069

uptime_in_days:2

hz:10

configured_hz:10

lru_clock:1033734

executable:/redis-server

config_file:

io_threads_active:0



# Clients

connected_clients:22

cluster_connections:0

maxclients:10000

client_recent_max_input_buffer:131072

client_recent_max_output_buffer:20504

blocked_clients:0

tracking_clients:0

clients_in_timeout_table:0



# Memory

used_memory:78490488

used_memory_human:74.85M

used_memory_rss:95145984

used_memory_rss_human:90.74M

used_memory_peak:107762080

used_memory_peak_human:102.77M

used_memory_peak_perc:72.84%

used_memory_overhead:11732120

used_memory_startup:863248

used_memory_dataset:66758368

used_memory_dataset_perc:86.00%

allocator_allocated:78855152

allocator_active:85233664

allocator_resident:92389376

total_system_memory:135080161280

total_system_memory_human:125.80G

used_memory_lua:45056

used_memory_vm_eval:45056

used_memory_lua_human:44.00K

used_memory_scripts_eval:3680

number_of_cached_scripts:8

number_of_functions:0

number_of_libraries:0

used_memory_vm_functions:32768

used_memory_vm_total:77824

used_memory_vm_total_human:76.00K

used_memory_functions:184

used_memory_scripts:3864

used_memory_scripts_human:3.77K

maxmemory:78643200

maxmemory_human:75.00M

maxmemory_policy:volatile-lru

allocator_frag_ratio:1.08

allocator_frag_bytes:6378512

allocator_rss_ratio:1.08

allocator_rss_bytes:7155712

rss_overhead_ratio:1.03

rss_overhead_bytes:2756608

mem_fragmentation_ratio:1.21

mem_fragmentation_bytes:16734624

mem_not_counted_for_evict:34920

mem_replication_backlog:1048592

mem_total_replication_buffers:1086376

mem_clients_slaves:37800

mem_clients_normal:423480

mem_cluster_links:0

mem_aof_buffer:1280

mem_allocator:jemalloc-5.2.1

active_defrag_running:0

lazyfree_pending_objects:0

lazyfreed_objects:0



# Persistence

loading:0

async_loading:0

current_cow_peak:0

current_cow_size:0

current_cow_size_age:0

current_fork_perc:0.00

current_save_keys_processed:0

current_save_keys_total:0

rdb_changes_since_last_save:35415861

rdb_bgsave_in_progress:0

rdb_last_save_time:1728882913

rdb_last_bgsave_status:ok

rdb_last_bgsave_time_sec:0

rdb_current_bgsave_time_sec:-1

rdb_saves:0

rdb_last_cow_size:3567616

rdb_last_load_keys_expired:0

rdb_last_load_keys_loaded:144049

aof_enabled:1

aof_rewrite_in_progress:0

aof_rewrite_scheduled:0

aof_last_rewrite_time_sec:1

aof_current_rewrite_time_sec:-1

aof_last_bgrewrite_status:ok

aof_rewrites:978

aof_rewrites_consecutive_failures:0

aof_last_write_status:ok

aof_last_cow_size:6897664

module_fork_in_progress:0

module_fork_last_cow_size:0

aof_current_size:42577698

aof_base_size:35824589

aof_pending_rewrite:0

aof_buffer_length:0

aof_pending_bio_fsync:0

aof_delayed_fsync:0



# Stats

total_connections_received:19285203

total_commands_processed:136641567

instantaneous_ops_per_sec:307

total_net_input_bytes:68396432198

total_net_output_bytes:87276720796

total_net_repl_input_bytes:0

total_net_repl_output_bytes:57034813636

instantaneous_input_kbps:43.16

instantaneous_output_kbps:48.88

instantaneous_input_repl_kbps:0.00

instantaneous_output_repl_kbps:20.17

rejected_connections:0

sync_full:2

sync_partial_ok:0

sync_partial_err:0

expired_keys:1364056

expired_stale_perc:0.36

expired_time_cap_reached_count:0

expire_cycle_cpu_milliseconds:17655

evicted_keys:11379824

evicted_clients:0

total_eviction_exceeded_time:49898

current_eviction_exceeded_time:0

keyspace_hits:63425410

keyspace_misses:15936985

pubsub_channels:0

pubsub_patterns:0

pubsubshard_channels:0

latest_fork_usec:3260

total_forks:980

migrate_cached_sockets:0

slave_expires_tracked_keys:0

active_defrag_hits:0

active_defrag_misses:0

active_defrag_key_hits:0

active_defrag_key_misses:0

total_active_defrag_time:0

current_active_defrag_time:0

tracking_total_keys:0

tracking_total_items:0

tracking_total_prefixes:0

unexpected_error_replies:0

total_error_replies:2

dump_payload_sanitizations:0

total_reads_processed:147621773

total_writes_processed:158032850

io_threaded_reads_processed:0

io_threaded_writes_processed:0

reply_buffer_shrinks:90716

reply_buffer_expands:958



# Replication

role:master

connected_slaves:1

slave0:ip=redis-replicas-0.,port=6379,state=online,offset=57020639151,lag=1

master_failover_state:no-failover

master_replid:5336034601fee61fdf44be1b37b3f3f7d3c4ad3b

master_replid2:0000000000000000000000000000000000000000

master_repl_offset:57020645008

second_repl_offset:-1

repl_backlog_active:1

repl_backlog_size:1048576

repl_backlog_first_byte_offset:57019573649

repl_backlog_histlen:1071360



# CPU

used_cpu_sys:2970.458011

used_cpu_user:2802.665843

used_cpu_sys_children:37.984556

used_cpu_user_children:278.229292

used_cpu_sys_main_thread:2919.603662

used_cpu_user_main_thread:2797.904919



# Modules



# Errorstats

errorstat_NOAUTH:count=2



# Cluster

cluster_enabled:0



# Keyspace

db0:keys=161021,expires=22843,avg_ttl=2762249612

db2:keys=50,expires=20,avg_ttl=2560463

db3:keys=2,expires=2,avg_ttl=50698

db11:keys=6,expires=6,avg_ttl=1232981

To decipher db type: https://github.com/wmde/wbaas-deploy/blob/main/k8s/helmfile/env/production/private.yaml#L41

# Keyspace

db0:keys=161021,expires=22843,avg_ttl=2762249612 [MediaWiki?]

db2:keys=50,expires=20,avg_ttl=2560463 [apiDb]

db3:keys=2,expires=2,avg_ttl=50698 [apiCacheDb]

db11:keys=6,expires=6,avg_ttl=1232981 [toolWidar]

So, most of the redis instance is being taken up by mediawiki (no surprise)
Some subset of that will be mediawiki sessions
10 and 11 make up sessions of quickstameents and widar, and seemingly there aint much there, they are likely getting booted fairly regularly


11379824 evictions vs 1364056 expires
So 10x the number of evictions than expires
That could be a good metric in general

Already written elsewhere, but generally I would recommend

  • Ideally hook redis up to prometheus so that you can see more
  • Doubling the k8s request and limits for redis
  • x4ing the maxmemory setting for redis
  • diving into what is being stored in redis (primarily by MW) to see if it all need to live there

We had a look at the cache keys too, here are some broken down metrics and thoughts for them all
Note: This of course is number of keys, and doesnt relate to the size of each of these keys

We also spotted 85) "PROMETHEUS_:gauge:platform_api_wiki_entity_imports_pending" which looks like API is configured in DB0 of redis, I'd advise to separate that one out!

Based on checking the size of a single key of each type (to give us a very rough idea of whats in there)
(maths might be off, as im making som assumptions here)

unpatrollable-page
WANCache:mwdb_b357db396b-mwt_1bd8ddbd02_:unpatrollable-page:14446|#|v
key len 70
value len 58
120 per value
(70)*120000/1024/1024
~ 7MB ?

(73+120)*14000/1024/1024
2MB for messages

(101+4605)*531/1024/1024
2MB for cirrus

(70+1697)*2237/1024/1024
4MB blob store text

(70+310)*400/1024/1024
0.1MB resource loader

(70+427)*1000/1024/1024
0.5MB sessions

So, I would look at

  • Seeing if unpatrollable-page is needed, or seeing if it can be stored elsewhere etc It seems crazy how big this is
  • Seeing if messages can be shared between sites (Not sure how custom messages ties into this), maybe not worth it
  • blob store text might also need to get kciked out given the size of redis currently
  • resouce loader can probably be shared between sites?

A second thought

Keep the current redis instance for low volume important sutff (mainly sessions)

Spin up another redis instance and use it for the main MW cache
https://github.com/wbstack/mediawiki/blob/main/dist-persist/wbstack/src/Settings/ProductionCache.php#L18
You would just need to define another object cache here with another name
https://github.com/wbstack/mediawiki/blob/main/dist-persist/wbstack/src/Settings/ProductionCache.php#L20-L38

This is probably the path of least resistance to stop peoples session vanishing.
Then you could sit back and evaluate the rest of the topic and see whats best

For those following in the future: these are notes taken by @Addshore that are the results of an RTC discussion between us where I was poking redis

Dependency on Argo initiative: first we need to make sure that our current Redis instance can be successfully managed by Argo instead of helmfile, then proceed with this.

Anton.Kokh changed the task status from Open to Stalled.Nov 7 2024, 1:42 PM

Just checking in to say this is certainly still happening today, i can't really get through 1 min of editing on my bot before the session gets lost.

Snippet from my response (not that its actually helpfull if we already decided its redis)

response body: {"error":{"code":"assertbotfailed","info":"You do not have the \"bot\" right, so the action could not be completed.","*":"See https://wikibase.world/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."}} {
  statusCode: 500,
  statusMessage: undefined,
  headers: Headers {
    [Symbol(guard)]: 'immutable',
    [Symbol(headers list)]: HeadersList {
      cookies: [
        'mwdb_581538d916_mwt_49c890840f__BPsession=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly'
      ],
      [Symbol(headers map)]: Map(16) {
        'date' => { name: 'Date', value: 'Sat, 16 Nov 2024 10:02:06 GMT' },
        ...
        'set-cookie' => {
          name: 'Set-Cookie',
          value: 'mwdb_581538d916_mwt_49c890840f__BPsession=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly'
        },
        'mediawiki-api-error' => { name: 'MediaWiki-API-Error', value: 'assertbotfailed' },
        ....
        'x-request-id' => { name: 'X-Request-Id', value: '36fbcd93918dd02788e05e20' },
        'x-wbstack-mw-backend' => {
          name: 'X-WBSTACK-MW-BACKEND',
          value: 'mediawiki-139-app-api'
        },
        'strict-transport-security' => {
          name: 'Strict-Transport-Security',
          value: 'max-age=31536000; includeSubDomains'
        }
      },
      [Symbol(headers map sorted)]: null
    },
    [Symbol(realm)]: null
  },
  body: {
    error: {
      code: 'assertbotfailed',
      info: 'You do not have the "bot" right, so the action could not be completed.',
      '*': 'See https://wikibase.world/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.'
    }
  },
  url: 'https://wikibase.world/w/api.php?action=wbeditentity&format=json&bot=true',
  context: {
    url: 'https://wikibase.world/w/api.php?action=wbeditentity&format=json&bot=true',
    body: {
      error: {
        code: 'assertbotfailed',
        info: 'You do not have the "bot" right, so the action could not be completed.',
        '*': 'See https://wikibase.world/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.'
      }
    }
  }
}

BPsession=deleted implies that mediawiki has deleted my session

@Addshore reported to me another detail: the main user session in their webbrowser has been happily persisting for months; however this session (used for some bot? maybe @Addshore can add the specifics) is going missing.

Yes, I can confirm that the session in my browser appears to still be there, and i dont remember it accidentally vanishing at any point.

image.png (145×925 px, 27 KB)

However my bot that currently auths using bot password appears to get its session deleted (or evicted from redis) after a few seconds or minuites (as was happening at the weekend)

someone in Telegram was also reporting that their sessions were disappearing (not via a bot)

I imagine you'll have better looking at the sessions in redis, logging in, looking for your session in redis, and watching it vanish.

Given what we looked at before, I'd still bet a lot on the fact this is just full redis evicting keys

Andrew and I have been looking at deploying a second redis for this using argo

I wanted to try and reassure myself that we were seeing:

  • a churning of sessions in redis
  • that it was perpetually close to the maxmemory limit

I counted the number of sessions very crudely every three seconds like so:

while true; do kubectl exec -it sts/redis-master -- /bin/bash -c 'REDISCLI_AUTH=$REDIS_PASSWORD redis-cli KEYS *MWSession*' | wc -l; sleep 3 ; done
187
177
211
247
207
218
194
182
246
208
174

Clearly the number of sessions there is jumping around wildly. I've convinced myself that the explanation for why some sessions happily persist and others evict fast is because volatile-lru is enabled (https://github.com/wmde/wbaas-deploy/blob/main/k8s/helmfile/env/production/redis.values.yaml.gotmpl#L55) but I don't fully understand it

From the Redis docs:

The volatile-lru and volatile-random policies are mainly useful when you want to use a single Redis instance for both caching and for a set of persistent keys. However, you should consider running two separate Redis instances in a case like this, if possible.

Which makes me think we are on the right line with T380448. Of course a very temporary solution may be to bump the max memory if we take much longer to get the second instance up and configured

Which makes me think we are on the right line with T380448. Of course a very temporary solution may be to bump the max memory if we take much longer to get the second instance up and configured

Given the data in this ticket, this might relive some of the pressure and difference in performance that we have been seeing in the past few weeks, but also might make no difference at all given how much data is trying to get shoved in this 1 redis instance currently.

Probably the new redis instance would be the thing to focus on IMO.

11379824 evictions vs 1364056 expires
So 10x the number of evictions than expires
That could be a good metric in general

If you did have this one as a metric, or even something you could just test on paper, then this would be the telling sign that the situation improves for either the approach of bumping max memory, or introducing a new instance for session stuff.

Tarrow renamed this task from Issues with sessions on wikibase.cloud? to 🤖 Issues with sessions on wikibase.cloud?.Nov 27 2024, 11:16 AM
Tarrow moved this task from Doing to Goals on the Wikibase Cloud (Kanban Board Q4 2024) board.

Right; new redis cluster for MW MainCache is now shipped; let's evaluate Friday or Monday if things are less awful with sessions :P

Looks pretty good to me
I managed to have an edit session spanning many minutes with no session issues.
Also managed to open qucikstatements and have it not immediately log me out! :)

Anton.Kokh claimed this task.