Page MenuHomePhabricator

Soft-delete and duplicate-detection state breaks across reloads (filekey not stable; re-key by sha1)
Closed, ResolvedPublic

Description

Symptoms

After reloading the workbench a day later, the displayed stash diverges from what is actually on Commons:

  • Photos that are in the user's Commons stash do not show up in the workbench UI.
  • Photos the user previously soft-deleted (via "Discard" / "Hide") show up again.
  • Some of the resurrected soft-deleted photos are flagged as duplicates of other rows — yet the duplicates were exactly what the user had soft-deleted.

Reported by Daanvr, observed in his own session on 2026-05-08 after a reload of https://upload-workbench.toolforge.org/.

Hypothesis

Durable state (soft-delete list, filename cache, item identity, duplicate-detection map) is keyed by MediaWiki's filekey. filekey is a per-stash-entry token and is not stable across stash regeneration / re-upload. sha1 (content hash) is content-defined and would be a stable identifier — the same bytes always produce the same sha1, regardless of how often MediaWiki re-issues the entry.

Investigation + fix plan

See comments below.

Event Timeline

Daanvr triaged this task as High priority.

Investigation: state keying trace

Mapped how every piece of durable state is keyed today, with file:line citations.

Soft-delete (hiddenFilekeys)

  • Stored as a flat array of filekeys in STORES.metadata.state.hiddenFilekeys (src/api/user-store.js:51), persisted to the user's Metadata.json wiki page.
  • onDelete in src/app.jsx:559 calls hideFilekey(item.filekey).
  • Visibility filter: !hidden.has(i.filekey) (src/app.jsx:399).
  • On bootstrap, pruneHiddenFilekeys(stashRaw.map(i => i.filekey)) (src/main.jsx:155) drops any hidden entry whose filekey is not in the current stash. So if MediaWiki re-issues a filekey for the same bytes, the user's soft-delete is silently wiped.

Filename cache

  • Stored as STORES.metadata.state.filenames[filekey] -> filename (src/api/user-store.js:49).
  • Looked up in src/api/commons.js:48 to enrich each stash row with the user's chosen filename.
  • Keyed by filekey — same staleness problem.

Stash item id

  • normalizeStashItem sets id = file.filekey (src/api/normalize.js:202). So for stash rows, id IS filekey.
  • Drag-drop pending uploads use a temporary pending-... UUID, replaced by the real filekey after the upload completes (src/ui/dropzone.jsx:107).
  • Published items get id = filename (src/app.jsx:549).

Drafts

  • draftKey(item) = item.sha1 || item.filekey || item.id (src/api/user-store.js:356).
  • Sha1 first — already the right pattern. The only state in the metadata page that survives a filekey change.

Duplicate-in-stash detection

  • In-memory map computed each render in src/app.jsx:280-303.
  • Groups visible stash rows by sha1 (good), then writes the result map keyed by item.id (= filekey, bad).
  • Excludes hidden rows via hidden.has(i.filekey) (also tied to filekey).

existsOnCommons (cross-Commons duplicate check)

  • Lookup is by sha1 (src/api/commons.js findCommonsFileBySha1) — stable.
  • Result is joined back onto the row by capturing id at effect setup and matching row.id === id (src/app.jsx:375-391). If id (= filekey) has changed between effect setup and result arrival, the join misses.

Summary

StateStored keyStable across re-upload / regen?
hiddenFilekeysfilekeyNo
filenames cachefilekeyNo
stash item idfilekeyNo
Duplicate-in-stash result mapitem.id (= filekey)No
existsOnCommons joinfilekey snapshotted at effect setupPartial
Draftssha1 (filekey fallback)Yes, when sha1 is available

The pattern: the metadata-page schema treats filekey as the canonical row identity, but MediaWiki does not. Anything keyed by filekey silently breaks across reloads.

Fix plan

Re-key durable state by sha1. Treat filekey as a per-session lookup token only.

Schema migration

Today's metadata-page shape:

{
  hiddenFilekeys: string[],
  filenames:      { [filekey]: string },
  drafts:         { [sha1|filekey]: {...} },
  history:        { lastSyncedAt, items[] }
}

Target shape:

{
  hiddenSha1s:    string[],            // canonical
  filenames:      { [sha1]: string },  // sha1-keyed
  drafts:         { [sha1]: {...} },   // sha1-only
  history:        { lastSyncedAt, items[] }
}

One-shot migration in loadOne('metadata'), after the page is parsed:

  1. If parsed.history.sha1Index is present (legacy from the previous fix), drop it. (Already shipped in commit c0b736e; this is a reminder so the migrations don't fight.)
  2. If parsed.hiddenFilekeys exists and parsed.hiddenSha1s does not: do not silently drop. Defer migration until after the first stash fetch lands — at that point we have {filekey -> sha1} for live stash rows, and we can translate hidden filekeys to sha1 for the entries that are still around. Anything we cannot translate (because the stash entry expired) is logged and dropped — same risk as today, no worse.
  3. If parsed.filenames is filekey-keyed (heuristic: keys match filekey pattern), do the same join via current stash to convert to sha1-keyed.
  4. scheduleSave('metadata') writes the cleaned-up shape. Same debounce + auto-shrink mechanism as the sha1Index cleanup.

Code changes (high level)

  1. src/api/user-store.js — add sha1-keyed analogues of the existing filekey functions: hideSha1, unhideSha1, getHiddenSha1s, pruneHiddenSha1s. Switch setStashedFilename / getStashedFilename to take sha1 as primary; accept filekey as a transitional fallback. Add the migration helper described above.
  2. src/app.jsxonDelete, onBulkDiscard, the visibility filter, and the duplicate-in-stash result-map keying all switch to item.sha1. When a row hasn't yet acquired a sha1 (the backfill effect from commit bd83df4 hasn't run), the row is shown as ungroupable / non-discardable for that brief moment.
  3. src/api/normalize.js — set id = sha1 for stash rows once sha1 is known; while sha1 is missing, keep the filekey-based id but mark the row as "pending-identity" so consumers can degrade gracefully.
  4. src/main.jsxpruneHiddenFilekeys becomes pruneHiddenSha1s(stashRaw.map(i => i.sha1).filter(Boolean)). Items without sha1 do not contribute to the prune set.
  5. Drag-drop (src/ui/dropzone.jsx) — sha1 is already computable client-side from the file bytes (the in-stash duplicate detection from commit 939cfe4 relies on it). Promote that compute to happen before the optimistic row is added, so new rows have stable identity from the start.

Risks and open questions

  • Items without sha1. mystashedfiles does not always return sha1; a stashimageinfo fetch is required. Backfill exists (commit bd83df4). During the gap a row has no canonical identity. Decide during implementation: render greyed-out, render with filekey-only fallback (legacy behaviour), or block the row from being soft-deleted until sha1 lands.
  • sha1 collisions across user accounts. sha1 is content-defined, so two users uploading the same bytes share a sha1. That's already fine — our state is per-user (per User:<self>/UploadWorkbench/Metadata.json), so no cross-contamination.
  • Migration write spike. First load after the fix triggers one extra metadata-page edit per existing user with legacy state. Same shape of impact as the sha1Index cleanup; small text size, well within wiki edit budget.
  • Re-upload of expired-then-recreated stash. If the user soft-deleted file X, X expired, and they re-uploaded the same bytes a week later — should the new entry inherit the soft-delete state? With sha1 keying, yes. That's a behaviour change worth calling out in the CHANGELOG. (We think it's the right behaviour: "soft-delete by content" rather than "soft-delete by stash ticket".)

Versioning

MINOR bump — visible behaviour fix + metadata schema migration, no breaking external API. Target: 0.2.x0.3.0.

Verification (when implementing)

  1. Soft-delete a file, force-reload, confirm it stays hidden.
  2. Soft-delete a file, wait until MediaWiki re-issues its filekey (or simulate via API by deleting and re-stashing same bytes). Confirm soft-delete is preserved.
  3. Open a session with a legacy Metadata.json (hiddenFilekeys only). Reload. Confirm the migration runs, hiddenSha1s appears, the page shrinks, and visible behaviour matches the legacy state for files still present.
  4. In-stash duplicate detection: two rows with the same bytes, soft-delete one. Confirm the visible row is no longer flagged as a duplicate (because the only twin is hidden).
Daanvr claimed this task.

Lessons learned

Two root causes were active simultaneously. Each one would have produced the user's "stash files missing" symptom on its own; together they made the misdiagnosis tempting (we initially attributed everything to filekey instability and only saw the API truncation when probing the live MediaWiki response directly).

MediaWiki list APIs silently cap at the default limit if no *limit parameter is set. mystashedfiles defaults to 10 items. The workbench was making a single un-paginated call, so any user with >10 stash files lost the rest invisibly. Fix: explicit msflimit=500 and a msfcontinue loop with a 5000-item safety cap. This is the kind of bug that gets masked during early dev (small stash counts) and bites at production scale.

MediaWiki filekey is not a stable cross-session identifier. The same file bytes can land in the stash with a different filekey after expiry/re-upload or other regenerations. State keyed by filekey (the soft-delete list, the filename cache) is logically correct in-session and quietly wrong between sessions. sha1 (content hash) is content-defined and stable; that is the right primary key for any cross-session join. Soft-delete is now keyed by sha1 with hiddenSha1s as the canonical store, and hiddenFilekeys is kept only as a transitional fallback for items whose sha1 isn't yet known.

Durable rules added to CLAUDE.md

  • API list calls without an explicit *limit parameter silently cap at the MediaWiki default — set the limit and follow the *continue token. Don't trust early-dev volumes to surface this.
  • Identify stash files by sha1, not filekey, for any state that needs to survive a reload.

Released

0.3.0 (commit dfab83e, MR !4). Live on https://upload-workbench.toolforge.org/.