Page MenuHomePhabricator

Codesearch is not searching some files that it thinks are "probably not text" (incl. package-lock.json, modules/admin/data/data.yaml)
Open, Needs TriagePublic

Event Timeline

I don't actually know what changed, but it appears to work for some extensions: https://codesearch.wmflabs.org/search/?q=minimist&i=nope&files=&repos=

Pasting the exclusion message here for searchability:

Trigram ratio too high (0.11), probably not text

Also noting that operations/puppet's modules/admin/data/data.yaml file is excluded for the same "Trigram ratio too high" reason.

Skimming through https://codesearch.wmcloud.org/deployed/?action=excludes & Ctrl+F-ing for the word probably, it seems like a large majority of excluded files with the reason "Trigram ratio too high" (at least, in the "MediaWiki & services at WMF" group) are false-positives.

I wonder if there'd be appetite upstream to (e.g.) make maxTrigramRatio configurable, and/or to raise it to a higher value than 0.1?

I have a counter-proposal. I don't think scanning package-lock is useful or better said, it's a X/Y problem. We need proper SBOMs produced from our software (and publish them too) so we can use many existing tools to find dependencies or potential CVEs. I think (ab)using search indexing for dependency tracking is the wrong way to approach problem while there is a good existing solution in place.

I mean, I guess I don't have a firm immediate opinion on that (though I guess my immediate reply would probably be something like 'if folks don't find package-lock.json results useful, that file name can be excluded from the results'). I admit that I've previously found Codesearch searching within composer.json file(s) to be useful, though.

From my perspective here, the issue is more generally that Codesearch is excluding some files (incl. but not limited to package-lock.json) based on (incorrectly) thinking that they're "probably not text"

A_smart_kitten renamed this task from codesearch is not searching package-lock.json to Codesearch is not searching some files that it thinks are "probably not text" (incl. package-lock.json, modules/admin/data/data.yaml).Mar 10 2026, 1:46 PM
A_smart_kitten updated the task description. (Show Details)
A_smart_kitten added subscribers: jeremyb, Dzahn.