Giuseppe says that nova resource searches turn up lots of cruft from the auto-generated instance and project pages. Would be nice to exclude them somehow.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T123425 [EPIC] Make wikitech more friendly for the multiple audiences it supports | |||
Resolved | bd808 | T122993 Exclude nova resource pages from *default* wikitech search |
Event Timeline
The namespace was added for T67132: Add "Nova Resource" to default search namespaces on wikitech.wikimedia.org, but the tool labs docs have been moved now.
I also use have used this to match names to instances and the corresponding projects.
I just realized this also excludes SALs (and project index pages!) from search, and SALs tend to be provide useful contextual information. Would it be possible to somehow split auto-generated instance pages and content pages?
Labs-project pages could actually be very useful if people used them to document those projects (which quite often does happen). The machine pages are indeed not very useful.
Maybe there is a way we can (ab)use boost-templates to lower the importance of the instance pages if we can get rid of them entirely? @demon might be able to tell us what secrets lie in the heart of CirrusSearch that we can use to get the cruft out of the default search space.
Default searches are based on content namespaces. Boost-templates wouldn't kill a result, but you could push them to the bottom effectively. Something like {{thisisacrappage}} on the pages you hate, then put Template:Thisisacrappage|-500% or somesuch to cirrusearch-boost-templates.
The regular expression that reads these doesn't like negative numbers. Just use a low %, like 10% or something.
Thanks for looking into this @EBernhardson. I updated to use boost of 10% which was hoped to really be a 90% importance reduction rather than a boost of 10%. https://wikitech.wikimedia.org/w/index.php?title=MediaWiki:Cirrussearch-boost-templates&oldid=291623
Testing with https://wikitech.wikimedia.org/w/index.php?search=mediawiki&title=Special%3ASearch&go=Go&cirrusDumpQuery shows however this this is actually a boost rather than a de-boost as hoped:
"rescore": [ { "window_size": 8192, "query": { "query_weight": 1, "rescore_query_weight": 1, "score_mode": "multiply", "rescore_query": { "function_score": { "functions": [ { "field_value_factor_with_default": { "field": "incoming_links", "modifier": "log2p", "missing": 0 } }, { "weight": 0.1, "filter": { "fquery": { "_cache": true, "query": { "match": { "template": { "query": "Template:InstanceStatus" } } } } } }, { "weight": 0.1, "filter": { "fquery": { "_cache": true, "query": { "match": { "template": { "query": "Template:Nova Instance" } } } } } }, { "weight": "0.05", "filter": { "terms": { "namespace": [ 0 ] } } }, { "weight": "0.2", "filter": { "terms": { "namespace": [ 12, 498 ] } } } ] } } } } ] },
This is actually a de-boost but you should maybe configure it to 1% or maybe 0% (might not be ideal: this will completely inhibit ranking if the purpose is to search for nova instance).
Values from 0 to 99 will de-boost and values from 101 to +inf will boost.
Also I wonder why we set namespace 498 weight (nova resources?) higher than the main namespace?
If the goal is to remove nova resource from default search results could we simply remove this namespace from the defaults or at least set configure $wgCirrusSearchNamespaceWeights for nova resources at a lower value ?
yes those namespace filters are very odd, i've also just double checked and the namespace filters are only applied to web search, not api[1] which is even odder...
[1] https://wikitech.wikimedia.org/w/api.php?action=query&list=search&srsearch=mediawiki&cirrusDumpQuery
I think we softly decided that completely getting rid of Nova Resource pages from search would be non-optimal due to SAL and potentially useful project pages. We do want to squash the instance pages however which the template based de-boost should mostly do. I've set them to 1% now.
I don't think we should boost Nova Resource over main namespace. That may be a leftover from before the Tool Labs help was moved into the Help namespace.
Also, it looks like that deleted instances do not have the templates you've set in the boost-templates system message. These pages don't have any templates, they'll be hard to de-boost with boost-template.
The namespace filters turned out to be a bug, any wiki with more than one content namespace was having the main namespace de-boosted by 95%. A default search for mediawiki on wikitech now looks a whole lot better. I'm not sure if there is any need to exclude nova resource pages from the search now, but hard to say.
Turning them into less horribly arranged status info and more autogenerated documentation would be nice. E.g. how do I SSH into the machine? How do I make it available from the internet? Whom to contact if it does not come up? How do I set hiera rules? Surfacing information like that at the point of need would be better than having to search for it.
Search results are looking pretty good now that the deboosting is in place and the bug that @EBernhardson found with multiple content namespaces has been fixed. Should we call this one done?
A search for "puppet" is now not showing any Nova Resource pages in the first 20 results and only 1 (Nova Resource:Puppet) in the first 50.