Page MenuHomePhabricator

Exclude nova resource pages from *default* wikitech search
Closed, ResolvedPublic

Description

Giuseppe says that nova resource searches turn up lots of cruft from the auto-generated instance and project pages. Would be nice to exclude them somehow.

Event Timeline

Andrew raised the priority of this task from to Needs Triage.
Andrew updated the task description. (Show Details)
Andrew added subscribers: Andrew, bd808, yuvipanda and 2 others.

The namespace was added for T67132: Add "Nova Resource" to default search namespaces on wikitech.wikimedia.org, but the tool labs docs have been moved now.

I also use have used this to match names to instances and the corresponding projects.

yuvipanda renamed this task from Exclude nova resource pages from wikitech search to Exclude nova resource pages from *default* wikitech search.Jan 6 2016, 9:21 PM
yuvipanda set Security to None.

Yeah, I think having it just off the default list is good enough.

I just realized this also excludes SALs (and project index pages!) from search, and SALs tend to be provide useful contextual information. Would it be possible to somehow split auto-generated instance pages and content pages?

I just realized this also excludes SALs (and project index pages!) from search, and SALs tend to be provide useful contextual information. Would it be possible to somehow split auto-generated instance pages and content pages?

Could we move SALs to a namespace?

Labs-project pages could actually be very useful if people used them to document those projects (which quite often does happen). The machine pages are indeed not very useful.

Maybe there is a way we can (ab)use boost-templates to lower the importance of the instance pages if we can get rid of them entirely? @demon might be able to tell us what secrets lie in the heart of CirrusSearch that we can use to get the cruft out of the default search space.

Maybe there is a way we can (ab)use boost-templates to lower the importance of the instance pages if we can get rid of them entirely? @demon might be able to tell us what secrets lie in the heart of CirrusSearch that we can use to get the cruft out of the default search space.

Default searches are based on content namespaces. Boost-templates wouldn't kill a result, but you could push them to the bottom effectively. Something like {{thisisacrappage}} on the pages you hate, then put Template:Thisisacrappage|-500% or somesuch to cirrusearch-boost-templates.

Looks like {{InstanceStatus}} might be the right template to de-boost.

The regular expression that reads these doesn't like negative numbers. Just use a low %, like 10% or something.

The regular expression that reads these doesn't like negative numbers. Just use a low %, like 10% or something.

Thanks for looking into this @EBernhardson. I updated to use boost of 10% which was hoped to really be a 90% importance reduction rather than a boost of 10%. https://wikitech.wikimedia.org/w/index.php?title=MediaWiki:Cirrussearch-boost-templates&oldid=291623

Testing with https://wikitech.wikimedia.org/w/index.php?search=mediawiki&title=Special%3ASearch&go=Go&cirrusDumpQuery shows however this this is actually a boost rather than a de-boost as hoped:

"rescore": [
      {
        "window_size": 8192,
        "query": {
          "query_weight": 1,
          "rescore_query_weight": 1,
          "score_mode": "multiply",
          "rescore_query": {
            "function_score": {
              "functions": [
                {
                  "field_value_factor_with_default": {
                    "field": "incoming_links",
                    "modifier": "log2p",
                    "missing": 0
                  }
                },
                {
                  "weight": 0.1,
                  "filter": {
                    "fquery": {
                      "_cache": true,
                      "query": {
                        "match": {
                          "template": {
                            "query": "Template:InstanceStatus"
                          }
                        }
                      }
                    }
                  }
                },
                {
                  "weight": 0.1,
                  "filter": {
                    "fquery": {
                      "_cache": true,
                      "query": {
                        "match": {
                          "template": {
                            "query": "Template:Nova Instance"
                          }
                        }
                      }
                    }
                  }
                },
                {
                  "weight": "0.05",
                  "filter": {
                    "terms": {
                      "namespace": [
                        0
                      ]
                    }
                  }
                },
                {
                  "weight": "0.2",
                  "filter": {
                    "terms": {
                      "namespace": [
                        12,
                        498
                      ]
                    }
                  }
                }
              ]
            }
          }
        }
      }
    ]
  },

This is actually a de-boost but you should maybe configure it to 1% or maybe 0% (might not be ideal: this will completely inhibit ranking if the purpose is to search for nova instance).
Values from 0 to 99 will de-boost and values from 101 to +inf will boost.

Also I wonder why we set namespace 498 weight (nova resources?) higher than the main namespace?
If the goal is to remove nova resource from default search results could we simply remove this namespace from the defaults or at least set configure $wgCirrusSearchNamespaceWeights for nova resources at a lower value ?

yes those namespace filters are very odd, i've also just double checked and the namespace filters are only applied to web search, not api[1] which is even odder...

[1] https://wikitech.wikimedia.org/w/api.php?action=query&list=search&srsearch=mediawiki&cirrusDumpQuery

If the goal is to remove nova resource from default search results could we simply remove this namespace from the defaults or at least set configure $wgCirrusSearchNamespaceWeights for nova resources at a lower value ?

I think we softly decided that completely getting rid of Nova Resource pages from search would be non-optimal due to SAL and potentially useful project pages. We do want to squash the instance pages however which the template based de-boost should mostly do. I've set them to 1% now.

I don't think we should boost Nova Resource over main namespace. That may be a leftover from before the Tool Labs help was moved into the Help namespace.

Also, it looks like that deleted instances do not have the templates you've set in the boost-templates system message. These pages don't have any templates, they'll be hard to de-boost with boost-template.

The namespace filters turned out to be a bug, any wiki with more than one content namespace was having the main namespace de-boosted by 95%. A default search for mediawiki on wikitech now looks a whole lot better. I'm not sure if there is any need to exclude nova resource pages from the search now, but hard to say.

Fixed in https://gerrit.wikimedia.org/r/269168

We do want to squash the instance pages however which the template based de-boost should mostly do. I've set them to 1% now.

Turning them into less horribly arranged status info and more autogenerated documentation would be nice. E.g. how do I SSH into the machine? How do I make it available from the internet? Whom to contact if it does not come up? How do I set hiera rules? Surfacing information like that at the point of need would be better than having to search for it.

Search results are looking pretty good now that the deboosting is in place and the bug that @EBernhardson found with multiple content namespaces has been fixed. Should we call this one done?

bd808 claimed this task.

A search for "puppet" is now not showing any Nova Resource pages in the first 20 results and only 1 (Nova Resource:Puppet) in the first 50.