Page MenuHomePhabricator

Deploy geosearch relevance sort and test pushing the radius
Closed, ResolvedPublic2 Estimated Story Points

Description

We're currently using the geosearch feature from GeoData for the nearby view. It has a default radius limit of 10000m. That limit might be there for no hard reason though and we might be able to push it to whatever seems reasonable.

Event Timeline

thiemowmde set the point value for this task to 2.May 13 2022, 1:30 PM

@EBernhardson You mentioned a test server for Cirrus experiments, I was hoping you could share some more hints about that. Currently, we have a feature that shouldn't be enabled on production until we've verified basic performance characteristics. The implementation is in ext-GeoData QueryGeoSearchElastic.php and as you can see, it uses an Elastica\Query on the backend.

Questions I have:

  • How can I dump the query executed by the backend? I found $wgCirrusSearchLogElasticRequests = true which seems to give the right timing metrics, but no details about the actual query:
2022-05-17 12:57:01 dev.wiki.local.wmftest.net dev: performing GeoData_spatial_search against dev_content took 6 millis and 2 Elasticsearch millis. Found 0 total results and returned 0 of them starting at 0 within these namespaces: 0. Requested via api for a43f7b6ed62366280e08b845efe19d4d by executor 362498998
  • Can you give an example request targeting the CirrusSearch replica for e.g. enwiki? I need to run against realistic data and indexes, but without hurting production.

How can I dump the query executed by the backend? I found $wgCirrusSearchLogElasticRequests = true which seems to give the right timing metrics, but no details about the actual query

For most Cirrus things we append &cirrusDumpQuery to the request URL and cirrus gives the query as debug output, sadly that functionality doesn't exist in GeoData. Mostly this means that you can get dumps for geodata search keywords, which embed into the cirrus query parsing pipeline. But for the QueryGeoSearchElastic api nothing in particular exists, I imagine Max would simply var_dump at the right place back when this was first written. Something like the following stuffed near the top of GeoData\Searcher::performSearch will probably do the trick (untested):

if ($_SERVER['cirrusDumpQuery'] ?? false) {
  echo json_encode($search->toArray(), JSON_PRETTY_PRINT);
  die( 1 );
}

Can you give an example request targeting the CirrusSearch replica for e.g. enwiki? I need to run against realistic data and indexes, but without hurting production.

This will work from anything in wmf cloud. Critically everything has to be a GET request, and that request needs a body with the application/json content-type.

curl -XGET -H 'Content-Type: application/json' https://cloudelastic.wikimedia.org:8243/enwiki_content/_search -d '{
  "query": {"match_all": {}}
}'

Thanks @EBernhardson for the hints. I guess this is our query then. The places to tune should be obvious. We will probably want play with distance and size. Also currently the coordinates.coord is centered in Monaco. ;-).

{
  "_source": [
    "coordinates.coord",
    "coordinates.primary",
    "coordinates.globe"
  ],
  "post_filter": {
    "bool": {
      "filter": [
        {
          "nested": {
            "path": "coordinates",
            "query": {
              "bool": {
                "filter": [
                  {
                    "bool": {
                      "filter": [
                        {
                          "term": {
                            "coordinates.globe": "earth"
                          }
                        },
                        {
                          "term": {
                            "coordinates.primary": true
                          }
                        }
                      ]
                    }
                  },
                  {
                    "geo_distance": {
                      "distance": "1000m",
                      "coordinates.coord": {
                        "lat": 43.731111,
                        "lon": 7.42
                      }
                    }
                  }
                ]
              }
            }
          }
        },
        {
          "terms": {
            "namespace": [
              0
            ]
          }
        }
      ]
    }
  },
  "rescore": [
    {
      "window_size": 8192,
      "query": {
        "query_weight": 1,
        "rescore_query_weight": 1,
        "score_mode": "multiply",
        "rescore_query": {
          "function_score": {
            "functions": [
              {
                "field_value_factor": {
                  "field": "incoming_links",
                  "modifier": "log2p",
                  "missing": 0
                }
              }
            ]
          }
        }
      }
    }
  ],
  "size": 5,
  "query": {
    "match_all": {}
  }
}
WMDE-Fisch moved this task from Doing to Sprint Backlog on the WMDE-TechWish-Sprint-2022-05-11 board.
WMDE-Fisch added a subscriber: awight.
WMDE-Fisch renamed this task from Deploy nearby search on testservers and test pushing the radius to Deploy geosearch relevance sort and test pushing the radius.May 25 2022, 10:39 AM
WMDE-Fisch updated the task description. (Show Details)
WMDE-Fisch updated the task description. (Show Details)