EBernhardson (EBernhardson)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 4:49 PM (163 w, 3 d)
Availability
Available
LDAP User
EBernhardson
MediaWiki User
EBernhardson (WMF)

Recent Activity

Wed, Nov 22

EBernhardson updated the title for P6368 Running spark on jupyter in SWAP from untitled to Running spark on jupyter in SWAP.
Wed, Nov 22, 8:57 PM
EBernhardson created P6368 Running spark on jupyter in SWAP.
Wed, Nov 22, 7:32 PM

Tue, Nov 21

EBernhardson added a comment to T177520: Experiment with different grouping of queries that get fed into the DBN.

Moving back to backlog as this task actually covers 2 experiments and thought it was new:

Tue, Nov 21, 5:42 PM · Discovery-Search (Current work), Discovery

Mon, Nov 20

EBernhardson updated subscribers of T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

I think it's pretty clear from the results that users prefer the existing DBN settings (min 10 queries per group) over the more restrictive settings that were tested. That's interesting as a human review of the DBN generated labels suggest the larger groups generate better labels, but our concern was that the larger groups also exclude significant numbers of queries from training. It appears those longer tail queries are particularly important. This leaves open a question posed on another ticket (i forget who, was either @dcausse or @TJones), considering *smaller* dbn groups. I hadn't expected this result, so only prepped the test for larger groupings, but perhaps we run it again going the other direction.

Mon, Nov 20, 11:32 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery

Fri, Nov 17

EBernhardson created P6342 (An Untitled Masterwork).
Fri, Nov 17, 4:08 PM

Thu, Nov 16

EBernhardson added a comment to T174103: [Epic] Port Selenium tests from Ruby to Node.js for the Search Platform.

I spent a little time today with nodemw and i concur with @Jdrewniak. For promises we can probably get away with something like bluebird.promisifyAll, but the batching functionality is key. The combination of succinct call structure, auto-magic concurrency, and operation specific error handling all built into mwbot's batch functionality is incredibly useful for our hooks that set everything up. Error handling for nodemw specifically is pretty difficult, as it doesn't actually return enough information. For example if i delete a page that doesn't exist i get back an error object containing the string Error returned by API: The page you specified doesn't exist.. This is a localized string that varies when we test against, for example, our language specific wikis. I really need the original response which has {code: 'missingtitle', info: 'The page you specified doesn\'t exist', ...}

Thu, Nov 16, 11:35 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Release-Engineering-Team (Kanban), Discovery-Search (Current work), Discovery
EBernhardson added a comment to T180706: Phabricator search hugely degraded in quality.

Because of how phabricator is architected, it kinda does both. Elasticsearch seems to be used as a first pass filtering/sorting, but then phabricator pulls more info from the database (and potentially does more filtering).

Thu, Nov 16, 8:02 PM · Release-Engineering-Team (Kanban), Regression, Phabricator

Wed, Nov 15

EBernhardson added a comment to T174103: [Epic] Port Selenium tests from Ruby to Node.js for the Search Platform.

@Jdrewniak setup the framework for running tests in nodejs, not sure why in particular mwbot was chosen but jan can probably comment. After using it a bit i can say that it's batch() method in particular is quite convenient as our setup creates lots of different pages with varied content to be searched for.

Wed, Nov 15, 2:26 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Release-Engineering-Team (Kanban), Discovery-Search (Current work), Discovery

Mon, Nov 13

EBernhardson edited P6307 (An Untitled Masterwork).
Mon, Nov 13, 7:31 PM
EBernhardson created P6307 (An Untitled Masterwork).
Mon, Nov 13, 7:30 PM
EBernhardson moved T162369: Evaluate rescore windows for learning to rank from In progress to Done on the Discovery-Search (Current work) board.
Mon, Nov 13, 5:01 PM · Discovery-Search (Current work), Discovery
EBernhardson added a comment to T162369: Evaluate rescore windows for learning to rank.

I manually reviewed 20 queries that had a difference in the top 5 results for 1024 vs 4096. In my (somewhat arbitrary) opinion only one of those queries improved. Others mostly seemed to pull up somewhat popular pages that weren't any more relevant to the query. Based on this i think we should continue with the current rescore window rather than expanding it. I also reviewed the 20 queries with changes to top 5 between 512 and 1024. This is a little murkier, some queries seem like they might be improving while others are not. I'm not sure the effect size is large enough we could run a good AB test though.

Mon, Nov 13, 5:00 PM · Discovery-Search (Current work), Discovery
EBernhardson moved T180298: Catchable fatal error: Argument 1 passed to CirrusSearch\DataSender::reportUpdateMetrics() must be an instance of Elastica\Bulk\ResponseSet, null given from Needs review to Done on the Discovery-Search (Current work) board.
Mon, Nov 13, 4:49 PM · MW-1.31-release-notes (WMF-deploy-2017-11-14 (1.31.0-wmf.8)), Discovery-Search (Current work), Patch-For-Review, Discovery, CirrusSearch

Fri, Nov 10

EBernhardson added a comment to T177270: API:Search maxes out at 10000.

@Headbomb I'm not sure its a great idea, but to at least think about it how many results do you think you need? "all of them" is unfortuntately not possible, as it would mean returning 10's of millions of results from search shards to the coordinator.

Fri, Nov 10, 12:43 AM · Discovery-Search, CirrusSearch, Discovery, MediaWiki-API

Thu, Nov 9

EBernhardson added a comment to T179266: search.wikimedia.org is source of lots of 500s.

If needed i can pull a full month, but will take longer. This is for nov 8th (UTC). This is also limited to requests that returned a 200 response code.

Thu, Nov 9, 11:57 PM · Discovery-Search (Current work), Patch-For-Review, Operations
EBernhardson added a comment to T179266: search.wikimedia.org is source of lots of 500s.

Pulled some info on overall usage and http response codes from webrequest logs. This is for oct 9 through nov 9 for all requests with host search.wikimedia.org.

Thu, Nov 9, 11:35 PM · Discovery-Search (Current work), Patch-For-Review, Operations
EBernhardson added a comment to T179266: search.wikimedia.org is source of lots of 500s.

This is apparently https://wikitech.wikimedia.org/wiki/Search.wikimedia.org and we still need to maintain it.

Thu, Nov 9, 7:41 PM · Discovery-Search (Current work), Patch-For-Review, Operations
EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

@chelsyx I pulled the data out for 11/2 00:00 to 11/9 00:00 into a single tsv file at stat1005.eqiad.wmnet:/mnt/hdfs/user/ebernhardson/tss_tsv/part-00000-7faa8246-4477-421e-8c91-df291eec70cc.csv.gz This is about 234M compressed and 1.18G uncompressed. If necessary i can re-sample this on session ids to get smaller data.

Thu, Nov 9, 6:42 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery

Wed, Nov 8

EBernhardson created P6292 (An Untitled Masterwork).
Wed, Nov 8, 7:30 PM
EBernhardson added a comment to T180050: [Russian] [Chinese] PHP Warning: Recursion detected in RequestContext::getLanguage MW v1.31.0-wmf.6.

The relevant cirrus code here hasn't really changed since at least 2015. The stack trace seems a bit odd, i don't know this path much but it seems asking for the request language shouldn't require rendering i18n messages.

Wed, Nov 8, 7:24 PM · Russian-Sites, Community-Tech, Technical-Debt, Chinese-Sites, Discovery, MW-1.31-release, Wikimedia-log-errors, MediaWiki-General-or-Unknown

Tue, Nov 7

EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

For future reference this is what i came up with for extracting a TSV. This can be pasted into the spark scala shell (/usr/lib/spark2/bin/spark-shell --master yarn):

import org.apache.spark.sql.{functions => F, types => T}
import org.apache.hadoop.io.{IntWritable, Text}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
Tue, Nov 7, 9:26 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery
EBernhardson created P6277 building a paldb from popularity data with spark+scala.
Tue, Nov 7, 12:57 AM

Mon, Nov 6

EBernhardson created P6269 (An Untitled Masterwork).
Mon, Nov 6, 6:04 PM
EBernhardson created T179848: Unable to add user to group in debian stretch instance.
Mon, Nov 6, 5:54 PM · Cloud-VPS
EBernhardson added a comment to T179500: Evaluation precision of discernatron results vs our retrieval query.

Huh, thats very interesting data! I'm surprised that we have such a high recall on this data. We currently set LTR rescore to 1024, but with 7 shards that means we see aproximately the top 7k results which should contain 98% of the top-20 results returned by other search engines, and 100% of results (in this small set) that graders determined were even a little bit relevant to the query.

Mon, Nov 6, 4:53 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson added a comment to T179748: Reading Common Crawl data from hadoop / webproxy performance.

I did run it, took just under 48 hours maxing out both proxies. I was mostly curious as if this initial experiment works out would want to run it monthly with the new common crawl dataset releases. Will have to see if the data is actually a good search signal or is just the article title over and over again

Mon, Nov 6, 3:26 PM · Analytics

Sat, Nov 4

EBernhardson updated the task description for T179748: Reading Common Crawl data from hadoop / webproxy performance.
Sat, Nov 4, 1:56 AM · Analytics

Fri, Nov 3

EBernhardson updated subscribers of T179748: Reading Common Crawl data from hadoop / webproxy performance.
Fri, Nov 3, 11:25 PM · Analytics
EBernhardson created T179748: Reading Common Crawl data from hadoop / webproxy performance.
Fri, Nov 3, 11:24 PM · Analytics
EBernhardson created P6259 (An Untitled Masterwork).
Fri, Nov 3, 4:38 PM

Thu, Nov 2

EBernhardson added a comment to T173710: Job queue is increasing non-stop.

It was perhaps noted before, but because of the recursive nature of the refreshLinks and htmlCacheUpdate jobs even if the backlog is being processed it may not look like it, because the jobs are just enqueing new jobs. Will probably take some time to really know what effect things are having.

Thu, Nov 2, 5:25 PM · User-Elukey, Patch-For-Review, Services (watching), Performance-Team (Radar), Discovery, CirrusSearch, Wikidata, Operations, MediaWiki-JobQueue

Wed, Nov 1

EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

@chelsyx yes, spark makes it pretty easy to read in a text file containing a json string per line, it's motsly just reading it in and spitting it back out. If helpful can probably do some other minor pre-processing like only taking the last checkin event per pageViewId.

Wed, Nov 1, 7:42 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery
EBernhardson updated the task description for T179500: Evaluation precision of discernatron results vs our retrieval query.
Wed, Nov 1, 3:44 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson added projects to T179500: Evaluation precision of discernatron results vs our retrieval query: CirrusSearch, Discovery-Search.
Wed, Nov 1, 3:43 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson created T179500: Evaluation precision of discernatron results vs our retrieval query.
Wed, Nov 1, 3:43 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

previous multi-wiki test was ~13k sessions per bucket per wiki, or 13*2*18, or ~480k sessions total. Rough estimates for 7 days at the set sampling rate on enwiki is 160k sessions per bucket per wiki but on a single wiki, so 320k sessions. More sessions for a single wiki, but less sessions overall. Getting the data out of hadoop and into a TSV file for ingestion into R is relatively easy, I can take care of that. I of course don't know that that's any easier for R, if there are suggestions i can try out different ways. If that's too much data for the report generator i can sample down by session id when generating the TSV and only use the full data size for purposes of calculating how # of sessions effects the separability of the two groups on CTR (basically the joy plots run previously).

Wed, Nov 1, 2:36 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery

Tue, Oct 31

EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

For estimating enough, the next test (starting tomorrow, tentatively) is running against enwiki and sampling ~15% of sessions into the 5 buckets of the test (for ~3% per bucket). This should hopefully give us more data than necessary to figure out how much we actually need going forward. It will of course only be able to tell us with certainty about the effect size we see on enwiki for this test, but hoepfully it can be extrapolated (or maybe there is a something more rigorous).

Tue, Oct 31, 8:43 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery
EBernhardson claimed T176997: Extract a set of a few hundred most popular abandoned queries.
Tue, Oct 31, 5:21 PM · Discovery-Search (Current work), CirrusSearch, Discovery

Mon, Oct 30

EBernhardson moved T170009: Evaluate training speed and accuracy for 1M and 30M sample training sets with different worker counts from Needs review to Done on the Discovery-Search (Current work) board.
Mon, Oct 30, 10:24 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson moved T178522: Another searchmatch span altering searches from Needs review to Done on the Discovery-Search (Current work) board.
Mon, Oct 30, 10:21 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, Discovery-Search (Current work), Discovery, CirrusSearch, MediaWiki-Search
EBernhardson added a comment to T166450: Better support for searching for misspellings.

fuzziness almost never does what you want it to do. With fuzziness a search for hat also includes the term hot.

Mon, Oct 30, 9:36 PM · CirrusSearch, Discovery-Search, Discovery
EBernhardson added a comment to T166450: Better support for searching for misspellings.

Might be a bug elsewhere in the stack, can see your link doesn't provide a suggestion but the actual response from elasticsearch includes it.

Mon, Oct 30, 7:14 PM · CirrusSearch, Discovery-Search, Discovery
EBernhardson added a comment to T173710: Job queue is increasing non-stop.

All jobs have a requestId parameter, which is passed down through the execution chain. This is the same as the reqId field in logstash. Basically this means if the originating request logged anything to logstash, you should be able to find it with the query type:mediawiki reqId:xxxxx and looking for the very first message. That assumes of course the initial request logged anything.

Mon, Oct 30, 5:29 PM · User-Elukey, Patch-For-Review, Services (watching), Performance-Team (Radar), Discovery, CirrusSearch, Wikidata, Operations, MediaWiki-JobQueue

Fri, Oct 27

EBernhardson added a comment to T177353: Metrics for SDoC: look at search hits based on which element the search is hitting.

While we don't log it, we could certainly take a sampling of say 20k queries, run them against our test cluster, and poke at the results to see which parts triggered the hit.

Fri, Oct 27, 3:56 PM · Discovery-Analysis, Structured-Data-Commons, Discovery, Wikidata
EBernhardson created P6205 (An Untitled Masterwork).
Fri, Oct 27, 3:29 PM

Thu, Oct 26

EBernhardson added a comment to T156474: Add the possibility to do regex search on titles.

Certainly it is possible the shorter field will allow for significantly less filtering prior to running the regex. The acceleration phase basically extract sets of trigrams (three sequential characters) that must be in the searched content from the regex and then look for documents containing those trigrams as a first pass filter. This generally reduces the number of articles we need to run the regex on significantly. I think it is worth keeping in mind, and evaluating.

Thu, Oct 26, 9:39 PM · CirrusSearch, Discovery-Search, Discovery
EBernhardson added a comment to T156474: Add the possibility to do regex search on titles.

I don't think this would be particularly hard to implement, all the functionality already exists. We need to add the appropriate sub-fields to title and adjust the intitle: keyword to swap between term matching and regex matching the same as insource: does today.

Thu, Oct 26, 7:46 PM · CirrusSearch, Discovery-Search, Discovery

Oct 25 2017

EBernhardson created P6185 hyperparameter tuning of sim_explo_training.
Oct 25 2017, 11:22 PM
EBernhardson moved T167410: ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean) from In progress to Done on the Discovery-Search (Current work) board.
Oct 25 2017, 10:58 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Discovery-Search (Current work), Discovery, MediaWiki-API, MediaWiki-Search, Wikimedia-log-errors
EBernhardson added a comment to T167410: ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean).

Error is still pretty rare and not seeing anything in the new logging that will help. The raw error rate is slightly higher right now, at 160 errors in the last 24 hours, but it's really one error that was cached for 24 hours. We should stop caching that error once the above patch rolls out with wmf.5 this week.

Oct 25 2017, 10:58 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Discovery-Search (Current work), Discovery, MediaWiki-API, MediaWiki-Search, Wikimedia-log-errors
EBernhardson moved T178522: Another searchmatch span altering searches from In progress to Needs review on the Discovery-Search (Current work) board.
Oct 25 2017, 10:40 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, Discovery-Search (Current work), Discovery, CirrusSearch, MediaWiki-Search
EBernhardson added a comment to D830: Boost recent documents in search results.

At least in CirrusSearch we would ensure something like this goes in the rescore phase, as opposed to the main query. It looks like phabricator has ~2M documents in the index so it is possibly worthwhile here as well.

Oct 25 2017, 10:32 PM
EBernhardson claimed T178522: Another searchmatch span altering searches.
Oct 25 2017, 8:38 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, Discovery-Search (Current work), Discovery, CirrusSearch, MediaWiki-Search
EBernhardson moved T178522: Another searchmatch span altering searches from Backlog to In progress on the Discovery-Search (Current work) board.
Oct 25 2017, 8:38 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, Discovery-Search (Current work), Discovery, CirrusSearch, MediaWiki-Search

Oct 24 2017

EBernhardson moved T177774: Refactor Elastic TTM Server implementation to allow experimenting new queries without breaking production usage from Later to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:29 PM · Discovery, Discovery-Search, Elasticsearch, MediaWiki-extensions-Translate
EBernhardson moved T156137: Reduce impact of GC pauses on elasticsearch response time from Up Next to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:28 PM · Discovery, Elasticsearch, Discovery-Search
EBernhardson moved T143553: Switching search traffic between datacenters should be faster from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:27 PM · Discovery, Elasticsearch, Discovery-Search
EBernhardson moved T87892: Convert CirrusSearch to use extension registration from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:27 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T143195: align elasticsearch.yml template with the default configuration for elasticsearch 2.x from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:27 PM · Discovery-Search, Easy, Discovery, Elasticsearch
EBernhardson moved T145065: Decrease time required to fully restart the Cirrus elasticsearch clusters from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:26 PM · Discovery-Search, Operations, Discovery, Elasticsearch
EBernhardson moved T149741: Improve how search queries are built for cross-wiki searches to allow for better filtering from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:26 PM · Discovery-Search, Performance, Discovery
EBernhardson moved T167091: Elasticsearch errors about BulkShardRequest from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:23 PM · Discovery-Search, Operations, Discovery, Elasticsearch
EBernhardson moved T174745: Enable debug API's, like cirrusDumpQuery, cirrusDumpResult and cirrusExplain for prefix search and completion suggester from This Quarter to Tech Debt/Misc on the Discovery-Search board.
Oct 24 2017, 5:23 PM · Discovery-Search, Discovery, CirrusSearch

Oct 23 2017

EBernhardson added a comment to T174960: Varnish does not vary elasticsearch query by request body.

Actually on closer review, kibana is allowing some POST requests to a limited set of endpoints, but not your _search endpoint:

Oct 23 2017, 10:59 PM · Operations, Traffic, Wikimedia-Logstash
EBernhardson added a comment to T174960: Varnish does not vary elasticsearch query by request body.

I suppose i can add that the reason it has to be GET, rather than POST, is because the kibana application that receives these requests and proxies them to elasticsearch only proxies GET requests. If it tried to proxy POST it would require a good bit more complexity to ensure the requests don't perform writes.

Oct 23 2017, 10:55 PM · Operations, Traffic, Wikimedia-Logstash

Oct 18 2017

EBernhardson added a project to T178530: Improve field mapping for nginx logstash: Wikimedia-Logstash.
Oct 18 2017, 8:14 PM · Discovery-Search (Current work), Patch-For-Review, User-Smalyshev, Wikimedia-Logstash, Discovery, Wikidata, Wikidata-Query-Service
EBernhardson added a comment to T178530: Improve field mapping for nginx logstash.

There might be other opinions, but i think hard coding specific fields to specific types in the logstash config is reasonable as long as it's documented. The primary problem we run into is that without coordination different types can be sent to fields by different applications. While documenting and expecting applications to conform to types for all fields is not going to happen, doing it for some limited set of useful fields seems acceptable to me.

Oct 18 2017, 8:13 PM · Discovery-Search (Current work), Patch-For-Review, User-Smalyshev, Wikimedia-Logstash, Discovery, Wikidata, Wikidata-Query-Service

Oct 17 2017

EBernhardson created T178442: ssl terminators on elasticsearch servers (nginx) don't send their logs to logstash.
Oct 17 2017, 10:07 PM · Discovery-Search, Wikimedia-Logstash
EBernhardson added a comment to T167410: ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean).

Increased logging levels to debug about 1.5 hours ago for:

  • org.elasticsearch.http
  • org.elasticsearch.transport
  • io.netty
Oct 17 2017, 8:45 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Discovery-Search (Current work), Discovery, MediaWiki-API, MediaWiki-Search, Wikimedia-log-errors
EBernhardson created T178425: experimental plugin throws exception on some requests.
Oct 17 2017, 7:03 PM · Discovery-Search, Elasticsearch, Discovery
EBernhardson added a comment to T167410: ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean).

On further thought, i think transport logger is the wrong one. That looks to be inter-node transport. I'm not sure yet which logger would have information on REST connections. Will probably require more poking around to figure out where that is happening.

Oct 17 2017, 12:08 AM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Discovery-Search (Current work), Discovery, MediaWiki-API, MediaWiki-Search, Wikimedia-log-errors

Oct 16 2017

EBernhardson added a comment to T167410: ApiQuerySearch.php: Call to a member function termMatches() on a non-object (boolean).

Looked into this a bit more today, correlated log messages:

Oct 16 2017, 11:56 PM · MW-1.31-release-notes (WMF-deploy-2017-10-24 (1.31.0-wmf.5)), Patch-For-Review, Discovery-Search (Current work), Discovery, MediaWiki-API, MediaWiki-Search, Wikimedia-log-errors
EBernhardson added a comment to T175049: Investigate which languages we should test on next.

Actually the difficulty here isn't necessarily translating the questions, translatewiki will help out there, the problems will be:

Oct 16 2017, 9:45 PM · Discovery-Search (Current work), Discovery
EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

Seems easy enough, i'll re-train models for hewiki and dewiki and we can rerun the tests on both since they look pretty odd. hewiki sampling was already at 80%, so to get the same amount of data we have to run the test for two weeks. dewiki only used 3% sampling so that could be run in a single week.

Oct 16 2017, 9:28 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery
EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.
  • @EBernhardson—is there any chance something was configured incorrectly for arwiki? The results being so similar is just too weird to accept at face value.

The per-wiki configuration is very minimal, just a name of a model to use. I double checked and the model does exist with the configured name (and if it was wrong, the search would fail as opposed to return the un-rescored results). I agree it's odd to be so similar, but I don't see anything obvious. Perhaps we could load the prod model and arwiki dump into relforge and run a test to see how much the results change, if somehow the model is predicting results very close to the control ranking that could explain it (and would be surprising).

Oct 16 2017, 9:16 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery

Oct 13 2017

EBernhardson added a comment to T178006: Search Relevance test #5: are users happy with the search results they got?.

I've put up a patch to core that handles this in a slightly different way. Basically the timeouts will only consider time the page is visible, rather than wall time (as in, the time on the clock on your wall). Then the only thing that needs to happen is resetting the timeout back to 60s when the user clicks the link and goes to a different page to read the privacy statement or why we are running the test.

Oct 13 2017, 11:23 PM · Discovery-Search (Current work), Discovery
EBernhardson moved T176428: Search Relevance test #4 - action items from In progress to Needs review on the Discovery-Search (Current work) board.
Oct 13 2017, 10:43 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson added a comment to T176428: Search Relevance test #4 - action items.

Patches are up. Some minor changes from the spec:

Oct 13 2017, 10:39 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson moved T177477: rebuildtextindex fails when searchindex is on InnoDB (no support for MyISAM) from Needs review to Done on the Discovery-Search (Current work) board.
Oct 13 2017, 8:17 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, MediaWiki-Maintenance-scripts, Discovery-Search (Current work), Discovery, MediaWiki-Search

Oct 12 2017

EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

Might also be worth looking into: I increased the sampling rates significantly for this test. This new test ran for 16 days and contains 1.4M SERP events from 683k sessions, significantly higher than anything we've collected before. Is this increase in event counts useful in making the buckets differentiable, or is it simply more data to store and process? I realize though that because the data is split between so many wikis it may not be as useful as having 700k sessions all from a single busy site like dewiki or enwiki.

Oct 12 2017, 11:57 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery
EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

@chelsyx I dont think the ltr-i-1024 bucket should be included in this first look, it's an interleaved result set that can't really be interpreted with our standard metrics.

Oct 12 2017, 11:50 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery
EBernhardson added a comment to T176493: Analysis of testing on 18 wikis with > 1% of search traffic.

Autocomplete data collection was intentionally turned off in this test, we were collecting much more data than usual and I wanted to avoid adding all of those events that I didn't think we would look at.

Oct 12 2017, 4:39 PM · Patch-For-Review, Discovery-Analysis (Current work), Discovery-Search (Current work), Discovery

Oct 10 2017

EBernhardson claimed T176428: Search Relevance test #4 - action items.
Oct 10 2017, 8:52 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson moved T176428: Search Relevance test #4 - action items from Backlog to In progress on the Discovery-Search (Current work) board.
Oct 10 2017, 8:52 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson moved T171462: HTML returned as text from Done to Backlog on the Discovery-Search (Current work) board.
Oct 10 2017, 8:35 PM · Discovery-Search, Discovery, MediaWiki-Special-pages
EBernhardson added a comment to T171462: HTML returned as text.

i was completely looking at the wrong page ... it's not fixed yet. I still think the right solution is to remove the tags from the message

Oct 10 2017, 8:35 PM · Discovery-Search, Discovery, MediaWiki-Special-pages
EBernhardson added a comment to T171462: HTML returned as text.

looks like the custom message has been removed on rowiki

Oct 10 2017, 8:33 PM · Discovery-Search, Discovery, MediaWiki-Special-pages
EBernhardson moved T171462: HTML returned as text from Backlog to Done on the Discovery-Search (Current work) board.
Oct 10 2017, 8:32 PM · Discovery-Search, Discovery, MediaWiki-Special-pages
EBernhardson added a comment to T118443: Over-ride the "zebra marquee" pending styling in Special:Search so that it's inconsistent with the rest of the interface, but consistent with every other full text search the user has likely used..

opensearch has been the default since before anyone on the team working on search. I believe the primary difference is this endpoint uses caching by default for not-logged in users which alleviates traffic levels and provides quicker responses, especially with shorter queries (which are the most expensive to run). I don't see any particular reason we have to use it, frontend should be able to provide appropriate requests to allow caching with the prefix api.

Oct 10 2017, 7:32 PM · Discovery-Search, MW-1.27-release (WMF-deploy-2016-03-08_(1.27.0-wmf.16)), MW-1.27-release-notes, UI-Standardization, MediaWiki-Search, Discovery
EBernhardson added a comment to T176428: Search Relevance test #4 - action items.

I think another important thing to figure out will be how many survey impressions we think we need to reliable information (maybe already covered).

Oct 10 2017, 5:23 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson added a comment to T175048: Search Relevance Survey test #3: analysis of test.

Tangentially related, i wonder if this can be used to better tune the DBN data as well. Basically the DBN can give us attractiveness and satisfaction %'s, which we currently just multiply together and then linear scale up to [0, 10]. We could potentially take the values from this click model, as well as a couple other click models (implemented in the same repository) that make different assumptions, and then learn a simple model to combine the information from the various click models to try and look like the data we get out of the relevance surveys (requires having survey data on queries that we also have enough sessions to train click models on). Or maybe that ends up being too many layers of ML, not sure.

Oct 10 2017, 5:21 PM · Discovery-Analysis (Current work), Discovery
EBernhardson moved T177477: rebuildtextindex fails when searchindex is on InnoDB (no support for MyISAM) from In progress to Needs review on the Discovery-Search (Current work) board.
Oct 10 2017, 4:53 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, MediaWiki-Maintenance-scripts, Discovery-Search (Current work), Discovery, MediaWiki-Search
EBernhardson moved T177477: rebuildtextindex fails when searchindex is on InnoDB (no support for MyISAM) from Backlog to In progress on the Discovery-Search (Current work) board.
Oct 10 2017, 4:50 PM · MW-1.31-release-notes (WMF-deploy-2017-10-17 (1.31.0-wmf.4)), Patch-For-Review, MediaWiki-Maintenance-scripts, Discovery-Search (Current work), Discovery, MediaWiki-Search
EBernhardson updated subscribers of T171652: Language Analysis Morphological Library Research Spike.

Perhaps worth noting that I'm pretty sure http://discovery.wmflabs.org/metrics/#langproj_breakdown isn't a true breakdown of search volume, although i should double check with @mpopov . I think that's a proportion of events in the TestSeachSatisfaction schema. The sampling on low volume wikis is all the same, but the top 20 or so have custom sampling rates which means we can't directly compare the numbers.

Oct 10 2017, 4:47 PM · Discovery-Search (Current work), Tamil-Sites, Malayalam-Sites, Bengali-Sites, Discovery
EBernhardson added a comment to T176997: Extract a set of a few hundred most popular abandoned queries.

The data above was for a single day, but i can get a run going with a full month of data. I wanted to get a bit more data cleaning in there, will poke at your suggestions and see if i can find nicely automated ways to filter them out of the data.

Oct 10 2017, 4:45 PM · Discovery-Search (Current work), CirrusSearch, Discovery

Oct 6 2017

EBernhardson moved T177273: avro-php tests fail under PHP 7 from Needs triage to Up Next on the Discovery-Search board.
Oct 6 2017, 7:14 PM · Discovery, NewPHP, Discovery-Search
EBernhardson added a project to T177273: avro-php tests fail under PHP 7: Discovery-Search.
Oct 6 2017, 7:14 PM · Discovery, NewPHP, Discovery-Search

Oct 5 2017

EBernhardson updated subscribers of T177535: MWException from line 545 of SiteConfiguration.php: No such wiki 's'..

One lingering problem found here that is papered over by appropriately checking InterwikiLookup::fetch results, svwiki has renamed wikisource interwiki prefix from 's' to 'src'. We get the prefix from $wgSiteMatrixSites which is the same for all wikis. The end result is basically svwiki doesn't get (never got) sister search results for wikisource. Not the end of the world but would be nice to be able to source the data correctly. Unfortunately the only place i can find this variation, so far, is in InterwikiLookup itself. And the way it's data is layed out it's not possible to efficiently extract this from its data.

Oct 5 2017, 11:33 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Discovery-Search (Current work), Patch-For-Review, Wikimedia-log-errors
EBernhardson edited projects for T177535: MWException from line 545 of SiteConfiguration.php: No such wiki 's'., added: Discovery-Search (Current work); removed Discovery-Search.

Patch is up and should resolve the problem. In the longer run we need better testing of this code path, but it's hard as you need a bunch of cross-wiki stuff setup in the test environment. Perhaps we can at least get it going in the browser test suite which already has multiple languages for testing analysis chains.

Oct 5 2017, 9:00 PM · MW-1.31-release-notes (WMF-deploy-2017-10-10 (1.31.0-wmf.3)), Discovery-Search (Current work), Patch-For-Review, Wikimedia-log-errors
EBernhardson updated the task description for T177519: Build and AB test an ML Model with all the features exploded into individual pieces.
Oct 5 2017, 5:01 PM · Discovery-Search