EBernhardson (EBernhardson)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 4:49 PM (176 w, 3 d)
Availability
Available
LDAP User
EBernhardson
MediaWiki User
EBernhardson (WMF)

Recent Activity

Thu, Feb 22

EBernhardson added a comment to T183451: Collect per node latency percentiles on our elasticsearch cirrus clusters.

I suppose the :9109 in the instance names is a bit annoying, in that is makes the list of instances much longer. Really though there are too many instances to list and it needs to be further filtered (top-N?) anyways.

Thu, Feb 22, 8:10 PM · Discovery-Search (Current work)
EBernhardson added a comment to T183451: Collect per node latency percentiles on our elasticsearch cirrus clusters.

I put together a very basic attempt at a first dashboard: https://grafana.wikimedia.org/dashboard/db/elasticsearch-per-node-percentiles?orgId=1
The overall numbers look sane and roughly what is expected.

Thu, Feb 22, 7:59 PM · Discovery-Search (Current work)
EBernhardson added a comment to T187548: [Bug] subpageof results will sometimes display wrong results .

Ran a quick test. To get the highlighter to return results i needed to add a highlight_query which referenced the text, and i needed to add the .prefix subfields to the per-field highlighter configuration matched_fields. Looks fixable but will need to poke around a bit to see how it should be implemented to do this.

Thu, Feb 22, 7:53 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson added a comment to T187548: [Bug] subpageof results will sometimes display wrong results .

Looking at the query we build and send to elasticsearch, it seems we might need to provide an appropriate query to the highlighter? Some testing should be able to tell.

Thu, Feb 22, 6:17 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson updated the task description for T188015: Increase ltr.cache.max_size in Cirrus elasticsearch clusters.
Thu, Feb 22, 5:14 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson moved T188015: Increase ltr.cache.max_size in Cirrus elasticsearch clusters from Backlog to Needs review on the Discovery-Search (Current work) board.
Thu, Feb 22, 5:12 PM · Patch-For-Review, Discovery-Search (Current work), Discovery
EBernhardson triaged T188015: Increase ltr.cache.max_size in Cirrus elasticsearch clusters as Normal priority.
Thu, Feb 22, 5:09 PM · Patch-For-Review, Discovery-Search (Current work), Discovery

Wed, Feb 21

EBernhardson added a comment to T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin.

After some testing my theory is the performance on eqiad is due to to our ltr model cache size. It 's currently set to the default of 10MB. If i run the ltr model in a series of requests at only 1 concurrent query i get poor performance. Ramping up to 5 concurrent queries per-query performance is still poor but improving. At 15 concurrent queries things seem to generally improve, with nothing over 500ms in the last few dozen of the 1k. My interpretation of this is that the more likely the model is to be in the cache the faster the request, and the model is being evicted very quickly. Unfortuntaely the cache size is not dynamically updatable, we have to do a rolling restart to update it. I've filed a bug upstream for this. Additionally i don't have stronger proof the cache is undersized because we only report number of items and size in bytes for cache stats, nothing about loads or invalidations from which trends over time could indicate churn. I've filed an upstream bug about that as well.

Wed, Feb 21, 11:07 PM · Discovery-Search (Current work), Discovery
EBernhardson added a comment to T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin.

I've also now created a limited feature set with just what the mrmr model needs and uploaded it to the prod clusters. Query time on the codfw cluster looks really good, around 125ms. Query time on the eqiad cluster on the other hand is hanging out around 750ms which is unacceptable. Going to have to dig into why the queries take significantly different amounts of time on the different clusters. I'm suspicious it's not actually feature collecting, but maybe model caching? This is because if i adjust the query to only have 6 results, so only 6 possible items could have been provided to feature collecting across 7 shards, i still see ~750ms. This suggests some sort of constant cost rather than a per-item cost like feature collection.

Wed, Feb 21, 9:33 PM · Discovery-Search (Current work), Discovery
EBernhardson added a comment to T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin.

After evaluating found the previous iteration was missing all the features related to the plain fields due to a bug. Re-collected the data and came up with 0.8627. Unfortunately there is something odd going on at evaluation time. I uploaded the 0.8628 model as [[ https://en.wikipedia.org/wiki/Special:Search?search=kennedy&fulltext=1&cirrusMLRModel=20180118-query_explorer_v2-enwiki-v1 | 20180118-query_explorer_v2-enwiki-v1 ] and it takes on average around 2s to return results. Taking the query being issued here and running it against codfw comes back at 300ms. 300ms is stlil probably too expensive, but 2s is amazing. The eqiad cluster load doesn't seem high enough (~25% cpu, minimal io) to make the queries 6x slower. Unfortunately it looks like the ltr plugin deployed in prod doesn't work with the elasticsearch profile api and throws an exception.

Wed, Feb 21, 8:51 PM · Discovery-Search (Current work), Discovery

Tue, Feb 20

EBernhardson added a comment to T187240: Boost search results with exact phrase match.

I have a test model up that currently pushes Chris to the top: https://en.wikipedia.org/wiki/Special:Search?search=Ingo+Heinrich&fulltext=1&cirrusMLRModel=20180118-query_explorer_v2-enwiki-v1

Tue, Feb 20, 7:43 PM · Discovery-Search (Current work), CirrusSearch, Discovery

Thu, Feb 15

EBernhardson added a comment to T187240: Boost search results with exact phrase match.

I can add a phrase match, but i'm not sure phrase match will be enough to push this single occurance all the way to the top. For this specific page a redirect containing the name might do the trick, but that's not as generalizable. Longer term I think named entity extraction has some potential here. A phrase match and a named entity match might (but needs to be evaluated) be enough.

Thu, Feb 15, 8:43 PM · Discovery-Search (Current work), CirrusSearch, Discovery
EBernhardson added a comment to T186742: Predict relevance of search results from historical clicks using a Neural Click Model.

@Vvekbv I think first steps for this project would be to review the paper and review the aggregation performed in the associated patch, to verify that the input vectors are generated correctly according to the paper. Once we think the data collection is correct i can run it against some time period and put a sample of data (probably saved as a scipy.sparse matrix, for loading into python) on https://analytics.wikimedia.org/datasets/discovery/ to start working with.

Thu, Feb 15, 8:23 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)

Tue, Feb 13

EBernhardson updated the task description for T186742: Predict relevance of search results from historical clicks using a Neural Click Model.
Tue, Feb 13, 9:54 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson updated the task description for T186742: Predict relevance of search results from historical clicks using a Neural Click Model.
Tue, Feb 13, 9:53 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson moved T184008: Language fallback for search can fail when rescore profile doesn't exist on target wiki from Needs review to Done on the Discovery-Search (Current work) board.
Tue, Feb 13, 7:57 PM · Discovery-Search (Current work), MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)), Patch-For-Review, Discovery, CirrusSearch
EBernhardson moved T186157: Build an end to end integration test for mjolnir from Needs review to Done on the Discovery-Search (Current work) board.
Tue, Feb 13, 7:56 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson moved T185127: Add SPARQL client to core from Needs review to Done on the Discovery-Search (Current work) board.
Tue, Feb 13, 6:22 PM · MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), Patch-For-Review, Discovery-Search (Current work), User-Smalyshev, TCB-Team, German-Community-Wishlist, Discovery
EBernhardson moved T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin from Backlog to In progress on the Discovery-Search (Current work) board.
Tue, Feb 13, 6:21 PM · Discovery-Search (Current work), Discovery
EBernhardson moved T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin from Needs triage to Current work on the Discovery-Search board.
Tue, Feb 13, 6:20 PM · Discovery-Search (Current work), Discovery
EBernhardson added a comment to T186742: Predict relevance of search results from historical clicks using a Neural Click Model.

One of the things that surprised me by attending the GSOC mentor summit in 2016 was that the scope of our tasks was much more limited than some of the other orgs. For example this task i think is fairly comparable to Learning to rank Clickstream Mining at https://trac.xapian.org/wiki/GSoCProjectIdeas but is a bit more focused on a specific idea. I think we will have to be careful to only accept a student with enough background/experience to have a chance at finishing, but I don't think that needs to prevent students from applying.

Tue, Feb 13, 4:17 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson added a comment to T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin.

Built out feature collection for these, the dataset sizes are a bit larger than expected. xgboost datasets for enwiki when maintaining 35M observations is roughly 50GB. Interestingly the lightgbm dataset for the same data is only 4GB. Similarly while xgboost needed 10 executors with 16G memory each to run training, lightgbm looks like it will only require 1 executor with 10G of memory. Might be worth evaluating if the ltr plugin can correctly evaluate lightgbm trees (if we convert them to xgboost format).

Tue, Feb 13, 4:12 AM · Discovery-Search (Current work), Discovery
EBernhardson triaged T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin as Normal priority.
Tue, Feb 13, 4:03 AM · Discovery-Search (Current work), Discovery

Mon, Feb 12

EBernhardson added a comment to T187139: Hadoop jobs that generate large temporary files can take down nodes.

From the application side it seems like i could perhaps put the temp files in a better place. One option would be PWD of the application which looks to be something like:

Mon, Feb 12, 11:44 PM · Patch-For-Review, Analytics, Analytics-Cluster
EBernhardson updated the task description for T187139: Hadoop jobs that generate large temporary files can take down nodes.
Mon, Feb 12, 11:42 PM · Patch-For-Review, Analytics, Analytics-Cluster
EBernhardson created T187139: Hadoop jobs that generate large temporary files can take down nodes.
Mon, Feb 12, 11:39 PM · Patch-For-Review, Analytics, Analytics-Cluster

Thu, Feb 8

EBernhardson removed a project from T186742: Predict relevance of search results from historical clicks using a Neural Click Model: Patch-For-Review.

Attached patch is a small proof of concept, but to go from there to a full evaluation of NCM vs DBN is going to be a large amount of work I can't find time for in the immediate future. This would be a great project for someone with interest in applying NN to practical problems.

Thu, Feb 8, 7:52 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)

Wed, Feb 7

EBernhardson updated the task description for T186742: Predict relevance of search results from historical clicks using a Neural Click Model.
Wed, Feb 7, 11:02 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson added a comment to T186742: Predict relevance of search results from historical clicks using a Neural Click Model.

abbreviations:

Wed, Feb 7, 7:49 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson added projects to T186742: Predict relevance of search results from historical clicks using a Neural Click Model: Discovery-Search, Possible-Tech-Projects.
Wed, Feb 7, 7:38 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson created T186742: Predict relevance of search results from historical clicks using a Neural Click Model.
Wed, Feb 7, 7:38 PM · Possible-Tech-Projects, Discovery-Search, Google-Summer-of-Code (2018)
EBernhardson added a comment to T186134: Experiment with varying MLR training hyperparameter space.

Next up: Train with more hyperopt iterations (300? 500?), to see if continued search is beneficial.

Wed, Feb 7, 3:24 AM · Discovery-Search (Current work)
EBernhardson added a comment to T186134: Experiment with varying MLR training hyperparameter space.

Next up: Experiment with variations in number of trees for small wikis, does current setting of 500 help?

Wed, Feb 7, 3:19 AM · Discovery-Search (Current work)

Tue, Feb 6

EBernhardson added a comment to T186134: Experiment with varying MLR training hyperparameter space.

First up: Train the same data with the same hyperparameter space multiple (3) times to get an idea of expected variance

Tue, Feb 6, 10:27 PM · Discovery-Search (Current work)
EBernhardson moved T177520: Experiment with different grouping of queries that get fed into the DBN from In progress to Backlog on the Discovery-Search (Current work) board.
Tue, Feb 6, 9:40 PM · Discovery-Search (Current work), Discovery
EBernhardson moved T186157: Build an end to end integration test for mjolnir from In progress to Needs review on the Discovery-Search (Current work) board.
Tue, Feb 6, 9:39 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson moved T186157: Build an end to end integration test for mjolnir from Backlog to In progress on the Discovery-Search (Current work) board.
Tue, Feb 6, 9:39 PM · Patch-For-Review, Discovery-Search (Current work)

Wed, Jan 31

EBernhardson created T186157: Build an end to end integration test for mjolnir.
Wed, Jan 31, 9:22 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson moved T184547: Switch mjolnir to file based training from Needs review to Done on the Discovery-Search (Current work) board.
Wed, Jan 31, 9:15 PM · Discovery-Search (Current work)
EBernhardson renamed T186134: Experiment with varying MLR training hyperparameter space from Experiment with varying the to Experiment with varying MLR training hyperparameter space.
Wed, Jan 31, 6:43 PM · Discovery-Search (Current work)
EBernhardson created T186134: Experiment with varying MLR training hyperparameter space.
Wed, Jan 31, 6:41 PM · Discovery-Search (Current work)
EBernhardson updated the task description for T175210: Select candidate jobs for transferring to the new infrastucture.
Wed, Jan 31, 4:56 AM · Patch-For-Review, Services (doing), MediaWiki-JobQueue, ChangeProp, Analytics, EventBus, Operations, User-Joe, User-Elukey

Jan 23 2018

EBernhardson added a comment to T185191: Investigate browser tested search results.

Nothing automatic as far as I'm aware either. Cirrus integration test's use the api to create all it's pages.

Jan 23 2018, 8:25 PM · Patch-For-Review, WMDE-Fundraising-Sprint-15, TCB-Team, Advanced-Search

Jan 19 2018

EBernhardson added a comment to T182160: Develop tests for phabricator search to detect regressions / search quality issues.

If you are looking for search quality, typically what would be done is:

Jan 19 2018, 6:03 PM · User-zeljkofilipin, Browser-Tests, monitoring, Release-Engineering-Team (Kanban), Phabricator

Jan 18 2018

EBernhardson added a comment to T185250: Investigate irrelevant sister project search results on Wikipedia.

For the moment, patch switches wiktionary on enwiki to use the title filter. We can ponder a bit on how to improve the relevance of the wiktionary search, or how to filter results that happen to be there but aren't particularly good.

Jan 18 2018, 10:18 PM · Discovery-Search (Current work), Patch-For-Review, Discovery
EBernhardson moved T185250: Investigate irrelevant sister project search results on Wikipedia from Backlog to In progress on the Discovery-Search (Current work) board.
Jan 18 2018, 10:17 PM · Discovery-Search (Current work), Patch-For-Review, Discovery
EBernhardson claimed T185250: Investigate irrelevant sister project search results on Wikipedia.
Jan 18 2018, 10:17 PM · Discovery-Search (Current work), Patch-For-Review, Discovery
EBernhardson moved T185250: Investigate irrelevant sister project search results on Wikipedia from Up Next to Current work on the Discovery-Search board.
Jan 18 2018, 10:17 PM · Discovery-Search (Current work), Patch-For-Review, Discovery
EBernhardson added a comment to T184099: experiment with different label scales for MLR.

I realized the two posts above had serious leakage between test and train sets, so here we go again this time we split the training data into 3 folds, we train against each fold a model with labels on a 0-9 scale and a model with labels on a 0-3 scale, then we evaluate both models against both scales

Jan 18 2018, 6:09 AM · Discovery-Search (Current work)
EBernhardson added a comment to T184099: experiment with different label scales for MLR.
Jan 18 2018, 3:57 AM · Discovery-Search (Current work)

Jan 17 2018

EBernhardson added a comment to T184099: experiment with different label scales for MLR.
Jan 17 2018, 10:55 PM · Discovery-Search (Current work)
EBernhardson moved T184099: experiment with different label scales for MLR from Backlog to In progress on the Discovery-Search (Current work) board.
Jan 17 2018, 10:47 PM · Discovery-Search (Current work)
EBernhardson claimed T184099: experiment with different label scales for MLR.
Jan 17 2018, 10:47 PM · Discovery-Search (Current work)
EBernhardson moved T184099: experiment with different label scales for MLR from Up Next to Current work on the Discovery-Search board.
Jan 17 2018, 10:47 PM · Discovery-Search (Current work)
EBernhardson moved T184547: Switch mjolnir to file based training from In progress to Needs review on the Discovery-Search (Current work) board.
Jan 17 2018, 10:46 PM · Discovery-Search (Current work)

Jan 11 2018

EBernhardson updated subscribers of T184767: Make it possible to stop a survey after receiving a certain number of responses.

I'd have to pull in someone with more stats background to be sure, but I think that cutting off a survey at a specific number of responses can be a bit too early. An example I've seen used:

Jan 11 2018, 9:46 PM · Readers-Web-Backlog (Tracking), QuickSurveys, Surveys, Community-Liaisons
EBernhardson added a comment to T89970: Enable microsurveys for long-term tracking of editing experience .

T183941 is probably relevant, but T184767 less so. The problem with T184767 is we want to spread the surveys out. If we ask for 100 impressions per week we don't want to get 100 impressions all in the first hour and end the survey, which T184767 seems to be implying. For 100/week we want around 14 /day spread across the days/hours of operation to get all the different kinds of users.

Jan 11 2018, 9:17 PM · QuickSurveys, Surveys, Community-Liaisons, MediaWiki-Page-editing
EBernhardson added a comment to T184754: many search-mjolnir-tox-docker jobs in aborted state.

Ahh, i was only thinking JVM dependencies and didn't remember the new python dep as well. Makes sense that that is taking awhile to compile, wish they would ship binary wheel's like the other packages.

Jan 11 2018, 8:20 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Config
EBernhardson added a comment to T89970: Enable microsurveys for long-term tracking of editing experience .

Another important aspect, at least for how search is using micro-surveys, is per-page sampling rates. Or more specifically targeting # of impressions per week on a per-page basis, rather than site wide.

Jan 11 2018, 7:35 PM · QuickSurveys, Surveys, Community-Liaisons, MediaWiki-Page-editing
EBernhardson updated the task description for T184754: many search-mjolnir-tox-docker jobs in aborted state.
Jan 11 2018, 6:42 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Config
EBernhardson created T184754: many search-mjolnir-tox-docker jobs in aborted state.
Jan 11 2018, 6:36 PM · Patch-For-Review, Release-Engineering-Team (Kanban), Continuous-Integration-Config

Jan 9 2018

EBernhardson updated the task description for T184547: Switch mjolnir to file based training.
Jan 9 2018, 6:12 PM · Discovery-Search (Current work)
EBernhardson moved T184547: Switch mjolnir to file based training from Backlog to In progress on the Discovery-Search (Current work) board.
Jan 9 2018, 6:05 PM · Discovery-Search (Current work)
EBernhardson created T184547: Switch mjolnir to file based training.
Jan 9 2018, 6:05 PM · Discovery-Search (Current work)

Jan 3 2018

EBernhardson created T184099: experiment with different label scales for MLR.
Jan 3 2018, 6:02 PM · Discovery-Search (Current work)

Jan 2 2018

EBernhardson added a comment to T184019: Run search relevance survey on enwiki and frwiki.

Closer reading of the report:

the model is very accurate with at least 40 yes/no/unsure/dismiss responses and the most accurate with at least 70 responses
Jan 2 2018, 11:39 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T183053: New Wikidata items appear in search with a delay.

sending the basic info semi-synchronously (from DeferredUpdates, which will run in the same process as the edit but after closing the connection to the user so as not to make save timing worse) should be ok. Actually generating a "basic" set instead of the full thing might be more difficult than necessary though, i would be tempted to add a called to Updater::updateFromTitle(...) and let it do the full thing. Since article creates should be relatively (compared to total edit rate) rare i don't think the extra computation expense out-weights the maintenance cost of keeping an extra bit of code to generate partial updates, including getting the labels from wikidata, without also calculating the rest of it.

Jan 2 2018, 11:33 PM · Patch-For-Review, User-Smalyshev, Discovery-Search (Current work), Discovery, Wikidata
EBernhardson updated subscribers of T184019: Run search relevance survey on enwiki and frwiki.

@mpopov I wasn't quite sure from https://wikimedia-research.github.io/Discovery-Search-Adhoc-RelevanceSurveys/#responses_required , is 40 to 70 responses the number of impressions (yes+no+dismiss+timeout), the number of clicks (yes+no+dismiss), or the number of yes+no? I think it was yes+no+dismiss, but it might have been yes+no+dismiss+timeout?

Jan 2 2018, 11:12 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson created T184019: Run search relevance survey on enwiki and frwiki.
Jan 2 2018, 10:50 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a subtask for T142795: Offer interwiki search with language detection functionality over the API: T184008: Language fallback for search can fail when rescore profile doesn't exist on target wiki.
Jan 2 2018, 10:01 PM · MW-1.29-release (WMF-deploy-2017-01-03_(1.29.0-wmf.7)), MW-1.29-release-notes, Discovery-Search (Current work), Patch-For-Review, Easy, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, Discovery
EBernhardson added a parent task for T184008: Language fallback for search can fail when rescore profile doesn't exist on target wiki: T142795: Offer interwiki search with language detection functionality over the API.
Jan 2 2018, 10:01 PM · Discovery-Search (Current work), MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)), Patch-For-Review, Discovery, CirrusSearch
EBernhardson created T184008: Language fallback for search can fail when rescore profile doesn't exist on target wiki.
Jan 2 2018, 9:53 PM · Discovery-Search (Current work), MW-1.31-release-notes (WMF-deploy-2018-01-09 (1.31.0-wmf.16)), Patch-For-Review, Discovery, CirrusSearch
EBernhardson added a comment to T183053: New Wikidata items appear in search with a delay.

We may still need to look into the special-case of newly created pages being indexed from the web request, rather than being punted into the job queue. cirrusSearchLinksUpdatePrioritized, which performs the actual generation of a document and write to elasticsearch, looks to have a p99 that regularly varies from 30 to 60 seconds. This is on top of however long it takes for the refreshLinksPrioritized job which is another 20s - 2 minutes for p99. For some fraction of requests, even when the queue is healthy, there will be a couple minutes between the edit being performed and the two necessary jobs making it through the job queue and turned into a write in elasticsearch.

Jan 2 2018, 7:11 PM · Patch-For-Review, User-Smalyshev, Discovery-Search (Current work), Discovery, Wikidata

Dec 21 2017

EBernhardson moved T183053: New Wikidata items appear in search with a delay from Backlog to Needs review on the Discovery-Search (Current work) board.
Dec 21 2017, 6:19 PM · Patch-For-Review, User-Smalyshev, Discovery-Search (Current work), Discovery, Wikidata
EBernhardson added a comment to T182352: UDF for language detection.

I don't know that it would be particularly useful here, but i have a WIP patch to expose our language detection via the mediawiki API that could potentially be useful for external language detection. For accessing in bulk from hadoop (like a UDF would) hitting a mediawiki api is probably undesirable.

Dec 21 2017, 6:18 PM · Discovery-Search, Discovery-Analysis, Analytics, Discovery
EBernhardson added a comment to T182452: Searching for strings starting with a # should not redirect to main page.

This can currently be done using the insource regex search. We might need to think how to reword the "exactly this text" field in the new search interface since, as you've noted, it's not strictly that text. What it does is skip things such as stemming which converts cats into cat or simplifying the character set (depends on language used), but it still tokenizes into words and drops non-words such as # and -.

Dec 21 2017, 6:13 PM · CirrusSearch, Discovery-Search, Discovery

Dec 20 2017

EBernhardson added a comment to T183053: New Wikidata items appear in search with a delay.

Took some measurements of refresh rate averaged over 5 minutes pre and post-deployment. Overall it's perhaps a 15% increase in refresh/minute across the cluster. Disk IO graphs don't show anything particularly interesting. There will certainly be more merge volume as well but elasticsearch should be able to bundle up the merges enough that these tiny merges are irrelelvant compared to the major merges that happen on many-GB segments.

Dec 20 2017, 8:54 PM · Patch-For-Review, User-Smalyshev, Discovery-Search (Current work), Discovery, Wikidata

Dec 19 2017

EBernhardson moved T115756: Search suggests non-existent title due to namespace/redirect mixup from In progress to Needs review on the Discovery-Search (Current work) board.
Dec 19 2017, 11:40 PM · Patch-For-Review, Discovery-Search (Current work), MediaWiki-Search, WorkType-Maintenance, CirrusSearch, Discovery
EBernhardson added a comment to T183053: New Wikidata items appear in search with a delay.

It seems there are a couple options here, my thoughts:

Dec 19 2017, 11:34 PM · Patch-For-Review, User-Smalyshev, Discovery-Search (Current work), Discovery, Wikidata
EBernhardson added a comment to T183282: [epic] Search cluster upgrade to 6.x.

Probably the biggest change I'm aware of is elasticsearch removing index types. We will need to come up with different solutions for what we have previously used those index types for. This can be done prior to moving to elastic 6 though.

Dec 19 2017, 8:03 PM · Epic, Discovery-Search
EBernhardson added a comment to T181627: Port elasticsearch metrics to Prometheus.

Unfortunately all of the elasticsearch-specific metrics are no exposed over jmx. We can get generic JVM info that way, but for the specialized stats we have to query the elasticsearch APIs.

Dec 19 2017, 4:58 PM · Patch-For-Review, Discovery-Search (Current work), cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations
EBernhardson added a comment to T183071: Import kibana package from jessie into stretch.

@Gehel It looks like we need to release stretch packages for all our custom elastic stuff (kibana, logstash, es, plugins?). My intuition is that since this is all JVM it should "just work" on stretch, is it possible for apt.wikimedia.org to build the appropriate package files without rebuilding all the debs? An alternate solution I've seen in the wild is to not use the distribution name in the apt line, but that always seemed a bit of a hack.

Dec 19 2017, 4:15 PM · Patch-For-Review, MediaWiki-Vagrant, Operations

Dec 18 2017

EBernhardson added a comment to T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager.

Thanks! I'll try this out this week and see how things go.

Dec 18 2017, 4:45 PM · Analytics-Kanban, User-Elukey, Patch-For-Review, Analytics-Cluster
EBernhardson added a comment to T175179: Create selenium-CirrusSearch-jessie daily Jenkins job.

chomedriver has to be started to listen on :4444. There might be some magic way to get nodejs to spawn this but I'm not sure how. I generally start it myself using:

Dec 18 2017, 3:13 PM · MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)), Patch-For-Review, User-zeljkofilipin, Release-Engineering-Team (Kanban), Discovery-Search (Current work), Discovery
EBernhardson added a comment to T181627: Port elasticsearch metrics to Prometheus.

Unrelated to dashboards, but for prometheus. We will likely need a fork (or an additional custom collector) to collect extra metrics that are only reported by our cluster. Specifically we collect per-node latency percentiles into a custom api endpoint on the elasticsearch servers. This isn't even in diamond yet as we only recently upgraded the plugin version on the cluster to expose these metrics.

Dec 18 2017, 3:09 PM · Patch-For-Review, Discovery-Search (Current work), cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations
EBernhardson added a comment to T181627: Port elasticsearch metrics to Prometheus.

load testing and percentiles could certainly go away. cluster recovery might be useful at some point in the future but hard to say. All of that data is found in other dashboards though, just not broken out by server and in a single board. It certainly wont be immediately useful, and if the data is there it can be recreated as necessary.

Dec 18 2017, 3:07 PM · Patch-For-Review, Discovery-Search (Current work), cloud-services-team (Kanban), User-fgiunchedi, Goal, Operations

Dec 15 2017

EBernhardson added a comment to T179528: Investigate full-text searches in event logging vs SRP pageviews.

Overall looks pretty close. Some parts are perhaps underdefined imo. So for example when you type into the main search bar on Special:Search you get autocomplete results, does that count as starting your search with autocomplete? Those are disambiguated in the event logging with the 'inputLocation' field. I might be tempted to throw out the autocomplete on the main search bar since its completing the query to be submitted, instead of completing a page title to go to.

Dec 15 2017, 10:16 PM · Discovery-Analysis (Current work), Discovery

Dec 14 2017

EBernhardson added a comment to T115756: Search suggests non-existent title due to namespace/redirect mixup.

The two attached patches are not complete solutions, that would still require a rethinking of how we store redirects, but it should at least paper over the problem from the users perspective.

Dec 14 2017, 1:18 AM · Patch-For-Review, Discovery-Search (Current work), MediaWiki-Search, WorkType-Maintenance, CirrusSearch, Discovery

Dec 13 2017

EBernhardson created P6459 (An Untitled Masterwork).
Dec 13 2017, 4:41 PM

Dec 12 2017

EBernhardson created P6454 (An Untitled Masterwork).
Dec 12 2017, 11:53 PM
EBernhardson moved T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model from Backlog to Needs review on the Discovery-Search (Current work) board.
Dec 12 2017, 10:34 PM · MW-1.31-release-notes (WMF-deploy-2018-01-16 (1.31.0-wmf.17)), Patch-For-Review, Discovery-Search (Current work)
EBernhardson claimed T182616: Re-run AB test for Hebrew Wikipedia (has > 1% of search traffic) with new model.
Dec 12 2017, 10:34 PM · MW-1.31-release-notes (WMF-deploy-2018-01-16 (1.31.0-wmf.17)), Patch-For-Review, Discovery-Search (Current work)

Dec 8 2017

EBernhardson added a comment to T182447: Search in "" does not distinguish between "ss" and "ß".

For the moment, you can use this to get exact matches: insource:/umfaßt/

Dec 8 2017, 6:07 PM · Discovery-Search, CirrusSearch, Discovery

Dec 7 2017

EBernhardson added a comment to T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager.

I suppose for a little more background on what i think is happening:

Dec 7 2017, 8:31 PM · Analytics-Kanban, User-Elukey, Patch-For-Review, Analytics-Cluster
EBernhardson added a comment to T94868: debugger broken in wmf hhvm packages.

If this is two years old then likely things are working now. We use the hhvm debugger in wmf prod (mwrepl calls it) and break points and such seem to work.

Dec 7 2017, 5:21 PM · Need-volunteer, HHVM
EBernhardson added a project to T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager: Analytics-Cluster.
Dec 7 2017, 5:10 AM · Analytics-Kanban, User-Elukey, Patch-For-Review, Analytics-Cluster
EBernhardson created T182276: Enable more accurate smaps based RSS tracking by yarn nodemanager.
Dec 7 2017, 5:09 AM · Analytics-Kanban, User-Elukey, Patch-For-Review, Analytics-Cluster
EBernhardson created P6441 mlr training container kill / nodemanager log.
Dec 7 2017, 5:05 AM
EBernhardson created P6440 example log of MLR training executor killed by yarn.
Dec 7 2017, 5:03 AM

Dec 5 2017

EBernhardson added a comment to T182136: English labels in wikidata prefix search in non-English have low ranking.

We really need a centralized place to store all these queries and expected results with different parameters. The key to making effective search is to have a set of queries and known good results, and then be able to evaluate changes to the system in how it affects all of those queries.

Dec 5 2017, 7:09 PM · MW-1.31-release-notes (WMF-deploy-2018-01-02 (1.31.0-wmf.15)), Discovery-Search (Current work), Patch-For-Review, CirrusSearch, Wikidata, Discovery