EBernhardson (EBernhardson)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Oct 7 2014, 4:49 PM (227 w, 4 d)
Availability
Available
LDAP User
EBernhardson
MediaWiki User
EBernhardson (WMF) [ Global Accounts ]

Recent Activity

Fri, Feb 15

EBernhardson added a comment to T215967: Add keyword for filtering based on captions in specific language.

I'm not sure if it's great, but i see two possible solutions to en being the final fallback language for almost everything:

Fri, Feb 15, 10:41 PM · User-Smalyshev, Discovery-Search (Current work), Multimedia, SDC General
EBernhardson added a comment to T216226: GPU upgrade for stat1005.

Thanks for the input @Shilad, its much appreciated! That info is EXACTLY the kind of info we need (and why this task exists!)

To echo what some of us discussed about this in irc yesterday, we're going to have a few issues/considerations to take into account for this:

  • The existing R730 has a GPU that is added via a Dell specific daughterboard. We cannot swap it out so much as disable it and use the PCIe slot if needed.
Fri, Feb 15, 8:08 PM · Analytics, hardware-requests, Operations
EBernhardson added a comment to T215967: Add keyword for filtering based on captions in specific language.

I like the structure of the syntax but would probably bikeshed the exact delimiters a bit if possible (later). Also, are we following fallback chains or only seeking exact language match? If we match exactly we may want to also think about allowing fallbacks.

inlabel:"pt-br,pt,colaborativa"

Did you mean inlabel:"pt-br,pt|colaborativa" ?

inlabel:gift,wrap
Pages that have the words wrap and gift in labels.*, along with a warning that gift is an unknown language.

Why? Didn't you define | as language marker? This query has no | - why talk about the language?

Fri, Feb 15, 4:55 AM · User-Smalyshev, Discovery-Search (Current work), Multimedia, SDC General
EBernhardson added a comment to T215967: Add keyword for filtering based on captions in specific language.

Actually i hadn't thought about inlabel:py-br,pt|colaborativa, that might be better than what I had with successive | characters. The successive pipes can be undetermined, but taking everything before the first pipe is very easy reason about.

Fri, Feb 15, 1:49 AM · User-Smalyshev, Discovery-Search (Current work), Multimedia, SDC General
EBernhardson added a comment to T215967: Add keyword for filtering based on captions in specific language.

I like the structure of the syntax but would probably bikeshed the exact delimiters a bit if possible (later). Also, are we following fallback chains or only seeking exact language match? If we match exactly we may want to also think about allowing fallbacks.

inlabel:"pt-br,pt,colaborativa"

Did you mean inlabel:"pt-br,pt|colaborativa" ?

Yes, i'll go back and edit. I started with , then used | then forget to switch them all.

Fri, Feb 15, 1:46 AM · User-Smalyshev, Discovery-Search (Current work), Multimedia, SDC General
EBernhardson updated subscribers of T215967: Add keyword for filtering based on captions in specific language.

Proposed syntax as follows. Note that we can have an incaption alias of inlabel, but this will be implemented in WikibasecirrusSearch where these considered labels so the code in wikibase to filter by them should probably reference label. One potential sticking point is the syntax of specifying one or more languages. I'm not entirely convinced this is the best syntax, but I'm not sure we have something today to draw from as an example. The pipe usage here is slightly different than we use in other places. We could potentially replace the pipe with a comma, i'm not sure if that's better or worse.

Fri, Feb 15, 12:54 AM · User-Smalyshev, Discovery-Search (Current work), Multimedia, SDC General

Thu, Feb 14

EBernhardson added a comment to T148843: GPU upgrade for stats machine.

Need to triple check with somebody else but from the inventory stat1005 is a Dell PowerEdge 730

Thu, Feb 14, 10:46 PM · Patch-For-Review, User-Elukey, Operations, Analytics, Research-management
EBernhardson moved T215967: Add keyword for filtering based on captions in specific language from elastic / cirrus to Current work on the Discovery-Search board.
Thu, Feb 14, 10:30 PM · User-Smalyshev, Discovery-Search (Current work), Multimedia, SDC General
EBernhardson added a comment to T63080: CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]].

Wow, i didn't realize we threw away so many interesting tokens. Unfortunate, but seems this task can become a child of the other to be considered "some day".

Thu, Feb 14, 10:21 PM · Discovery-Search, good first bug, Discovery, CirrusSearch
EBernhardson moved T133174: Bootstrap a confidence interval for the engine scoring algorithms in Relevance Forge from later on... to ML & Data Pipeline on the Discovery-Search board.
Thu, Feb 14, 10:17 PM · Discovery, Discovery-Search
EBernhardson moved T71489: Expose mwgrep functionality on-wiki from later on... to watching / waiting on the Discovery-Search board.
Thu, Feb 14, 10:17 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson closed T109090: Investigate the need for master only (non data nodes) in our ES cluster as Declined.
Thu, Feb 14, 10:15 PM · Discovery-Search, Operations, Elasticsearch, Discovery
EBernhardson closed T109090: Investigate the need for master only (non data nodes) in our ES cluster, a subtask of T109089: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade), as Declined.
Thu, Feb 14, 10:15 PM · Discovery-Search, Operations, Epic, Discovery, Elasticsearch
EBernhardson moved T129256: Implement proper orchestration mechanics for elasticsearch upgrade from later on... to Ops / SRE on the Discovery-Search board.
Thu, Feb 14, 10:14 PM · Discovery-Search, Discovery, Elasticsearch
EBernhardson moved T150153: Enable substring searching on office wiki from later on... to elastic / cirrus on the Discovery-Search board.
Thu, Feb 14, 10:13 PM · Discovery, Discovery-Search
EBernhardson added a comment to T150153: Enable substring searching on office wiki.

In particular i think this refers to the fuzzy completion suggester? We recently made it the default on mw.org and wikitech, might as well do office wiki too.

Thu, Feb 14, 10:13 PM · Discovery, Discovery-Search
EBernhardson moved T152442: Exact title search for page in extra content namespace does not return that page in the first 500 results from later on... to elastic / cirrus on the Discovery-Search board.
Thu, Feb 14, 10:12 PM · CirrusSearch, Discovery-Search, Discovery
EBernhardson moved T133844: Improve Elasticsearch icinga alerting from later on... to Ops / SRE on the Discovery-Search board.
Thu, Feb 14, 10:10 PM · good first bug, Discovery-Search, Discovery, Operations, Elasticsearch
EBernhardson moved T166243: Create maintenance script for cleaning up stale indexes from Ops / SRE to elastic / cirrus on the Discovery-Search board.
Thu, Feb 14, 10:06 PM · CirrusSearch, Discovery, Google-Code-in-2017, Need-volunteer, good first bug, Discovery-Search
EBernhardson moved T166243: Create maintenance script for cleaning up stale indexes from later on... to Ops / SRE on the Discovery-Search board.
Thu, Feb 14, 10:06 PM · CirrusSearch, Discovery, Google-Code-in-2017, Need-volunteer, good first bug, Discovery-Search
EBernhardson closed T150370: [EPIC][Search][Dashboard] Add "well-behaved searchers" filter as Declined.
Thu, Feb 14, 10:05 PM · Discovery-Search, Discovery
EBernhardson moved T194569: Allow URL forwarding for search, while keeping native MW search for editors? from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 10:04 PM · Discovery-Search, MediaWiki-Special-pages, Discovery, MediaWiki-Search
EBernhardson added a comment to T188125: Make it possible to search by page author /contributor/ uploader.

Since then the analytics team has built up the mediawiki-history job that, essentially, exposes all of mediawiki history in a reasonably structured way. Perhaps now some job can aggregate over that to generate a list of page titles and all their contributors. If implemented this way itwould not be real-time contributors though, it would be a batch job that runs monthly or something to generate a contibutors list to index in the search engine. We would probably need to find some strong use cases to justify the extra moving pieces of this kind of setup as well.

Thu, Feb 14, 10:03 PM · Discovery-Search, CirrusSearch, Discovery
EBernhardson added a comment to T125926: CirrusSearch hastemplate can't find Translatable template.

At a high level, this from the opening seems to capture the request:

Hastemplate can find template usage where the target is a secondary template, and this ability should also be able to find where the target template is passed as a parameter. Currently hastemplate doesn't recognize the parameter list as "a place for template names", as it does in template code, where it the target template as a secondary.
Thu, Feb 14, 9:57 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T186114: searchmenu-new links to startpage at short URL setup from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 9:49 PM · Discovery-Search, Discovery, MediaWiki-Page-editing, MediaWiki-Search
EBernhardson moved T109089: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) from later on... to [epic] on the Discovery-Search board.
Thu, Feb 14, 9:48 PM · Discovery-Search, Operations, Epic, Discovery, Elasticsearch
EBernhardson closed T127876: Run a test in relevance forge to estimate effects of rewriting misspelled queries as Declined.
Thu, Feb 14, 9:48 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson closed T135073: Boost search results within a configurable geo distance as Resolved.

Implemented as either a filter or a boost: https://www.mediawiki.org/wiki/Help:CirrusSearch#Geo_Search

Thu, Feb 14, 9:47 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T63080: CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]] from later on... to Language Stuff on the Discovery-Search board.
Thu, Feb 14, 9:46 PM · Discovery-Search, good first bug, Discovery, CirrusSearch
EBernhardson updated subscribers of T63080: CirrusSearch: intitle:¢ returns no results despite there being a redirect at [[¢]].

intitle now queries the redirect titles, but this bug is still not fixed. It looks like the analyzers throw away this token:

Thu, Feb 14, 9:46 PM · Discovery-Search, good first bug, Discovery, CirrusSearch
EBernhardson moved T128076: EPIC: Evaluate the indexing strategy and try to make more benefits from the semi-structured content we have from later on... to [epic] on the Discovery-Search board.
Thu, Feb 14, 9:34 PM · Discovery-Search, Epic, Discovery, CirrusSearch
EBernhardson moved T110171: Alert when ES indexes are freezed for more than 30 minutes from later on... to Ops / SRE on the Discovery-Search board.
Thu, Feb 14, 9:34 PM · Discovery-Search, Discovery, Wikimedia-Incident, Operations, Incident-20150825-Redis, monitoring
EBernhardson closed T142654: Use autoloader more as Declined.
Thu, Feb 14, 9:26 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T105409: Rewrite the elasticsearch rolling-restart scripts in something better than bash from later on... to Ops / SRE on the Discovery-Search board.
Thu, Feb 14, 9:26 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T142809: Cleanup CirrusSearch.php and remove dynamic/non configuration vars from later on... to elastic / cirrus on the Discovery-Search board.
Thu, Feb 14, 9:25 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T151479: Cross-wiki search completion suggestor based on interwiki prefixes from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 9:25 PM · good first bug, Discovery, Discovery-Search
EBernhardson closed T151217: ElasticSearch 2.3.5 ask for memlock limit to be raised as Declined.

Basically we have a different opinion than elastic on how this should work. Elastic has to support pretty generic use cases, and a huve variety of ways people set things up. We only have to support ourselves, and our opinion for production puppet is the servers should have no swap enabled. If there is no swap then mlockall does nothing, Then no additional security rights are required for the elasticsearch user either. The reasoning for this is that elasticsearch only works when at least half the server memory is available as a disk cache. If the servers get anywhere near needing to swap memory to disk something else is fatally wrong.

Thu, Feb 14, 9:04 PM · Discovery-Search, Discovery, Elasticsearch
EBernhardson closed T145644: Experiment with wp10 as a new query independent factor as Resolved.
Thu, Feb 14, 9:00 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson closed T130590: Have dedicated master nodes for elasticsearch as Declined.
Thu, Feb 14, 9:00 PM · Discovery-Search, Operations, Discovery, Elasticsearch
EBernhardson added a parent task for T151060: Is it possible to set some word matching (synonym) for CirrusSearch/elasticsearch?: T213093: Implement reloadable search-time synonyms in extra plugin.
Thu, Feb 14, 8:59 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson added a subtask for T213093: Implement reloadable search-time synonyms in extra plugin: T151060: Is it possible to set some word matching (synonym) for CirrusSearch/elasticsearch?.
Thu, Feb 14, 8:59 PM · User-EBernhardson, Discovery-Search
EBernhardson moved T75355: "You may create the page" suggestion does not appear if search contains a hyphen from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:58 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T91975: clarify the message shown when an exact search query wasn't found from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:57 PM · Discovery-Search, Discovery, Design, MediaWiki-Search, MediaWiki-General-or-Unknown
EBernhardson moved T30088: Search result snippets should skip parenthetical phrases (like Google does) from later on... to elastic / cirrus on the Discovery-Search board.
Thu, Feb 14, 8:56 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson closed T160234: Searching for images in nested categories on Commons as Resolved.

I believe the implementation of deepcat resolves this feature request:

Thu, Feb 14, 8:55 PM · Discovery-Search, Discovery, CirrusSearch, Commons, Community-Wishlist-Survey-2016
EBernhardson moved T149811: [A/B Test] Add thumbnail icons to the search results from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:51 PM · Discovery-Search, CirrusSearch, Discovery
EBernhardson closed T148811: Evaluate using SERP click throughs to build a search feedback loop as Resolved.

in general this has been complete with the implementation of mjolnir.

Thu, Feb 14, 8:51 PM · Discovery-Search, CirrusSearch, Discovery
EBernhardson moved T163423: Can't create new article from search when including a quote in the search string from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:50 PM · Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T159321: [Bug] Unpredictable behavior with the order of Special:Search parameters from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:50 PM · Patch-For-Review, MediaWiki-Search, Discovery-Search, Discovery, CirrusSearch
EBernhardson moved T169641: Search results pages should include a wiki-markup friendly version of their URL from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:49 PM · Discovery-Search, Discovery
EBernhardson moved T125725: [epic] Update autocomplete search box with metadata from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:46 PM · Discovery-Search, Wikimedia-Portals, MediaWiki-Interface, Discovery, Contributors-Team
EBernhardson moved T128232: [Bug] Search bar clipped for lower end mobile devices from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:46 PM · Discovery-Search, Discovery, Mobile, MediaWiki-Search
EBernhardson moved T185126: Have searchbox recognize {{ to search Template: namespace from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:46 PM · Discovery-Search, Discovery, MediaWiki-Search
EBernhardson moved T18237: Sort results by date from later on... to Current work on the Discovery-Search board.
Thu, Feb 14, 8:45 PM · Discovery-Search (Current work), Discovery, MediaWiki-Search
EBernhardson moved T196392: Special:Search when interwiki link is entered wrongly says that the article does not exist from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 8:44 PM · Need-volunteer, Discovery-Search, Discovery, MediaWiki-Search
EBernhardson added a comment to T151903: Special:Search performs DB writes on GET request.

Wouldn't it be feasible to have the search request generate a simple job that saves that preference asynchronously?

The jobqueue is already capable of listening to the events in both DCs, and execute them in the primary one only, so most things that can happen post-send can probably be deferred to the jobqueue.

I generic preference setting job would be useful for this and similar cases.

Thu, Feb 14, 8:05 PM · Availability (MediaWiki-MultiDC), Discovery-Search, CirrusSearch, Discovery
EBernhardson moved T210649: Suggestions in a search field on Special:Search should act the same as suggestions in search bar from later on... to UI tickets on the Discovery-Search board.
Thu, Feb 14, 7:43 PM · Discovery-Search, UI-Standardization, MediaWiki-Search
EBernhardson moved T162331: Provide tools for processing obfuscated Chinese geodata (GCJ-02, BD-09) from later on... to Geodata on the Discovery-Search board.
Thu, Feb 14, 7:43 PM · Discovery-Search, Maps, Wikidata, Chinese-Sites, GeoData
EBernhardson created P8086 (An Untitled Masterwork).
Thu, Feb 14, 6:55 PM
EBernhardson added a comment to T148843: GPU upgrade for stats machine.

Looks like one more option, a workstation card from AMD, the Vega Frontier has 16GB of memory with very similar compute to the Vega 64. Based on my review, I think this is probably our best bet for an AMD card.

Thu, Feb 14, 5:48 PM · Patch-For-Review, User-Elukey, Operations, Analytics, Research-management
EBernhardson added a comment to T148843: GPU upgrade for stats machine.

I look back over things and it looks like stat1005 is in an R470 case, they advertise compatibilty with several full-size nvidia cards so length is probably ok. For cards the choices seem to be:

Thu, Feb 14, 5:22 PM · Patch-For-Review, User-Elukey, Operations, Analytics, Research-management
EBernhardson added a comment to T148843: GPU upgrade for stats machine.

As long as it fits in the case, a high end consumer GPU from AMD should be
just fine. The most important spec for choosing will probably be the amount
of memory, more is (almost) always better. The sticking point .ight be
length, consumer cards are typically 10.5" long which might be too long for
our case? Need to find out how much space is in there.

Thu, Feb 14, 4:48 PM · Patch-For-Review, User-Elukey, Operations, Analytics, Research-management
EBernhardson updated the task description for T216093: Setup ivysettings.xml for sourcing spark job dependencies from archiva.
Thu, Feb 14, 12:19 AM · User-EBernhardson, Analytics-Cluster, Analytics
EBernhardson created T216093: Setup ivysettings.xml for sourcing spark job dependencies from archiva.
Thu, Feb 14, 12:18 AM · User-EBernhardson, Analytics-Cluster, Analytics

Wed, Feb 13

EBernhardson added a comment to T215969: Measure mutation latency across the newly split elasticsearch clusters.

One possibly way to test would be to drop our 2 minute master timeout back to 30s and see how daily completion suggester builds and whatnot work. I would love to rip out all the related master timeout code in cirrus.

Wed, Feb 13, 7:35 PM · Discovery-Search (Current work)
EBernhardson added a comment to T215969: Measure mutation latency across the newly split elasticsearch clusters.

The numbers that seem most important, only chi-eqiad (primary load-bearing cluster)

Wed, Feb 13, 7:33 PM · Discovery-Search (Current work)
EBernhardson added a comment to T148843: GPU upgrade for stats machine.

Let's see if we can narrow down the packages needed:

  • hsa-rocr-dev - AMD Heterogeneous System Architecture HSA - Linux HSA Runtime for ROCm platforms
  • hsa-ext-rocr-dev - AMD Heterogeneous System Architecture HSA - Linux HSA Runtime extensions for ROCm platforms (closed source / non FLOSS)
  • rocm-device-libs - Radeon Open Compute - device libraries
  • rocm-utils - Radeon Open Compute (ROCm) Runtime software stack
  • hcc - HCC: An Open Source, Optimizing C++ Compiler for Heterogeneous Compute
  • hip_base - HIP: Heterogenous-computing Interface for Portability [BASE]
  • hip_doc - HIP: Heterogenous-computing Interface for Portability [DOCUMENTATION]
  • hip_hcc - HIP: Heterogenous-computing Interface for Portability [HCC]
  • hip_samples - HIP: Heterogenous-computing Interface for Portability [SAMPLES]
  • rocm-smi - System Management Interface for ROCm
  • hsakmt-roct - HSAKMT library for AMD KFD support
  • hsakmt-roct-dev - HSAKMT development package.
  • hsa-amd-aqlprofile - AQLPROFILE library for AMD HSA runtime API extension support
  • comgr - Library to provide support functions
  • rocr_debug_agent - Radeon Open Compute (ROCm) Runtime debug agent
Wed, Feb 13, 6:47 PM · Patch-For-Review, User-Elukey, Operations, Analytics, Research-management
EBernhardson moved T215969: Measure mutation latency across the newly split elasticsearch clusters from in progress to Needs review on the Discovery-Search (Current work) board.
Wed, Feb 13, 1:00 AM · Discovery-Search (Current work)
EBernhardson added a comment to T215969: Measure mutation latency across the newly split elasticsearch clusters.

Re-ran data collection and the report. Of particular interest here is going to be chi-eqiad which is serving the majority of traffic. The over-time graphs for chi-eqiad aren't great, but they are better than before. Additionally the largest spikes are directly attributable to disk space issues we are currently experiencing in eqiad. Looking at the allocation explain while running the test shows that sometimes the master decides all nodes are above the disk threshold. I ended up needing to increase the watermark from 75% to 79% for the test to even run.

Wed, Feb 13, 1:00 AM · Discovery-Search (Current work)

Tue, Feb 12

EBernhardson triaged T215969: Measure mutation latency across the newly split elasticsearch clusters as Normal priority.
Tue, Feb 12, 9:59 PM · Discovery-Search (Current work)
EBernhardson created T215945: Mjolnir kafka msearch daemon should scale up/down based on elasticsearch load.
Tue, Feb 12, 6:49 PM · User-EBernhardson
EBernhardson updated the task description for T215856: Free the mjolnir datasets.
Tue, Feb 12, 6:41 PM · User-EBernhardson, Discovery, Epic
EBernhardson added a comment to T215856: Free the mjolnir datasets.

I think the main idea will be to keep all of the appropriate code for performing transformations in mjolnir, and add oozie jobs to wikimedia/search/analytics. The new jobs can be python scripts, we already build venv's with dependencies for transfer_to_es, that import mjolnir and run the appropriate transformations. The scripts would primarily be concerned with where to load data from and write to store it. The algorithms would stay in mjolnir.

Tue, Feb 12, 6:36 PM · User-EBernhardson, Discovery, Epic
EBernhardson updated the task description for T215856: Free the mjolnir datasets.
Tue, Feb 12, 6:34 PM · User-EBernhardson, Discovery, Epic
EBernhardson updated the task description for T215916: ElasticSearch 6 migration plan checklist (search cluster).
Tue, Feb 12, 3:59 PM · Discovery-Search
EBernhardson created T215856: Free the mjolnir datasets.
Tue, Feb 12, 2:53 AM · User-EBernhardson, Discovery, Epic

Mon, Feb 11

EBernhardson moved T215487: search sorted by creation date missing some items from in progress to Waiting/Blocked on the Discovery-Search (Current work) board.
Mon, Feb 11, 11:27 PM · Discovery-Search (Current work), MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Patch-For-Review, CirrusSearch, Discovery
EBernhardson added a comment to T215487: search sorted by creation date missing some items.

Kind of good news / bad news. The good news is the patch is merged and will deploy this week. The bad news is the bug was in the process that backfill's old properties like the somewhat recently added page creation date. It's basically going to take 2 more months before these new property sorts take into account all pages.

Mon, Feb 11, 11:26 PM · Discovery-Search (Current work), MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Patch-For-Review, CirrusSearch, Discovery
EBernhardson moved T215487: search sorted by creation date missing some items from elastic / cirrus to Current work on the Discovery-Search board.
Mon, Feb 11, 11:25 PM · Discovery-Search (Current work), MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Patch-For-Review, CirrusSearch, Discovery
EBernhardson claimed T215487: search sorted by creation date missing some items.
Mon, Feb 11, 11:25 PM · Discovery-Search (Current work), MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Patch-For-Review, CirrusSearch, Discovery
EBernhardson moved T214515: Run wikidata entitiy autocomplete AB test in de, fr, es from Waiting/Blocked to Done on the Discovery-Search (Current work) board.
Mon, Feb 11, 11:22 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

@EBernhardson , this looks exactly what I was looking for, initially. Thank you very much for that.

However, I wont close this task, because wikibase_item is still missing the page_id information. Joining by page_title does not seems very 'healthy'. We should keep discussing how to solve that. Thanks

Mon, Feb 11, 7:53 PM · MediaWiki-Database, Wikidata, DBA, Analytics, Research
EBernhardson updated the task description for T214515: Run wikidata entitiy autocomplete AB test in de, fr, es.
Mon, Feb 11, 7:31 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T214515: Run wikidata entitiy autocomplete AB test in de, fr, es.

AB test reports published. Clicks@1 improved in all languages with no regressions in other metrics. Clicks@1 starts around 78-80% and improves by 2-5% to 83-85% depending on language. Will deploy these profiles as the new defaults.

Mon, Feb 11, 7:31 PM · Patch-For-Review, Discovery-Search (Current work)
EBernhardson added a comment to T215616: Improve interlingual links across wikis through Wikidata IDs.

I don't know if this meets your needs, but the cirrussearch dumps have the wikidata id's broken out. This is the wikibase_item field of the ebernhardson.cirrus2hive table in hive. Alternatively there are full dumps with each article as a json object: https://dumps.wikimedia.your.org/other/cirrussearch/

Mon, Feb 11, 6:28 PM · MediaWiki-Database, Wikidata, DBA, Analytics, Research
EBernhardson removed a project from T215475: Ensure mjolnir daemons work seamlessly with elasticsearch 5 or 6: Patch-For-Review.
Mon, Feb 11, 5:44 PM · Discovery-Search (Current work)
EBernhardson removed a project from T215199: mwgrep needs to query multiple elasticsearch clusters: Patch-For-Review.
Mon, Feb 11, 5:44 PM · Discovery-Search (Current work)
EBernhardson removed a project from T215621: Make sure that ApiFeatureUsage still works with elasticsearch 6.5.4: Patch-For-Review.
Mon, Feb 11, 5:44 PM · MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Discovery-Search (Current work), ApiFeatureUsage
EBernhardson removed a project from T198734: Rework how source_regex timeout is done in Cirrus: Patch-For-Review.
Mon, Feb 11, 5:44 PM · Discovery-Search (Current work), Discovery, CirrusSearch
EBernhardson removed a project from T215369: Install fails while running updateSearchIndexConfig.php - no such index "mw_cirrus_metastore": Patch-For-Review.
Mon, Feb 11, 5:44 PM · Discovery-Search (Current work)
EBernhardson moved T215369: Install fails while running updateSearchIndexConfig.php - no such index "mw_cirrus_metastore" from Needs review to Done on the Discovery-Search (Current work) board.
Mon, Feb 11, 5:44 PM · Discovery-Search (Current work)
EBernhardson moved T198734: Rework how source_regex timeout is done in Cirrus from Needs review to Done on the Discovery-Search (Current work) board.
Mon, Feb 11, 5:43 PM · Discovery-Search (Current work), Discovery, CirrusSearch
EBernhardson moved T215621: Make sure that ApiFeatureUsage still works with elasticsearch 6.5.4 from Needs review to Done on the Discovery-Search (Current work) board.
Mon, Feb 11, 5:42 PM · MW-1.33-notes (1.33.0-wmf.17; 2019-02-12), Discovery-Search (Current work), ApiFeatureUsage
EBernhardson added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

Longer term search will potentially want to generate some significantly larger datasets to ship to production, but we don't yet have a concrete implementation plan so everything is a bit hand-wavy. As one example though we have looked into turning all sentences from articles on a wiki into vectors. These vectors are 4kB and a previous estimation was 250M sentences on en.wiki, 50M sentences on fr.wiki, declining from there. Overall data size on the order of 2-10TB. This is fairly far off though, something closer term would be more like a 1kB vector per article which is a much more reasonable ~5GB for enwiki and declining from there. This is far enough out I'm really not

Mon, Feb 11, 4:07 PM · Research, Operations, Discovery, Analytics
EBernhardson added a comment to T213976: Workflow to be able to move data files computed in jobs from analytics cluster to production .

How big is the dataset and how fast is it going to grow?

In the hundreds of megabytes I believe. @Halfak, @EBernhardson, @Miriam, @bmansurov, is this right? Will ML models be about this size for the foreseeable future?

Mon, Feb 11, 3:57 PM · Research, Operations, Discovery, Analytics

Sun, Feb 10

EBernhardson added a comment to T109715: Replicate production elasticsearch indices to labs.

The servers have been purchased and racked up. Patches were going through puppet last week getting new security groups setup for accessing the cluster, installing the servers, etc. Basically, things are progressing and I'm optimistic we will have a public service ready in time for the summer hackathon.

Sun, Feb 10, 6:32 PM · Discovery-Search, Cloud-Services, Elasticsearch, Discovery

Thu, Feb 7

EBernhardson moved T215199: mwgrep needs to query multiple elasticsearch clusters from Needs review to Done on the Discovery-Search (Current work) board.
Thu, Feb 7, 11:42 PM · Discovery-Search (Current work)
EBernhardson added a comment to T215520: prefixsearch does not include result whose title is exactly the same as the search string.

The default prefix search is heavily tuned to finding content articles, It considers redirects and the page redirected to to be a singular entity, the version of the string chosen to show amounts to a heuristic that tries to decide between showing something closer to what you typed that exists as a redirect, or the original page title. Additionally this system considers two versions of the string with different casing to be the same string, only one cased version (chosen fairly randomly) is available to find.

Thu, Feb 7, 7:29 PM · CirrusSearch, Discovery-Search, Anti-Harassment
EBernhardson added a comment to T215520: prefixsearch does not include result whose title is exactly the same as the search string.
Thu, Feb 7, 6:52 PM · CirrusSearch, Discovery-Search, Anti-Harassment

Wed, Feb 6

EBernhardson moved T215475: Ensure mjolnir daemons work seamlessly with elasticsearch 5 or 6 from in progress to Needs review on the Discovery-Search (Current work) board.
Wed, Feb 6, 11:22 PM · Discovery-Search (Current work)