Page MenuHomePhabricator

TJones (Trey Jones)
Staff Computational Linguist, Search Platform Team

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jul 8 2015, 3:02 PM (429 w, 3 d)
Availability
Available
IRC Nick
Trey314159
LDAP User
Tjones
MediaWiki User
TJones (WMF) [ Global Accounts ]

I would have written a shorter comment, but I did not have the time.

I'm part of the Search Platform team and I spend my time working on search & relevance, trying to better support search in various languages, analyzing queries, and doing random mathy things. I tend to write long, detailed notes about my investigations (so as to improve the bus number of my work).

When I have to work on _GitHub,_ /‍‍/Phab,/‍‍/ and ''MediaWiki'' all on the same day, I sometimes suffer Severe Markup Incongruence Fatigue.

I � Unicode.

Recent Activity

Wed, Sep 27

TJones added a comment to T346051: Refactor slow global analysis components.

We previously discussed how to bundle the new filters, but talked about it again today.

Wed, Sep 27, 5:02 PM · Discovery-Search (Current work)
TJones updated the task description for T346051: Refactor slow global analysis components.
Wed, Sep 27, 4:47 PM · Discovery-Search (Current work)

Thu, Sep 21

TJones updated the task description for T346051: Refactor slow global analysis components.
Thu, Sep 21, 9:54 PM · Discovery-Search (Current work)

Mon, Sep 18

TJones moved T346456: Improve concurrency limits configuration of the wdqs updater from needs triage to Current work on the Discovery-Search board.
Mon, Sep 18, 3:48 PM · Patch-For-Review, Discovery-Search (Current work), wdwb-tech, Wikidata, serviceops, Wikidata-Query-Service
TJones removed a project from T328330: Create SLI / SLO on Search update lag and error rate: Epic.
Mon, Sep 18, 3:11 PM · Patch-For-Review, Discovery-Search (Current work)

Mon, Sep 11

TJones moved T332342: Standardize ASCII-folding/ICU-folding across analyzers from In Progress to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Moved back to ready for dev while working on T346051

Mon, Sep 11, 3:41 PM · Discovery-Search (Current work)
TJones moved T346051: Refactor slow global analysis components from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Mon, Sep 11, 3:40 PM · Discovery-Search (Current work)
TJones claimed T346051: Refactor slow global analysis components.
Mon, Sep 11, 3:28 PM · Discovery-Search (Current work)
TJones renamed T346051: Refactor slow global analysis components from Refactor slow analysis components to Refactor slow global analysis components.
Mon, Sep 11, 3:21 PM · Discovery-Search (Current work)
TJones updated the task description for T346051: Refactor slow global analysis components.
Mon, Sep 11, 3:20 PM · Discovery-Search (Current work)
TJones updated the task description for T346051: Refactor slow global analysis components.
Mon, Sep 11, 3:20 PM · Discovery-Search (Current work)
TJones updated the task description for T346051: Refactor slow global analysis components.
Mon, Sep 11, 3:19 PM · Discovery-Search (Current work)
TJones created T346051: Refactor slow global analysis components.
Mon, Sep 11, 3:16 PM · Discovery-Search (Current work)
TJones moved T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper from Ready for Dev -- SRE/Ops to Blocked/Waiting on the Discovery-Search (Current work) board.
Mon, Sep 11, 3:15 PM · Discovery-Search (Current work)
TJones moved T170625: Smarter handling of acronyms for word_break_helper in language analyzers from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Mon, Sep 11, 3:13 PM · Discovery-Search (Current work)
TJones added a comment to T170625: Smarter handling of acronyms for word_break_helper in language analyzers.

This has been deployed, but the reindexing ws stopped for being too slow. I'll move this ticket into needs reporting and open a new one for the new efficiency refactor.

Mon, Sep 11, 3:13 PM · Discovery-Search (Current work)

Aug 28 2023

TJones added a comment to T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper.

Sounds like my local reindexing is insufficient for detecting non-egregious slow downs in indexing speed. (I know I have other overhead—I guess it's even more than I thought.) Should we pause the reindex and investigate more thoroughly on RelForge, with the possibility of reverting some changes after finding the slowest ones?

Aug 28 2023, 2:21 PM · Discovery-Search (Current work)

Aug 1 2023

TJones moved T170625: Smarter handling of acronyms for word_break_helper in language analyzers from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Aug 1 2023, 2:10 PM · Discovery-Search (Current work)

Jul 31 2023

TJones claimed T332342: Standardize ASCII-folding/ICU-folding across analyzers.
Jul 31 2023, 8:44 PM · Discovery-Search (Current work)
TJones moved T332342: Standardize ASCII-folding/ICU-folding across analyzers from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Jul 31 2023, 7:03 PM · Discovery-Search (Current work)
TJones moved T170625: Smarter handling of acronyms for word_break_helper in language analyzers from In Progress to Needs review on the Discovery-Search (Current work) board.
Jul 31 2023, 6:17 PM · Discovery-Search (Current work)
TJones added a comment to T170625: Smarter handling of acronyms for word_break_helper in language analyzers.

acronym_fixer is rather complicated, as expected. word_break_helper is a little complicated, unexpectedly! More on MediaWiki.

Jul 31 2023, 6:15 PM · Discovery-Search (Current work)
TJones updated the task description for T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper.
Jul 31 2023, 2:47 AM · Discovery-Search (Current work)

Jul 26 2023

TJones updated the task description for T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper.
Jul 26 2023, 1:56 PM · Discovery-Search (Current work)

Jul 25 2023

TJones added a comment to T219550: [EPIC] Harmonize language analysis across languages.

While harmonizing, I noticed that the Hebrew analysis chain was creating a lot of duplicate tokens. Adding a remove_duplicates filter removed 19.7% (Wikipedia) to 22.7% (Wiktionary) of all tokens—all non-Hebrew and many Hebrew tokens were duplicated! Did a lot of refactoring (checked off the task above!), too.

Jul 25 2023, 11:40 PM · MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), Discovery-Search (Current work), Epic
TJones updated the task description for T219550: [EPIC] Harmonize language analysis across languages.
Jul 25 2023, 11:36 PM · MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), Discovery-Search (Current work), Epic

Jul 21 2023

Pols12 awarded T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper a Love token.
Jul 21 2023, 10:06 PM · Discovery-Search (Current work)
TJones added a comment to T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper.

I'll do a write up of the before-and-after impact of the reindexing and post a link here, but anyone can do the reindexing and finish the ticket without that.

Jul 21 2023, 3:16 PM · Discovery-Search (Current work)
TJones created T342444: Reindex all wikis to enable apostrophe normalization, camelCase handling, acronym handling, and word_break_helper.
Jul 21 2023, 3:15 PM · Discovery-Search (Current work)
TJones added a comment to T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis.

@Pols12, the code is deployed, but not activated yet. In our workflow, we generally close tickets when the code is deployed, separate from when the feature is available.

Jul 21 2023, 3:05 PM · Discovery-Search (Current work), CirrusSearch

Jul 17 2023

TJones changed the point value for T332337: Repair multi-script tokens split by the ICU tokenizer from 5 to 8.
Jul 17 2023, 3:50 PM · Discovery-Search (Current work)
TJones changed the point value for T332342: Standardize ASCII-folding/ICU-folding across analyzers from 5 to 8.
Jul 17 2023, 3:50 PM · Discovery-Search (Current work)

Jul 14 2023

TJones closed T268788: Create Elasticsearch filter so we can do aggressive_splitting without causing an invalid token order, a subtask of T268730: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards, as Declined.
Jul 14 2023, 8:00 PM · MW-1.36-notes (1.36.0-wmf.30; 2021-02-09), Discovery-Search (Current work), CirrusSearch
TJones closed T268788: Create Elasticsearch filter so we can do aggressive_splitting without causing an invalid token order as Declined.

I'm going to close this one because we've deprecated aggressive_splitting as too aggressive for the text field. It is still used in the short_text field, but the context there is constrained and not language-specific, so it's less likely to accidentally get messy.

Jul 14 2023, 8:00 PM · Discovery-Search, CirrusSearch

Jul 10 2023

TJones renamed T341332: [EPIC] The CirrusSearch streaming updater should support private wikis from The CirrusSearch streaming updater should support private wikis to [EPIC] The CirrusSearch streaming updater should support private wikis.
Jul 10 2023, 3:43 PM · Epic, Discovery-Search (Current work), CirrusSearch
TJones moved T341332: [EPIC] The CirrusSearch streaming updater should support private wikis from Incoming to Epics on the Discovery-Search (Current work) board.
Jul 10 2023, 3:43 PM · Epic, Discovery-Search (Current work), CirrusSearch
TJones moved T340548: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s from Incoming to Epics on the Discovery-Search (Current work) board.
Jul 10 2023, 3:42 PM · Epic, Discovery-Search (Current work), Data-Platform-SRE
TJones renamed T340548: [EPIC] Deployment of the Search Update Pipeline on Flink / k8s from Deployment of the Search Update Pipeline on Flink / k8s to [EPIC] Deployment of the Search Update Pipeline on Flink / k8s.
Jul 10 2023, 3:41 PM · Epic, Discovery-Search (Current work), Data-Platform-SRE
TJones triaged T341073: Normalise Mongolian script when searching as High priority.

If we can get a list of mappings, this should be technically straightforward. I will review the lists provided, and consult with a linguist I know who lives in Mongolia to see if I missed anything else obvious.

Jul 10 2023, 3:23 PM · Discovery-Search, I18n, Vertical-Writing
TJones moved T315118: Handle variation in apostrophe-like characters better from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Jul 10 2023, 3:08 PM · Discovery-Search (Current work), CirrusSearch
TJones moved T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Jul 10 2023, 3:08 PM · Discovery-Search (Current work), CirrusSearch

Jul 5 2023

TJones moved T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Jul 5 2023, 1:59 PM · Discovery-Search (Current work), CirrusSearch
TJones moved T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis from In Progress to Needs review on the Discovery-Search (Current work) board.
Jul 5 2023, 1:59 PM · Discovery-Search (Current work), CirrusSearch

Jun 26 2023

TJones added a comment to T339293: Wikimedia\Assert\PostconditionException: Postcondition failed: Regex failed: 4.

I think I found the input that causes the problem.

Jun 26 2023, 3:42 PM · API Platform, MediaWiki-REST-API, CirrusSearch, Wikimedia-production-error
TJones moved T315118: Handle variation in apostrophe-like characters better from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
Jun 26 2023, 3:13 PM · Discovery-Search (Current work), CirrusSearch

Jun 23 2023

TJones updated subscribers of T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis.

My full writeup is on Mediawiki.

Jun 23 2023, 7:54 PM · Discovery-Search (Current work), CirrusSearch

Jun 15 2023

TJones added a comment to T170625: Smarter handling of acronyms for word_break_helper in language analyzers.

Sorry for the hokey pokey—you put the ticket in, you take the ticket out.. you put the ticket in, and you shake it all about—but the aggressive_splitting ticket (T219108) overlaps with this one too much. And! I discovered I can do what I want for acronym collapsing with a regex (probably.. still checking on details) rather than a custom filter, which makes this easier—and I'd feel better about deploying word_break_helper everywhere with that fix in place.

Jun 15 2023, 2:08 PM · Discovery-Search (Current work)
TJones added a comment to T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis.

More details to come, but aggressive_splitting (which is a word_delimiter filter underneath) is just too aggressive. It breaks things ICU normalization does better, and word_break_helper (T170625) does or can do the good things aggressive_splitting does. My new plan is to deactivate aggressive_splitting on English-language wikis and replace it with a split_camelCase filter that addresses the original issue of "FilesystemHierarchyStandard" in this ticket, and delegate the good things it does to word_break_helper.

Jun 15 2023, 2:00 PM · Discovery-Search (Current work), CirrusSearch
TJones moved T170625: Smarter handling of acronyms for word_break_helper in language analyzers from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Jun 15 2023, 1:40 PM · Discovery-Search (Current work)

Jun 12 2023

TJones claimed T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis.
Jun 12 2023, 5:54 PM · Discovery-Search (Current work), CirrusSearch
TJones moved T219108: Investigate applying aggressive_splitting everywhere, not just on English-language wikis from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
Jun 12 2023, 5:54 PM · Discovery-Search (Current work), CirrusSearch
TJones moved T315118: Handle variation in apostrophe-like characters better from In Progress to Needs review on the Discovery-Search (Current work) board.
Jun 12 2023, 3:18 PM · Discovery-Search (Current work), CirrusSearch

Jun 9 2023

TJones added a comment to T315118: Handle variation in apostrophe-like characters better.

Full write up on MediaWiki.

Jun 9 2023, 9:52 PM · Discovery-Search (Current work), CirrusSearch

Jun 8 2023

TJones moved T170625: Smarter handling of acronyms for word_break_helper in language analyzers from In Progress to Ready for Dev -- SWE on the Discovery-Search (Current work) board.
Jun 8 2023, 10:35 PM · Discovery-Search (Current work)

May 24 2023

TJones updated the task description for T219550: [EPIC] Harmonize language analysis across languages.
May 24 2023, 7:26 PM · MW-1.41-notes (1.41.0-wmf.20; 2023-08-01), Discovery-Search (Current work), Epic

May 22 2023

TJones updated the task description for T147505: [tracking] CirrusSearch: what is updated during re-indexing.
May 22 2023, 6:14 PM · Tracking-Neverending, Epic, Discovery-Search (Current work), Discovery-ARCHIVED
TJones moved T272606: [EPIC] Unpack all Elasticsearch analyzers from Epics to Needs Reporting on the Discovery-Search (Current work) board.

Holy Guacamole, Batman! It's all done!

May 22 2023, 6:12 PM · Epic, Discovery-Search (Current work)
TJones updated the task description for T272606: [EPIC] Unpack all Elasticsearch analyzers.
May 22 2023, 6:02 PM · Epic, Discovery-Search (Current work)
TJones moved T337064: Reindex Turkish wikis to enable improved apostrophe handling from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

Only looking at changes from the better apostrophe handling, because everything else got reindexed as part of reindexing all wikis for another project before the Turkish plugin was deployed. At least we tested the fallback config for when the Turkish plugin is missing, and it worked just fine!

  • No meaningful changes to ZRR or top result for this sample of queries.
  • Good changes to queries getting both more results (improved recall—Italian oggetto matching all'oggetto) and fewer results (improved precision—l no longer matches l'administration).
May 22 2023, 6:02 PM · Discovery-Search (Current work)
TJones moved T332355: Deploy Turkish Analyzer Plugin from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
May 22 2023, 3:15 PM · Discovery-Search (Current work)

May 19 2023

TJones updated the task description for T272606: [EPIC] Unpack all Elasticsearch analyzers.
May 19 2023, 6:11 PM · Epic, Discovery-Search (Current work)
TJones moved T337064: Reindex Turkish wikis to enable improved apostrophe handling from Incoming to In Progress on the Discovery-Search (Current work) board.
May 19 2023, 6:11 PM · Discovery-Search (Current work)
TJones created T337064: Reindex Turkish wikis to enable improved apostrophe handling.
May 19 2023, 6:11 PM · Discovery-Search (Current work)

May 17 2023

TJones added a comment to T332355: Deploy Turkish Analyzer Plugin.

We've built the new package 7.10.2-5. Haven't yet done a restart of hosts.

May 17 2023, 9:54 PM · Discovery-Search (Current work)

May 16 2023

TJones claimed T170625: Smarter handling of acronyms for word_break_helper in language analyzers.
May 16 2023, 8:28 PM · Discovery-Search (Current work)
TJones claimed T315118: Handle variation in apostrophe-like characters better.
May 16 2023, 8:28 PM · Discovery-Search (Current work), CirrusSearch

May 15 2023

TJones moved T335704: Reindex Estonian wikis to enable new unpacked analyzer from Needs review to Needs Reporting on the Discovery-Search (Current work) board.
May 15 2023, 3:07 PM · Discovery-Search (Current work)

May 9 2023

TJones moved T335704: Reindex Estonian wikis to enable new unpacked analyzer from In Progress to Needs review on the Discovery-Search (Current work) board.
May 9 2023, 5:53 PM · Discovery-Search (Current work)
TJones added a comment to T335704: Reindex Estonian wikis to enable new unpacked analyzer.
  • Estonian got a new stemmer, and it had a pretty big impact! 1 in 6 previous Estonian Wikipedia zero-results queries get results. Almost 1 in 3 of non-zero-results queries get more results, and more than 1 in 8 queries had their top result change.
May 9 2023, 5:52 PM · Discovery-Search (Current work)

May 8 2023

TJones moved T335704: Reindex Estonian wikis to enable new unpacked analyzer from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.
May 8 2023, 3:22 PM · Discovery-Search (Current work)
TJones claimed T335704: Reindex Estonian wikis to enable new unpacked analyzer.
May 8 2023, 3:22 PM · Discovery-Search (Current work)
TJones moved T332322: Install and unpack Estonian analyzer from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
May 8 2023, 3:22 PM · MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Discovery-Search (Current work)

May 2 2023

TJones raised the priority of T78485: At Vietnamese wikis, Special:Search should not redirect based on case-folding from Low to Needs Triage.
May 2 2023, 3:40 PM · Discovery-Search, CirrusSearch, Discovery-ARCHIVED, I18n, MediaWiki-Search
TJones added a project to T78485: At Vietnamese wikis, Special:Search should not redirect based on case-folding: Discovery-Search.
May 2 2023, 3:39 PM · Discovery-Search, CirrusSearch, Discovery-ARCHIVED, I18n, MediaWiki-Search

May 1 2023

TJones updated the task description for T147505: [tracking] CirrusSearch: what is updated during re-indexing.
May 1 2023, 6:07 PM · Tracking-Neverending, Epic, Discovery-Search (Current work), Discovery-ARCHIVED
TJones removed a subtask for T272606: [EPIC] Unpack all Elasticsearch analyzers: T315265: Reindex Bengali wikis to enable new analyzer.
May 1 2023, 6:07 PM · Epic, Discovery-Search (Current work)
TJones edited parent tasks for T315265: Reindex Bengali wikis to enable new analyzer, added: T294067: Install and unpack Bengali analyzer; removed: T272606: [EPIC] Unpack all Elasticsearch analyzers.
May 1 2023, 6:07 PM · Discovery-Search (Current work)
TJones added a subtask for T294067: Install and unpack Bengali analyzer: T315265: Reindex Bengali wikis to enable new analyzer.
May 1 2023, 6:07 PM · Discovery-Search (Current work)
TJones added a subtask for T294147: Unpack Arabic & Thai Elasticsearch Analyzers: T316817: Explore Using Arabic Analysis Chain for Egyptian Arabic and Moroccan Arabic.
May 1 2023, 6:05 PM · MW-1.40-notes (1.40.0-wmf.5; 2022-10-10), Discovery-Search (Current work)
TJones added a parent task for T316817: Explore Using Arabic Analysis Chain for Egyptian Arabic and Moroccan Arabic: T294147: Unpack Arabic & Thai Elasticsearch Analyzers.
May 1 2023, 6:05 PM · MW-1.40-notes (1.40.0-wmf.8; 2022-10-31), Discovery-Search (Current work)
TJones added a subtask for T322776: Deploy Ukrainian Analyzer Plugin: T323927: Reindex Ukrainian-language wikis to enable unpacked analysis.
May 1 2023, 6:03 PM · Discovery-Search (Current work)
TJones added a parent task for T323927: Reindex Ukrainian-language wikis to enable unpacked analysis: T322776: Deploy Ukrainian Analyzer Plugin.
May 1 2023, 6:03 PM · Discovery-Search (Current work)
TJones added a subtask for T318264: Investigate Unpacking Ukrainian Analyzer: T322776: Deploy Ukrainian Analyzer Plugin.
May 1 2023, 6:03 PM · MW-1.40-notes (1.40.0-wmf.12; 2022-11-28), Discovery-Search (Current work)
TJones added a parent task for T322776: Deploy Ukrainian Analyzer Plugin: T318264: Investigate Unpacking Ukrainian Analyzer.
May 1 2023, 6:03 PM · Discovery-Search (Current work)
TJones added a subtask for T325092: Unpack Brazilian (Portuguese) Elasticsearch Analyzer: T333398: Reindex brwikimedia to use new unpacked Brazlian Portuguese analysis chain.
May 1 2023, 6:02 PM · MW-1.41-notes (1.41.0-wmf.3; 2023-04-03), Discovery-Search (Current work)
TJones added a parent task for T333398: Reindex brwikimedia to use new unpacked Brazlian Portuguese analysis chain: T325092: Unpack Brazilian (Portuguese) Elasticsearch Analyzer.
May 1 2023, 6:02 PM · Discovery-Search (Current work)
TJones triaged T335704: Reindex Estonian wikis to enable new unpacked analyzer as High priority.
May 1 2023, 5:59 PM · Discovery-Search (Current work)
TJones created T335704: Reindex Estonian wikis to enable new unpacked analyzer.
May 1 2023, 5:58 PM · Discovery-Search (Current work)
TJones moved T332322: Install and unpack Estonian analyzer from Needs review to To Be Deployed on the Discovery-Search (Current work) board.
May 1 2023, 5:54 PM · MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Discovery-Search (Current work)
TJones triaged T333401: Investigate using a better stemmer & stopwords for Portuguese wikis as High priority.
May 1 2023, 3:51 PM · Discovery-Search
TJones moved T333401: Investigate using a better stemmer & stopwords for Portuguese wikis from needs triage to Language Stuff on the Discovery-Search board.
May 1 2023, 3:51 PM · Discovery-Search
TJones edited projects for T333401: Investigate using a better stemmer & stopwords for Portuguese wikis, added: Discovery-Search; removed Discovery-Search (Current work).
May 1 2023, 3:50 PM · Discovery-Search

Apr 28 2023

TJones updated the task description for T272606: [EPIC] Unpack all Elasticsearch analyzers.
Apr 28 2023, 7:59 PM · Epic, Discovery-Search (Current work)

Apr 27 2023

TJones moved T332322: Install and unpack Estonian analyzer from In Progress to Needs review on the Discovery-Search (Current work) board.
Apr 27 2023, 9:42 PM · MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Discovery-Search (Current work)
TJones added a comment to T332322: Install and unpack Estonian analyzer.

The new Estonian analyzer looks good, and enabling it has a big impact:

  • 4.372% of Wiktionary tokens (532 distinct, including case variants) and 13.146% of Wikipedia tokens (987 distinct) were filtered as stop words.
  • The merges from stemming were quite significant (even more so for Wikipedia):
    • Estonian Wiktionary: 11,044 tokens (6.912% of tokens) were merged into 2,487 groups (2.895% of groups)
    • Estonian Wikipedia: 497,501 tokens (22.675% of tokens) were merged into 23,075 groups (7.373% of groups)
Apr 27 2023, 9:41 PM · MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Discovery-Search (Current work)

Apr 17 2023

TJones moved T333398: Reindex brwikimedia to use new unpacked Brazlian Portuguese analysis chain from In Progress to Needs Reporting on the Discovery-Search (Current work) board.

A review of the cirrus-settings-dump and a test query show things are deployed as expected.

Apr 17 2023, 5:45 PM · Discovery-Search (Current work)
TJones claimed T333398: Reindex brwikimedia to use new unpacked Brazlian Portuguese analysis chain.
Apr 17 2023, 5:36 PM · Discovery-Search (Current work)

Apr 10 2023

TJones added a comment to T323628: Optimize the WikibaseCirrusSearch elasticsearch mapping and filter query for non-english users.

@EBernhardson, it is working! I also verified some of the original examples from the Wikidata discussion page.

Apr 10 2023, 6:46 PM · MW-1.41-notes (1.41.0-wmf.1; 2023-03-20), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), Discovery-Search (Current work), CirrusSearch
TJones moved T325315: Add support for redirects in CirrusSearch from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.
Apr 10 2023, 3:56 PM · Data Engineering and Event Platform Team, MW-1.41-notes (1.41.0-wmf.16; 2023-07-04), Event-Platform, Data-Engineering, Discovery-Search (Current work)
TJones moved T329762: Unpack Turkish Analyzer and improve apostrophe handling from To Be Deployed to Needs Reporting on the Discovery-Search (Current work) board.
Apr 10 2023, 3:16 PM · MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), Discovery-Search (Current work)