Maniphest T219911

Retrain Chinese query-based language ID models
Open, MediumPublic
Actions

Assigned To

None

Authored By

	TJones
	Apr 2 2019, 5:59 PM

Tags

Referenced Files

None

Subscribers

Description

As part of the investigation into T174116, we discovered that the Chinese query-based language models have a lot of long strings of dashes and periods in them, which should be removed.

Once the model is updated, it should be tested against the TextCat regression test sets to make sure there's no big change in overall performance, and against the punctuation-heavy sample from T174116, adn against the specific known queries (... being the main one) to make sure everything works as expected.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T118278 [EPIC] Improve Language Identification for use in Cirrus Search
		Open		None	T219911 Retrain Chinese query-based language ID models

Event Timeline

TJones created this task.Apr 2 2019, 5:59 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2019, 5:59 PM

TJones triaged this task as Medium priority.Apr 2 2019, 6:00 PM

TJones moved this task from needs triage to Language Stuff on the Discovery-Search board.

TJones added a parent task: T174116: Another look at multi-hyphen tokens on enwiki and zhwiki.Apr 2 2019, 6:08 PM

TJones edited parent tasks, added: T118278: [EPIC] Improve Language Identification for use in Cirrus Search; removed: T174116: Another look at multi-hyphen tokens on enwiki and zhwiki.Apr 2 2019, 6:11 PM

TJones mentioned this in T174116: Another look at multi-hyphen tokens on enwiki and zhwiki.Apr 2 2019, 6:27 PM

Shizhao added a project: Chinese-Sites.Apr 9 2019, 8:59 AM

VulpesVulpes825 moved this task from Backlog to Research on the Chinese-Sites board.Jul 13 2020, 7:17 AM