Page MenuHomePhabricator

Generate wikitext-based and query-based language models for TextCat
Closed, ResolvedPublic

Description

We need to generate two sets of data models for TextCat - one based on wikitext for top 50 languages (based on number of speakers) and one based on queries for top 50 (based on logs or same as above?)

Related Objects

StatusSubtypeAssignedTask
ResolvedEBernhardson
Declinedmpopov
ResolvedEBernhardson
Resolvedmpopov
ResolvedEBernhardson
Resolveddebt
OpenNone
ResolvedEBernhardson
ResolvedEBernhardson
ResolvedEBernhardson
Resolveddebt
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
ResolvedTJones
Resolveddebt
ResolvedAnikethfoss
ResolvedTJones
Resolveddebt
ResolvedSmalyshev
ResolvedTJones
ResolvedTJones
Resolved dpatrick
ResolvedEBernhardson

Event Timeline

Smalyshev assigned this task to TJones.
Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev added a project: Discovery-ARCHIVED.
Smalyshev subscribed.

This is at least related to T121545, though some of the details are different.

This is very similar to T121545. I think they should be merged.

At the moment, I've created models based on lightly cleaned up WikiText, but haven't evaluated them. They have been committed and submitted for review, too.

The mentioned patch is merged, so i'm calling this complete. It could also just be merged as duplicate like suggested by trey