Page MenuHomePhabricator

Milestone 2025Q2 > Freelance: Wikibase data migration
Open, HighPublic

Description

IMPORTANT: This is as proposed estimate workload and budget. Refers to column D.

Hello Xavier,

Below are my recommendations for the wikibase migrations.

Per diem

Per diem at 400€.

Resources

Some documentation exist in /doc

Mission

Axis 1: Prepare
Axis 2: Extract
Axis 3: Migrate

TaskPredictedActualTotalTitle
Axis 1
T385489½?200€Gather base knowledge on LinguaLibre to Wikidata properties
T3546402?800€Write migration procedure for Locutors data
T3546342?800€Write a resource sheet about Recordings
Axis 2
½Take LinguaLibre.org offline for 3 days (WMFR)
1?400€Extract Recordings data
1?400€Extract Locutors data
Axis 3
2?800€Prepare Wikimedia Commons Bot to migrate structured data and templates data
1?400€(?) Inject Recordings data to existing Wikimedia Commons file as their structured data
1?400€Inject Locutors data to Lingualibre.org Mariadb
½Online back LinguaLibre.org (WMFR)
Others
--3.5?1000Review and fixes (Optional)
--1?400Coordination (emails, meetups)
Total14 days_ days6400€(project not approved yet)

Note: Commons:Bots/Requests/Lingua_Libre_Bot

WARNING: After this sprint, local clean up is welcome. See T298412.

Event Timeline

Yug triaged this task as High priority.Feb 4 2025, 2:38 PM
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)
Yug updated the task description. (Show Details)

Xavier and Michael rose the following questions :

  1. No speaker data view: Unless I’m mistaken, no public view of the speaker data is currently planned. Given that the structured data models on Commons currently do not allow for the inclusion of information such as gender, place of language learning, or language proficiency level, omitting such a view implies that this collected data will only exist hidden in the depths of a closed database.
  2. Audio metadata now temporary: The recording metadata will, in principle, only be saved temporarily during the upload process. The absence of this data in a lasting way will lead to a technical inability to filter out already-recorded words from a list — a very important feature for regular and prolific users.
  3. Broken link on Commons: The descriptions of the million or so files on Commons uploaded via LinguaLibre all contain links to the recording ID on LL as well as to the speaker's page. This is less critical, but given the previous two points, we’ll need to plan for cleaning up a large number of dead links post-migration.
  4. Why this version revamped the Vuejs UI?: I’m genuinely surprised by the decision to completely rewrite the RecordWizard (the “frontend”) code from scratch. The current production version of LinguaLibre already loads this as a standalone web application via MediaWiki. In my view, it would have been much simpler to reuse it as-is and only adapt its API calls (a few weeks of work at most), but perhaps I don’t have the full picture to judge.

We will address this questions here shortly.

For a gist there I my preliminary answers which will need Pushkar's additional input :
(1) there were concern about user data so he/we moved to more opacity on that field. It wont be transparency per default.
(2) Yes, audio data do not need duplication, all on Commons now. From memory, believe Pushkar reimplemented some queries accordingly so the "no re-recording" is still working.
(3) Commons' LL template/Lua module will need a refresher to hide broken link, yes. Not urgent, and volunteers can lead this next Automn.
(4) new UX parcours initiated by Polslovitch with 6 steps instead of 5, new UI now use Wikimedia Codex and polished by Pushkar.

@Pushkar7077 , could you address this points above for Antoine ?

This comment was removed by Pushkar7077.

I have answered the question as per my knowledge of Lingualibre:

  1. I am not sure why a view for the speaker's data is not there. I feel we should create a view so that speakers can see their recordings
  2. I am not well aware of audio metadata. However, we do store all the words recorded by speakers on Lingualibre's database, which are being used to remove words already recorded in the list step
  3. I am not sure about broken links to common things.
  4. When I took over Lingubilibre, UI, in Vue.js was already revamped to some extent. So I am not sure why and when the UI was revamped

Antoine point 1 :
1/ Table for groups of users -> no usages. Answer: No idea what was that for! -- Yug
2/ Table for users, field is_staff and is_superuser (default Django field). Are they used ? Answer:migrate all mediawiki sysop or bureaucrate as is_staff=true.
3/ Genre : rien dans l'UI, but in locutor table, a field (type VARCHAR) exist. Answer: use Qid.
4/ V2, table locutor : no licence ; V3, table locutor : a license. What to do ? Answer: Either default cc-by-sa or fish for user's profiles values.
5/ Places : UI selects on Qid, but value saved is place's name (string, ex: "Saint-Denis"). Answer: yes, update expected.