⚓ T360296 [Analytics] Implement data process to identify missing Wiktionary entries

		Status	Subtype	Assigned	Task
		Open		Lydia_Pintscher	T332899 [EPIC] Migrate selected R-based Wikidata products
		Open		AndrewTavis_WMDE	T360296 [Analytics] Implement data process to identify missing Wiktionary entries

• Manuel created this task.Mar 18 2024, 11:14 AM

• Manuel mentioned this in T358254: [Analytics] Investigate effort of selective legacy migrations to Airflow .

• Manuel updated the task description. (Show Details)

• Manuel added a project: Wikidata Integration in Wikimedia projects.Mar 18 2024, 11:19 AM

• Manuel moved this task from Incoming to To-Do on the Wikidata Analytics (Kanban) board.

• Manuel added a subscriber: ECohen_WMDE.

Michael subscribed.Mar 18 2024, 11:20 AM

AndrewTavis_WMDE claimed this task.Mar 18 2024, 12:12 PM

AndrewTavis_WMDE updated the task description. (Show Details)

AndrewTavis_WMDE updated the task description. (Show Details)Mar 18 2024, 12:20 PM

Thanks! I'll give an estimate on the timing of this once we've finished up T341330: [Analytics] Airflow implementation of unique IPs accessing Wikidata's REST API metrics. I'll need to check to see that the cognate_wiktionary table is an appropriate source, but here's hoping as the original source is unclear and this is the only structured data source I've seen. Maybe there's an API for the extension that could also be used within the job. Note that the plan for this is an Airflow DAG that leverages the aforementioned job that will be running on WMF's infrastructure.

AndrewTavis_WMDE moved this task from To-Do to In Progress on the Wikidata Analytics (Kanban) board.Mar 18 2024, 2:54 PM

Michael moved this task from Backlog to Analytics on the Wikidata Integration in Wikimedia projects board.Mar 19 2024, 9:47 AM

Michael mentioned this in T326432: Cognate does not work for newly created Wiktionary projects.Mar 19 2024, 3:00 PM

• Manuel moved this task from In Progress to To-Do on the Wikidata Analytics (Kanban) board.Apr 18 2024, 8:55 AM

AndrewTavis_WMDE moved this task from To-Do to In Progress on the Wikidata Analytics (Kanban) board.May 23 2024, 3:25 PM

AndrewTavis_WMDE updated the task description. (Show Details)May 23 2024, 3:38 PM

I've been asking around about the data source and connecting the tables and have yet to get concrete answers. Based on general assumptions of the names of the tables/columns though, the path forward for getting missing entries for a Wiktionary will be to:

Start with cognate_wiktionary.cognate_sites
Join to cognate_wiktionary.cognate_pages (cognate_sites.cgsi_key = cognate_pages.cgpa_site)
Join to cognate_wiktionary.cognate_titles (cognate_pages.cgpa_title = cognate_titles.cgti_raw_key - note the use of cgti_raw_key)
Use cognate_titles.cgti_normalized_key as a means of checking which Wiktionary entries are shared/missing across projects

Putting this here as documentation :)

wmde/analytics/hql/airflow_jobs/wiktionary_cognate on GitLab now has all the needed queries for missing entries, most popular entries and comparing Wiktionaries. Was easier to write all three at once rather than lose some context later. Note that these are Hive queries as the goal is to first migrate them to HDFS.

I've discussed the further infrastructure needs at length with a data engineer at WMF, with the steps from here being:

I need to write a PySpark job that gets the cognate_wiktionary tables from the MariaDB instance and puts them on HDFS on a daily basis
- This will go in wmde/analytics/spark
- Note that this is relatively uncharted territory (it can be done with current long term supported tools, but will be a new type of job)
From there we need a DAG that will eventually include all three processes discussed above
- The reason we'll do a DAG for all three is that each will rely on the PySpark job to migrate the data from MariaDB to HDFS
- We can start with just doing missing entries as an output for this task, and then other tasks can add the other two to the DAG
The DAG in question will create tables in HDFS and then export them to the published datasets directories
- If the plan is that this data is community facing only, then adding something to delete the contents of the HDFS tables after the fact would be good to make sure that they're not needlessly copied to backups
  - We should delete the contents of the tables, but not drop them as admin rights are needed to create them
  - Edit: query for this has been added :)
- It would also be good to add in a PySpark job that splits the datasets for the end user so they can just download the data they're interested in

AndrewTavis_WMDE updated the task description. (Show Details)Jun 3 2024, 2:57 PM

AndrewTavis_WMDE moved this task from In Progress to To-Do on the Wikidata Analytics (Kanban) board.Jun 3 2024, 3:30 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jun 3 2024, 4:18 PM

AndrewTavis_WMDE updated the task description. (Show Details)Jun 3 2024, 5:26 PM

• Manuel removed a project: Epic.Jun 4 2024, 1:08 PM

There's now a MR draft for the DAGs open on GitLab. There's still lots to do as WMF wants to sync on suggestions they'll give me on how to do the MariaDB to HDFS data transfer, but the DAGs are mapped out and the hive queries they're calling have been prepared :)

AndrewTavis_WMDE changed the task status from Open to Stalled.Jun 6 2024, 11:13 AM

I am a simple editor who volunteered to give feedback on the original Wiktionary Cognate Dashboard. We used it on Dutch Wiktionary as a means to help editors prioritizing new lemmas to add. As soon as I discovered it didn't work anymore, I have asked for a remedy 18 months ago. Is there someone who understands that it's really dispiriting to see the status changed from Open to Stalled, without a comprehensible explanation and no indication of the expected moment a solution will be implemented?

Hi @MarcoSwart, sorry for changing the status without explanation. Was in a meeting and we were moving things around, but obviously context should have been added. This is stalled for now as we're waiting for WMF to advise us on the best way forward on migrating data from MariaDB to HDFS. The data processes we need to use for this cannot be run directly on MariaDB in a sustainable way that's in line with long term supported data practices, so first we need to migrate the data to the private data cluster, and then our normal workflows take over. This migration is non-standard, and they're looking into how best to support/guide us.

By the sounds of it they're allotting the budget of a Staff Engineer to help with this soon. The data pipeline and the needed queries are basically done, so what we're waiting on is the process to migrate the data as a final step. From there we'll get the process up and running such that the data at the very least will be exported to the published datasets folders on a daily basis.

As far as a dashboard is concerned, we're also in the midst of looking into a more sustainable solution for presenting information to the public. This is similarly tied to WMF's efforts on this front. For now we hope that an export to the published datasets will suffice such that the community can then take the data and model it as they wish. I'd be happy to help people with simple Python scripts to get the data loaded into data frames and more workable states once that's done! I'd put an estimate on the data process as end of month if things work out with WMF's resources, but if not then it's August as I'm away for most of July (no later than that though).

Please let me know if you have further questions, and again sorry for the confusion!

Talked further with WMF about this just now. One basic question for the end users: would it make it more convenient for you all if the exported datasets were per Wiktionary? There are two options here, with missing entries being used as an example:

We export one file that has all missing entries for all Wiktionaries
- 188,000 rows x 3 columns
- 188,000 rows = the 1,000 most popular missing entries for each Wiktionary (there are 188 in the data)
- 3 columns
  - The Wiktionary
  - The word that's missing from it
  - The total of the other Wiktionaries that have it
We export 188 CSVs, each of length 1,000 with the above columns

Reason for option 1 or 2 and not both is that we don't want to keep the data in duplicate both in the published datasets directories and in the data lake. Option 1 is easier, but we can figure out Option 2 if that would be your all's preference.

So the baseline question for each option is:

If you're only working on one Wiktionary, would you be ok with getting it as a subset from the whole dataset?
If you're working on more than one Wiktionary, would you be ok with getting the separate datasets and combining them?

Let us know which would be better for your workflow! And thanks for your continued interest in this. Great talks today about the various options we have 😊

To me, it seems difficult to combine 188 separate datasets myself: they will contain lots of duplicates, because a large part of the smaller wiktionaries will share the same missing entries. Would it be possible to combine proposal 1 with a comprehensive CSV that contains all entries that are part of at least one of the 188 CSVs in the first column and the total of Wiktionaries that have it in the second column? Because of the duplicates I expect the number of rows in this file to be significantly lower than 188,000.

Hi @MarcoSwart 👋 Thanks for the communication here :) I guess I'm a bit confused by how the other one would be used. You're roughly talking about:

word_that_is_missing_from_a_wiktionary	number_of_wiktionaries_that_do_have_it
MOST_MISSING_WORD	156
NEXT_MOST_MISSING_WORD	155
...	...

With that we're missing the Wiktionary column, so then editors wouldn't have the ability to easily know if their Wiktionary needed that word or not? Maybe it can be gotten from another part of the data process. Let me explain :)

What's planned for this data process at this point is two outputs:

Missing Entries (I miss you ...) as described above - per Wiktionary what are the 1,000 most popular missing words (popular = number of Wiktionaries that do have it)
Most Popular - the most popular entries across all Wiktionaries

Maybe Most Popular would serve your interests above? This would be a CSV with say the 10,000 or 100,000 or whatever you all would need most popular entries across all Wiktionaries. All of this updating on a daily basis. Would that work for you?

Please let me know if I'm understanding correctly, by the way! Appreciate your feedback :)

• Manuel triaged this task as Low priority.Jul 24 2024, 9:10 AM

• Manuel raised the priority of this task from Low to Medium.Jul 24 2024, 9:14 AM

AndrewTavis_WMDE updated the task description. (Show Details)Aug 6 2024, 4:24 PM

AndrewTavis_WMDE edited subscribers, added: karapayneWMDE; removed: • Manuel.

AndrewTavis_WMDE moved this task from To-Do to Stalled on the Wikidata Analytics (Kanban) board.Aug 7 2024, 7:50 AM

Note that a working DAG based on MariaDB based data has been successfully deployed by WMF, so a proof of concept here has been released and we can work from that towards our goals here. See https://phabricator.wikimedia.org/T362615#10106942 for the deployment message :)

[Analytics] Implement data process to identify missing Wiktionary entries
Open, MediumPublic
Actions

Description

Wikidata Analytics Request

Purpose

Specific Results

Context

Notes

Desired Outputs

Assignee Planning

Sub Tasks

Estimation

Data

Notes

Related Objects
Search...

Event Timeline

[Analytics] Implement data process to identify missing Wiktionary entries Open, MediumPublicActions

Description

Wikidata Analytics Request

Purpose

Specific Results

Context

Notes

Desired Outputs

Assignee Planning

Sub Tasks

Estimation

Data

Notes

Related ObjectsSearch...

Event Timeline

[Analytics] Implement data process to identify missing Wiktionary entries
Open, MediumPublic
Actions

Related Objects
Search...