Wikidata Analytics Request
This task was generated using the Wikidata Analytics request form. Please use the task template linked on our project page to create issues for the team. Thank you!
Purpose
Please provide as much context as possible as well as what the produced insights or services will be used for.
Continuing the legacy R data infrastructure migration, with the investigation for this task being done in T358254: [Analytics] Investigate effort of selective legacy migrations to Airflow .
As a Wiktionary user, I want to know what are the most common words ("entries") that are missing from a specific Wiktionary project.
Specific Results
Please detail the specific results that the task should deliver.
- Identify the original CSV for the "I miss you ..." table in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
- The {ISO_2}wiktionary_missing.csv tables that are found in the public datasets projectData directory
- (Re)create a data process that generates the table daily (daily for now so that we can evaluate the resource investment and usage)
- Some entries need to be filtered out ("Main_Page" and "main_Page")
Context
Wiktionaries describe words coming from their own languages as well as other languages. Pages on Wiktionaries are called "entries". Example: en:tree.
The Cognate extension provides automatic links between two pages of different language versions of Wiktionary that have the same title (including a few normalization rules). So for example, fr:tree and en:tree. These links then show up as automatic interwikilinks.
There was also a Wiktionary Cognate dashboard that helped the community analyze the data of the extension.
This community tool included an "I miss you..." table/dashboard.
- The users could select a particular Wiktionary from a drop-down menu. A table then showed a table encompassing the top 1,000 entries (page titles) found in other Wiktionaries that are absent from the selected project.
- The idea was to give to the editors of a language version, some ideas on what new pages to create on their home wiki. So, if someone is editing French Wiktionary, they would be interested in the words (whatever the language), that already have a page on many other Wiktionaries, but not the French one. That's probably the most interesting/useful pages to create. That's why users want a list of the entries that already exist in a lot of languages, but not theirs.
- The data was originally updated every 6 hours.
https://meta.wikimedia.org/wiki/Wiktionary_Cognate_Dashboard#I_Miss_You_tab
This is just for context, this task is only about implementing the data process to create public CSVs.
Notes
- Some tech details of the original work was documented in this task: {T166487#4425588}
Desired Outputs
Please list the desired outputs of this task.
- We know which CSV was the source for the "I miss you ..." table
- The {ISO_2}wiktionary_missing.csv tables that are found in the public datasets projectData directory
- A data process is generating the respective CSV daily
- Some entries are filtered out ("Main_Page" and "main_Page")
- The CSV is published in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/ again
Information below this point is filled out by the task assignee.
Assignee Planning
Sub Tasks
A full breakdown of the steps to complete this task.
- Explore tables
- Understand columns and ways to connect them
- Create a query that will derive missing entries
- Explore ways of implementing a MariaDB Airflow job via suggested files
- Create PySpark job for migrating Wiktionary Cognate data from MariaDB to HDFS
- Create PySpark job for splitting the data into separate CSVs after published data export
- Implement DAG
- Test jobs and DAG
- Deploy DAG
Estimation
Estimate: 4 days (using MariaDB jobs via Airflow is totally new)
Actual:
Data
The tables that will be referenced in this task.
We're not working from hive in this task and need to be using MariaDB via wmfdata-python. The description of the Cognate extension can be seen here. We'll be accessing the following tables:
- cognate_wiktionary.cognate_pages
- cognate_wiktionary.cognate_sites
- cognate_wiktionary.cognate_titles
An example of getting data from one of the tables is:
import wmfdata as wmf cognate_sites_query = """ SELECT * FROM cognate_sites ; """ df_cognate_sites = wmf.mariadb.run( commands=cognate_sites_query, dbs="cognate_wiktionary", use_x1=True, ) display(df_cognate_sites)
Notes
Things that came up during the completion of this task, questions to be answered and follow up tasks.
- The Cognate IDs can at times be negative as they're sha256 hashes of the strings in question :)