Continuing the legacy R data infrastructure migration, with the investigation for this task being done in {T358254}.
As a Wiktionary user, I want to know what are the most common words ("entries") that are missing from a specific Wiktionary project.
===Scope===
* Identify the original CSV for the "I miss you ..." table in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
* (Re)create a data process that generates the table daily (daily for now so that we can evaluate the resource investment and usage)
* Some entries need to be filtered out ("Main_Page" and "main_Page")
===Context===
**Wiktionaries** describe words coming from their own languages as well as other languages. Pages on Wiktionaries are called "entries". Example: [[ https://en.wiktionary.org/wiki/pain | en:tree ]].
The **Cognate extension** provides automatic links between two pages of different language versions of Wiktionary that have the same title (including a few normalization rules). So for example, [[ https://fr.wiktionary.org/wiki/tree | fr:tree ]] and [[ https://en.wiktionary.org/wiki/tree | en:tree ]]. These links then show up as automatic interwikilinks.
There was also a **Wiktionary Cognate dashboard** that helped the community analyze the data of the extension.
This community tool included an **"I miss you..." table/dashboard**.
* The users could select a particular Wiktionary from a drop-down menu. A table then showed a table encompassing the top 1,000 enties (page titles) found in other Wiktionaries that are absent from the selected project.
* The idea was to give to the editors of a language version, some ideas on what new pages to create on their home wiki. So, if someone is editing French Wiktionary, they would be interested in the words (whatever the language), that already have a page on many other Wiktionaries, but not the French one. That's probably the most interesting/useful pages to create. That's why users want a list of the entries that already exist in a lot of languages, but not theirs.
* The data was originally updated every 6 hours.
https://meta.wikimedia.org/wiki/Wiktionary_Cognate_Dashboard#I_Miss_You_tab
This is just for context, this task ist only about implementing the data process to create public CSVs.
===Notes===
* Some tech details of the original work was documented in this task: {T166487#4425588}
===Acceptance criteria===
[] We know which CSV was the source for the "I miss you ..." table
[] A data process is generating the respective CSV daily
[] Some entries are filtered out ("Main_Page" and "main_Page")
[] The CSV is published in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/ again
---
**Information below this point is filled out by the Wikidata Analytics team.**
== Assignee Planning ==
Information is filled out by the assignee of this task.
=== Estimation ===
Estimate:
Actual:
=== Sub Tasks ===
Full breakdown of the steps to complete this task:
[ ] subtask
=== Data to be used ===
See [Analytics/Data_Lake](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake) for the breakdown of the data lake databases and tables.
The following tables will be referenced in this task:
- link_to_table
=== Notes and Questions ===
Things that came up during the completion of this task, questions to be answered and follow up tasks:
- Note