Page MenuHomePhabricator

[Analytics] Implement data process to identify missing Wiktionary entries
Open, Stalled, Needs TriagePublic

Description

Continuing the legacy R data infrastructure migration, with the investigation for this task being done in T358254: [Analytics] Investigate effort of selective legacy migrations to Airflow .

As a Wiktionary user, I want to know what are the most common words ("entries") that are missing from a specific Wiktionary project.

Scope

Context

Wiktionaries describe words coming from their own languages as well as other languages. Pages on Wiktionaries are called "entries". Example: en:tree.

The Cognate extension provides automatic links between two pages of different language versions of Wiktionary that have the same title (including a few normalization rules). So for example, fr:tree and en:tree. These links then show up as automatic interwikilinks.

There was also a Wiktionary Cognate dashboard that helped the community analyze the data of the extension.

This community tool included an "I miss you..." table/dashboard.

  • The users could select a particular Wiktionary from a drop-down menu. A table then showed a table encompassing the top 1,000 entries (page titles) found in other Wiktionaries that are absent from the selected project.
  • The idea was to give to the editors of a language version, some ideas on what new pages to create on their home wiki. So, if someone is editing French Wiktionary, they would be interested in the words (whatever the language), that already have a page on many other Wiktionaries, but not the French one. That's probably the most interesting/useful pages to create. That's why users want a list of the entries that already exist in a lot of languages, but not theirs.
  • The data was originally updated every 6 hours.

https://meta.wikimedia.org/wiki/Wiktionary_Cognate_Dashboard#I_Miss_You_tab

This is just for context, this task is only about implementing the data process to create public CSVs.

Notes

  • Some tech details of the original work was documented in this task: {T166487#4425588}

Acceptance criteria


Information below this point is filled out by the Wikidata Analytics team.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: 4 days (using MariaDB jobs via Airflow is totally new)
Actual:

Sub Tasks

Full breakdown of the steps to complete this task:

  • Explore tables
  • Understand columns and ways to connect them
  • Create a query that will derive missing entries
  • Explore ways of implementing a MariaDB Airflow job via suggested files
  • Create PySpark job for migrating Wiktionary Cognate data from MariaDB to HDFS
  • Create PySpark job for splitting the data into separate CSVs after published data export
  • Implement DAG
  • Test jobs and DAG
  • Deploy DAG

Data to be used

We're not working from hive in this task and need to be using MariaDB via wmfdata-python. The description of the Cognate extension can be seen here. We'll be accessing the following tables:

  • cognate_wiktionary.cognate_pages
  • cognate_wiktionary.cognate_sites
  • cognate_wiktionary.cognate_titles

An example of getting data from one of the tables is:

import wmfdata as wmf

cognate_sites_query = """
SELECT
    *

FROM
    cognate_sites
;
"""

df_cognate_sites = wmf.mariadb.run(
    commands=cognate_sites_query,
    dbs="cognate_wiktionary",
    use_x1=True,
)
display(df_cognate_sites)

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • The Cognate IDs can at times be negative as they're sha256 hashes of the strings in question :)

Event Timeline

Thanks! I'll give an estimate on the timing of this once we've finished up T341330: [Analytics] Airflow implementation of unique ips accessing Wikidata's REST API metrics. I'll need to check to see that the cognate_wiktionary table is an appropriate source, but here's hoping as the original source is unclear and this is the only structured data source I've seen. Maybe there's an API for the extension that could also be used within the job. Note that the plan for this is an Airflow DAG that leverages the aforementioned job that will be running on WMF's infrastructure.

I've been asking around about the data source and connecting the tables and have yet to get concrete answers. Based on general assumptions of the names of the tables/columns though, the path forward for getting missing entries for a Wiktionary will be to:

  • Start with cognate_wiktionary.cognate_sites
  • Join to cognate_wiktionary.cognate_pages (cognate_sites.cgsi_key = cognate_pages.cgpa_site)
  • Join to cognate_wiktionary.cognate_titles (cognate_pages.cgpa_title = cognate_titles.cgti_raw_key - note the use of cgti_raw_key)
  • Use cognate_titles.cgti_normalized_key as a means of checking which Wiktionary entries are shared/missing across projects

Putting this here as documentation :)

wmde/analytics/hql/airflow_jobs/wiktionary_cognate on GitLab now has all the needed queries for missing entries, most popular entries and comparing Wiktionaries. Was easier to write all three at once rather than lose some context later. Note that these are Hive queries as the goal is to first migrate them to HDFS.

I've discussed the further infrastructure needs at length with a data engineer at WMF, with the steps from here being:

  • I need to write a PySpark job that gets the cognate_wiktionary tables from the MariaDB instance and puts them on HDFS on a daily basis
    • This will go in wmde/analytics/spark
    • Note that this is relatively uncharted territory (it can be done with current long term supported tools, but will be a new type of job)
  • From there we need a DAG that will eventually include all three processes discussed above
    • The reason we'll do a DAG for all three is that each will rely on the PySpark job to migrate the data from MariaDB to HDFS
    • We can start with just doing missing entries as an output for this task, and then other tasks can add the other two to the DAG
  • The DAG in question will create tables in HDFS and then export them to the published datasets directories
    • If the plan is that this data is community facing only, then adding something to delete the contents of the HDFS tables after the fact would be good to make sure that they're not needlessly copied to backups
      • We should delete the contents of the tables, but not drop them as admin rights are needed to create them
      • Edit: query for this has been added :)
    • It would also be good to add in a PySpark job that splits the datasets for the end user so they can just download the data they're interested in

There's now a MR draft for the DAGs open on GitLab. There's still lots to do as WMF wants to sync on suggestions they'll give me on how to do the MariaDB to HDFS data transfer, but the DAGs are mapped out and the hive queries they're calling have been prepared :)

AndrewTavis_WMDE changed the task status from Open to Stalled.Thu, Jun 6, 11:13 AM

I am a simple editor who volunteered to give feedback on the original Wiktionary Cognate Dashboard. We used it on Dutch Wiktionary as a means to help editors prioritizing new lemmas to add. As soon as I discovered it didn't work anymore, I have asked for a remedy 18 months ago. Is there someone who understands that it's really dispiriting to see the status changed from Open to Stalled, without a comprehensible explanation and no indication of the expected moment a solution will be implemented?

Hi @MarcoSwart, sorry for changing the status without explanation. Was in a meeting and we were moving things around, but obviously context should have been added. This is stalled for now as we're waiting for WMF to advise us on the best way forward on migrating data from MariaDB to HDFS. The data processes we need to use for this cannot be run directly on MariaDB in a sustainable way that's in line with long term supported data practices, so first we need to migrate the data to the private data cluster, and then our normal workflows take over. This migration is non-standard, and they're looking into how best to support/guide us.

By the sounds of it they're allotting the budget of a Staff Engineer to help with this soon. The data pipeline and the needed queries are basically done, so what we're waiting on is the process to migrate the data as a final step. From there we'll get the process up and running such that the data at the very least will be exported to the published datasets folders on a daily basis.

As far as a dashboard is concerned, we're also in the midst of looking into a more sustainable solution for presenting information to the public. This is similarly tied to WMF's efforts on this front. For now we hope that an export to the published datasets will suffice such that the community can then take the data and model it as they wish. I'd be happy to help people with simple Python scripts to get the data loaded into data frames and more workable states once that's done! I'd put an estimate on the data process as end of month if things work out with WMF's resources, but if not then it's August as I'm away for most of July (no later than that though).

Please let me know if you have further questions, and again sorry for the confusion!

Talked further with WMF about this just now. One basic question for the end users: would it make it more convenient for you all if the exported datasets were per Wiktionary? There are two options here, with missing entries being used as an example:

  1. We export one file that has all missing entries for all Wiktionaries
    • 188,000 rows x 3 columns
    • 188,000 rows = the 1,000 most popular missing entries for each Wiktionary (there are 188 in the data)
    • 3 columns
      • The Wiktionary
      • The word that's missing from it
      • The total of the other Wiktionaries that have it
  2. We export 188 CSVs, each of length 1,000 with the above columns

Reason for option 1 or 2 and not both is that we don't want to keep the data in duplicate both in the published datasets directories and in the data lake. Option 1 is easier, but we can figure out Option 2 if that would be your all's preference.

So the baseline question for each option is:

  1. If you're only working on one Wiktionary, would you be ok with getting it as a subset from the whole dataset?
  2. If you're working on more than one Wiktionary, would you be ok with getting the separate datasets and combining them?

Let us know which would be better for your workflow! And thanks for your continued interest in this. Great talks today about the various options we have 😊

To me, it seems difficult to combine 188 separate datasets myself: they will contain lots of duplicates, because a large part of the smaller wiktionaries will share the same missing entries. Would it be possible to combine proposal 1 with a comprehensive CSV that contains all entries that are part of at least one of the 188 CSVs in the first column and the total of Wiktionaries that have it in the second column? Because of the duplicates I expect the number of rows in this file to be significantly lower than 188,000.

Hi @MarcoSwart 👋 Thanks for the communication here :) I guess I'm a bit confused by how the other one would be used. You're roughly talking about:

word_that_is_missing_from_a_wiktionarynumber_of_wiktionaries_that_do_have_it
MOST_MISSING_WORD156
NEXT_MOST_MISSING_WORD155
......

With that we're missing the Wiktionary column, so then editors wouldn't have the ability to easily know if their Wiktionary needed that word or not? Maybe it can be gotten from another part of the data process. Let me explain :)

What's planned for this data process at this point is two outputs:

  • Missing Entries (I miss you ...) as described above - per Wiktionary what are the 1,000 most popular missing words (popular = number of Wiktionaries that do have it)
  • Most Popular - the most popular entries across all Wiktionaries

Maybe Most Popular would serve your interests above? This would be a CSV with say the 10,000 or 100,000 or whatever you all would need most popular entries across all Wiktionaries. All of this updating on a daily basis. Would that work for you?

Please let me know if I'm understanding correctly, by the way! Appreciate your feedback :)