Page MenuHomePhabricator

[Analytics] Investigate effort of selective legacy migrations to Airflow
Closed, ResolvedPublic

Description

Scope

Understand the complexity of bringing back the data processes for:

  1. Cognate “I Miss You”
  1. Usage Dashboard

Acceptance criteria

  • We understand what the complexity for #1 is.
  • We understand what the complexity for #2 is.

Information below this point is filled out by the Wikidata Analytics team.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: 1.5 days
Actual: 1.5 days

Sub Tasks

Full breakdown of the steps to complete this task:

  • Check process to derive data for the Wiktionary Cognate tables
    • Find another source for this data as the R process isn't clear
  • Check process to derive data for the Usage Dashboard tables

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

  • NA: all work is via investigating prior R based code bases

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

  • Note

Event Timeline

Manuel renamed this task from [Analytics] Investigate effort selective legacy migrations to Airflow to [Analytics] Investigate effort of selective legacy migrations to Airflow .Feb 22 2024, 5:17 PM

Initial explorations of the Wiktionary directory indicate that the data we're trying to replicate here is in projectData. Within this directory we have three files per language ISO 2 code:

  • {ISO_2}wiktionary_entries.csv
  • {ISO_2}wiktionary_missing.csv (likely what we need for "I Miss You")
  • {ISO_2}wiktionary_searchvector.Rds (R based data file, likely what we need for "Compare")

For "Most Popular" the file in question is mostPopularEntries.csv in the Wiktionary directory.

Looking into this more, I'm as of now not sure how the original connection to the Cognate extension data was made. I'm seeing no inputs from a source database in the Wiktionary Cognate dashboard code. The server is loading in data from the published datasets directory for the UI to be displayed, but there's no indication of how it got there in the first place. For instance, mostPopularEntries.csv appears once in the code where it's read in, but then the expectation would be that there would be a step where it's also saved there. Maybe there's a generation step that's not included in the Wiktionary Cognate dashboard code that's on GitHub, which is what I have locally. There might also be code that's on the server that's doing all this 🤔

I found the following Phab task talking about Cognate tables: T162252: Create SQL database and Tables for Cognate extension to be used on Wiktionaries. This leads to the following documentation: wikitech.wikimedia.org/wiki/WMDE/Cognate. Based on this, I've ran the following queries for some baseline exploration of Cognate data that's available via MariaDB using wmfdata-python:

Queries ran with
df = wmf.mariadb.run(
    commands=QUERY,
    dbs="cognate_wiktionary",
    use_x1=True,  # connect to the given database on the ExtensionStorage replica
)
All tables
SHOW TABLES;
Tables_in_cognate_wiktionary
cognate_pages
cognate_sites
cognate_titles
cognate_pages
SELECT
    *

FROM
    cognate_pages

LIMIT
    5
cgpa_sitecgpa_namespacecgpa_title
2.50397e+180-9.22337e+18
8.71187e+180-9.22337e+18
6.77301e+180-9.22337e+18
8.12084e+180-9.22337e+18
8.71187e+180-9.22337e+18
cognate_sites
SELECT
    *

FROM
    cognate_sites

LIMIT
    5
cgsi_keycgsi_dbnamecgsi_interwiki
-9070280448546609211cawiktionaryca
-8834749551276028540nahwiktionarynah
-8821737830943167491kuwiktionaryku
-8705824589415612322towiktionaryto
-8329989933404253437wowiktionarywo
cognate_titles
SELECT
    *

FROM
    cognate_titles

LIMIT
    5
cgti_rawcgti_raw_keycgti_normalized_key
выясняешь-9223371534148352930-9223371534148352930
అడుసు-9223370618425054874-9223370618425054874
skiftat-9223370043901259262-9223370043901259262
arreá-9223369858987257508-9223369858987257508
σιτικά-9223369370128554895-9223369370128554895

These tables to me seem like where we'd be starting from in all of this. It'd be great to find someone who has a better idea of what's actually in these tables, but at first glance we're looking at IDs that link Wiktionaries and the strings that are within them. Queries across these tables could then be used to recreate "I Miss You", "Compare" and "Most Popular".

Side note: can we rename "I Miss You"... ? "Missing Entries" would be much better, in my opinion.

Moving on to the Usage Dashboard, what it is we're looking for is the following two tables:

ProjectProject TypeTotal ArticlesPercent Articles Using WDTotal Articles Using WDPercent Articles With SitelinksTotal Articles With Sitelinks
Project TypeTotal ArticlesPercent Articles Using WDTotal Articles Using WDPercent Articles With SitelinksTotal Articles With Sitelinks

The process to produce the above tables is similarly quite confusing. There are tables being loaded into the server code that have no relation to the outputs, like wdcm_project_category.csv that loads in per project counts for categories like Architectural Structure. Maybe the aggregates of the categories is being used to do this, but it's all quite messy and if that is the case then it's not a fluid data process...

Generally we're looking for the process that creates the table USER_NAME.wdcm_clients_wb_entity_usage that the frontend is using. Looking through the entire Wikidata Analytics code for wdcm_clients_wb_entity_usage, we're mostly getting print statements with progress reports related to this table and code reading from the table. The file WikidataAnalytics/_engines/_wdcmModules/WDCM_Sqoop_Clients.R is where the table is dropped, created and filled, with Sqoop being Apache software for transferring bulk data between Hadoop and relational databases. Original table for this is wbc_entity_usage, with the original destination table before the user table being tmp/wmde/analytics/wdcm/wdcmsqoop/wdcm_clients_wb_entity_usage (copied to the user table in the same file).

The documentation for wbc_entity_usage is found here. I would suggest that we find someone with greater knowledge of this table and plan out how to recreate the data such that the steps being taken are checked are verified along the way. We'd be having this be primarily an query based job rather than R based, so working from the R files that were not peer reviewed in the first place and use systems (R, Sqoop, etc) that we won't be using seems like not the best use of time for this.

We got confirmation that cognate_wiktionary is a source for the Wiktionary Cognate data as well, so we're covered as far as baseline data sources 🎉

We should discuss what the frequency of the jobs we're discussing is:

  • For the Wiktionary Cognate data to be of use to editors I'd say we'd be looking at daily jobs as a user would want to be able to make needed edits and then check the next day to see what else they can do. The frequency could then be increased if we have usage of the tables that means that we need them updating so that users are not trying to do edits that have already been reported and fixed.
  • For the usage dashboard I'd argue for monthly at least, but if product needs more granular frequencies we can add those in and have multiple jobs. These different frequencies could then be displayed on the same dashboard.

Thank you for the investigation!

Based on the investigation, let's pause the work on the usage dashboard until further notice and only focus on the Wiktionary Cognate data. I have created a task about re-implementing the main data job: T360296: [Analytics] Implement data process to identify missing Wiktionary entries