[Analytics] Investigate effort of selective legacy migrations to Airflow
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Manuel
	Feb 22 2024, 5:16 PM

Description

Scope

Understand the complexity of bringing back the data processes for:

Cognate “I Miss You”
- The main CSV in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/ should be regenerated (unclear which CSV is the right one, but we know that it was linked from the legacy dashboard as a download)
- Generally, People wanted I Miss You, Compare, Most Popular. We can investigate all three, but are committed to implement only "I Miss You" for now.

Usage Dashboard
- The CSV in https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wikidata/WD_percentUsage/ should be regenerated

Acceptance criteria

We understand what the complexity for #1 is.
- An aggregation process based on the cognate_wiktionary table
We understand what the complexity for #2 is.
- An aggregation process based on the wbc_entity_usage table

Information below this point is filled out by the Wikidata Analytics team.

Assignee Planning

Information is filled out by the assignee of this task.

Estimation

Estimate: 1.5 days
Actual: 1.5 days

Sub Tasks

Full breakdown of the steps to complete this task:

Check process to derive data for the Wiktionary Cognate tables
- Find another source for this data as the R process isn't clear
Check process to derive data for the Usage Dashboard tables

Data to be used

See Analytics/Data_Lake for the breakdown of the data lake databases and tables.

The following tables will be referenced in this task:

NA: all work is via investigating prior R based code bases

Notes and Questions

Things that came up during the completion of this task, questions to be answered and follow up tasks:

Note

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		Manuel	T332899 [EPIC] Migrate selected R-based Wikidata products
		Resolved		AndrewTavis_WMDE	T358254 [Analytics] Investigate effort of selective legacy migrations to Airflow

Event Timeline

Manuel created this task.Feb 22 2024, 5:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 22 2024, 5:16 PM

Manuel renamed this task from [Analytics] Investigate effort selective legacy migrations to Airflow to [Analytics] Investigate effort of selective legacy migrations to Airflow .Feb 22 2024, 5:17 PM

Manuel edited projects, added Wikidata Analytics (Kanban); removed Wikidata Analytics.Feb 22 2024, 5:21 PM

Manuel added a parent task: T332899: [EPIC] Migrate selected R-based Wikidata products .

Manuel moved this task from Incoming to Prioritized backlog on the Wikidata Analytics (Kanban) board.Feb 22 2024, 5:24 PM

AndrewTavis_WMDE mentioned this in T358311: Check home/HDFS leftovers of goransm.Feb 23 2024, 11:13 AM

ECohen_WMDE added a project: Wikidata Integration in Wikimedia projects.Feb 26 2024, 10:22 AM

ECohen_WMDE moved this task from Backlog to Analytics on the Wikidata Integration in Wikimedia projects board.

ECohen_WMDE subscribed.

AndrewTavis_WMDE moved this task from Prioritized backlog to In progress on the Wikidata Analytics (Kanban) board.Mar 14 2024, 9:48 AM

AndrewTavis_WMDE claimed this task.Mar 14 2024, 1:41 PM

AndrewTavis_WMDE updated the task description. (Show Details)

Initial explorations of the Wiktionary directory indicate that the data we're trying to replicate here is in projectData. Within this directory we have three files per language ISO 2 code:

{ISO_2}wiktionary_entries.csv
{ISO_2}wiktionary_missing.csv (likely what we need for "I Miss You")
{ISO_2}wiktionary_searchvector.Rds (R based data file, likely what we need for "Compare")

For "Most Popular" the file in question is mostPopularEntries.csv in the Wiktionary directory.

Looking into this more, I'm as of now not sure how the original connection to the Cognate extension data was made. I'm seeing no inputs from a source database in the Wiktionary Cognate dashboard code. The server is loading in data from the published datasets directory for the UI to be displayed, but there's no indication of how it got there in the first place. For instance, mostPopularEntries.csv appears once in the code where it's read in, but then the expectation would be that there would be a step where it's also saved there. Maybe there's a generation step that's not included in the Wiktionary Cognate dashboard code that's on GitHub, which is what I have locally. There might also be code that's on the server that's doing all this 🤔

I found the following Phab task talking about Cognate tables: T162252: Create SQL database and Tables for Cognate extension to be used on Wiktionaries. This leads to the following documentation: wikitech.wikimedia.org/wiki/WMDE/Cognate. Based on this, I've ran the following queries for some baseline exploration of Cognate data that's available via MariaDB using wmfdata-python:

Queries ran with

df = wmf.mariadb.run(
    commands=QUERY,
    dbs="cognate_wiktionary",
    use_x1=True,  # connect to the given database on the ExtensionStorage replica
)

All tables

SHOW TABLES;

Tables_in_cognate_wiktionary
cognate_pages
cognate_sites
cognate_titles

cognate_pages

SELECT
    *

FROM
    cognate_pages

LIMIT
    5

cgpa_site	cgpa_namespace	cgpa_title
2.50397e+18	0	-9.22337e+18
8.71187e+18	0	-9.22337e+18
6.77301e+18	0	-9.22337e+18
8.12084e+18	0	-9.22337e+18
8.71187e+18	0	-9.22337e+18

cognate_sites

SELECT
    *

FROM
    cognate_sites

LIMIT
    5

cgsi_key	cgsi_dbname	cgsi_interwiki
-9070280448546609211	cawiktionary	ca
-8834749551276028540	nahwiktionary	nah
-8821737830943167491	kuwiktionary	ku
-8705824589415612322	towiktionary	to
-8329989933404253437	wowiktionary	wo

cognate_titles

SELECT
    *

FROM
    cognate_titles

LIMIT
    5

cgti_raw	cgti_raw_key	cgti_normalized_key
выясняешь	-9223371534148352930	-9223371534148352930
అడుసు	-9223370618425054874	-9223370618425054874
skiftat	-9223370043901259262	-9223370043901259262
arreá	-9223369858987257508	-9223369858987257508
σιτικά	-9223369370128554895	-9223369370128554895

These tables to me seem like where we'd be starting from in all of this. It'd be great to find someone who has a better idea of what's actually in these tables, but at first glance we're looking at IDs that link Wiktionaries and the strings that are within them. Queries across these tables could then be used to recreate "I Miss You", "Compare" and "Most Popular".

Side note: can we rename "I Miss You"... ? "Missing Entries" would be much better, in my opinion.

Moving on to the Usage Dashboard, what it is we're looking for is the following two tables:

Project

Project Type

Total Articles

Percent Articles Using WD

Total Articles Using WD

Percent Articles With Sitelinks

Total Articles With Sitelinks

Project Type

Total Articles

Percent Articles Using WD

Total Articles Using WD

Percent Articles With Sitelinks

Total Articles With Sitelinks

The process to produce the above tables is similarly quite confusing. There are tables being loaded into the server code that have no relation to the outputs, like wdcm_project_category.csv that loads in per project counts for categories like Architectural Structure. Maybe the aggregates of the categories is being used to do this, but it's all quite messy and if that is the case then it's not a fluid data process...

Generally we're looking for the process that creates the table USER_NAME.wdcm_clients_wb_entity_usage that the frontend is using. Looking through the entire Wikidata Analytics code for wdcm_clients_wb_entity_usage, we're mostly getting print statements with progress reports related to this table and code reading from the table. The file WikidataAnalytics/_engines/_wdcmModules/WDCM_Sqoop_Clients.R is where the table is dropped, created and filled, with Sqoop being Apache software for transferring bulk data between Hadoop and relational databases. Original table for this is wbc_entity_usage, with the original destination table before the user table being tmp/wmde/analytics/wdcm/wdcmsqoop/wdcm_clients_wb_entity_usage (copied to the user table in the same file).

The documentation for wbc_entity_usage is found here. I would suggest that we find someone with greater knowledge of this table and plan out how to recreate the data such that the steps being taken are checked are verified along the way. We'd be having this be primarily an query based job rather than R based, so working from the R files that were not peer reviewed in the first place and use systems (R, Sqoop, etc) that we won't be using seems like not the best use of time for this.

We got confirmation that cognate_wiktionary is a source for the Wiktionary Cognate data as well, so we're covered as far as baseline data sources 🎉

AndrewTavis_WMDE updated the task description. (Show Details)Mar 15 2024, 5:28 PM

AndrewTavis_WMDE moved this task from In progress to Product verification on the Wikidata Analytics (Kanban) board.

We should discuss what the frequency of the jobs we're discussing is:

For the Wiktionary Cognate data to be of use to editors I'd say we'd be looking at daily jobs as a user would want to be able to make needed edits and then check the next day to see what else they can do. The frequency could then be increased if we have usage of the tables that means that we need them updating so that users are not trying to do edits that have already been reported and fixed.
For the usage dashboard I'd argue for monthly at least, but if product needs more granular frequencies we can add those in and have multiple jobs. These different frequencies could then be displayed on the same dashboard.

Thank you for the investigation!

Based on the investigation, let's pause the work on the usage dashboard until further notice and only focus on the Wiktionary Cognate data. I have created a task about re-implementing the main data job: T360296: [Analytics] Implement data process to identify missing Wiktionary entries

AndrewTavis_WMDE mentioned this in T360296: [Analytics] Implement data process to identify missing Wiktionary entries .Mar 18 2024, 12:12 PM

[Analytics] Investigate effort of selective legacy migrations to Airflow Closed, ResolvedPublicActions