Page MenuHomePhabricator

Add raw sites table to Analytics Data Lake
Closed, DuplicatePublic

Description

I just discovered the edit data in the wmf_raw database on the data lake. Could we add a raw copy of the sites table as well? It doesn't matter from which wiki, they should all be the same.

The main benefit would be easily joining to it to filter the data in the data lake to "only Wikipedias" or "only Wikivoyages" and so on.

Event Timeline

Hi @Neil - Would wmf_raw.mediawiki_project_namespace_map saisfy the need ? This table is updated every month (snapshot partition) and is defined as explained here in github.

Would wmf_raw.mediawiki_project_namespace_map saisfy the need ? This table is updated every month (snapshot partition) and is defined as explained here in github.

Unfortunately not. The data I'm most interested in is the site_group, which tells which project family ( wikipedia, wiktionary, etc.) a wiki belongs to. That isn't found in mediawiki_project_namespace_map.

I just realized I filed a broader version of this task in T184576. Let me merge this in, and we can discuss there.