Page MenuHomePhabricator

Download enterprise structured content snapshots in hdfs
Closed, ResolvedPublic

Description

The enterprise structured content snapshots will be the source of the semantic search MVP.
We want to refresh the indices weekly and for this we need to download such dumps at this same rate.

We are going to join these dumps with the main search index dumps done on Sundays.

Related: T403298
Related: import_enterprise_dumps.py

Preliminary work: https://gitlab.wikimedia.org/repos/search-platform/discolytics/-/commit/c5925b4e6825fc9c2bf400e08d8c44fd55e3ab26

AC:

  • structured content snapshots are available in hdfs
  • updated weekly on Sundays

Event Timeline

Change #1224894 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] airflow-search: add enterprise extra_secrets

https://gerrit.wikimedia.org/r/1224894

The full HTML snapshots are "Updated twice-monthly (on the 2nd and 21st)" (1) so I'm curious to know whether Structured Contents follows that same cadence.

The full HTML snapshots are "Updated twice-monthly (on the 2nd and 21st)" (1) so I'm curious to know whether Structured Contents follows that same cadence.

The structured content snapshots are updated daily (and the html snapshots as well assuming you have an "non-free" account), from the same page:

If you require snapshots updated daily and/or access to the new Structured Contents snapshots, please contact our sales team for access.
Downloadable bundle of structured contents of all current revisions in a specified project and namespace. Updated daily at 12:00 UTC.

The WMF search platform has an account that we will be using for this and download them weekly (on Sundays) because this cadence is enough for us.
Ultimately (if the project is successful) we won't rely on them but do the passages extraction from within MW itself once we get a better understanding of how to "chunk" the articles.

Change #1224894 merged by Bking:

[operations/deployment-charts@master] airflow-search: add enterprise extra_secrets

https://gerrit.wikimedia.org/r/1224894

Dumps are available as text files (raw ndjson) under /wmf/data/discovery/wikimedia_enterprise/structured_content_snapshots/snapshot=$YYYYMMDD/project=${WIKI}_namespace_0 and will be updated weekly on Sundays.