Page MenuHomePhabricator

Republish datasets with primary key ID column included
Open, HighPublic

Description

  • simplewiki
  • dewiki
  • ptwiki
  • elwiki
  • frwiki
  • enwiki
  • arwiki
  • viwiki
  • cswiki
  • bnwiki
WARNING: Make sure the data/$WIKI_ID directory has up-to-date datasets. Otherwise you need to rerun run-pipeline.sh from the beginning.
WIKI_ID=elwiki
DATASET_PATH=$(pwd)/data/${WIKI_ID}

DB_USER=${DB_USER:-research}
DB_DATABASE=${DB_DATABASE:-staging}
DB_HOST=${DB_HOST:-dbstore1005.eqiad.wmnet}
DB_PORT=${DB_PORT:-3350}
DB_READ_DEFAULT_FILE=${DB_READ_DEFAULT_FILE:-/etc/mysql/conf.d/analytics-research-client.cnf}

DB_USER=$DB_USER \
DB_DATABASE=$DB_DATABASE \
DB_HOST=$DB_HOST \
DB_PORT=$DB_PORT \
DB_READ_DEFAULT_FILE=$DB_READ_DEFAULT_FILE \
python create_tables.py -id "$WIKI_ID"

DB_USER=$DB_USER \
DB_DATABASE=$DB_DATABASE \
DB_HOST=$DB_HOST \
DB_PORT=$DB_PORT \
DB_READ_DEFAULT_FILE=$DB_READ_DEFAULT_FILE \
python copy-sqlite-to-mysql.py -id "$WIKI_ID"

DB_USER=$DB_USER \
DB_DATABASE=$DB_DATABASE \
DB_HOST=$DB_HOST \
DB_PORT=$DB_PORT \
DB_READ_DEFAULT_FILE=$DB_READ_DEFAULT_FILE \
python export-tables.py -id "$WIKI_ID" --path "$DATASET_PATH"

echo "Generated datasets in $DATASET_PATH"
echo "To publish the datasets, run \"WIKI_ID=$WIKI_ID ./publish-datasets.sh\""

Event Timeline

kostajh triaged this task as High priority.Tue, Apr 6, 11:23 AM
kostajh moved this task from Backlog to April 5 - April 9 on the Add-Link board.
kostajh moved this task from Incoming to In Progress on the Growth-Team (Current Sprint) board.

Change 677507 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Remove backwards compatibility hack for datasets lacking the id column

https://gerrit.wikimedia.org/r/677507

Change 677508 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Add ORDER BY clause to contains and get item queries

https://gerrit.wikimedia.org/r/677508

Change 677507 merged by jenkins-bot:

[research/mwaddlink@main] Remove backwards compatibility hack for datasets lacking the id column

https://gerrit.wikimedia.org/r/677507

Change 677508 abandoned by Kosta Harlan:

[research/mwaddlink@main] Add ORDER BY clause to contains and get item queries

Reason:

https://gerrit.wikimedia.org/r/677508