Page MenuHomePhabricator

New cloud instance and attached volume for dumps processing
Closed, ResolvedPublic

Description

Create an instance in the new tech wishes WMCS project (T332040), it would be fine to allocate it all of the project resources (8VCPU, 16GB RAM). This will be our runner node for HTML dump processing, and it should be considered only semi-permanent since it will be destroyed after this initial collection. The scraper can utilize all available processors, but we don't know what its memory profile will look like yet, only that concurrency is often memory-hungry.

An initial estimate of the required output storage is 250M articles x 1kiB per row, or 250GiB, which tells us that we'll need to compress the intermediate outputs. Gzip-compressed, we use more like 16 bytes per row or 4GiB. Let's assume 10GiB to be safe.

Create a block device of maybe 10GB, attach it to the new instance, and mount. We will store scraper intermediate and final outputs here, and it will persist beyond the lifetime of the runner node.

  • Create project
  • Create instance
  • Attach /public/dumps
  • Provision elixir
  • Clone scraper source and test

Access

ssh runner.dump-references-processor.eqiad1.wikimedia.cloud

Notes

Manual provisioning steps:

mkdir -p /srv/dumps/inputs
ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230320 /srv/dumps/inputs

wget https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb
sudo dpkg -i erlang-solutions_2.0_all.deb

apt update
apt upgrade
apt dist-upgrade
apt autoremove

apt install -y esl-erlang make gcc g++ libc6-dev cmake

git clone https://github.com/elixir-lang/elixir.git --depth 1 --branch v1.14.3 /srv/elixir
chgrp -R wikidev /srv/elixir
chmod -R g+w /srv/elixir/

git clone https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump.git
chgrp -R wikidev /srv/scrape-wiki-html-dump/
chmod -R g+w /srv/scrape-wiki-html-dump/

# as regular user
cd /srv/elixir
make clean test

# as root
make install

# as regular user
cd /srv/scrape-wiki-html-dump
mix deps.get
mix compile test

Verified working:

mix run parse_wiki.exs /srv/dumps/inputs/20230320/afwikibooks-NS0-20230320-ENTERPRISE-HTML.json.tar.gz > /tmp/out.ndjson

Event Timeline

awight renamed this task from New cloud instance and detached volume for dumps processing to New cloud instance and attached volume for dumps processing.Mar 23 2023, 10:54 AM
awight updated the task description. (Show Details)
awight claimed this task.
awight updated the task description. (Show Details)