Create an instance in the new tech wishes WMCS project (T332040), it would be fine to allocate it all of the project resources (8VCPU, 16GB RAM). This will be our runner node for HTML dump processing, and it should be considered only semi-permanent since it will be destroyed after this initial collection. The scraper can utilize all available processors, but we don't know what its memory profile will look like yet, only that concurrency is often memory-hungry.
An initial estimate of the required output storage is 250M articles x 1kiB per row, or 250GiB, which tells us that we'll need to compress the intermediate outputs. Gzip-compressed, we use more like 16 bytes per row or 4GiB. Let's assume 10GiB to be safe.
Create a block device of maybe 10GB, attach it to the new instance, and mount. We will store scraper intermediate and final outputs here, and it will persist beyond the lifetime of the runner node.
== Access ==
ssh runner.dump-references-processor.eqiad1.wikimedia.cloud
== Notes ==
Manual provisioning steps:
```
wget https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb
sudo dpkg -i erlang-solutions_2.0_all.deb
apt update
apt upgrade
apt dist-upgrade
apt autoremove
apt install -y esl-erlang make
git clone https://github.com/elixir-lang/elixir.git --depth 1 --branch v1.14.3 /srv/elixir
chgrp -R wikidev /srv/elixir
chmod -R g+w /srv/elixir/
# as regular user
cd /srv/elixir
make clean test
# as root
make install
```