Change Details

Create an instance in the new tech wishes WMCS project (T332040), it would be fine to allocate it all of the project resources (8VCPU, 16GB RAM). This will be our runner node for HTML dump processing, and it should be considered only semi-permanent since it will be destroyed after this initial collection. The scraper can utilize all available processors, but we don't know what its memory profile will look like yet, only that concurrency is often memory-hungry. An initial estimate of the required output storage is 250M articles x 1kiB per row, or 250GiB, which tells us that we'll need to compress the intermediate outputs. Gzip-compressed, we use more like 16 bytes per row or 4GiB. Let's assume 10GiB to be safe. Create a block device of maybe 10GB, attach it to the new instance, and mount. We will store scraper intermediate and final outputs here, and it will persist beyond the lifetime of the runner node. * [x] Create project * [x] Create instance * [ ] Attach /public/dumps * [x] Provision elixir * [ ] Clone scraper source and test == Access == ssh runner.dump-references-processor.eqiad1.wikimedia.cloud == Notes == Manual provisioning steps: ``` wget https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb sudo dpkg -i erlang-solutions_2.0_all.deb apt update apt upgrade apt dist-upgrade apt autoremove apt install -y esl-erlang make git clone https://github.com/elixir-lang/elixir.git --depth 1 --branch v1.14.3 /srv/elixir chgrp -R wikidev /srv/elixir chmod -R g+w /srv/elixir/ # as regular user cd /srv/elixir make clean test # as root make install ```