Create an instance in the new tech wishes WMCS project (T332040), it would be fine to allocate it all of the project resources (8VCPU, 16GB RAM). This will be our runner node for HTML dump processing, and it should be considered only semi-permanent since it will be destroyed after this initial collection. The scraper can utilize all available processors, but we don't know what its memory profile will look like yet, only that concurrency is often memory-hungry.
An initial estimate of the required output storage is 250M articles x 1kiB per row, or 250GiB, which tells us that we'll need to compress the intermediate outputs. Gzip-compressed, we use more like 16 bytes per row or 4GiB. Let's assume 10GiB to be safe.
Create a block device of maybe 10GB, attach it to the new instance, and mount. We will store scraper intermediate and final outputs here, and it will persist beyond the lifetime of the runner node.
- Create project
- Create instance
- Attach /public/dumps
- Provision elixir
- Clone scraper source and test
Access
ssh runner.dump-references-processor.eqiad1.wikimedia.cloud
Notes
Manual provisioning steps:
mkdir -p /srv/dumps/inputs ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230320 /srv/dumps/inputs wget https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb sudo dpkg -i erlang-solutions_2.0_all.deb apt update apt upgrade apt dist-upgrade apt autoremove apt install -y esl-erlang make gcc g++ libc6-dev cmake git clone https://github.com/elixir-lang/elixir.git --depth 1 --branch v1.14.3 /srv/elixir chgrp -R wikidev /srv/elixir chmod -R g+w /srv/elixir/ git clone https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump.git chgrp -R wikidev /srv/scrape-wiki-html-dump/ chmod -R g+w /srv/scrape-wiki-html-dump/ # as regular user cd /srv/elixir make clean test # as root make install # as regular user cd /srv/scrape-wiki-html-dump mix deps.get mix compile test
Verified working:
mix run parse_wiki.exs /srv/dumps/inputs/20230320/afwikibooks-NS0-20230320-ENTERPRISE-HTML.json.tar.gz > /tmp/out.ndjson