New cloud instance and attached volume for dumps processing
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	awight
	Mar 15 2023, 12:29 PM

Description

Create an instance in the new tech wishes WMCS project (T332040), it would be fine to allocate it all of the project resources (8VCPU, 16GB RAM). This will be our runner node for HTML dump processing, and it should be considered only semi-permanent since it will be destroyed after this initial collection. The scraper can utilize all available processors, but we don't know what its memory profile will look like yet, only that concurrency is often memory-hungry.

An initial estimate of the required output storage is 250M articles x 1kiB per row, or 250GiB, which tells us that we'll need to compress the intermediate outputs. Gzip-compressed, we use more like 16 bytes per row or 4GiB. Let's assume 10GiB to be safe.

Create a block device of maybe 10GB, attach it to the new instance, and mount. We will store scraper intermediate and final outputs here, and it will persist beyond the lifetime of the runner node.

Access

ssh runner.dump-references-processor.eqiad1.wikimedia.cloud

Notes

Manual provisioning steps:

mkdir -p /srv/dumps/inputs
ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/enterprise_html/runs/20230320 /srv/dumps/inputs

wget https://packages.erlang-solutions.com/erlang-solutions_2.0_all.deb
sudo dpkg -i erlang-solutions_2.0_all.deb

apt update
apt upgrade
apt dist-upgrade
apt autoremove

apt install -y esl-erlang make gcc g++ libc6-dev cmake

git clone https://github.com/elixir-lang/elixir.git --depth 1 --branch v1.14.3 /srv/elixir
chgrp -R wikidev /srv/elixir
chmod -R g+w /srv/elixir/

git clone https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump.git
chgrp -R wikidev /srv/scrape-wiki-html-dump/
chmod -R g+w /srv/scrape-wiki-html-dump/

# as regular user
cd /srv/elixir
make clean test

# as root
make install

# as regular user
cd /srv/scrape-wiki-html-dump
mix deps.get
mix compile test

Verified working:

mix run parse_wiki.exs /srv/dumps/inputs/20230320/afwikibooks-NS0-20230320-ENTERPRISE-HTML.json.tar.gz > /tmp/out.ndjson

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	None	T332162 Run scraper on samples from several wikis
Resolved	awight	T332159 New cloud instance and attached volume for dumps processing
Resolved	awight	T332040 Shut down our previous Cloud VPS project and create a new one
Resolved	taavi	T332773 Destroy "wmde-templates-alpha" Cloud VPS project
Resolved	Andrew	T332777 Request creation of "dump-references-processor" VPS project
Resolved	awight	T333549 Requesting Cloud VPS access to NFS mount /public/dumps

Event Timeline

awight created this task.Mar 15 2023, 12:29 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 15 2023, 12:29 PM

awight added a subtask: T332040: Shut down our previous Cloud VPS project and create a new one.Mar 15 2023, 12:29 PM

awight added a parent task: T332032: Create baseline statistics for reference usage (2023).

awight updated the task description. (Show Details)Mar 15 2023, 12:31 PM

awight added a parent task: T332162: Run scraper on samples from several wikis.Mar 15 2023, 12:40 PM

lilients_WMDE moved this task from Incoming to In progress on the WMDE-References-FocusArea board.Mar 15 2023, 2:14 PM

JJMC89 removed a project: Cloud-Services.Mar 15 2023, 2:54 PM

awight renamed this task from New cloud instance and detached volume for dumps processing to New cloud instance and attached volume for dumps processing.Mar 23 2023, 10:54 AM

awight updated the task description. (Show Details)

awight added a subtask: T333549: Requesting Cloud VPS access to NFS mount /public/dumps.Mar 30 2023, 12:15 PM

awight closed subtask T332040: Shut down our previous Cloud VPS project and create a new one as Resolved.

awight updated the task description. (Show Details)Mar 30 2023, 12:31 PM

awight updated the task description. (Show Details)Mar 30 2023, 12:37 PM

awight moved this task from Sprint Backlog to Watching / Epic / Stalled on the WMDE-TechWish-Sprint-2023-03-14 board.Mar 30 2023, 12:40 PM

awight closed subtask T333549: Requesting Cloud VPS access to NFS mount /public/dumps as Resolved.

awight closed this task as Resolved.Mar 30 2023, 12:56 PM

awight claimed this task.

awight updated the task description. (Show Details)

awight moved this task from Watching / Epic / Stalled to Done on the WMDE-TechWish-Sprint-2023-03-14 board.