Page MenuHomePhabricator

Move behaviors/proxies/tunnels processing into an independent script
Closed, ResolvedPublic

Description

From https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/20#note_45767:

The script should be able to be run independently and generate a sql file that can be run.

Details

TitleReferenceAuthorSource BranchDest Branch
Remove updates to behaviors, proxies and tunnels from output-sql.jsrepos/mediawiki/services/ipoid!63tchandersremove-insert-propertiesmain
Add a script for popuplating behaviors, proxies and tunnels tablesrepos/mediawiki/services/ipoid!48tchandersimport-propertiesmain
Customize query in GitLab

Event Timeline

I think this is worth doing for a couple of reasons:

  • Checking whether to insert known values for up to 25 million actors every day is a lot of unnecessary work (see analysis below)
  • Inserting behaviors/proxies/tunnels just before inserting rows in the map tables that use them creates dependencies which make batching more complicated (see T344272)

Profiling

I did some profiling of output-diff.js and found that inserting the behaviors/proxies/tunnels during the update takes up a large proportion of the running time, and uses a large proportion of memory.

Here's where we do it: https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/blob/main/output-diff.js#L261

We gradually build arrays of behaviors, proxies and tunnels, potentially appending data from the latest actor to each, then ensuring there are no duplicates. We spend a lot of time in the isUnique function:

image.png (1×3 px, 201 KB)

And of course those arrays and the objects whose properties they are references from use up more memory:

Without arraysWith arrays
image.png (671×1 px, 182 KB)
image.png (686×1 px, 202 KB)

Time to run script on full dataset with the arrays: 13m38.731s
Time to run script on full dataset without the arrays: 5m28.828s

The script should be able to be run independently and generate a sql file that can be run.

The dataset is so small that we can just store it as a JSON

We also need to make sure the other scripts don't try to populate these tables.

tchanders updated https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/63

Remove updates to behaviors, proxies and tunnels from output-sql.js