Page MenuHomePhabricator

Scraper: visual progress bar
Closed, ResolvedPublic

Description

Design and code a light UI for the scraper. Show parallel cursors moving through 0-100% of source files as a stack of progress bars. This view rolls up concurrent workers, eg. using the checkpoint counter to show overall progress, but partitions on wiki.

Print a count of rows processed, and the percentage of total lines.

Screencast demonstrating the owl library (sample code, asciicinema):

image.png (576×1 px, 157 KB)

Efficient implementation depends on a missing dump stat "number of pages in Main namespace", see subtask T332858: Enterprise HTML dump stats should include file size and article count.

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/51

Event Timeline

I'm pasting a woefully incomplete attempt to report on progress every N rows, to spark conversation.

diff --git a/lib/html_dump.ex b/lib/html_dump.ex
index ab6e9a9..0731873 100644
--- a/lib/html_dump.ex
+++ b/lib/html_dump.ex
@@ -1,4 +1,6 @@
 defmodule Wiki.HtmlDump do
+  @window_size 1_000
+
   def parse_file(path) do
     path
     |> File.stream!()
@@ -6,10 +8,17 @@ defmodule Wiki.HtmlDump do
     |> Enum.to_list()
   end
 
-  def parse_stream(%File.Stream{} = input) do
+  def parse_stream(input) do
+    window = Flow.Window.count(@window_size)
+
     input
     |> Flow.from_enumerable()
     |> Flow.filter(&(&1 != ""))
+    |> Flow.partition(window: window)
+    |> Flow.on_trigger(fn acc ->
+      IO.puts("hash")
+      {:nothing, acc}
+    end)
     |> Flow.map(&parse_line/1)
   end
thiemowmde renamed this task from Visual progress bar to Visual progress bar for the HTML dump scraper.Apr 19 2023, 11:40 AM
awight renamed this task from Visual progress bar for the HTML dump scraper to Scraper: visual progress bar.Apr 21 2023, 9:36 AM
awight updated the task description. (Show Details)
awight moved this task from Doing to Tech Review on the WMDE-TechWish-Sprint-2023-04-19 board.
awight updated the task description. (Show Details)