awight renamed T363293: Aggregate some numbers from scraper results from Aggregate some numbers from Scraper results to Aggregate some numbers from scraper results.

Wed, Apr 24, 11:43 AM · WMDE-TechWish-Sprint-2024-04-24, WMDE-References-FocusArea

Tue, Apr 23

awight added a comment to T362904: Scraper: track with production Prometheus.

Sneak preview for those playing at home:

Tue, Apr 23, 8:25 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a comment to T362904: Scraper: track with production Prometheus.

When the metrics land, they should appear on https://prometheus-eqiad.wikimedia.org/analytics/targets?search=wmde_tewu .

Tue, Apr 23, 1:23 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight moved T357611: Re-run the scraper on a limited set of wikis from Doing to Watching / Epic / Stalled on the WMDE-TechWish-Sprint-2024-04-12 board.

Tue, Apr 23, 10:10 AM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight updated the task description for T357611: Re-run the scraper on a limited set of wikis.

Tue, Apr 23, 10:10 AM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight moved T357611: Re-run the scraper on a limited set of wikis from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

Tue, Apr 23, 10:10 AM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

Mon, Apr 22

awight added a project to T357613: Measure the reference use and re-use in VE: Epic.

Mon, Apr 22, 1:13 PM · WMDE-TechWish-Sprint-2024-04-24, Epic, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-03-27, WMDE-TechWish-Sprint-2024-03-13, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight placed T362904: Scraper: track with production Prometheus up for grabs.

Mon, Apr 22, 1:09 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight moved T357611: Re-run the scraper on a limited set of wikis from Demo to Sprint Backlog on the WMDE-TechWish-Sprint-2024-04-12 board.

Mon, Apr 22, 1:08 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight added a parent task for T362904: Scraper: track with production Prometheus: T357611: Re-run the scraper on a limited set of wikis.

Mon, Apr 22, 12:38 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a subtask for T357611: Re-run the scraper on a limited set of wikis: T362904: Scraper: track with production Prometheus.

Mon, Apr 22, 12:38 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight claimed T362904: Scraper: track with production Prometheus.

Mon, Apr 22, 12:38 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added projects to T362904: Scraper: track with production Prometheus: WMDE-TechWish-Sprint-2024-04-12, Unplanned-Sprint-Work.

This will simplify how we share monitoring duty during the long-running scrape job.

Mon, Apr 22, 12:38 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight created T363096: [Refactor] Get rid of "auto/<#>" and "literal/<name>" internal ref IDs.

Mon, Apr 22, 12:18 PM · WMDE-TechWish-Sprint-2024-04-24, Patch-For-Review, VisualEditor, VisualEditor-MediaWiki-References, WMDE-TechWish-Maintenance, WMDE-TechWish-Sprint-2024-04-12, Cite

awight created T363095: [Refactor] New class to encapsulate Cite refs in VE.

Mon, Apr 22, 12:11 PM · VisualEditor, VisualEditor-MediaWiki-References, Cite, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight changed the status of T362358: Log events for copy and paste action around references in VE from Open to Stalled.

Stalled waiting for WMF legal review.

Mon, Apr 22, 11:35 AM · WMDE-TechWish-Sprint-2024-04-24, Patch-For-Review, WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight changed the status of T362358: Log events for copy and paste action around references in VE, a subtask of T357613: Measure the reference use and re-use in VE, from Open to Stalled.

Mon, Apr 22, 11:33 AM · WMDE-TechWish-Sprint-2024-04-24, Epic, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-03-27, WMDE-TechWish-Sprint-2024-03-13, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight moved T362358: Log events for copy and paste action around references in VE from Tech Review to Watching / Epic / Stalled on the WMDE-TechWish-Sprint-2024-04-12 board.

Mon, Apr 22, 11:33 AM · WMDE-TechWish-Sprint-2024-04-24, Patch-For-Review, WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight added a project to T362900: Investigate scraper performance drop-off: Patch-For-Review.

Mon, Apr 22, 11:26 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight placed T362900: Investigate scraper performance drop-off up for grabs.

Well, it could be simple after all. Articles at the end are on average twice as long (by HTML length).

Mon, Apr 22, 11:22 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight added a comment to T362900: Investigate scraper performance drop-off.

In this example, the segment on the left is processing the tail articles starting at the 2.6M'th row, and on the right we're processing the first articles in the dump.

Mon, Apr 22, 10:26 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight added a comment to T362900: Investigate scraper performance drop-off.

Very surprisingly to me, Hypothesis 4 seems to be the only validated theory. I haven't yet identified what makes the last articles harder to process, but the performance characteristics are almost perfectly repeatable when going back and forth between sets of articles at the beginning vs. the end of the dump. Initial articles can be processed at ~1.5k articles/s, and final articles at ~250 articles/s.

Mon, Apr 22, 10:20 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight updated the task description for T362900: Investigate scraper performance drop-off.

Mon, Apr 22, 10:04 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight updated the task description for T362900: Investigate scraper performance drop-off.

Mon, Apr 22, 9:57 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight updated the task description for T362900: Investigate scraper performance drop-off.

Mon, Apr 22, 9:51 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight updated the task description for T362900: Investigate scraper performance drop-off.

Mon, Apr 22, 9:35 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight updated the task description for T362900: Investigate scraper performance drop-off.

Mon, Apr 22, 7:42 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight claimed T362900: Investigate scraper performance drop-off.

Mon, Apr 22, 7:40 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight moved T362900: Investigate scraper performance drop-off from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

Mon, Apr 22, 7:40 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

Fri, Apr 19

awight added a comment to T362900: Investigate scraper performance drop-off.

WIP on the low-level-concurrency branch will let us experiment with per-page timeouts and debugging.

Fri, Apr 19, 11:57 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight updated the task description for T362900: Investigate scraper performance drop-off.

Fri, Apr 19, 10:08 AM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

Thu, Apr 18

awight created T362904: Scraper: track with production Prometheus.

Thu, Apr 18, 3:36 PM · Patch-For-Review, Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight created T362900: Investigate scraper performance drop-off.

Thu, Apr 18, 3:17 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-References-FocusArea

awight added a comment to T354018: Duplicate articles in snapshot dump.

This may be related to T362894: Data quality: HTML dumps contain unexplainably outdated revisions of some pages. The duplicates seem to have various revision ids, here's a set showing that the article is included three times with the same title and page id, but at different versions:

tar xzf dewiki-NS0-20240201-ENTERPRISE-HTML.json.tar.gz -O | jq 'select(.name == "10.000 B.C.") | .identifier,.version.identifier'

Thu, Apr 18, 3:00 PM · Wikimedia Enterprise

awight updated the task description for T362894: Data quality: HTML dumps contain unexplainably outdated revisions of some pages.

Thu, Apr 18, 2:19 PM · Wikimedia Enterprise, Dumps-Generation, WMDE-References-FocusArea

awight created T362894: Data quality: HTML dumps contain unexplainably outdated revisions of some pages.

Thu, Apr 18, 2:18 PM · Wikimedia Enterprise, Dumps-Generation, WMDE-References-FocusArea

awight closed T362678: Package request: install elixir and erlang-otp to the analytics clients as Resolved.

@BTullis Thanks for highlighting this possibility! I tried the Conda environment as you suggested and it works perfectly for our needs. Even at high concurrency, the performance seems to be the same as in the bare metal environment I had cobbled together previously.

Thu, Apr 18, 12:29 PM · Data-Platform-SRE, Data-Engineering

Wed, Apr 17

awight updated the task description for T357611: Re-run the scraper on a limited set of wikis.

Wed, Apr 17, 12:22 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight added a comment to T357611: Re-run the scraper on a limited set of wikis.

Still seeing extreme swings in performance, following the same shape as before. Now with additional metrics:

Wed, Apr 17, 6:48 AM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight placed T350300: Scraper: emit additional diagnostics up for grabs.

Wed, Apr 17, 6:42 AM · Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight moved T350300: Scraper: emit additional diagnostics from Doing to Done on the WMDE-TechWish-Sprint-2024-04-12 board.

Wed, Apr 17, 6:42 AM · Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

Tue, Apr 16

awight added a comment to T350300: Scraper: emit additional diagnostics.

Implemented in https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/122

Tue, Apr 16, 4:30 PM · Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight claimed T350300: Scraper: emit additional diagnostics.

Tue, Apr 16, 3:43 PM · Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight moved T350300: Scraper: emit additional diagnostics from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

Tue, Apr 16, 3:43 PM · Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight assigned T357611: Re-run the scraper on a limited set of wikis to WMDE-Fisch.

Tue, Apr 16, 3:41 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight added projects to T350300: Scraper: emit additional diagnostics: WMDE-TechWish-Sprint-2024-04-12, Unplanned-Sprint-Work.

Pulling this in because it would be nice to have, to debug the slowdown we see after the first 20 minutes or so.

Tue, Apr 16, 3:41 PM · Unplanned-Sprint-Work, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight moved T357611: Re-run the scraper on a limited set of wikis from Watching / Epic / Stalled to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

Tue, Apr 16, 3:41 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight added a comment to T362678: Package request: install elixir and erlang-otp to the analytics clients.

Some of these packages already appeary in debmonitor:

Tue, Apr 16, 3:34 PM · Data-Platform-SRE, Data-Engineering

awight updated the task description for T362678: Package request: install elixir and erlang-otp to the analytics clients.

Tue, Apr 16, 3:29 PM · Data-Platform-SRE, Data-Engineering

awight created T362678: Package request: install elixir and erlang-otp to the analytics clients.

Tue, Apr 16, 3:18 PM · Data-Platform-SRE, Data-Engineering

awight placed T362659: Scraping enterprise dumps: investigate incomplete article lists up for grabs.

Hmm, spot-checking is only turning up articles which were created or moved after the snapshot date.

Tue, Apr 16, 2:47 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a comment to T362659: Scraping enterprise dumps: investigate incomplete article lists.

Duplicates: each copy of a page comes with a different revid, and checking the final counts we can see that our deduplication did catch the extra copies during the aggregation step:

Tue, Apr 16, 2:34 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight claimed T362659: Scraping enterprise dumps: investigate incomplete article lists.

Tue, Apr 16, 2:21 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight placed T357611: Re-run the scraper on a limited set of wikis up for grabs.

Tue, Apr 16, 2:21 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight moved T357611: Re-run the scraper on a limited set of wikis from Doing to Watching / Epic / Stalled on the WMDE-TechWish-Sprint-2024-04-12 board.

Tue, Apr 16, 2:21 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight moved T362659: Scraping enterprise dumps: investigate incomplete article lists from Sprint Backlog to Doing on the WMDE-TechWish-Sprint-2024-04-12 board.

Tue, Apr 16, 2:20 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a parent task for T362659: Scraping enterprise dumps: investigate incomplete article lists: T357611: Re-run the scraper on a limited set of wikis.

Tue, Apr 16, 2:13 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a subtask for T357611: Re-run the scraper on a limited set of wikis: T362659: Scraping enterprise dumps: investigate incomplete article lists.

Tue, Apr 16, 2:13 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight added a project to T362659: Scraping enterprise dumps: investigate incomplete article lists: WMDE-TechWish-Sprint-2024-04-12.

Bringing this task into our sprint because it has data quality implications and probably blocks scraping for the moment.

Tue, Apr 16, 2:13 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight renamed T362659: Scraping enterprise dumps: investigate incomplete article lists from Enterprise dumps: investigate incomplete article lists to Scraping enterprise dumps: investigate incomplete article lists.

Tue, Apr 16, 2:12 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a comment to T362659: Scraping enterprise dumps: investigate incomplete article lists.

Oookay there are all kinds of things happening. Diffing the two lists, we can see that the scraper is still producing duplicates:

+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel
+1._Buch_Samuel

Tue, Apr 16, 2:11 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a comment to T362659: Scraping enterprise dumps: investigate incomplete article lists.

analytics-mysql dewiki -B -e 'select page_title from page where page_namespace=0 and page_is_redirect=0' > dewiki_all_pages_db.txt

Tue, Apr 16, 2:06 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a comment to T362659: Scraping enterprise dumps: investigate incomplete article lists.

Deleted articles shouldn't show up in either list, so Hypothesis #2 is also looking unlikely.

Tue, Apr 16, 1:58 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight added a comment to T362659: Scraping enterprise dumps: investigate incomplete article lists.

ApiQueryAllPages uses the page title to carry continuation state, which is very reasonable! This would only be fooled by page renames happening during the dump interval, which is possible but not likely to add up to 1%. Hypothesis #1 is looking unlikely.

Tue, Apr 16, 1:49 PM · WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Maintenance

awight updated the task description for T357611: Re-run the scraper on a limited set of wikis.

Tue, Apr 16, 1:34 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight added a comment to T357611: Re-run the scraper on a limited set of wikis.

There's still a small (<1%) gap in page count. Splitting a small investigation out as T362659.

Tue, Apr 16, 1:32 PM · WMDE-TechWish-Sprint-2024-04-24, WMDE-TechWish-Sprint-2024-04-12, WMDE-TechWish-Sprint-2024-02-28, WMDE-TechWish-Sprint-2024-02-15, WMDE-References-FocusArea

awight (Adam Wight)
User

Projects (19)
View All

Calendar

Today

Tomorrow

Monday

User Details

Recent Activity
View All

Yesterday

Thu, Apr 25

Wed, Apr 24

Tue, Apr 23

Mon, Apr 22

Fri, Apr 19

Thu, Apr 18

Wed, Apr 17

Tue, Apr 16

awight (Adam Wight)User

Projects (19)View All

Calendar

Today

Tomorrow

Monday

User Details

Recent ActivityView All

Yesterday

Thu, Apr 25

Wed, Apr 24

Tue, Apr 23

Mon, Apr 22

Fri, Apr 19

Thu, Apr 18

Wed, Apr 17

Tue, Apr 16

awight (Adam Wight)
User

Projects (19)
View All

Recent Activity
View All