Page MenuHomePhabricator

Investigation: How can we improve the speed of the popular pages bot
Closed, ResolvedPublic3 Estimated Story Points

Description

The Popular pages bot (source code) is working, but running too slowly to complete all the reports within 1 month. Ideally, we would like the bot to complete all reports within 1 week so that the reports are actually useful for informing the work priorities of the WikiProjects. It looks like there are 6 possible solutions for improving the speed of the bot:

  1. Adjust the throttling in ApiHelper::getMonthlyPageviews() (may hit pageview API throttling)
  2. Use promises to run more requests asynchronously (may cause memory issues)
  3. Run multiple instances of the bot simultaneously (may hit pageview API throttling)
  4. Cache the results of the pageview API in a database table or persistent cache such as Redis to eliminate redundant queries
  5. Get the pageview API to handle redirects (T121912) (probably not a short-term solution)
  6. Get the pageview API to handle requests for multiple titles in one request (probably not a short-term solution)

Please evaluate these 6 approaches and create a new task that specifies our short-term (one sprint) strategy for improving the speed of the bot.

Event Timeline

@Milimetric: Any thoughts on items 5 and 6 in the task description?

@kaldari: can you include a link to the source code for bot? Maybe it will benefit from using pageview.js client in terms of speed and asynchronicity, we can also help with that.

@kaldari: can you include a link to the source code for bot? Maybe it will benefit from using pageview.js client in terms of speed and asynchronicity, we can also help with that.

https://github.com/wikimedia/popularpages

@Nuria: What's the pageview.js client? Is there a link for that?

@Nuria: What's the pageview.js client? Is there a link for that?

https://github.com/tomayac/pageviews.js

We could use this because it's Node-compatible and could live server-side, but the bot was written in PHP and preferably we'd stick to one language. We also have code to make asynchronous requests using Guzzle, but as pointed out in #2 it may cause memory issues, which we have already ran into.

In my opinion a combination of Guzzle promises and Redis is worth a try. I can't speak for the timeliness of making modifications to the Pageviews API itself (#5 and #6 above), but it would be really awesome if they became reality! :)

One thing I'd like to know more about is the throttling. @Nuria how does that work? Does it enforce the 100 req/sec limitation on a per-IP basis? Are the Tool Labs IP(s) by any chance exempt?

I think the code can benefit from many improvements on frontend before you need backend improvements given that there is no paralelization, the throttling is enabled at 100 reqs per sec per hosts, we have now 6 so that is > 500 reqs per sec. These are fresh requests (i.e cold cache) so I doubt you would run into that limit.

More info: https://wikitech.wikimedia.org/wiki/Analytics/Systems/AQS

Regarding solution #1, it looks like the script is currently waiting 0.01 seconds after every 99 requests, which basically has no effect. So removing that won't make any substantial improvement.

For reference, each request to the pageview API takes about 2 seconds to complete via curl.

For reference, each request to the pageview API takes about 2 seconds to complete via curl.

mmm.. that seems quite high, maybe worth troubleshooting on your end, even now, with datacenter switchover median latencies are 100 ms, 300 ms for percentile 99

https://grafana.wikimedia.org/dashboard/db/pageviews?orgId=1&from=now-30d&to=now

For reference, each request to the pageview API takes about 2 seconds to complete via curl.

I wonder where you saw that. From tool labs:

tools.popularpages-dev@tools-bastion-03:~$ time curl https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/user/Barack_Obama/monthly/2017030100/2017033100
{"items":[{"project":"en.wikipedia","article":"Barack_Obama","granularity":"monthly","timestamp":"2017030100","access":"all-access","agent":"user","views":838481}]}
real	0m0.265s
user	0m0.012s
sys	0m0.006s

My test was from my local machine. Wow, that's a huge difference!

FYI that our "usual" latencies (when not in switchove mode) are arround 50ms at percentile 99

kaldari set the point value for this task to 3.May 2 2017, 11:20 PM
DannyH triaged this task as Medium priority.May 2 2017, 11:23 PM
DannyH moved this task from Needs Discussion to Up Next (June 3-21) on the Community-Tech board.

Another question: Would querying the database for page assessments be faster than going through the API? With the latter I see we have to loop through every 1000 pages with the wppcontinue parameter. I ran a query on Biography (the biggest WikiProject I think), and it took around 1.5 minutes to finish. How does the API compare?

I ran a query on Biography (the biggest WikiProject I think), and it took around 1.5 minutes to finish.

...and ~3 mins when joining on page to get the page title.

If we assume there are 1.5 million biography articles, collecting all the articles from the API would take 1500 API requests. If we assume each request takes 1 second (which is probably pessimistic), that means it takes ~25 minutes to retrieve all the articles. So we could probably save about 20 minutes. Currently, generating the report for WikiProject Biography takes several days to complete, so I'm not sure if shaving off 20 minutes would be worth the effort. It's clearly a possibility for later optimization though.

While the purpose of this bot is awesome, the approach to get data is wrong. This kind of data should be fetched in batch not in millions of tiny chunks. I'm not sure what deadlines are, but a better approach is:

  • add a page_wiki_projects field of type array<string> to the mediawiki_history dataset
  • join mediawiki_history to pageview_hourly and get all the data we need in one query
  • make it available as a dump file

With this approach we establish a valuable new field that there are lots of use cases for already, and get you the data you need fairly quickly.

This comment was removed by MusikAnimal.

I made https://github.com/wikimedia/popularpages/pull/5 to use promises for fetching redirects.
This PR also takes out throttling.

I believe @kaldari did some testing with the above PR and found it to be ~3 times faster for one wikiproject. It's worth running the bot with the promises in place and seeing if that helps. If that does not, the other remaining option is to run multiple bots in parallel.

Alright, so I got the same Out of memory error on tool labs for project Biography. Here's the stack trace:

PHP Fatal error:  Out of memory (allocated 1585709056) (tried to allocate 32 bytes) in /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/psr7/src/MessageTrait.php on line 180
PHP Stack trace:
PHP   1. {main}() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/checkReports.php:0
PHP   2. ReportUpdater->updateReports() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/checkReports.php:22
PHP   3. ApiHelper->getMonthlyPageviews() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/ReportUpdater.php:47
PHP   4. GuzzleHttp\Promise\Promise->wait() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/ApiHelper.php:149
PHP   5. GuzzleHttp\Promise\Promise->waitIfPending() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:62
PHP   6. GuzzleHttp\Promise\Promise->invokeWaitList() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:225
PHP   7. GuzzleHttp\Promise\Promise->waitIfPending() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:267
PHP   8. GuzzleHttp\Promise\Promise->invokeWaitFn() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:223
PHP   9. GuzzleHttp\Promise\EachPromise->GuzzleHttp\Promise\{closure}() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:246
PHP  10. GuzzleHttp\Promise\Promise->wait() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/EachPromise.php:101
PHP  11. GuzzleHttp\Promise\Promise->waitIfPending() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:62
PHP  12. GuzzleHttp\Promise\Promise->invokeWaitList() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:225
PHP  13. GuzzleHttp\Promise\Promise->waitIfPending() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:267
PHP  14. GuzzleHttp\Promise\Promise->invokeWaitFn() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:223
PHP  15. GuzzleHttp\Handler\CurlMultiHandler->execute() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/Promise.php:246
PHP  16. GuzzleHttp\Handler\CurlMultiHandler->tick() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php:123
PHP  17. curl_multi_exec() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php:106
PHP  18. GuzzleHttp\Handler\CurlFactory->GuzzleHttp\Handler\{closure}() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php:106
PHP  19. GuzzleHttp\Handler\EasyHandle->createResponse() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:516
PHP  20. GuzzleHttp\Psr7\Response->__construct() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/guzzle/src/Handler/EasyHandle.php:82
PHP  21. GuzzleHttp\Psr7\Response->setHeaders() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/psr7/src/Response.php:102
PHP  22. GuzzleHttp\Psr7\Response->trimHeaderValues() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/psr7/src/MessageTrait.php:151
PHP  23. array_map() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/psr7/src/MessageTrait.php:181
PHP  24. GuzzleHttp\Psr7\Response->GuzzleHttp\Psr7\{closure}() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/psr7/src/MessageTrait.php:181

I'm not sure about the cause for this. Biography is the biggest project and so far the only one to have caused this error. Ping @MusikAnimal @Samwilson

I'm still trying to wrap my head around why the heck it's using 1.5+ gigabytes of memory. For starters, would it be worth looking into how well garbage collection is working (which I think PHP is supposed to be good at)? gc_collect_cycles will force the collection of cycles even if the possible root buffer is not full yet, and return how many cycles were collected. What if we logged that on each iteration, just out of curiosity?

As for the code itself, there may be some memory-intensive things going on behind the scenes (why it failed when settling the promises), but all of those steps are necessary, at least if we want to retain the speed advatange promises give us.

So thinking of things that use memory that we can cut back on... what if we only kept track of pages that ended up having more than say, 100 pageviews? So after you've done your processing, check $results[$page] and if it's less than 100 unset it (but still add it to the total sum of pageviews for that WikiProject). I'm skeptical if this approach will really save much memory, but it might be worth a try. And also, while for most WikiProjects such pages probably don't make the top X pages, WikiProject Columbia, Missouri for instance will have a fair amount of pages missing from the report. However for Biography, I bet it would reduce the size of the $results array substantially.

So thinking of things that use memory that we can cut back on... what if we only kept track of pages that ended up having more than say, 100 pageviews? So after you've done your processing, check $results[$page] and if it's less than 100 unset it (but still add it to the total sum of pageviews for that WikiProject). I'm skeptical if this approach will really save much memory, but it might be worth a try. And also, while for most WikiProjects such pages probably don't make the top X pages, WikiProject Columbia, Missouri for instance will have a fair amount of pages missing from the report. However for Biography, I bet it would reduce the size of the $results array substantially.

I think we can generalize that idea and after every round of fetching pageviews, we can sort the array and shave off projects which are above the 500/1000 mark (from config). This might save us some memory but it sure will cost us more time.

At this point, I'm kinda inclined to settle for running two/three instances of the bot and see how that goes. @kaldari, thoughts?

#4 sounds like it could be done in a fairly straightforward manner, using a temporary indexed database table.

  1. Go through each project's list of included pages, looking up #views and appending project assessment data to it, and adding these page entries to a temp table, indexed to page name. When running for each subsequent project, remove page entries from the queries that are already in the temp table before running the queries. And so on, so you'll get (generally) faster and faster queries.
  1. Go through each project's list of included pages the second time, this time pulling only the top <specified number> with the most views, sorted by views, from the temp table.

To make all this even faster, limit the original queries to pages having views more than a particular number so that the temp db table won't become ginormous.

What is especially beautiful about this is it should be fairly straightforward for any database programmer, without having to grok a caching mechanism.

@Niharika: Let's try throwing a gc_collect_cycles() at the end of the foreach ( $pages as $page ) { loop and re-running WikiProject Biography. If that doesn't help, we should look at options 3 and 4. I'm not opposed to Musikanimal's suggestion of debugging the memory issues further, but I'm worried it will end up being a deep rabbit hole.

@Niharika: Let's try throwing a gc_collect_cycles() at the end of the foreach ( $pages as $page ) { loop and re-running WikiProject Biography.

I did a quick test on my local and gc_collect_cycles didn't seem to do anything, returning 0 (nothing collected) every time. So I guess hold off on that for right now. I did find this Guzzle issue, which is probably related. We're not using GuzzleHttp\Pool but I'm suspicious that the same problem persists – that resources aren't released after the requests have finished. Going to try a few other things...

@Niharika: Let's try throwing a gc_collect_cycles() at the end of the foreach ( $pages as $page ) { loop and re-running WikiProject Biography. If that doesn't help, we should look at options 3 and 4. I'm not opposed to Musikanimal's suggestion of debugging the memory issues further, but I'm worried it will end up being a deep rabbit hole.

Done. I've also added a few unset statements to free up memory when we don't need the arrays. I added a new job 'biography' so we can keep a check on it.

@Stevietheman I can't find a reference for it now, but the pageviews API does do caching on their end so caching on our end might not be as big a speed boost as it looks.

@Niharika: Let's try throwing a gc_collect_cycles() at the end of the foreach ( $pages as $page ) { loop and re-running WikiProject Biography.

I did a quick test on my local and gc_collect_cycles didn't seem to do anything, returning 0 (nothing collected) every time. So I guess hold off on that for right now. I did find this Guzzle issue, which is probably related. We're not using GuzzleHttp\Pool but I'm suspicious that the same problem persists – that resources aren't released after the requests have finished. Going to try a few other things...

Thanks for looking into this! Lets see if unset() does anything for us.

@Niharika: Let's try throwing a gc_collect_cycles() at the end of the foreach ( $pages as $page ) { loop and re-running WikiProject Biography.

I did a quick test on my local and gc_collect_cycles didn't seem to do anything, returning 0 (nothing collected) every time. So I guess hold off on that for right now. I did find this Guzzle issue, which is probably related. We're not using GuzzleHttp\Pool but I'm suspicious that the same problem persists – that resources aren't released after the requests have finished. Going to try a few other things...

Thanks for looking into this! Lets see if unset() does anything for us.

....and nope.

PHP Fatal error:  Out of memory (allocated 1576009728) (tried to allocate 76 bytes) in /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/EachPromise.php on line 155
PHP Stack trace:
PHP   1. {main}() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/generateReport.php:0
PHP   2. ReportUpdater->updateReports() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/generateReport.php:30
PHP   3. ApiHelper->getMonthlyPageviews() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/ReportUpdater.php:47
PHP   4. GuzzleHttp\Promise\settle() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/ApiHelper.php:159
PHP   5. GuzzleHttp\Promise\each() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/functions.php:321
PHP   6. GuzzleHttp\Promise\EachPromise->promise() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/functions.php:354
PHP   7. GuzzleHttp\Promise\EachPromise->refillPending() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/EachPromise.php:77
PHP   8. GuzzleHttp\Promise\EachPromise->addPending() /mnt/nfs/labstore-secondary-tools-project/popularpages-dev/public_html/vendor/guzzlehttp/promises/src/EachPromise.php:121

@Stevietheman I can't find a reference for it now, but the pageviews API does do caching on their end so caching on our end might not be as big a speed boost as it looks.

Two possible issues here:

  1. How durable is the caching? I have noticed this process running for a few hours, then waiting a day and running a few more hours. At least with a temp table, it would be like a dedicated cache continuously until disposed of.
  2. Since pageviews data is being joined with page assessments data, are both components cached?

@Stevietheman I can't find a reference for it now, but the pageviews API does do caching on their end so caching on our end might not be as big a speed boost as it looks.

Two possible issues here:

  1. How durable is the caching? I have noticed this process running for a few hours, then waiting a day and running a few more hours. At least with a temp table, it would be like a dedicated cache continuously until disposed of.

Good point. This could speed up the bot a bit.

  1. Since pageviews data is being joined with page assessments data, are both components cached?

The page assessment data is unique every time because every project has its own evaluation for the article.

The bot is currently using promises for fetching redirects and it's...lightning fast somehow...Examples below -

Before (~20 hours):

2017-04-25 19:26:59  Fetching pages and assessments for project Olympics
2017-04-25 19:28:09  Total number of pages fetched: 103560
2017-04-25 19:28:09  Fetching monthly pageviews
2017-04-26 15:09:41  Pageviews fetch complete

After (~1 hour 15 minutes!):

2017-05-11 18:12:27  Fetching pages and assessments for project Olympics
2017-05-11 18:15:44  Total number of pages fetched: 103935
2017-05-11 18:16:17  Fetching monthly pageviews
2017-05-11 20:30:18  Pageviews fetch complete

Another one:

Before (~9 hours):

2017-04-25 06:59:29  Fetching pages and assessments for project Novels
2017-04-25 07:00:33  Total number of pages fetched: 39449
2017-04-25 07:00:33  Fetching monthly pageviews
2017-04-25 16:03:50  Pageviews fetch complete

After (~53 minutes!):

2017-05-11 17:00:00  Fetching pages and assessments for project Novels
2017-05-11 17:02:21  Total number of pages fetched: 39462
2017-05-11 17:02:21  Fetching monthly pageviews
2017-05-11 17:55:40  Pageviews fetch complete

While these numbers are good, they seem too good to be true.

@Niharika: Since using promises provides such a dramatic speed improvement, I would hate for us to throw that away for the majority of WikiProjects. Here's a slightly hacky idea that we could implement quickly:

  • Change ReportUpdater::updateReports() to skip any projects that have over 1 million pages.
  • Create a command-line script called checkBigReports.php (or something like that) that only handles projects with over 1 million pages (or is even hard-coded to handle just WikiProject Biography) and uses a non-promise version of getMonthlyPageviews().
  • Add checkBigReports.php to the crontab.

We should also report this memory leak issue upstream to guzzle/promises. Do you think you or @MusikAnimal could create a simple testcase for the memory issue? Once that is resolved (which could take months) we can throw out the hacks.

The bot is currently using promises for fetching redirects and it's...lightning fast somehow...

Woohoo! 😄

While these numbers are good, they seem too good to be true.

It's possible something's awry, but I'm not surprised the speed was improved that much. If each article has say, an average of 5 redirects, that's taking down your run time to roughly 1/5, right?

We should also report this memory leak issue upstream to guzzle/promises. Do you think you or @MusikAnimal could create a simple testcase for the memory issue? Once that is resolved (which could take months) we can throw out the hacks.

Reported here, but I'm not even convinced it's the same issue, since we're not using GuzzleHttp\Pool (maybe it uses it internally though, dunno). Like I said, during my tests PHP adds a little bit of memory on every iteration, even if you do absolutely nothing. I think WP:BIOGRAPHY is just that big, that throwing in promises was enough to send it over the edge.

@Niharika: Since using promises provides such a dramatic speed improvement, I would hate for us to throw that away for the majority of WikiProjects. Here's a slightly hacky idea that we could implement quickly:

  • Change ReportUpdater::updateReports() to skip any projects that have over 1 million pages.
  • Create a command-line script called checkBigReports.php (or something like that) that only handles projects with over 1 million pages (or is even hard-coded to handle just WikiProject Biography) and uses a non-promise version of getMonthlyPageviews().
  • Add checkBigReports.php to the crontab.

I'd prefer doing it just for Biography separately and letting all else be in the same job. According to P4950, none of the other projects are even close to the million mark. We won't have to look back at this for ages. For Biography, we can use the pre-promise version even though it may take a week or something to finish.

While these numbers are good, they seem too good to be true.

It's possible something's awry, but I'm not surprised the speed was improved that much. If each article has say, an average of 5 redirects, that's taking down your run time to roughly 1/5, right?

Could be... but 20 hours -> 1h 15mins? Seems a little fishy. But I did also take out throttling, even though it was pretty minor. That could add up a little bit too.

@MusikAnimal: Does the memory usage go back down when it starts on a new project (within the same bot run)?

@MusikAnimal: Does the memory usage go back down when it starts on a new project (within the same bot run)?

Doesn't look like it. Each "Used ... bytes as of iteration N" is when the memory usage has reached a new peak while processing that WikiProject:

Beginning to process: Physics/Acoustics Taskforce
Fetching pages and assessments for project Physics/Acoustics Taskforce
Total number of pages fetched: 283
Fetching monthly pageviews
Used 3794072 bytes as of iteration 1
Used 4414824 bytes as of iteration 3
Used 4738856 bytes as of iteration 20
Pageviews fetch complete
Finished processing: Physics/Acoustics Taskforce
Beginning to process: Adelaide
Fetching pages and assessments for project Adelaide
Total number of pages fetched: 1
Fetching monthly pageviews
Used 3816800 bytes as of iteration 1
Pageviews fetch complete
Finished processing: Adelaide
Beginning to process: Cartoon Network/Adult Swim task force
Fetching pages and assessments for project Cartoon Network/Adult Swim task force
Total number of pages fetched: 438
Fetching monthly pageviews
Used 5229968 bytes as of iteration 1
Used 5444504 bytes as of iteration 6
Used 7005360 bytes as of iteration 68
Pageviews fetch complete
Finished processing: Cartoon Network/Adult Swim task force
Beginning to process: Aviation/aerospace biography project
Fetching pages and assessments for project Aviation/aerospace biography project
Total number of pages fetched: 4515
Fetching monthly pageviews
Used 8791064 bytes as of iteration 1
Used 9031280 bytes as of iteration 3
Used 9083968 bytes as of iteration 4
Used 9478680 bytes as of iteration 5
Used 10668952 bytes as of iteration 6

Unrelated, I think there might be some issue with the WikiProject Adelaide task force (part of WikiProject Australia). Only one page was found via PageAssessments, but the WP 1.0 Bot reports ~1,890 pages https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Adelaide

Here's the same run with the memory output, except with the promises removed – so all it does is fetch redirects (and store 0 as the number of pageviews for that page):

Beginning to process: Physics/Acoustics Taskforce
Fetching pages and assessments for project Physics/Acoustics Taskforce
Total number of pages fetched: 283
Fetching monthly pageviews
Used 3526928 bytes as of iteration 1
Used 3548216 bytes as of iteration 3
Used 3562360 bytes as of iteration 20
Used 3562704 bytes as of iteration 205
Used 3564096 bytes as of iteration 208
Used 3564120 bytes as of iteration 209
Used 3564720 bytes as of iteration 233
Used 3576400 bytes as of iteration 239
Used 3580352 bytes as of iteration 262
Pageviews fetch complete
Finished processing: Physics/Acoustics Taskforce
Beginning to process: Adelaide
Fetching pages and assessments for project Adelaide
Total number of pages fetched: 1
Fetching monthly pageviews
Used 3293632 bytes as of iteration 1
Pageviews fetch complete
Finished processing: Adelaide
Beginning to process: Cartoon Network/Adult Swim task force
Fetching pages and assessments for project Cartoon Network/Adult Swim task force
Total number of pages fetched: 438
Fetching monthly pageviews
Used 3712376 bytes as of iteration 1
Used 3720624 bytes as of iteration 6
Used 3786232 bytes as of iteration 68
Pageviews fetch complete
Finished processing: Cartoon Network/Adult Swim task force
Beginning to process: Aviation/aerospace biography project
Fetching pages and assessments for project Aviation/aerospace biography project
Total number of pages fetched: 4515
Fetching monthly pageviews
Used 7775216 bytes as of iteration 1
Used 7783408 bytes as of iteration 3
Used 7785528 bytes as of iteration 4
Used 7799144 bytes as of iteration 5
Used 7842888 bytes as of iteration 6

Notice the memory peaks at the same iterations!! (???) But more importantly, the memory goes up each time even without promises – just to a lesser degree.

Niharika moved this task from In Development to Q1 2018-19 on the Community-Tech-Sprint board.

I believe this can be resolved now. This time the bot ran through all projects except Biography in 15 days. The update script for Biography (using the non-promise version) is still running.

It looks like even without using promises, the bot can't successfully process WikiProject Biography. The report page hasn't successfully updated since April. Rather than waste more time on this 1 report, I've removed WikiProject Biography from the JSON config and changed the bot code to skip any projects larger than 1,000,000 articles. I also updated the crontab in the popularpages-dev project to remove the special job just for WikiProject Biography.

Let's make a task to output this data directly from the cluster so the bot doesn't have to do any work.