Page MenuHomePhabricator

filter_unique leaks memory
Closed, ResolvedPublic

Description

If you have a bot running a large generator (eg. -recentchanges on Wikidata) and need to load every page into memory, it may eventually fail on MemoryError because filter_unique maintains a growing set of already visited pages (for instance, my bot on Toolforge visited 220,000 items before crashing).

I think this could be improved by calling BasePage.clear_cache() somewhere inside BaseBot.run or GeneratorFactory. Because the only thing we have to know when the page is dormant and sitting in the set is the hash, which doesn't need to know the content.

Event Timeline

This is too undetermined. Could you give a sample?
When filter_unique is used the filter key could be replaced by the hash but this does not solve a lot. Most space is used by the content but it is not clear to me where this is hold after processing.

Oh, I realized that my bot couldn't crash because of the loaded content because the preloaded items are instances of different objects than the generator yields... maybe. But still, if I preloaded the objects yielded by the generator, they are burried in the collection and nothing can flush their attributes, so this drains the memory space faster.

Another problem I can see is that filter_unique (inside GenFact it is self._filter_unique) is used twice: it's provided to RecentChangesPageGenerator (line 821) and also used for dupfiltergen = self._filter_unique(gensList) (line 532). This means there are two same independent sets, one of which is redundant.

I think this could be improved by calling BasePage.clear_cache() somewhere inside BaseBot.run or GeneratorFactory.

No, this isn't a good idea. Imagine situation you put the pages to a queue.

When filter_unique is used the filter key could be replaced by the hash but this does not solve a lot.

I like this idea.

zhuyifei1999 renamed this task from Long-running tasks may end on MemoryError to filter_unique leaks memory.Jul 14 2018, 5:27 PM

Long-running tasks may end on MemoryError to filter_unique leaks Memory

Why do you assume that?
Try:

from sys import getsizeof
import pwb, pywikibot as py
from pywikibot.tools import filter_unique as f
s = py.Site()
p = py.Page(s, 'Hydraulik')
container = set()
gen = p.getReferences()
gen = PreloadingGenerator(gen)
generator = f(gen, container)
for item in generator:
	pass
getsizeof(container)

gives 16500

container = set()
gen = p.getReferences()
gen = PreloadingGenerator(gen)
generator = f(gen, Container, key=hash)
for item in generator:
	pass
getsizeof(container)

gives 16500 too

No glue where the Memory leakage might come from.

No glue where the Memory leakage might come from.

I see the getsizeof() counts the pointers only but not the Page objects itself.

I see the getsizeof() counts the pointers only but not the Page objects itself.

The underlying implementation of set.__sizeof__ looks weird to me. It doesn't seem to be iterative or recursive on a first glance.

But yes, the problem is the container gets really large, and Page objects are too large to be kept from being garbage collected, and hashes are much better in this regard. Ideally we should have an upper bound to memory consumption, but IMO that's not really do-able in filter_unique without potential breakage.

We could

  • use hash function for the filter_unique key
  • use hash function for the filter_unique key by default
  • use a GeneratorFactory Container attribute to hold the seen pages which could be reused when we have more than one duplicate filter
  • use an container which uses disk space instead of memory (but this could be time consuming)

We could

  • use hash function for the filter_unique key
  • use hash function for the filter_unique key by default
  • use a GeneratorFactory Container attribute to hold the seen pages which could be reused when we have more than one duplicate filter

LGTM.

  • use an container which uses disk space instead of memory (but this could be time consuming)

This could increase the amount of entries that we can store, but creates more things to consider:

  • file permissions (600 or 644?)
  • location & uniqueness of file (use tempfile?)
  • access by memory-map (causes a lot of page faults) or file objects (causes a lot of syscalls and context switches)?
  • and most importantly: do we have to make our own implementation of set()?
Xqt triaged this task as Medium priority.

I did some measurements for the memory needed for the statements given above:

  • ~ 50 MB for the full page Content
  • ~ 10 MB for hash keys only

The difference will be much higher for larger generators

Therefore I propose to use the hash by default.

Change 445854 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] Use hash key for filter_unique by default

https://gerrit.wikimedia.org/r/445854

Change 451824 had a related patch set uploaded (by Dalba; owner: dalba):
[pywikibot/core@master] Use a key for filter_unique where appropriate

https://gerrit.wikimedia.org/r/451824

Change 451824 merged by jenkins-bot:
[pywikibot/core@master] Use a key for filter_unique where appropriate

https://gerrit.wikimedia.org/r/451824

Change 445854 abandoned by Xqt:
[IMPR] Use hash key for unique filter by default

Reason:
https://gerrit.wikimedia.org/r/#/c/pywikibot/core/ /451824/

https://gerrit.wikimedia.org/r/445854

Another problem I can see is that filter_unique (inside GenFact it is self._filter_unique) is used twice: it's provided to RecentChangesPageGenerator (line 821) and also used for dupfiltergen = self._filter_unique(gensList) (line 532). This means there are two same independent sets, one of which is redundant.

Could somebody verify the last sentence is a valid statement?

Another problem I can see is that filter_unique (inside GenFact it is self._filter_unique) is used twice: it's provided to RecentChangesPageGenerator (line 821) and also used for dupfiltergen = self._filter_unique(gensList) (line 532). This means there are two same independent sets, one of which is redundant.

Could somebody verify the last sentence is a valid statement?

Yeah, it seems that it will happen if:

I'll send a change request.

Change 454198 had a related patch set uploaded (by Dalba; owner: dalba):
[pywikibot/core@master] pagegenerators.py: Avoid applying two uniquifying filters

https://gerrit.wikimedia.org/r/454198

Change 454198 merged by jenkins-bot:
[pywikibot/core@master] pagegenerators.py: Avoid applying two uniquifying filters

https://gerrit.wikimedia.org/r/454198