Page MenuHomePhabricator

tools.intersect_generators gives wrong results
Closed, ResolvedPublic

Description

>>> import pwb, pywikibot as py
>>> from pywikibot.tools import intersect_generators as isg
>>> list(isg(('ABB', 'BB')))
['B', 'B']
>>> 
>>> list(isg(('BB', 'ABB')))
['B']
>>>

Event Timeline

Xqt triaged this task as High priority.Sep 27 2020, 3:35 PM

A generator does not return the same page twice. Or is it possible in some cases?

It is possible. SQL queries and recent changes can return the same page multiple times.

OK, I see the issue.
Cache inside intersect_generators() cannot be cleared once the item has been yielded, as item might come again from all gens after it has been yielded the first time.

What is the best way for streams/genrators to be implemented?
The current approach is having a behaviour like a set which means the result has unique elements. The tests compares the result with set:
{S1} ∩ {S2}
set(S1) & set(S2)
e.g. 'ABBCD' ∩ 'ABCBA' gives 'ABC'

Another approach would be to have an intersection of each element like:
{|S1|} ∩ {|S2|}
collection.Counter(S1) & collection.Counter(S2)
e.g. 'ABBCD' ∩ 'ABCBA' gives 'ABBC' or 'ABCB'

Probably we can have both.

Change 630808 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [tests] Ad a intersect generator test which is known as failing

https://gerrit.wikimedia.org/r/630808

If no duplicates is desired, I think the only issue is memory as all seen items must be remembered.
If duplicates are allowed, should it be possible to keep memory low.

Change 630808 merged by jenkins-bot:
[pywikibot/core@master] [tests] Add a intersect generator test which is known as failing

https://gerrit.wikimedia.org/r/630808

Change 631282 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] [bugfix] Avoid duplicates in intersect_generators()

https://gerrit.wikimedia.org/r/631282

Change 631284 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] [bugfix] Avoid duplicates in intersect_generators()

https://gerrit.wikimedia.org/r/631284

Change 631284 abandoned by Mpaa:
[pywikibot/core@master] [bugfix] Avoid duplicates in intersect_generators()

Reason:
Submitted by mistake

https://gerrit.wikimedia.org/r/631284

Change 631282 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Avoid duplicates in intersect_generators()

https://gerrit.wikimedia.org/r/631282

Change 608153 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [Bugfix] Rewrite tools.intersect_generators

https://gerrit.wikimedia.org/r/608153