Once we have a frequency table of tokens in the current revision and the parent revision, we need to compute the delta frequency table. This will help us to compute the increase/decrease diff credibility signals.
Small ticket. A couple of lines of code to copy.
For example, if:
wikitext_parent = "kljlj 90 cat apple viking viking shoes cat apple apple viking viking" wikitext_current = ")()(. viking viking shoes dog apple apple viking viking shoes pypi"
We process these wikitexts using our utilities, and find the list of dictionary_words as follows:
dictionary_word_parent = ['cat', 'apple', 'viking', 'viking', 'shoes', 'cat', 'apple', 'apple', 'viking', 'viking', ] dictionary_word_current = ['dog', 'viking', 'viking', 'shoes', 'shoes', 'apple', 'apple', 'viking', 'viking', ]
Now, we use this utility, to get the frequency tables as follows:
old_ft = {'cat':2, 'apple':3, 'shoes':1, 'viking':4} # parent rev
new_ft = {'apple':2, 'shoes':2, 'viking':4, 'dog':1} # current revA delta of two freq tables is a dictionary, where keys are tokens and values are the difference in freq between new_ft and old_ft.
For the above example:
delta_table = {'apple': -1 , 'shoes': 1, dog: 1, 'cat': -2}Here is the code snippet to copy/paste from:
class delta(Datasource):
"""
Generates a frequency table diff by comparing two frequency tables.
:Parameters:
old_ft_datasource : :class:`revscoring.Datasource`
A frequency table datasource
new_ft_datasource : :class:`revscoring.Datasource`
A frequency table datasource
name : `str`
A name for the datasource.
"""
def __init__(self, old_ft_datasource, new_ft_datasource, name=None):
name = self._format_name(name, [old_ft_datasource, new_ft_datasource])
super().__init__(name, self.process,
depends_on=[old_ft_datasource, new_ft_datasource])
def process(self, old_ft, new_tf):
old_ft = old_ft or {}
delta_table = {}
for item, new_count in new_tf.items():
old_count = old_ft.get(item, 0)
if new_count != old_count:
delta_table[item] = new_count - old_count
for item in old_ft.keys() - new_tf.keys():
delta_table[item] = old_ft[item] * -1
return delta_tableThe utility will live in structured-data/packages.