Page MenuHomePhabricator

Create A Utility To Generate A Delta of Two Frequency Tables
Closed, ResolvedPublic8 Estimated Story Points

Description

Once we have a frequency table of tokens in the current revision and the parent revision, we need to compute the delta frequency table. This will help us to compute the increase/decrease diff credibility signals.

Small ticket. A couple of lines of code to copy.

For example, if:

wikitext_parent = "kljlj 90 cat apple viking viking         shoes    cat apple apple viking viking"
wikitext_current = ")()(. viking viking         shoes    dog apple apple viking viking shoes pypi"

We process these wikitexts using our utilities, and find the list of dictionary_words as follows:

dictionary_word_parent = ['cat', 'apple', 'viking', 'viking',  'shoes',    'cat',  'apple',  'apple',  'viking', 'viking', ]
dictionary_word_current = ['dog',  'viking', 'viking',  'shoes',   'shoes',  'apple',  'apple',  'viking', 'viking', ]

Now, we use this utility, to get the frequency tables as follows:

old_ft = {'cat':2, 'apple':3, 'shoes':1, 'viking':4}   # parent rev 
new_ft = {'apple':2, 'shoes':2, 'viking':4, 'dog':1}    # current rev

A delta of two freq tables is a dictionary, where keys are tokens and values are the difference in freq between new_ft and old_ft.

For the above example:

delta_table = {'apple': -1 , 'shoes': 1,  dog: 1, 'cat': -2}

Here is the code snippet to copy/paste from:

class delta(Datasource):
    """
    Generates a frequency table diff by comparing two frequency tables.
    :Parameters:
        old_ft_datasource : :class:`revscoring.Datasource`
            A frequency table datasource
        new_ft_datasource : :class:`revscoring.Datasource`
            A frequency table datasource
        name : `str`
            A name for the datasource.
    """

    def __init__(self, old_ft_datasource, new_ft_datasource, name=None):
        name = self._format_name(name, [old_ft_datasource, new_ft_datasource])
        super().__init__(name, self.process,
                         depends_on=[old_ft_datasource, new_ft_datasource])

    def process(self, old_ft, new_tf):
        old_ft = old_ft or {}

        delta_table = {}
        for item, new_count in new_tf.items():
            old_count = old_ft.get(item, 0)
            if new_count != old_count:
                delta_table[item] = new_count - old_count

        for item in old_ft.keys() - new_tf.keys():
            delta_table[item] = old_ft[item] * -1

        return delta_table

The utility will live in structured-data/packages.

Event Timeline

prabhat updated the task description. (Show Details)
prabhat updated the task description. (Show Details)
Protsack.stephan moved this task from Incoming to In Progress on the Wikimedia Enterprise board.
Protsack.stephan set the point value for this task to 8.
Lena.Milenko changed the task status from Open to In Progress.Mar 22 2022, 10:58 PM
Lena.Milenko changed the task status from In Progress to Open.Apr 13 2022, 9:28 PM
Lena.Milenko changed the status of subtask T299607: Create A Utility To Compute Diff Decrease from In Progress to Open.
Lena.Milenko changed the status of subtask T299599: Create A Utility To Compute Diff Increase from In Progress to Open.