Page MenuHomePhabricator

[Investigate] Get signal from adding/removing images
Closed, ResolvedPublic

Description

This will be extremely useful in case when "conservative vandals" remove pictures of not-very-conservative subjects for example this

Event Timeline

We should be able to track image links on a per-wiki basis. Most images render from "wikilinks" like [[File:name.ext|...]]. We could use wikilink_titles_matching() to match /^(File|Image):/. This will work for English Wikipedia. Other wikis will have their own localized prefixes.

But sometimes images appear in infoboxes. And sometimes the infoboxes handle the File namespace prefix. So, we might want to match strings that look like filenames with extensions.

{{Infobox NFL biography
| name                = Shayne Skov
| image               = Shayne Skov 2016.JPG

or

{{Infobox settlement
|official_name = Ferguson Township,<br/>Clearfield County,<br/>Pennsylvania
<!-- Images -->
|image_skyline = Cherry Corner Road near sunset.jpg
<!-- Maps -->
|image_map = Map of Ferguson Township, Clearfield County, Pennsylvania Highlighted.png
|image_map1 = Map of Pennsylvania highlighting Clearfield County.svg

So in this case, we see "Shayne Skov 2016.JPG" and "Cherry Corner Road near sunset.jpg" are images that appear without the "File:" prefix in an infobox. We could match it with something like:

(file|image\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))

I would implement this inside of https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py. For the first link-matching strategy, I would use wikitext.revision.wikilink_titles_matching(...) and wikitext.revision.parent.wikilink_titles_matching(...).

For the infobox images, I would implement a revscoring.datasource.meta.extractors.regex() that uses the regex I supplied above to get a set of matches for parent and revision text.

For the infobox images strategy to be implemented in the most elegant way, it will require changes to both editquality and revscoring, which means two separate pull requests. Is that ok?
My plan is to add this into /feature_lists/enwiki.py
local_wiki = [ ....

sub(
    wikitext_features.revision.infobox_image_matching(
    r"(file|image\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))"),
    wikitext_features.revision.parent.infobox_image_matching(
    r"(file|image\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))"),
    name="enwiki.revision.wikilink.if_infobox_image_removed"
)

]

also this into revscoring/features/wikitext/datasources/parsed.py:

def infobox_image_matching(self, regex, name=None):
    """
    Constructs a :class:`revscoring.Datasource` that generates a `list`
    of infobox image references that match any of a set of provided regexes.
    """
    if not hasattr(regex, "pattern"):
        regex = re.compile(regex, re.I)

    if name is None:
        name = "{0}({1})" \
               .format(self._name + ".infobox_image_matching",
                       regex.pattern)

    return filters.regex_matching(regex, self.wikicode, name=name)

and this into revscoring/features/wikitext/features/parsed.py:

def infobox_image_matching(self, regex, name=None):

"""
Constructs a :class:`revscoring.Datasource` that generates a count
of infobox image references that match a regular expression.
"""
if not hasattr(regex, "pattern"):
    regex = re.compile(regex, re.I)

if name is None:
    name = "{0}({1})" \
           .format(self._name + ".infobox_image_matching",
                   regex.pattern)

return aggregators.len(
    self.datasources.infobox_image_matching(regex),
    name=name
)

You should be able to implement this entirely in editquality. Infoboxes are not defined the same across wikis, so it would no be apt to define the details of "infobox" in revscoring.

Also, you can't run a regex on "wikicode". It needs to run on either a string or a list of strings.

ok, this is my ver2:
from revscoring.features import revision_oriented, wikitext as wikitext_features
from revscoring.features.modifiers import sub
from revscoring.features.meta.aggregators import len
from revscoring.languages import english, features as language_features

from revscoring.features.wikitext.datasources.revision_oriented import Diff as wikitext_diff

from . import mediawiki, wikipedia, wikitext

local_wiki - [
len(

   language_features.regex_matches.datasources.Diff(
   "enwiki.revision.infobox_images.matches_removed",
   r"(file|image|\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))",
   wikitext_diff, revision=None, exclusions=None, wrapping=(r'\b', r'\b')).matches_removed,
   name="enwiki.revision.infobox_images.matches_removed"
)

]
but this wikitext_diff is not linked to any revison.. I am apparently doing something wrong. Where can I get the datasource for regex_matches?

I guess I'd do something like this:

from revscoring.features import wikitext, modifiers
from revscoring.features.meta import aggregators
from revscoring.datasources.meta import extractors

image_params_re = r"(file|image|\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))"
revision_image_params = aggregators.len(extractors.regex(
    image_params_re, wikitext.revision.datasources.text), name="enwiki.revision.image_params")
parent_image_params = aggregators.len(extractors.regex(
    image_params_re, wikitext.revision.parent.datasources.text), name="enwiki.revision.parent.image_params")

revision_imagelinks = wikitext.revision.wikilinks_matching(r"^file|image:", name="enwiki.revision.image_wikilinks")
parent_imagelinks = wikitext.revision.parent.wikilinks_matching(r"^file|image:", name="enwiki.revision.parent.image_wikilinks")

local_wiki = [
    ...
    revision_image_params,
    parent_image_params,
    parent_image_params - revision_image_params,
    (parent_image_params - revision_image_params / modifiers.max(parent_image_params, 1)),
    revision_imagelinks,
    parent_imagelinks,
    parent_imagelinks - revision_imagelinks,
    (parent_imagelinks - revision_imagelinks / modifiers.max(parent_imagelinks, 1))
]

Seems like this feature is not very productive.
I ended up using the following regexes:

for freestanding picture = r"(File|Image):\w*"
for infobox picture = r"(file|image|\w)\s{1,}=\s{1,}(((file|image):)|)([^\n\|]{1,}.(jpe{0,1}g|png|gif|svg))"

For the Damaging model, the AUC improved insignificantly (0.924 v. 0.922); for the goodfaith model, AUC decreased a little (0.927 v. 0928).

DAMAGING

Top scoring configurations

modelmean(scores)std(scores)params
:------------------------------------------:--------------::-----------------------------------------------------------------------
GradientBoostingClassifier0.9240.007max_features="log2", max_depth=5, n_estimators=700, learning_rate=0.01
GradientBoostingClassifier0.9220.006max_features="log2", max_depth=5, n_estimators=500, learning_rate=0.01
GradientBoostingClassifier0.9220.007max_features="log2", max_depth=7, n_estimators=700, learning_rate=0.01

GOODFAITH

Top scoring configurations

modelmean(scores)std(scores)params
:------------------------------------------:--------------::--------------------------------------------------------------------------------
GradientBoostingClassifier0.9270.007max_features="log2", max_depth=5, learning_rate=0.01, n_estimators=700
GradientBoostingClassifier0.9270.006max_features="log2", max_depth=5, learning_rate=0.01, n_estimators=500
GradientBoostingClassifier0.9250.008max_features="log2", max_depth=7, learning_rate=0.01, n_estimators=500
from revscoring.features import wikitext, modifiers
from revscoring.features.meta import aggregators
from revscoring.datasources.meta import extractors, filters, frequencies

image_params_re = r"(file|image|\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))"
revision_image_params_freq = frequencies.table(extractors.regex(
    image_params_re, wikitext.revision.datasources.text))
parent_image_params_freq = frequencies.table(extractors.regex(
    image_params_re, wikitext.revision.parent.datasources.text))
image_params_delta = frequencies.delta(parent_image_params_freq, revision_image_params_freq)
image_params_delta_sum = aggregators.sum(
    dicts.values(image_params_delta),
    name="enwiki.revision.diff.image_params_delta_sum")
image_params_increase = aggregators.sum(
    filters.positive(dicts.values(image_params_delta)),
    name="enwiki.revision.diff.image_params_increase")
image_params_decrease = aggregators.sum(
    filters.negative(dicts.values(image_params_delta)),
    name="enwiki.revision.diff.image_params_decrease")   

image_link_re = r"^file|image:"
revision_imagelinks_freq = frequencies.table(wikitext.revision.datasources.wikilink_titles_matching(
    image_link_re, name="enwiki.revision.image_wikilinks"))
parent_imagelinks_freq = frequencies.table(wikitext.revision.parent.datasources.wikilink_titles_matching(
    image_link_re, name="enwiki.revision.parent.image_wikilinks"))
imagelinks_delta = frequencies.delta(parent_imagelinks_freq, revision_imagelinks_freq)
imagelinks_delta_sum = aggregators.sum(
    dicts.values(imagelinks_delta),
    name="enwiki.revision.diff.imagelinks_delta_sum")
imagelinks_increase = aggregators.sum(
    filters.positive(dicts.values(imagelinks_delta)),
    name="enwiki.revision.diff.imagelinks_increase")
imagelinks_decrease = aggregators.sum(
    filters.negative(dicts.values(imagelinks_delta)),
    name="enwiki.revision.diff.imagelinks_decrease")   

local_wiki = [
    ...
    imagelinks_delta_sum,
    imagelinks_increase,
    imagelinks_decrease,
    image_params_delta_sum,
    image_params_increase,
    image_params_decrease,
]

the results:

DAMAGING:
Top scoring configurations

modelmean(scores)std(scores)params
GradientBoostingClassifier0.9230.006max_depth=5, n_estimators=700, max_features="log2", learning_rate=0.01
GradientBoostingClassifier0.9230.006max_depth=5, n_estimators=500, max_features="log2", learning_rate=0.01
GradientBoostingClassifier0.9220.008max_depth=7, n_estimators=500, max_features="log2", learning_rate=0.01

GOODFAITH:
Top scoring configurations

modelmean(scores)std(scores)params
GradientBoostingClassifier0.9280.006max_depth=5, n_estimators=700, learning_rate=0.01, max_features="log2"
GradientBoostingClassifier0.9280.006max_depth=5, n_estimators=500, learning_rate=0.01, max_features="log2"
GradientBoostingClassifier0.9260.008max_depth=7, n_estimators=500, learning_rate=0.01, max_features="log2"

¯\_(ツ)_/¯

Halfak renamed this task from Get signal from adding/removing images to [Investigate] Get signal from adding/removing images.Aug 28 2017, 4:55 PM
Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.