This will be extremely useful in case when "conservative vandals" remove pictures of not-very-conservative subjects for example this
Description
Related Objects
- Mentioned In
- Blog Post: Status update (October 6, 2017)
Event Timeline
We should be able to track image links on a per-wiki basis. Most images render from "wikilinks" like [[File:name.ext|...]]. We could use wikilink_titles_matching() to match /^(File|Image):/. This will work for English Wikipedia. Other wikis will have their own localized prefixes.
But sometimes images appear in infoboxes. And sometimes the infoboxes handle the File namespace prefix. So, we might want to match strings that look like filenames with extensions.
{{Infobox NFL biography | name = Shayne Skov | image = Shayne Skov 2016.JPG
or
{{Infobox settlement |official_name = Ferguson Township,<br/>Clearfield County,<br/>Pennsylvania <!-- Images --> |image_skyline = Cherry Corner Road near sunset.jpg <!-- Maps --> |image_map = Map of Ferguson Township, Clearfield County, Pennsylvania Highlighted.png |image_map1 = Map of Pennsylvania highlighting Clearfield County.svg
So in this case, we see "Shayne Skov 2016.JPG" and "Cherry Corner Road near sunset.jpg" are images that appear without the "File:" prefix in an infobox. We could match it with something like:
(file|image\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))
I would implement this inside of https://github.com/wiki-ai/editquality/blob/master/editquality/feature_lists/enwiki.py. For the first link-matching strategy, I would use wikitext.revision.wikilink_titles_matching(...) and wikitext.revision.parent.wikilink_titles_matching(...).
For the infobox images, I would implement a revscoring.datasource.meta.extractors.regex() that uses the regex I supplied above to get a set of matches for parent and revision text.
For the infobox images strategy to be implemented in the most elegant way, it will require changes to both editquality and revscoring, which means two separate pull requests. Is that ok?
My plan is to add this into /feature_lists/enwiki.py
local_wiki = [ ....
sub( wikitext_features.revision.infobox_image_matching( r"(file|image\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))"), wikitext_features.revision.parent.infobox_image_matching( r"(file|image\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))"), name="enwiki.revision.wikilink.if_infobox_image_removed" )
]
also this into revscoring/features/wikitext/datasources/parsed.py:
def infobox_image_matching(self, regex, name=None): """ Constructs a :class:`revscoring.Datasource` that generates a `list` of infobox image references that match any of a set of provided regexes. """ if not hasattr(regex, "pattern"): regex = re.compile(regex, re.I) if name is None: name = "{0}({1})" \ .format(self._name + ".infobox_image_matching", regex.pattern) return filters.regex_matching(regex, self.wikicode, name=name)
and this into revscoring/features/wikitext/features/parsed.py:
def infobox_image_matching(self, regex, name=None):
""" Constructs a :class:`revscoring.Datasource` that generates a count of infobox image references that match a regular expression. """ if not hasattr(regex, "pattern"): regex = re.compile(regex, re.I) if name is None: name = "{0}({1})" \ .format(self._name + ".infobox_image_matching", regex.pattern) return aggregators.len( self.datasources.infobox_image_matching(regex), name=name )
You should be able to implement this entirely in editquality. Infoboxes are not defined the same across wikis, so it would no be apt to define the details of "infobox" in revscoring.
Also, you can't run a regex on "wikicode". It needs to run on either a string or a list of strings.
ok, this is my ver2:
from revscoring.features import revision_oriented, wikitext as wikitext_features
from revscoring.features.modifiers import sub
from revscoring.features.meta.aggregators import len
from revscoring.languages import english, features as language_features
from revscoring.features.wikitext.datasources.revision_oriented import Diff as wikitext_diff
from . import mediawiki, wikipedia, wikitext
local_wiki - [
len(
language_features.regex_matches.datasources.Diff( "enwiki.revision.infobox_images.matches_removed", r"(file|image|\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))", wikitext_diff, revision=None, exclusions=None, wrapping=(r'\b', r'\b')).matches_removed, name="enwiki.revision.infobox_images.matches_removed" )
]
but this wikitext_diff is not linked to any revison.. I am apparently doing something wrong. Where can I get the datasource for regex_matches?
I guess I'd do something like this:
from revscoring.features import wikitext, modifiers from revscoring.features.meta import aggregators from revscoring.datasources.meta import extractors image_params_re = r"(file|image|\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))" revision_image_params = aggregators.len(extractors.regex( image_params_re, wikitext.revision.datasources.text), name="enwiki.revision.image_params") parent_image_params = aggregators.len(extractors.regex( image_params_re, wikitext.revision.parent.datasources.text), name="enwiki.revision.parent.image_params") revision_imagelinks = wikitext.revision.wikilinks_matching(r"^file|image:", name="enwiki.revision.image_wikilinks") parent_imagelinks = wikitext.revision.parent.wikilinks_matching(r"^file|image:", name="enwiki.revision.parent.image_wikilinks") local_wiki = [ ... revision_image_params, parent_image_params, parent_image_params - revision_image_params, (parent_image_params - revision_image_params / modifiers.max(parent_image_params, 1)), revision_imagelinks, parent_imagelinks, parent_imagelinks - revision_imagelinks, (parent_imagelinks - revision_imagelinks / modifiers.max(parent_imagelinks, 1)) ]
Seems like this feature is not very productive.
I ended up using the following regexes:
for freestanding picture = r"(File|Image):\w*"
for infobox picture = r"(file|image|\w)\s{1,}=\s{1,}(((file|image):)|)([^\n\|]{1,}.(jpe{0,1}g|png|gif|svg))"
For the Damaging model, the AUC improved insignificantly (0.924 v. 0.922); for the goodfaith model, AUC decreased a little (0.927 v. 0928).
DAMAGING
Top scoring configurations
model | mean(scores) | std(scores) | params |
:--------------------------- | ---------------: | --------------: | :----------------------------------------------------------------------- |
GradientBoostingClassifier | 0.924 | 0.007 | max_features="log2", max_depth=5, n_estimators=700, learning_rate=0.01 |
GradientBoostingClassifier | 0.922 | 0.006 | max_features="log2", max_depth=5, n_estimators=500, learning_rate=0.01 |
GradientBoostingClassifier | 0.922 | 0.007 | max_features="log2", max_depth=7, n_estimators=700, learning_rate=0.01 |
GOODFAITH
Top scoring configurations
model | mean(scores) | std(scores) | params |
:--------------------------- | ---------------: | --------------: | :-------------------------------------------------------------------------------- |
GradientBoostingClassifier | 0.927 | 0.007 | max_features="log2", max_depth=5, learning_rate=0.01, n_estimators=700 |
GradientBoostingClassifier | 0.927 | 0.006 | max_features="log2", max_depth=5, learning_rate=0.01, n_estimators=500 |
GradientBoostingClassifier | 0.925 | 0.008 | max_features="log2", max_depth=7, learning_rate=0.01, n_estimators=500 |
from revscoring.features import wikitext, modifiers from revscoring.features.meta import aggregators from revscoring.datasources.meta import extractors, filters, frequencies image_params_re = r"(file|image|\w*)\s*=\s*((file|image):)?([^\n\|]+.(jpe?g|png|gif|svg))" revision_image_params_freq = frequencies.table(extractors.regex( image_params_re, wikitext.revision.datasources.text)) parent_image_params_freq = frequencies.table(extractors.regex( image_params_re, wikitext.revision.parent.datasources.text)) image_params_delta = frequencies.delta(parent_image_params_freq, revision_image_params_freq) image_params_delta_sum = aggregators.sum( dicts.values(image_params_delta), name="enwiki.revision.diff.image_params_delta_sum") image_params_increase = aggregators.sum( filters.positive(dicts.values(image_params_delta)), name="enwiki.revision.diff.image_params_increase") image_params_decrease = aggregators.sum( filters.negative(dicts.values(image_params_delta)), name="enwiki.revision.diff.image_params_decrease") image_link_re = r"^file|image:" revision_imagelinks_freq = frequencies.table(wikitext.revision.datasources.wikilink_titles_matching( image_link_re, name="enwiki.revision.image_wikilinks")) parent_imagelinks_freq = frequencies.table(wikitext.revision.parent.datasources.wikilink_titles_matching( image_link_re, name="enwiki.revision.parent.image_wikilinks")) imagelinks_delta = frequencies.delta(parent_imagelinks_freq, revision_imagelinks_freq) imagelinks_delta_sum = aggregators.sum( dicts.values(imagelinks_delta), name="enwiki.revision.diff.imagelinks_delta_sum") imagelinks_increase = aggregators.sum( filters.positive(dicts.values(imagelinks_delta)), name="enwiki.revision.diff.imagelinks_increase") imagelinks_decrease = aggregators.sum( filters.negative(dicts.values(imagelinks_delta)), name="enwiki.revision.diff.imagelinks_decrease") local_wiki = [ ... imagelinks_delta_sum, imagelinks_increase, imagelinks_decrease, image_params_delta_sum, image_params_increase, image_params_decrease, ]
the results:
DAMAGING:
Top scoring configurations
model | mean(scores) | std(scores) | params |
GradientBoostingClassifier | 0.923 | 0.006 | max_depth=5, n_estimators=700, max_features="log2", learning_rate=0.01 |
GradientBoostingClassifier | 0.923 | 0.006 | max_depth=5, n_estimators=500, max_features="log2", learning_rate=0.01 |
GradientBoostingClassifier | 0.922 | 0.008 | max_depth=7, n_estimators=500, max_features="log2", learning_rate=0.01 |
GOODFAITH:
Top scoring configurations
model | mean(scores) | std(scores) | params |
GradientBoostingClassifier | 0.928 | 0.006 | max_depth=5, n_estimators=700, learning_rate=0.01, max_features="log2" |
GradientBoostingClassifier | 0.928 | 0.006 | max_depth=5, n_estimators=500, learning_rate=0.01, max_features="log2" |
GradientBoostingClassifier | 0.926 | 0.008 | max_depth=7, n_estimators=500, learning_rate=0.01, max_features="log2" |
¯\_(ツ)_/¯