The number of features of Wikidata vandalism detection is good but it can be better.
Description
Event Timeline
I just had a meeting with Wikidata's communication manager. She is starting the process and it takes some time.
Aaand now I made the landing pages for the feedback: https://www.wikidata.org/wiki/Wikidata:ORES
Sure:
is_client_move, is_client_delete, is_merge_into, is_merge_from, is_revert, is_restore, is_item_creation, sex_or_gender_changed, country_of_citizenship_changed, member_of_sports_team_changed, date_of_birth_changed, image_changed, signature_changed, commons_category_changed, official_website_changed, en_label_changed, is_human, is_blp comment_longest_repeated_char, comment_uppercase_ratio, comment_numbers_ratio, comment_whitespace_ratio, comment_english_bad_words, comment_english_informals, comment_longest_repeated_uppercase_char, comment_has_url, comment_has_first_person_pronouns_en, comment_has_second_person_pronouns_en, comment_has_do_or_dont_en, log(wikibase.revision.parent.claims + 1), log(wikibase.revision.parent.properties + 1), log(wikibase.revision.parent.aliases + 1), log(wikibase.revision.parent.sources + 1), log(wikibase.revision.parent.qualifiers + 1), log(wikibase.revision.parent.badges + 1), log(wikibase.revision.parent.labels + 1), log(wikibase.revision.parent.sitelinks + 1), log(wikibase.revision.parent.descriptions + 1) wikibase.revision.diff.sitelinks_added, wikibase.revision.diff.sitelinks_removed, wikibase.revision.diff.sitelinks_changed, wikibase.revision.diff.labels_added, wikibase.revision.diff.labels_removed, wikibase.revision.diff.labels_changed, wikibase.revision.diff.descriptions_added, wikibase.revision.diff.descriptions_removed, wikibase.revision.diff.descriptions_changed, wikibase.revision.diff.aliases_added, wikibase.revision.diff.aliases_removed, wikibase.revision.diff.properties_added, wikibase.revision.diff.properties_removed, wikibase.revision.diff.properties_changed, wikibase.revision.diff.claims_added, wikibase.revision.diff.claims_removed, wikibase.revision.diff.claims_changed, wikibase.revision.diff.identifiers_changed, wikibase.revision.diff.sources_added, wikibase.revision.diff.sources_removed, wikibase.revision.diff.qualifiers_added, wikibase.revision.diff.qualifiers_removed, wikibase.revision.diff.badges_added, wikibase.revision.diff.badges_removed, wikibase.revision.diff.proportion_of_qid_added, wikibase.revision.diff.proportion_of_language_added, wikibase.revision.diff.proportion_of_links_added revision.comment.suggests_section_edit revision.comment.has_link revision.user.is_bot revision.user.has_advanced_rights revision.user.is_admin revision.user.is_trusted revision.user.is_patroller revision.user.is_curator revision_oriented.revision.user.is_anon, log(temporal.revision.user.seconds_since_registration + 1)
This is all of the features, Tell me if any one them is not clear enough.
I don't know how much work this is.
@Lydia_Pintscher should this still be on the campsite?
Is this ready to be done?
It's probably good to do embedding or clustering on set of one-hot encodings of properties changed, languages changed, number of statements per properties, etc. That would make it greatly more accurate.
is it possible already to write up more accurate description of what the expected outcome of this would look like? Right now it is very wide open and general with no clear end.
with this:
It's probably good to do embedding or clustering on set of one-hot encodings of properties changed, languages changed, number of statements per properties, etc. That would make it greatly more accurate.
maybe we repurpose this task to capture doing that (which I don't understand yet). @Ladsgroup does that make sense?