Page MenuHomePhabricator

What UI vs API based revisions data do we have for Wikidata?
Closed, DuplicatePublic

Description

The ticket is meant for the discussion of an initial analysis on what data do we have to better understand UI (broader: non-API) vs API trigered revisions of Wikidata. If we build that understanding we will be able to analyze data on API vs non-API based Wikidata revisions, understand better the typical usage of the Wikibase API, and potentially derive a deeper understanding on the structure of our editing community.

In particular, we want to learn about the following:

  • What data do we have to extract useful information from (e.g. user agent strings and similar)?
  • Can we map such data to items/revisions/users?
  • How difficult would it be to get them?
  • What can you quickly find out?
  • Who should we talk to if we need more knowledge or additional datasets produced?

Event Timeline

@Manuel

Initial check for Tag all edits made via Wikibase View and Wikibase Client T236893:

  • data-bridge: used 279 times;
  • client-linkitem-change: used 58740 times;
  • client-automatic-update: used 58367 times;
  • wikidata-ui: used 902690 times;
  • termbox: used 6384 times.

Checks were performed from the wikidatawiki.change_tag_def table.

Thx Goran! Could you please figure out where the problem is? The tags are already in use.

@Manuel

Thx Goran! Could you please figure out where the problem is? The tags are already in use.

The tags are in use indeed, but you must have seen T285459#7321218 before I had it edited.
My bad: on the first sight I did not spot the new tags in the table (attached). Please take a look again.

Thank you! Let's discuss how we could use these tags in combination with other inf (e.g isBot and the total number of edits).

@Manuel

Let's discuss how we could use these tags in combination with other inf (e.g isBot and the total number of edits).

I am currently working on this from SQL, because the august 2021 snapshot of wmf.mediawiki_history is not yet produced in our Data Lake.

It might take some time to get to the results since a rather tricky join (change_tag_def --> change_tag --> revision_actor_temp --> actor --> user) needs to be performed in order to find out how many of the new tags were used by who in wikidatawiki since 20210801000000.

I will not be using the user.user_editcount since it is a rough approximation of the number of revisions made on behalf of a user; rather join user on revision_actor_temp --> actor.
Also, I will need to filter out bots manually in this approach, and I have already created two fields in that respect (botByName, botByGroup), similarly to what is done in wmf.mediawiki_history in the Data Lake.

To put it in a nutshell:

SELECT actor.actor_user AS user_id, 
                change_tag_def.ctd_name as rev_tag, 
                COUNT(change_tag_def.ctd_name) as rev_tag_count 
        FROM actor 
        LEFT JOIN revision_actor_temp ON (actor.actor_id = revision_actor_temp.revactor_actor)
        LEFT JOIN change_tag ON (revision_actor_temp.revactor_rev = change_tag.ct_rev_id) 
        LEFT JOIN change_tag_def ON (change_tag.ct_tag_id = change_tag_def.ctd_id)
        WHERE revision_actor_temp.revactor_timestamp > 20210801000000 
        GROUP BY actor.actor_user, change_tag_def.ctd_name;

FWIW a week-based quarry on semi-automated edits from the most popular tools (manually selected based on another query) is here: https://quarry.wmcloud.org/query/58473
Results:
OAuth CID: 1776 869604 aka QS
wikidata-ui 403984
client-linkitem-change 30437
client-automatic-update 20785
WikibaseJS-cli 20100
OAuth CID: 1740 13024 aka Author-Disambiguator 2.0
openrefine-3.4 6980
apps-suggested-edits 6628
OAuth CID: 1768 6002 aka Mix'n match 1.0
termbox 3422
openrefine-3.3 2324
data-bridge 1

According to @Lydia_Pintscher the lexeme edits are not getting the wikidata-ui tag yet.

Statistics are:
manual edits on wikidata.org based on a week of data make up (403984+3422+6628+1)/(869604+20100+6980+6002+2324+13024)=414035/918034=0.45 ~45% of the total edits

Thank you, Dennis! And nice to meet you in Alessandro's session!

Happy to be of service, and nice to meet you too. I liked the presentation a lot. I feel like doing a ph.D myself on knowledge graphs like Wikidata 😆

So about the statistics, I'm actually surprised the percentage is so high. It will be interesting to follow this over time.