Page MenuHomePhabricator

Featured Page Revision History: Support Phd Researcher in Residence WMDE
Closed, ResolvedPublic

Description

Dydimus Zengenene is a Phd researcher, University of Uppsala, Sweden, currently researcher in residence, WMDE. To support his Phd research we need to provide several data sets on the revision history of a featured page on Wikipedia (page to be selected soon).

A brief description of what needs to be done (Dydimus Zengenene & Goran S. Milovanović, Google Hangouts 2019/11/08):

  • Topic: How do we manage online ecologies
    • People
    • Resources
    • Beliefs
  • Task: we are focusing on one Wikipedia page (featured article):
    • we want to study what was the historical path of that page since its start until it became a featured article
    • track the editors who contributed to that page:
      • how they are related in terms of communication?
      • how did they collaborate?
  • The design of the data sets:
    • Page revision history
    • Pageviews for the selected page
    • External links from this page
    • Internal links from this page
    • Links to this page
    • Talk page analysis:
      • what users were editing?
      • were the editors of the selected page collaborating on some other pages as well?
    • For the Social Networking analysis:
      • who mentions whom on the Talk page -> matrix representation -> graph
      • separation of editors into two groups: (a) those who produced the page, (b) those who promoted the page to become a featured article.

To be specified/ToDo:

  • the page that will undergo analysis;
  • Phabricator account for Dydimus Zengenene.

Note. All private data (user names, IDs, etc) need to be anonymized, and all data sets will have to undergo a review with Analytics before their release.

Ideally we want to have the data sets produced until December 08, 2019.

Event Timeline

2019/11/10: consider the WikiConv data set for this project.

2019/11/10: shared an excerpt of the WikiConv data set with Dydimus to assess it and decide if it can be used in his research project.

@Dydimusz The selected page is https://en.wikipedia.org/wiki/Hurricane_Hazel (your email today, 5:29 PM CET).
Please let me know what do you think about the WikiConv data set. Thanks.

This comment was removed by Dydimusz.

@GoraSMilovanovic. I am trying to make sense of the dataset on the WikiCov based on the example you shared, Indeed its big and maybe it may not be very useful as it does not lead us to or come from our specific page of interest. However, I like their idea of measuring the toxicity of comments, maybe i can use the same idea to measure the toxicity of the talks around that page and even of vandalism and some edits. I have not understood yet how the measure is reached at though.

@Dydimusz

I am trying to make sense of the dataset on the WikiCov based on the example you shared...

I will try to help understand the structure of this data set in this comment (see bellow).

Indeed its big and maybe it may not be very useful as it does not lead us to or come from our specific page of interest.

I have shared just a tiny excerpt from the whole data set which spans hundreds of gigabytes as a whole. The page of interest - Hurricane_Hazel - is somewhere in the data set. In relation to the WikiConv data set I am primarily interested in two thing:

  • first, I am interested if the organization of the data set, let alone the toxicity score which you did not intend to study in the first place, is suitable for your research, and
  • second, if we decide to go for this data set, how will I manage it in our infrastructure.

@Dydimusz You can help with the first by letting me know if the organization of the data set (the columns) at least partly matches what you need. As of the second point, I will have to decide whether to use this data set or directly parse the target page (once again, since managing the whole data set is a complex task on its own, and it probably does not make sense to go for it to be able to extract the data from one page only).

The WikiConv data set

From the description of the WikiConc data set, and given that you are interested in user interactions primarily, the columns that seem to be relevant for your research are the following:

  • id: An id for the action that the row represents.
  • page_title: The name of the Talk Page where the action occurred.
  • replyTo_id: The id of the action that this action is a reply to.
  • cleaned_content: The text of the comment or section underlying the action without MediaWiki markup.
  • user_text: The name of the user that made the edit from which the action was extracted.
  • type: The type of action that the row represents. This will be one of the types enumerated in the previous section (@Dydimusz: the types of action are described in the data set GitHub repo, and to me it seems that CREATION: An edit that creates a new section in wiki markup, and ADDITION: An edit that adds a new comment to the thread of a conversation are the two that you will be interested in, because they refer to user interactions e.g. "replies").

From these columns of the WikiConv data set we can learn: who (user_text) replied what (cleaned_content) to whom (replyTo_id and we then take a look at that id and see who authored it) on which page (page_title). From the description of your research in this ticket it seems that these information would cover what is needed for the social networking analysis.

@Dydimusz Let alone the technical aspects (how will the data be extracted) and the parts of the data set in which we are not interested in (e.g. toxicity score). At this point I am trying to learn from you if the structure of the data as I have described it above fits the purpose of your research. I will take care of the data extraction once that question is settled (and yet I have to see how: to manage the whole data set to analyze one page, well, no; parsing the page with my code, maybe, parsing the page relying on the code that has produced the WikiConv data set, perhaps; we'll see). Thank you.

@Dydimusz N.B. If we will be analyzing only the Hurricane_Hazel Talk Page, then in my humble opinion it can be done manually. All other data that you need (any other pages where the editors of this page have interacted, page internal links, page external links, etc) can be easily obtained from our databases). Please consider this option - there are not a plenty of interactions on that talk page indeed.

@GoranSMilovanovic Thanks for this detailed explanation. It is really helpful. I am now understanding a bit more clearer than I could from the corpus article. I think these suggested columns would be ideal then. On the way forward maybe we can arrange another video call if you have a few minutes to clear the misunderstanding on my part here. Would you have sometime for this tomorrow 19 November or Thursday 21?

@Dydimusz Tomorrow November 19, anytime after 15:30 CET, works for me.

In the meantime, I will probably have all the data sets - except the interactions on the Talk page - extracted, but they will not be available before they are reviewed by the WMF Analytics.

As of the editor interactions on the Talk:Hurricane Hazel page: I really think it would take you around half an hour only to construct the data set manually. Tomorrow I can share with you how to organize a Spreadsheet to do it.

@GoranSMilovanovic Thank you very much for the work especially given that you are not feeling well at the moment. I am available at 15.30 to discuss but if you are not able due to your schedule with the Dr, we can meet Wednesday morning or Friday. Only my Thursday is occupied this week.

@Dydimusz Today 15:30 CET is just fine. A Google Hangouts invite has been sent out to your gmail address.

@Dydimusz @WMDE-leszek

We have now opened a ticket for a public review of the most sensitive data sets for this research project: T239393.
@Dydimusz The idea now is to wait for the Analytics to let us know if the data set can be used publicly.

In the meantime, let me summarize what we have at this point:

  • the two data sets that are undergoing review encompass all (anonymized) users that have engaged in "replies" on the English Wikipedia page of interest in your research;
  • for each pair of users that have ever interacted on this page we know which user "replied" (i.e. has a reference to a parent revision) to another and how many times;
  • and we also have the data for their interactions across other pages in English Wikipedia, including the page_id of the page of interaction and the frequency of interactions.

The data sets now undergoing public review in T239393 should present all that is needed for your social networking analysis.

The remaining data sets:

  • Pageviews for the selected page
  • External links from this page
  • Internal links from this page
  • Links to this page

are all easy to obtain from public sources and will be provided together with the social networking data as soon as the data review is finished.

Finally, as of the talk page analysis:

I think the interactions on the talk page of the selected https://en.wikipedia.org/wiki/Hurricane_Hazel page are really sparse. The advised course of action is a good old-fashioned pen and paper analysis. It shouldn't take you more than half an hour to reconstruct all the conversations that took place there.

@Dydimusz

While we wait for the public data review in T239393 to complete, here are the additional data sets.

All of the following were obtained through the MediaWiki Action API :

  • Links to this page: backlinks_TargetPage.csv
  • External links from this page: extlinks_TargetPage.csv
  • Internal links from this page: links_TargetPage.csv.

To be discussed (reminder: @Dydimusz we have a Hangouts session scheduled on Wednesday):

  • what were the users editing? (revision content)
  • separation of editors into two groups:
    • (a) those who produced the page,
    • (b) those who promoted the page to become a featured article.

@Dydimusz @WMDE-leszek

in relation to your question on can we learn what external pages point to some pages in the Wikimedia universe (e.g. the target page of your research project - https://en.wikipedia.org/wiki/Hurricane_Hazel), it seems that the WMF has access to the Google Console service:

https://wikitech.wikimedia.org/wiki/Google_Search_Console_access

However, the documentation page also says:

User must have a valid NDA on file with WMF legal. Not a phabricator NDA electronically signed, but a full NDA with our legal department.

so they do not seem to be super-happy about providing access to it - I don't think that my (electronically signed) NDA covers this.

Example access request: T240501.

2020/03/01 - data sets on editors interactions delivered to @Dydimusz and @WMDE-leszek via email.