Page MenuHomePhabricator

Quantitative Exploration of Content Translation Tools
Closed, ResolvedPublic

Description

Building on this notebook (https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb ), start to explore what types of content is translated and what happens to these articles once they are created.

On top of the content that was translated, which the above notebook demonstrates ways to access, more data can be accessed about the translations and what occurred after them. Try comparing statistics about edits, pageviews, etc. between the source and translated versions of articles. More advanced analyses in a project might eventually compare translated articles with similar articles that were not translated or classify edits based upon their 'type' for more fine-grained analyses of what happens to translated articles.

You can programmatically access page views for source/translated pages:

The notebook shows you how to programmatically access a page's history as well:

Event Timeline

Isaac triaged this task as Medium priority.Mar 13 2019, 6:06 PM

Hi there, I am currently doing this research project too.

I consider myself pretty familiar with Jupyter Notebook, so if you meet any problems, feel free to drop me a line and I'll see what I can do to help. :)

A quick tip that always helps me: If you wanna take a glance of a function's signature, put your caret inside the bracket, and press Shift + Tab.

Good luck everyone. 💃

@Isaac, could you help me with one problem? I am trying to access Page History for the translated page, but it doesn't work. It works as planned with the source page though.
https://monosnap.com/file/kxejBqfshB8RoxQVdEHb4m1TWGEROc
https://monosnap.com/file/PRpyqH0gmMZcGXvVCOKnxxVS9Ukemm

I am trying to access Page History for the translated page, but it doesn't work. It works as planned with the source page though.

Hey @Cherrywins : a few things going on:

  • My guess is that you have your host for the session set to English Wikipedia instead of Russian Wikipedia. For certain API calls, this wouldn't matter, but given that you're looking up a page's revisions, this information is language-specific. You'll want the code to look something like this:
    • session = mwapi.Session(host='https://ru.wikipedia.org', user_agent='mwapi (python) -- outreachy content translation')
  • The other thing to note for that particular example, is that after the page was translated, it seems it was decided that there was a better name and so the page title you are using is actually a redirect and most of the content etc. is under this title: "Мендес, Камила"
    • To automatically resolve these redirects, you should be able to add "redirects": True to the API parameters

I am trying to access Page History for the translated page, but it doesn't work. It works as planned with the source page though.

Hey @Cherrywins : a few things going on:

  • My guess is that you have your host for the session set to English Wikipedia instead of Russian Wikipedia. For certain API calls, this wouldn't matter, but given that you're looking up a page's revisions, this information is language-specific. You'll want the code to look something like this:
    • session = mwapi.Session(host='https://ru.wikipedia.org', user_agent='mwapi (python) -- outreachy content translation')
  • The other thing to note for that particular example, is that after the page was translated, it seems it was decided that there was a better name and so the page title you are using is actually a redirect and most of the content etc. is under this title: "Мендес, Камила"
    • To automatically resolve these redirects, you should be able to add "redirects": True to the API parameters

Thank you very much, it worked! :)

Mansi29ag added a comment.

hi! Does the api provides us with no. of pageviews of any article, or is there any API to get the number of pageviews of any wikipedia article.

Still waiting @Isaac answer for this comment

@Israashahin : thanks for letting me know that you needed an answer to that as the original comment has been deleted. The example notebook that I provided to you ( https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb ) has a link to examples of how to do that under the Quantitative Analyses section. If you have more specific questions, let me know.

@Israashahin : thanks for letting me know that you needed an answer to that as the original comment has been deleted. The example notebook that I provided to you ( https://paws-public.wmflabs.org/paws-public/User:Isaac_(WMF)/Content%20Translation%20Example.ipynb ) has a link to examples of how to do that under the Quantitative Analyses section. If you have more specific questions, let me know.

Yes, sorry I removed that comment as @Isaac mentioned, I was able to find it in the example later. :)

@Isaac Is there a way to way to get the length of each article i.e. no, of bytes or do I have to perform scraping to get that information. Also please have a look at my work and suggest if I am going in right direction. :)

Is there a way to way to get the length of each article i.e. no, of bytes or do I have to perform scraping to get that information.

Hey @Mansi29ag: to the general question how to get additional information about a page, follow the same pattern as the other API calls -- for example, the revisions example I include in my notebook under "Quantitative Analyses". Change the parameters that are passed though to query other aspects of the API. The various options you have are listed here: https://en.wikipedia.org/w/api.php?action=help&modules=query

For your specific length of article question, you'd look under the "info" module but you might also find others interesting (e.g., "linkshere")

@Issac hi! I am an Outreachy applicant and have been working on the quantitative analysis task. I'd really like for someone to look at my work till now who can guide me how to go further. Please let me know how I can share my notebook with you.
I also can't seem to get on FreeNode. I registered my account but when I try login (using /connect chat.freenode.net 6667 mquin:uwhY8wgzWw22-zXs.M39p) it shows 'You're not an IRC operator'.

Thanks!

@Isaac I was trying to get a particular page's information on Hindi Wikipedia when all I have is its translation id (when converted from English to Hindi), which I got from the dump file. I thought I could add some parameters passed to the mwapi API (action=query), but I could find only pageids and titles as a way to do it. I tried connecting it to the cxpublishedtranslations API, but the maximum number of results returned is 500, and the given translationid might not exist in it. Am I missing something, or is there a way around to getting this.

@Trishla08 if you have specific questions, I or others can try to provide some assistance. General feedback is not feasible at this stage though. I am not great at troubleshooting IRC but Phabricator has been the more effective channel for discussion on this project.

@AggNisha Unfortunately you'll have to follow the example I provided to get page titles as they are not included in the dump files. You might find that it is worthwhile to scan through the entire cxpublishedtranslations API and save the mapping of translation ID -> page title to a CSV so you do not have to rebuild it every time.

@Isaac where do I submit my analysis notebook? Can't seem to find the link.

Also. If I want to get all the page revisions of multiple files, is there a way to do it in a single function call?

Thanks.

@Isaac, could you help me with one thing? How can we programmatically define the topic/category of the translated article? I don't see this option in MediaWiki API. Should I use some another API?

@Cherrywins : categories are not a straightforward concept on Wikipedia. I believe you can get the categories that are listed for a page (https://www.mediawiki.org/wiki/API:Categories), but this is far from a perfect solution. I would not worry about getting this perfect on a submission - if you find an approach that works, great, but I'd say more important is to discuss how you might approach this given more time.

@Isaac where do I submit my analysis notebook? Can't seem to find the link.

Also. If I want to get all the page revisions of multiple files, is there a way to do it in a single function call?

Thanks.

@Issac hey can you have a look at this?

@Isaac, I am trying to get Public Link of my Jupyter Notebook on PAWS but it gives an error 'Not Found'. (After I've clicked on "Public link" button). Did anyone have this problem before?

P.S. Problem is solved. I saved my notebook, allowed all pop-ups, turned off ad blocker on this site, reloaded page and it worked!

@Isaac, I am trying to get Public Link of my Jupyter Notebook on PAWS but it gives an error 'Not Found'. (After I've clicked on "Public link" button). Did anyone have this problem before?

P.S. Problem is solved. I saved my notebook, allowed all pop-ups, turned off ad blocker on this site, reloaded page and it worked!

The public link button is pretty simple, it creates a link with a regex substitution of your current notebook. It doesn't recreate the link when the file is renamed. I can't find an issue for this now, but would appreciate if someone can find or create it.

Yes, regarding public links for PAWS notebooks: in general if you want to check what public notebooks exist for you, you can go to this URL (with your username substituted in) to see the list:
https://paws-public.wmflabs.org/paws-public/User:<username>/

Thanks for letting me know I'd missed your comment @Trishla08:

where do I submit my analysis notebook? Can't seem to find the link.

I don't know much about what the application process looks like, but as long as the public link to your notebook is included in your application, I'll be able to access it. Others on this thread might have more insight.

Also. If I want to get all the page revisions of multiple files, is there a way to do it in a single function call?

Yes, in general for most API calls you can query multiple pages by separating them with the "|" character. Here is the documentation for the revisions API, which includes an example of what you're asking about: https://www.mediawiki.org/wiki/API:Revisions#Example_1:_Get_revision_data_of_several_pages

@Trishla08 You can submit your work in "Record a Contribution". There will be a form that you can fill a contribution link and description. You can edit your contribution and final application later after submitted.

@Isaac I encountered a few problems, could you please help me with them?

  • Can sections in dump file and those returned by contenttranslationcorpora differ? I tried and found that the dump file contains fewer sections than contenttranslationcorpora for some translation ids but could not understand why?
  • How can we get the type of the edit done? There is parameter 'rvprop' which can be set to tags and comments in prop=revisions, but what for the case when no tags and comments have been mentioned?
  • Is 'any' attribute calculated over the number of sections translated, or over the whole article?
  • I found a few sections in contenttranslationcorpora, for which 'content' exists for 'source', but not for 'mt' or 'user'. If the section has not been included at all in the translated version, why is it counted amongst the content translated version?

Thanks

@Isaac I am sorry for my absence for a long duration . Due to some I problem I couldnot envolve here. As 5 days(approx )are still remaining, I want to continue my involvement in this task. Please allow me to do so.
Thank you

@NuKira that is entirely up to you whether you feel you can complete the application. Glad to hear you are still interested.

@AggNisha see below:

Can sections in dump file and those returned by contenttranslationcorpora differ? I tried and found that the dump file contains fewer sections than contenttranslationcorpora for some translation ids but could not understand why?

Yes, unfortunately there appears to be a discrepancy that I was unaware of prior - see T218168

How can we get the type of the edit done?

I'm not sure what you're interested in without more information. There have been attempts to categorize edits (for example: https://en.wikipedia.org/wiki/Wikipedia:Labels/Edit_types/Taxonomy), but I don't know of any APIs yet that would do that categorization for you. For the current project, you might try hand-labeling some of the edits.

Is 'any' attribute calculated over the number of sections translated, or over the whole article?

See T218003#5029680 and the follow-up comments.

I found a few sections in contenttranslationcorpora, for which 'content' exists for 'source', but not for 'mt' or 'user'. If the section has not been included at all in the translated version, why is it counted amongst the content translated version?

I don't know the exact answer but I'd suggest trying to use the content translation tool yourself if you haven't and that might shine some light on why certain sections are included or not included.

@Isaac I was filling out my application for Outreachy. Are there any community-specific questions we need to answer? Also, where can we discuss the project timeline?

Closing this task as it served primarily as a discussion space for the quantitative analysis component of the Outreachy application. Research now ongoing under T223765