Page MenuHomePhabricator

Outreachy Application Task: Tutorial for Wikipedia Clickstream data
Closed, ResolvedPublic

Description

Overview

This task serves as a tutorial with microtasks for the Outreachy Project T275608 (Build a tool for analyzing and visualizing reader navigation on wikipedia). Starting from this notebook try go through the steps and complete the different TODOs.

The full Outreachy project will involve more comprehensive coding than what is being asked for here (and some opportunities for additional explorations as desired). This task will introduce some of the basic concepts and give us a sense of your Python skills, how well you work with new data, documentation of your code, and description of your thinking and results. We are not expecting perfection -- give it your best shot! See this example of working with Wikidata data as an example of what a completed notebook tutorial might look like.

Set-up

  • Make sure that you can login to the PAWS service with your wiki account: https://paws.wmflabs.org/paws/hub
  • Using this notebook as a starting point, create your own notebook (see these instructions for forking the notebook to start with) and complete the functions / analyses. All PAWS notebooks have the option of generating a public link, which can be shared back so that we can evaluate what you did. Use a mixture of code cells and markdown to document what you find and your thoughts.
  • As you have questions, feel free to add comments to this task (and please don't hesitate to answer other applicant's questions if you can help)
  • If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to your mentor with the link to your public PAWS notebook. We will try to make time to give this feedback at least once to anyone who would like it.
  • When you feel you are happy with your notebook, you should include the public link in your final Outreachy project application as a recorded contribution. You may record contributions as you go as well to track progress.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks everyone for your help. I've fixed the problem. I also noticed that the row had a lot of content in it. I added the 'quoting=3' suggested to the arguments in csv.reader. It worked.

Hi everyone,
welcome everyone who joined since the last posts. Great to see the ongoing discussion .

As the deadline for the contribution period is (slowly) approaching I wanted to give some updated:

  • One request for everyone working on this task: To get a good sense of how many people are intending to apply to the project (T275608), I'd like to ask that you make an initial contribution on the Outreachy site with a link to your current progress by the end of this week (so April 25). You can still edit your final application until the final deadline.
  • If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to the mentors (mgerlach@wikimedia.org and/or isaac@wikimedia.org) with the public link to your PAWS notebook. We will try to make time to give this feedback once to anyone who would like it (though is is entirely optional). Note that this will take us a day or two, so best to send us your notebook by the end of this week as well (or the beginning of next week) in order to make sure we can get back to you before the final deadline.
  • some general recommendations on the notebooks:
* Explain each piece of code that you are running. The idea is to make the notebook easy to understand. Don't make the reader have to guess what you were trying to do.
* Describe your motivation and conclusions for every statistics you show. For example, why are you plotting variable X, or Y? and what is your takeaway/conclusions?
* Avoid long/repetitive code outputs that doesn't provide relevant information. For example, if you are applying a model that runs 1000 epochs, avoid to print 1000 lines which each epoch, because makes the notebook difficult to read. If you think that there is relevant information on those outputs, think how to show that information in a way that is compact and easy to understand (for example a plot).

Thanks. Keep the questions / collaboration coming!

Hello everyone, i would like to ask what kind of visualization(Bar graphs, Pie Charts etc.) you all preferred over the others in the place where--

  1. TODO: Choose a destination article from the dataset that is:
  2. * relatively popular (at least 1000 pageviews and 20 unique sources in the dataset)
  3. * shows up in at least one other language with at least 1000 pageviews in January 2021
  4. ** You can check quickly via this tool: https://pageviews.toolforge.org/langviews/
  5. Pull all the data in the clickstream dataset for that article (both as a source and destination)
  6. Visualize the data to show what the common pathways to and from the article are

Hello everyone, i would like to ask what kind of visualization(Bar graphs, Pie Charts etc.) you all preferred over the others in the place where--

  1. TODO: Choose a destination article from the dataset that is:
  2. * relatively popular (at least 1000 pageviews and 20 unique sources in the dataset)
  3. * shows up in at least one other language with at least 1000 pageviews in January 2021
  4. ** You can check quickly via this tool: https://pageviews.toolforge.org/langviews/
  5. Pull all the data in the clickstream dataset for that article (both as a source and destination)
  6. Visualize the data to show what the common pathways to and from the article are

Hi, @Zealink I did a Sankey diagram, like this available in the Research: Wikipedia clickstream page. I made the Sankey diagram using Plotly.

Hello everyone, i would like to ask what kind of visualization(Bar graphs, Pie Charts etc.) you all preferred over the others in the place where--

  1. TODO: Choose a destination article from the dataset that is:
  2. * relatively popular (at least 1000 pageviews and 20 unique sources in the dataset)
  3. * shows up in at least one other language with at least 1000 pageviews in January 2021
  4. ** You can check quickly via this tool: https://pageviews.toolforge.org/langviews/
  5. Pull all the data in the clickstream dataset for that article (both as a source and destination)
  6. Visualize the data to show what the common pathways to and from the article are

The answer lies in the message you're trying to convey. It depends on the hidden patterns/insights you consider important to highlight,. So, a Sankey diagram like the one @Vanevela uses is something I would choose for the 'common pathways' part. All the same, it is not the only choice you have. To compare values, as an example (e.g., compare pageviews among a subset of articles), pie charts, scatter plots, or line charts are my preference.

When I feel a bit confused I use this to check if I am making the right choice based on my objective.: https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization

@Vanevela thank you so much! I was struggling to get a nice-looking network graph using networkx, I don't know why I didn't think of Plotly. ❤

@Vanevela thank you so much! I was struggling to get a nice-looking network graph using networkx, I don't know why I didn't think of Plotly. ❤

Glad it was helpful :)

Hello, @MGerlach and @Isaac this is the last question in the Outreachy final application :

Please work with your mentor to provide a timeline of the work you plan to accomplish on the project and what tasks you will finish at each step. Make sure to take into account any time commitments you have during the Outreachy internship round.

So I would like to know if is ok to use the phases described in T275608 as steps in the project timeline and if you have any suggestion or guidance about it.

Hello everyone, i would like to ask what kind of visualization(Bar graphs, Pie Charts etc.) you all preferred over the others in the place where--

  1. TODO: Choose a destination article from the dataset that is:
  2. * relatively popular (at least 1000 pageviews and 20 unique sources in the dataset)
  3. * shows up in at least one other language with at least 1000 pageviews in January 2021
  4. ** You can check quickly via this tool: https://pageviews.toolforge.org/langviews/
  5. Pull all the data in the clickstream dataset for that article (both as a source and destination)
  6. Visualize the data to show what the common pathways to and from the article are

Hi, @Zealink I did a Sankey diagram, like this available in the Research: Wikipedia clickstream page. I made the Sankey diagram using Plotly.

thanks @Vanevela i was considering this option too. It looks appropriate for this kind of data, thanks a lot.

Hello everyone, i would like to ask what kind of visualization(Bar graphs, Pie Charts etc.) you all preferred over the others in the place where--

  1. TODO: Choose a destination article from the dataset that is:
  2. * relatively popular (at least 1000 pageviews and 20 unique sources in the dataset)
  3. * shows up in at least one other language with at least 1000 pageviews in January 2021
  4. ** You can check quickly via this tool: https://pageviews.toolforge.org/langviews/
  5. Pull all the data in the clickstream dataset for that article (both as a source and destination)
  6. Visualize the data to show what the common pathways to and from the article are

The answer lies in the message you're trying to convey. It depends on the hidden patterns/insights you consider important to highlight,. So, a Sankey diagram like the one @Vanevela uses is something I would choose for the 'common pathways' part. All the same, it is not the only choice you have. To compare values, as an example (e.g., compare pageviews among a subset of articles), pie charts, scatter plots, or line charts are my preference.

When I feel a bit confused I use this to check if I am making the right choice based on my objective.: https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization

That was really insightful @Ahn-nath. Thanks for the help.

Hello, @MGerlach and @Isaac this is the last question in the Outreachy final application :

Please work with your mentor to provide a timeline of the work you plan to accomplish on the project and what tasks you will finish at each step. Make sure to take into account any time commitments you have during the Outreachy internship round.

So I would like to know if is ok to use the phases described in T275608 as steps in the project timeline and if you have any suggestion or guidance about it.

@Vanevela good question, I will try to clarify. Yes you can definitely use those phases (e.g. as a rough guideline). also feel free to use it to identify which of the phases mentioned in the task (Analysis, Visualization, interface building/developing, model building) are most interesting to you and would like to spend more time on; as well as any additional steps you might think are pertinent for your work on this project. In general though, no need to spend too much time on this.

Hello. would like to ask a question. In the first TODO I do not understand what they mean by 'destination'. Can you please help.

Destination means the 2nd column in the data set . The article that viewers
end up in after clicking a link in another article, using a search
engine("other-search") etc .

Hello, @MGerlach and @Isaac right from the inception of my contribution, I have chunked down the microtask into further tasks based upon my understanding, Can I reflect the same while recording contributions on the Outreachy website providing links to the respective notebooks? Do we have any limitations to the number of Contributions TIA

Hello, @MGerlach and @Isaac right from the inception of my contribution, I have chunked down the microtask into further tasks based upon my understanding, Can I reflect the same while recording contributions on the Outreachy website providing links to the respective notebooks? Do we have any limitations to the number of Contributions TIA

@puja_jaji The notebook from the application task is already divided into smaller tasks in separate sections with individual TODOs. If you want to further subdivide the tasks according to your understanding, that sounds good. If you add some explanation on why you subdivide it like this, that would be very much appreciated as it helps us to follow your reasoning. Ideally, you would have your contribution for the application task in a single notebook.
Also want to mention this from an earlier comment that you may request feedback before your final submission if you would like:

If you feel you have completed your notebook, you may request feedback and we will provide high-level feedback on what is good and what is missing. To do so, send an email to the mentors (mgerlach@wikimedia.org and/or isaac@wikimedia.org) with the public link to your PAWS notebook. We will try to make time to give this feedback once to anyone who would like it (though is is entirely optional). Note that this will take us a day or two, so best to send us your notebook by the end of this week as well (or the beginning of next week) in order to make sure we can get back to you before the final deadline.

Hi everyone, i wasn't quite sure whether we had to remove the initial comments (and Markdown texts) completely/partially before adding our own justifications and conclusions to the given tasks in the notebook. Could you all please tell me your thoughts on that.

Hi everyone, i wasn't quite sure whether we had to remove the initial comments (and Markdown texts) completely/partially before adding our own justifications and conclusions to the given tasks in the notebook. Could you all please tell me your thoughts on that.

Hi @Zealink!
I have not removed any initial comments or markdown texts from the notebook and I do not plan to do so for the final submission. Since our interpretations of the tasks could be different from what was expected, I think it would be helpful for the mentors to keep a track of what was written in the task and how we have tackled it. I believe that removing the existing documentation will be required only if you plan to explore the data further, carry out additional analysis and expand this notebook in order to present it as an independent project.

Hey, @Zealink.
 
I support rachita on this. I wouldn't remove all comments because they serve as a good reference; it's a given. Yet, I did remove some parts/modify them. As a general pattern, I kept the ones that described the task/stage and removed parts that were much more specific to the how-to-do of a task (tutorial). At least for me, this rule only applied to comments.

Hi everyone, i wasn't quite sure whether we had to remove the initial comments (and Markdown texts) completely/partially before adding our own justifications and conclusions to the given tasks in the notebook. Could you all please tell me your thoughts on that.

Hi @Zealink

Like @Ahn-nath, I too made modifications to the comments, but I've kept the overall structure and flow the same.

Any advice on how I can visualize the data from a Todo that I do not quiet understand

Any advice on how I can visualize the data from a Todo that I do not quiet understand

Hi @PatsonJay!
Please elaborate on where exactly are you facing an issue and which part are you finding difficult to understand?

Visualize the data to show what the common pathways to and from the article are. This is Todo isn't clear to understand can someone please explain

@PatsonJay

Using Python, You may showcase your analytical skills and observations you have drawn with pictorial representation like graphs, charts etc

Visualize the data to show what the common pathways to and from the article are. This is Todo isn't clear to understand can someone please explain

@PatsonJay You need to visualize the data to find the most common pathways to and from the article you choose. For instance you can use python libraries to plot a bar graph of the frequencies of the various sources and destinations of the chosen article and infer the most common pathways from it.

Hi @PatsonJay those previous comments on a related issue can help you as well.

Hello everyone, i would like to ask what kind of visualization(Bar graphs, Pie Charts etc.) you all preferred over the others in the place where--

  1. Visualize the data to show what the common pathways to and from the article are

Hi, @Zealink I did a Sankey diagram, like this available in the Research: Wikipedia clickstream page. I made the Sankey diagram using Plotly.

The answer lies in the message you're trying to convey. It depends on the hidden patterns/insights you consider important to highlight,. So, a Sankey diagram like the one @Vanevela uses is something I would choose for the 'common pathways' part. All the same, it is not the only choice you have. To compare values, as an example (e.g., compare pageviews among a subset of articles), pie charts, scatter plots, or line charts are my preference.

When I feel a bit confused I use this to check if I am making the right choice based on my objective.: https://blog.hubspot.com/marketing/types-of-graphs-for-data-visualization

Hi,

Is someone else having trouble accessing the paws service? I'm seeing a 503 service unavailable error. A bit worried now.

@Pikaa97 it looks like there is some maintenance work going on. Hopefully will be up in an hour. I'll try to give an update if I hear but know that it's not specific to you and should be returned soon.

Hello, @Pikaa97 yes I have problems too, but is good to know that the service should go back soon. :)

Hi, @Pikaa97

Yes, the service is currently unavailable.

Oh, okay, that's relief to know @Isaac thank you for letting me know, I'll check back again in an hour or so.
Also thanks @Ahn-nath @Vanevela!

FYI PAWS should be back but if you continue to have intermittent issues, don't be too surprised, just try back in another e.g., 10 minutes.

Oh okay, noted. Thanks!

It's up now.

FYI PAWS should be back but if you continue to have intermittent issues, don't be too surprised, just try back in another e.g., 10 minutes.

Hello. While trying to do a query using the mwapi library I'm getting the error below. What might be the problem.

ValueError: Could not decode as JSON:
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>MediaWiki API help - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!0,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","Fe

I solved the problem by doing a request command.

Just a remember guys, Don't forget to submit your final application on Outreachy website. We have another 25 and half hours. The deadline is April 30 2021 4 PM UTC

@puja_jaji thanks for the reminder, also it was okay for us to record just one contribution for the task,right?

Hello everyone I have an article say with the name Chris Ferguson and I got this article from an English clickstream. When do a query using the mwapi libray I can see that this article also appeared in Spanish. But when I try to find this article from the Spanish clickstream("eswiki") I do not find this article even if I search It's name in Spanish. But then when I look at this article using the languageviews (https://pageviews.toolforge.org/langviews/) I can see that this article does exist. What might be the problem I'm facing

Hello everyone I have an article say with the name Chris Ferguson and I got this article from an English clickstream. When do a query using the mwapi libray I can see that this article also appeared in Spanish. But when I try to find this article from the Spanish clickstream("eswiki") I do not find this article even if I search It's name in Spanish. But then when I look at this article using the languageviews (https://pageviews.toolforge.org/langviews/) I can see that this article does exist. What might be the problem I'm facing

@PatsonJay Please check if the article has any pageviews in January 2021 through this tool - https://pageviews.toolforge.org/langviews/. It is possible that the article is available is Spanish but did not receive any views in the Spanish wikipedia in the month of January since the data corresponds only to that time period.

What if it doesn't

@PatsonJay If the article doesn't have enough pageviews, I suggest that you pick a different article which has at least 1000 pageviews in at least one other language so that you can carry out the "Compare Reader Behaviour across Languages" task.

# TODO: for at least one language the article exists in that has a corresponding clickstream dataset,
# loop through that clickstream dataset and gather all the relevant data
# (as you did in English for your visualization above)

Above is a TODO, so for example the article that I got from the English one that I was working on is Chris Ferguson and so I do a query using the mwapi libary and the I noticed that this appears in spanish but when I look into the clickstream I do not find it. How then can I answer this todo if the article doesn't exist in any languages as per clickstream.

@PatsonJay The name you're using might have spaces in between them. The clickstream data does not store names with spaces. For example, Chris Ferguson might be stored as Chris_Ferguson in the dataset. I suggest you try using name.replace() command to replace spaces with underscores.

@PatsonJay I hope you found the solution to your problem

Hi, @PatsonJay
You must be sure to check that the date is the same in both sources, For example, the clickstream that we are analyzing is from January 2021, so when you search at the toolforge you must make the request for the same period of time. like here . And there does not appear page views for "Cris Ferguson" in Spanish January 2021.

Hello everyone I have an article say with the name Chris Ferguson and I got this article from an English clickstream. When do a query using the mwapi libray I can see that this article also appeared in Spanish. But when I try to find this article from the Spanish clickstream("eswiki") I do not find this article even if I search It's name in Spanish. But then when I look at this article using the languageviews (https://pageviews.toolforge.org/langviews/) I can see that this article does exist. What might be the problem I'm facing

Therefore in the Spanish January 2021 clickstream "Chris Ferguson" does not appears because, for privacy reasons, a particular path between articles must occur at least 10 times to be reported.
So "Chris Ferguson may have a Spanish version as you found it when requesting the API, but don't being in the Spanish January 2021 clickstream because it wasn't visited.

# TODO: for at least one language the article exists in that has a corresponding clickstream dataset,
# loop through that clickstream dataset and gather all the relevant data
# (as you did in English for your visualization above)

Above is a TODO, so for example the article that I got from the English one that I was working on is Chris Ferguson and so I do a query using the mwapi libary and the I noticed that this appears in spanish but when I look into the clickstream I do not find it. How then can I answer this todo if the article doesn't exist in any languages as per clickstream.

@puja_jaji thanks for the reminder, also it was okay for us to record just one contribution for the task,right?

Yes @Zealink, I think it is appropriate to present all of our work in a single notebook - one contribution,

Hello, @PatsonJay I forgot to mention that here in this API request :
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Chris_Ferguson&lllimit=max&redirects=
a German version for Chris Ferguson appears, we also have a January 2021 German clickstreams available in the notebook, and in Toolforge that German version of Chris Ferguson has 397 pageviews.

# TODO: for at least one language the article exists in that has a corresponding clickstream dataset,
# loop through that clickstream dataset and gather all the relevant data
# (as you did in English for your visualization above)

Above is a TODO, so for example the article that I got from the English one that I was working on is Chris Ferguson and so I do a query using the mwapi libary and the I noticed that this appears in spanish but when I look into the clickstream I do not find it. How then can I answer this todo if the article doesn't exist in any languages as per clickstream.

Just a remember guys, Don't forget to submit your final application on Outreachy website. We have another 25 and half hours. The deadline is April 30 2021 4 PM UTC

@puja_jaji thanks for posting the reminder.

Note that the final deadline has been shifted to a later date due to the potential impact of Covid-19 on applicants (especially but not restricted to India).

The final application deadline is now extended to Monday, May 3 at 4pm UTC.

Please do not hesitate to reach out to Isaac and me directly via email if you have further questions and/or are personally impacted by Covid-19.

Stay safe&healthy!

@MGerlach thanks a lot to the outreachy organisers and the team for this consideration. This really means a lot.

@puja_jaji thanks for the reminder, also it was okay for us to record just one contribution for the task,right?

Yes @Zealink, I think it is appropriate to present all of our work in a single notebook - one contribution,

okay @puja_jaji, thanks for sharing.

@PatsonJay The name you're using might have spaces in between them. The clickstream data does not store names with spaces. For example, Chris Ferguson might be stored as Chris_Ferguson in the dataset. I suggest you try using name.replace() command to replace spaces with underscores.

thank let me try that one

Hello, @PatsonJay I forgot to mention that here in this API request :
https://en.wikipedia.org/w/api.php?action=query&prop=langlinks&titles=Chris_Ferguson&lllimit=max&redirects=
a German version for Chris Ferguson appears, we also have a January 2021 German clickstreams available in the notebook, and in Toolforge that German version of Chris Ferguson has 397 pageviews.

# TODO: for at least one language the article exists in that has a corresponding clickstream dataset,
# loop through that clickstream dataset and gather all the relevant data
# (as you did in English for your visualization above)

Above is a TODO, so for example the article that I got from the English one that I was working on is Chris Ferguson and so I do a query using the mwapi libary and the I noticed that this appears in spanish but when I look into the clickstream I do not find it. How then can I answer this todo if the article doesn't exist in any languages as per clickstream.

thank you so much

I really do not know where I'm going wrong but I can't happen to find the article when I search in the clickstream. I have tried all sorts of methods. For example I would love to search this article - "NHK_Educational_TV". But I can't find it in the clickstream when I use

df.loc[(df[0] == NHK_Educational_TV)| (df[1] == NHK_Educational_TV)]

where df is my dataframe. But when I do a quick look at this https://pageviews.toolforge.org/langviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2021-01-01&end=2021-01-31&sort=views&direction=1&view=list&page=NHK%20Educational%20TV I can see that there are lots of languages where NHK_Education_TV appears. What might be the problem

Another question is doesn't the mwapi query(session.request(method="Get", params) show data of this recent month, and by that I mean April.

I really do not know where I'm going wrong but I can't happen to find the article when I search in the clickstream. I have tried all sorts of methods. For example I would love to search this article - "NHK_Educational_TV". But I can't find it in the clickstream when I use

df.loc[(df[0] == NHK_Educational_TV)| (df[1] == NHK_Educational_TV)]

where df is my dataframe. But when I do a quick look at this https://pageviews.toolforge.org/langviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2021-01-01&end=2021-01-31&sort=views&direction=1&view=list&page=NHK%20Educational%20TV I can see that there are lots of languages where NHK_Education_TV appears. What might be the problem

Hello, @PatsonJay.

Pre-conditions:
You are confident that:
(1) the article (its data) is present in the clickstream dataset you are using; you have verified it by the langlinks API.
(2) 'NHK_Educational_TV' is the page title of the article in that clickstream dataset.

What might be happening:

Syntax
I would take a look at the syntax of that statement. For example, df.loc() is used to access a group of rows and columns by label(s) or a boolean array. If you want to select the columns with the 'source' and 'destination' articles you should use the column names instead. To access the group of rows by integer position, change the method to 'iloc' and be careful with the syntax. The second thing I noticed is that the page title is not enclosed in single or double quotation marks. If NHK_Educational_TV is not a variable, then use quotation marks for the program to recognize it as a string. 

 You are not loading the full dataset and the article is not present in the subset you have
If your syntax is correct, that is, no error is being thrown by the compiler (that was not the exact statement you used). Then, it could have something to do with the subset of the data you have. Are you loading the full dataset? If not, then your 'article' may not be present in the data you loaded, especially if you have less than 70%. 

Ok thanks so much let me try df.iloc instead. my code was correct, just that I forgot to put the single or double quotes when I brought the question here. And for the dataset I was loading something like 20k of the data

If you want to select the columns with the 'source' and 'destination' articles you should use the column names instead.

From the dataset I think 1 represents the source and 3 represent the destination that's why I was doing something like this:

df[3] == "NHK_Educational_TV"

I really do not know where I'm going wrong but I can't happen to find the article when I search in the clickstream. I have tried all sorts of methods. For example I would love to search this article - "NHK_Educational_TV". But I can't find it in the clickstream when I use

df.loc[(df[0] == NHK_Educational_TV)| (df[1] == NHK_Educational_TV)]

where df is my dataframe. But when I do a quick look at this https://pageviews.toolforge.org/langviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2021-01-01&end=2021-01-31&sort=views&direction=1&view=list&page=NHK%20Educational%20TV I can see that there are lots of languages where NHK_Education_TV appears. What might be the problem

Hello, @PatsonJay.

Pre-conditions:
You are confident that:
(1) the article (its data) is present in the clickstream dataset you are using; you have verified it by the langlinks API.
(2) 'NHK_Educational_TV' is the page title of the article in that clickstream dataset.

What might be happening:

Syntax
I would take a look at the syntax of that statement. For example, df.loc() is used to access a group of rows and columns by label(s) or a boolean array. If you want to select the columns with the 'source' and 'destination' articles you should use the column names instead. To access the group of rows by integer position, change the method to 'iloc' and be careful with the syntax. The second thing I noticed is that the page title is not enclosed in single or double quotation marks. If NHK_Educational_TV is not a variable, then use quotation marks for the program to recognize it as a string. 

 You are not loading the full dataset and the article is not present in the subset you have
If your syntax is correct, that is, no error is being thrown by the compiler (that was not the exact statement you used). Then, it could have something to do with the subset of the data you have. Are you loading the full dataset? If not, then your 'article' may not be present in the data you loaded, especially if you have less than 70%. 

I have used iloc as you said and then for the dataset I tried using 500k and the 1million but still i'm finding and empty data set.

Ok thanks so much let me try df.iloc instead. my code was correct, just that I forgot to put the single or double quotes when I brought the question here. And for the dataset I was loading something like 20k of the data

You are welcome!

If you want to select the columns with the 'source' and 'destination' articles you should use the column names instead.

From the dataset I think 1 represents the source and 3 represent the destination that's why I was doing something like this:

df[3] == "NHK_Educational_TV"

 If you did not rearrange the order of columns, 0 should be 'source/pre', and 1 'destination/curr.'

yes that's it i had made a mistake there

yes that's it i had made a mistake there

No problem!

Another question is doesn't the mwapi query(session.request(method="Get", params) show data of this recent month, and by that I mean April.

It depends on the API. If you are making a GET request to the langlinks API, then you can't, because it is used to "get a list of all language links from the provided pages to other languages"; it is not specific to the month.

Hello mentors, @MGerlach
I am Ananya Mahato, an outreachy applicant. Due to my semester exams as well as poor health conditions, I wasn't able to contribute to any project. But since the contribution deadline has been extended. I wish to ask you, whether I can contribute to this project now, or are there a sufficient number of applicants already.

Hello mentors, @MGerlach
I am Ananya Mahato, an outreachy applicant. Due to my semester exams as well as poor health conditions, I wasn't able to contribute to any project. But since the contribution deadline has been extended. I wish to ask you, whether I can contribute to this project now, or are there a sufficient number of applicants already.

Also adding @Isaac here

yes that's it i had made a mistake there

No problem!

Another question is doesn't the mwapi query(session.request(method="Get", params) show data of this recent month, and by that I mean April.

It depends on the API. If you are making a GET request to the langlinks API, then you can't, because it is used to "get a list of all language links from the provided pages to other languages"; it is not specific to the month.

ok thanks so much

Hello mentors, @MGerlach
I am Ananya Mahato, an outreachy applicant. Due to my semester exams as well as poor health conditions, I wasn't able to contribute to any project. But since the contribution deadline has been extended. I wish to ask you, whether I can contribute to this project now, or are there a sufficient number of applicants already.

Also adding @Isaac here

@123ananya you can definitely still submit a contribution for this task. We will take a look at all submissions (the deadline is Monday, May 3 at 4pm UTC).

Hello guys I have tried different articles that appears across languages but I still I can't find the article in any other language other than English. If I do a query on the Langlinks API of the article "Westdeutscher_Rundfunk" you can see that this article appears in Germany with 80k plus pageviews but when I look for it in the germany clickstream I can't find it using it's germany name 'Westdeutscher_Rundfunk_Köln I cannot find it, when I do something like this on my germany clickstream dataframe:

df_es.iloc[((df_es[0] == "Westdeutscher_Rundfunk_Köln")| (df_es[1] == "Westdeutscher_Rundfunk_Köln")).values, [0, 1,3]]

where I'm I going wrong

Hello mentors, @MGerlach
I am Ananya Mahato, an outreachy applicant. Due to my semester exams as well as poor health conditions, I wasn't able to contribute to any project. But since the contribution deadline has been extended. I wish to ask you, whether I can contribute to this project now, or are there a sufficient number of applicants already.

Also adding @Isaac here

@123ananya you can definitely still submit a contribution for this task. We will take a look at all submissions (the deadline is Monday, May 3 at 4pm UTC).

Thank you.

Hello @PatsonJay I have a question, in your code

df_es

Is a DataFrame created from :

clickstream-dewiki-2021-01.tsv.gz

???

Hello guys I have tried different articles that appears across languages but I still I can't find the article in any other language other than English. If I do a query on the Langlinks API of the article "Westdeutscher_Rundfunk" you can see that this article appears in Germany with 80k plus pageviews but when I look for it in the germany clickstream I can't find it using it's germany name 'Westdeutscher_Rundfunk_Köln I cannot find it, when I do something like this on my germany clickstream dataframe:

df_es.iloc[((df_es[0] == "Westdeutscher_Rundfunk_Köln")| (df_es[1] == "Westdeutscher_Rundfunk_Köln")).values, [0, 1,3]]

where I'm I going wrong

yes it is

Hello @PatsonJay I have a question, in your code

df_es

Is a DataFrame created from :

clickstream-dewiki-2021-01.tsv.gz

???

Hello guys I have tried different articles that appears across languages but I still I can't find the article in any other language other than English. If I do a query on the Langlinks API of the article "Westdeutscher_Rundfunk" you can see that this article appears in Germany with 80k plus pageviews but when I look for it in the germany clickstream I can't find it using it's germany name 'Westdeutscher_Rundfunk_Köln I cannot find it, when I do something like this on my germany clickstream dataframe:

df_es.iloc[((df_es[0] == "Westdeutscher_Rundfunk_Köln")| (df_es[1] == "Westdeutscher_Rundfunk_Köln")).values, [0, 1,3]]

where I'm I going wrong

yes it is

Hello @PatsonJay I have a question, in your code

df_es

Is a DataFrame created from :

clickstream-dewiki-2021-01.tsv.gz

???

Hello guys I have tried different articles that appears across languages but I still I can't find the article in any other language other than English. If I do a query on the Langlinks API of the article "Westdeutscher_Rundfunk" you can see that this article appears in Germany with 80k plus pageviews but when I look for it in the germany clickstream I can't find it using it's germany name 'Westdeutscher_Rundfunk_Köln I cannot find it, when I do something like this on my germany clickstream dataframe:

df_es.iloc[((df_es[0] == "Westdeutscher_Rundfunk_Köln")| (df_es[1] == "Westdeutscher_Rundfunk_Köln")).values, [0, 1,3]]

where I'm I going wrong

yes it is

Ok, @PatsonJay. If your syntax is correct, maybe you aren't loading the full dataset, and the article name: "Westdeutscher_Rundfunk_Köln " is not present in the subset you got.

So you can try this approach to search in the full dataset:

with  gzip.open('path_file', 'rt') as raw_text :
    for line in raw_text:
         lst = line.strip().split()

parsing each line (a string) in the object returned, using the strip and the split methods you got a list, so there you can add an if/ statement to look for your article name in the respective position of the list. That would be something like :
0: source
1: destination
2: type
3: n_occurrences

Hello @Isaac I would like to know how can we answer this question on Outreachy application form:

Outreachy internship project timeline:

Please work with your mentor to provide a timeline of the work you plan to accomplish on the project and what tasks you will finish at each step. Make sure take into account any time commitments you have during the Outreachy internship round. If you are still working on your contributions and need more time, you can leave this blank and edit your application later.

Outreach application form:

Some communities or projects may want you to answer additional questions. Please check with your mentor and community coordinator to see if you need to provide any additional information after you save your final application.

are there also any specific questions we need to answer?

@PatsonJay

You can refer to this:

Hello, @MGerlach and @Isaac this is the last question in the Outreachy final application :

Please work with your mentor to provide a timeline of the work you plan to accomplish on the project and what tasks you will finish at each step. Make sure to take into account any time commitments you have during the Outreachy internship round.

So I would like to know if is ok to use the phases described in T275608 as steps in the project timeline and if you have any suggestion or guidance about it.

@Vanevela good question, I will try to clarify. Yes you can definitely use those phases (e.g. as a rough guideline). also feel free to use it to identify which of the phases mentioned in the task (Analysis, Visualization, interface building/developing, model building) are most interesting to you and would like to spend more time on; as well as any additional steps you might think are pertinent for your work on this project. In general though, no need to spend too much time on this.

Outreach application form:

Some communities or projects may want you to answer additional questions. Please check with your mentor and community coordinator to see if you need to provide any additional information after you save your final application.

are there also any specific questions we need to answer?

As far as I know, Wikimedia doesn't have any community-specific questions. This is specified here (step 12). Hope is useful.

This comment was removed by PatsonJay.

Hi all,
just a reminder: if you have not done already, dont forget to submit your final application on the outreachy website before the deadline on Monday, May 3 at 4pm UTC (less than 24 hours).
Even if you sent your notebook to Isaac or me for feedback during the past weeks (thanks for anyone who shared their progress), you still need to submit the application on the outreachy-site.

Thanks for all the great contributions and discussions.

I would like to thank this great community for the constructive advice and help that I received during the contribution stage ending today. It was a learning phase for me, an interesting period where I get to learn from this great community. I really enjoyed working through the problems. Thanks to everyone.

I have had an amazing experience working on this task and I truly appreciate how the community came forward to help out in any way possible everytime I or anyone else faced an issue. I have learnt a lot under the guidance of the mentors. I hope to apply all the knowledge I have gained through this experience in all future projects that I take up. I am really looking forward to continue contributing to open source.

Wishing all the best to everyone!

@PatsonJay and @rachita_saha thank you to you and everyone in the community for the collaboration and mutual support. It was a period of great learning. Best wishes to all. :)

Stay safe and healthy.

@MGerlach @Isaac
Although I had started contributing very late, I have managed to complete the task and mailed you my public link for feedback. Please provide your valuable feedback as soon as possible, as I wish to submit my proposal today.

I would also want to thank everyone, for being so helpful .

Thanks a lot everyone. It was my first time being a part of such an active and collaborative forum and let me tell you all it was great :)
May you all have a great journey ahead.

Hi everyone,
the final application deadline has passed. I wanted to thank you for all your hard work and effort you put into your submissions. you all did a really good job not only in your analysis and notebooks but also in being curious, asking questions, and helping each other out!

We are now reviewing the submissions. According to information from the organizers, the selection of interns will be announced on May 17 at 4pm UTC.

Do not hesitate to reach out in case you have any questions.

MGerlach claimed this task.