Page MenuHomePhabricator

Tru2198 (TRUSHA PATEL)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Mon, Mar 29, 5:20 PM (2 w, 2 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
Tru2198 [ Global Accounts ]

Recent Activity

Sun, Apr 11

Tru2198 added a comment to T279290: Synchronising Wikidata and Wikipedias using pywikibot - Task 6.

@Mike_Peel, I am picking the topic Category: JudoInside_template_with_ID_not_in_Wikidata. I hope that won't be an issue!

Sun, Apr 11, 5:37 AM · Outreachy (Round 22)

Sat, Apr 10

Tru2198 added a comment to T279289: Synchronising Wikidata and Wikipedias using pywikibot - Task 5.

@Mike_Peel, For Bonus: Explore how to identify the correct item when multiple terms are returned, can we approach the problem with the idea that for every QID returned through the searching title, we refine the correct one by parsing each item for a specific property? For instance, if my title is "Harry Potter". Now that returns Harry Potter movie, book, and character. But only the book will have the property "Language of the work", or "author", or "publication date"?

Sat, Apr 10, 12:41 PM · Outreachy (Round 22)

Fri, Apr 9

Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

“Print out the information alongside the property name (e.g., "P31 = human"). “
Isn't this only in wikidata, as wikidata stores the information in the form of properties and QIDs?

So, how should the statements be printed on parsing Wikipedia? I am confused here.

Also, do review by task_2:
https://www.wikidata.org/w/index.php?title=User:Tru2198/Outreachy_2

Fri, Apr 9, 5:03 PM · Outreachy (Round 22)

Thu, Apr 8

Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

“Print out the information alongside the property name (e.g., "P31 = human"). “
Isn't this only in wikidata, as wikidata stores the information in the form of properties and QIDs?

Thu, Apr 8, 5:38 PM · Outreachy (Round 22)
Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

So we have to print from both Wikipedia and wikidata?

Thu, Apr 8, 4:01 PM · Outreachy (Round 22)
Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

Hi @MSGJ, I have completed task_3 which awaits your feedback!
Also, I think this tutorial is also ideal for task_4 that needs adding information to the wiki data.

Thu, Apr 8, 10:28 AM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

In case you might need some inspiration for the last microtask, I found these to be useful:

Thu, Apr 8, 3:12 AM · Outreachy (Round 22)

Wed, Apr 7

Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

Hi @Tru2198, I didn't implement it for such cases yet, except Image.

I believe finding href html tag and then asking for title of the page for that link will resolve this issue.

And when the programme couldn't find a href, then must be a plain text, which should be extracted from relevant <div> tags.

If there is any better approach, please share. Because

Wed, Apr 7, 8:44 AM · Outreachy (Round 22)
Tru2198 updated subscribers of T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

Hello @Mike_Peel, @MSGJ
I have completed Task_3. Though, I haven't used the parsing through the regex.

Wed, Apr 7, 8:42 AM · Outreachy (Round 22)
Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.

@Shristi0gupta, wonderful programme! I wanted to know how did you deal with the parameter values who didn't have label? and also, some values without Qnumbers like dates?

Wed, Apr 7, 3:39 AM · Outreachy (Round 22)

Tue, Apr 6

Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Now, as per the first clickstream data dump, it's in English. But in langlink API, it's not in English.

@Tru2198 make sure that you're viewing all the results. This can be done via the continuation parameter in mwapi (documentation) or just increasing the lllimit parameter via your API call to return more results at once. Hope that helps.

Tue, Apr 6, 2:59 PM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

I am having few confusions on the final tasks for Compare Reader Behavior across Languages. Finding the correct article that matches both the datasets and consequently comparing their (their respective source and destination) values across languages is not ideal and time-consuming as I have to resort to a trial and error approach. I would like to know how others have moved forward with the task?
Thanks!

Hi @Tru2198 , taking into account the memory restriction in PAWS, I iterated through the files without downloading them into memory and only extracted the necessary information in less memory-consuming data structures (as @Isaac suggested before), in my case I used dictionaries and lists. That is, my initial exploration doesn't use Pandas, later when I've selected a smaller dataset I start using this library to take advantage of his methods.

Tue, Apr 6, 2:26 PM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

I am having few confusions on the final tasks for Compare Reader Behavior across Languages. Finding the correct article that matches both the datasets and consequently comparing their (their respective source and destination) values across languages is not ideal and time-consuming as I have to resort to a trial and error approach. I would like to know how others have moved forward with the task?
Thanks!

Can you provide more details? @Tru2198?

  1. Does your memory run out after loading the datasets (while analyzing them) or during?
  2. How much memory are you consuming prior to this session?
  3. Are you operating on the whole datasets or a subsection of them?
Tue, Apr 6, 2:24 PM · Outreachy (Round 22)
Tru2198 added a comment to T278860: Synchronising Wikidata and Wikipedias using pywikibot - Task 1.

Hello, please for task 1 when we are asked to add a statement am I adding the statement directly on the article itself?

Tue, Apr 6, 1:06 PM · Outreachy (Round 22)
Tru2198 added a comment to T278997: Synchronising Wikidata and Wikipedias using pywikibot - Task 3.
Tue, Apr 6, 12:33 PM · Outreachy (Round 22)
Tru2198 added a comment to T278860: Synchronising Wikidata and Wikipedias using pywikibot - Task 1.

Although yesterday, my task_1 got approved, I have added few extra properties and an article that might deviate a bit.

Tue, Apr 6, 11:19 AM · Outreachy (Round 22)
Tru2198 updated subscribers of T278863: Synchronising Wikidata and Wikipedias using pywikibot - Task 2.

@Mike_Peel, @MSGJ, I have completed task 2. I have mailed you.
Though here is the link:
https://public.paws.wmcloud.org/User:Tru2198/Outreachy2_synchronizing_pywikibot-Copy1.ipynb

Tue, Apr 6, 9:04 AM · Outreachy (Round 22)

Mon, Apr 5

Tru2198 added a comment to T276329: Synchronising Wikidata and Wikipedias using pywikibot.

@Mike_Peel Thank you for the heads up! No matter the interns, I am really enjoying the learning in the contribution!

Mon, Apr 5, 3:30 PM · Outreachy (Round 22), Outreach-Programs-Projects
Tru2198 added a comment to T278860: Synchronising Wikidata and Wikipedias using pywikibot - Task 1.

Hello @Mike_Peel!
I have tried my best with Task_1, and look forward to your feedback. Kindly review it at your time.
Here is the link:
https://www.wikidata.org/wiki/User:Tru2198/Outreachy_1

Mon, Apr 5, 11:34 AM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

I am having few confusions on the final tasks for Compare Reader Behavior across Languages. Finding the correct article that matches both the datasets and consequently comparing their (their respective source and destination) values across languages is not ideal and time-consuming as I have to resort to a trial and error approach. I would like to know how others have moved forward with the task?
Thanks!

Mon, Apr 5, 4:54 AM · Outreachy (Round 22)
Tru2198 added a comment to T276329: Synchronising Wikidata and Wikipedias using pywikibot.
Mon, Apr 5, 4:39 AM · Outreachy (Round 22), Outreach-Programs-Projects

Fri, Apr 2

Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

I have completed my first notebook's microtask with the basic implementation of the functions and analysis that are required. However, there are many updates and enhancements that I would act upon in the upcoming days that I have mentioned in this notebook. Should I record my first contribution and await the review on the Outreachy site?

Fri, Apr 2, 4:14 AM · Outreachy (Round 22)

Thu, Apr 1

Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.
Thu, Apr 1, 11:50 AM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

for the task: Compare Reader Behavior across Languages, will the comparison between two languages suffice, as most articles are supported by only one or two languages, other than the English language that intersects with the available clickstream data and that of the langlinks API?

Thu, Apr 1, 8:04 AM · Outreachy (Round 22)

Wed, Mar 31

Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello, for the microtask, I am trying to convert the file to CSV with pandas and subsequently data frames, but I am getting the error where several rows have conflicting columns. So the parameter, error_bad_lines=False to ignore the troubling lines can be used. Is that option viable?

@Tru2198 @rachita_saha. good catch regarding these errors when importing with pandas. in this case, if the number of lines that are affected is small, it is ok to remove them using "error_bad_lines=False".
However, more generally, it is worth to check what is happening with these lines. When pandas is importing the rows line by line, it is splitting each line at the "\t" to get the entries for the different columns. Now, the error suggests that sometimes, the number of entries from splitting is not 4 (like most of the rows) but 5 (or more). This happens because of some characters contained in the page-titles of the source and/or the target-page (such as quotes). Pandas provides different options to deal with this. I have found that using the option "quoting=3" when doing read_csv() can solve the problem (the default is quoting=0). see here for more documentation, if you are interested to dig deeper.

@MGerlach That's very insightful. I'll look up this option. Thank you.

Wed, Mar 31, 3:42 PM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

@MGerlach @Isaac Several of the popular destination articles do not link to any further articles. So there is no data on common pathways from this article. Should we find an article that links to more articles or should we only provide the visualizations for the common pathways to the chosen article?

Wed, Mar 31, 3:25 PM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Can anybody clarify for me that what exactly pageview means? Isn't it the sum of the number of occurrences (the 4th column)? @MGerlach @Tru2198

Wed, Mar 31, 2:26 PM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello, to address my query, I used this approach.

Wed, Mar 31, 11:35 AM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

data_frame.group by('Destination').nunique()>=20 returns most of the values with false in Source, destination and link columns. Also, I am facing a deadlock in combining the queries of the TODO task: Choose a destination article from the dataset that is:
"relatively popular (at least 1000 page views (where I am using: data_frame.group by('Destination').sum()>=1000) and 20 unique sources in the dataset"

Wed, Mar 31, 10:32 AM · Outreachy (Round 22)
Tru2198 added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hello, for the microtask, I am trying to convert the file to CSV with pandas and subsequently data frames, but I am getting the error where several rows have conflicting columns. So the parameter, error_bad_lines=False to ignore the troubling lines can be used. Is that option viable?

Wed, Mar 31, 6:11 AM · Outreachy (Round 22)