Page MenuHomePhabricator

HagerDakroury (Hager Eldakroury)
User

Projects

User does not belong to any projects.

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Apr 1 2021, 12:07 PM (167 w, 1 d)
Availability
Available
LDAP User
Unknown
MediaWiki User
HagerDakroury [ Global Accounts ]

Recent Activity

Apr 6 2021

HagerDakroury added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

I am having few confusions on the final tasks for Compare Reader Behavior across Languages. Finding the correct article that matches both the datasets and consequently comparing their (their respective source and destination) values across languages is not ideal and time-consuming as I have to resort to a trial and error approach. I would like to know how others have moved forward with the task?
Thanks!

Apr 6 2021, 9:37 AM · Outreachy (Round 22)

Apr 4 2021

HagerDakroury added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Thank you, @HagerDakroury. This is useful.

Additionally, if you're not doing it already, it is always a good strategy to take advantage of the computing power of the Google servers (for example) instead of just using your own machine. Running Python scripts requires often a lot of computing power and can take time. However, running scripts in the cloud should let you focus on the task. Hope it helps: https://colab.research.google.com/

Apr 4 2021, 4:24 PM · Outreachy (Round 22)
HagerDakroury added a comment to T276315: Outreachy Application Task: Tutorial for Wikipedia Clickstream data.

Hi Everyone, for anyone using pd.read_csv() and struggling with the memory usage issue. You can read through this guide in the documentation Scaling to large datasets (specifically the part about choosing more optimal datatypes) I tried doing so with some columns. For example, the column representig the type of the link can be better represented as a Category as they're many duplicates (there are only a handful types of links). I'm still unable to load the whole dataset using pandas but the improvement in memory usage is significant. (2.1 GB to 955.8 MB for 10 Million entries) The guide goes through some other suggestions as well. Hope this helps!

Apr 4 2021, 1:35 PM · Outreachy (Round 22)