Aug 4 2020
Thank you, @elukey !
Jun 18 2020
Jun 2 2020
I deleted all files on nb3 and shutdown the server.
I rsynced all files from nb4 and shutdown the server.
May 21 2020
These were interesting and helpful metrics to review for GLOW India articles:
Namespace (or just main/not main?)
num of editors
num of edits
num of watchers
time since last edit
May 14 2020
Hi @elukey I'll transfer files and shut down notebooks over the next few days. I'll check in on Tuesday with an update or questions if any.
Apr 21 2020
thank you @nettrom_WMF! sorry for the delay
Apr 1 2020
Loading neighboring contest articles:
Mar 27 2020
Data wrangling code to pull the items in this task can be found here: https://github.com/IreneFlorez/GLOW/tree/article_suggestions/scripts/data_wrangling
Mar 24 2020
articles that were edited using a translation tool (by type):
expanded 113 (expanded total 1418)
new 3602. (new total 7445)
Expanded articles edited using a translation tool:
Mar 20 2020
Mar 18 2020
@mpopov maybe the faulty link was related to a bug? I'm receiving bug reports related to this ticket. Would it make sense to create a new ticket?
Sorry about that, I just updated the ticket with a functional task link.
Mar 16 2020
In an effort to run these queries from a Python3 notebook without needing to change the notebook type, I've switched these queries to run as spark queries using the wmf data package's spark.run function. I'm now able to run the queries. For example, here's the code for the translation query:
Thank you. Yes, I can confirm that I had run kinit and entered my kerberos credentials in a notebook-terminal.
@JAllemandou I tried running these spark queries over the weekend on a small batch of articles and they timed out.
Might you have tips or insights? I didn't receive any error messages, simply the queries took a very long time and eventually I stopped the kernel.
Given that behavior, I also tried running the queries as hive queries and had similar issues.
Mar 12 2020
Total values in full rec list: 34295
total recs in translation: 14155
total recs in editing: 20102
Feb 18 2020
Thank you for the quick replies and the feedback re: EXPLAIN and the group by line.
Feb 17 2020
Thank you for the feedback!
I've updated the date handling in the pageviews query and added event_entity, revision_is_identity_reverted, and revision_is_deleted_by_page_deletion to the fields used in the revision tags mediawiki_history table query.
These are performing much better now.
Feb 16 2020
Feb 14 2020
Hi All! I work with Edna on the Partnerships team.
I am a Partnerships data analyst focusing on analyzing the data coming out of the GLOW project.
Edna and I are on the Partnerships and Global Reach team, she focuses on the Latin America region.
Feb 7 2020
Feb 4 2020
Jan 24 2020
Aeryn Palmer from legal has taken a look and now this is on to James Fishback for the second part of the new privacy review process, a security review.
According to Aeryn, much of this should be okay to publish, although we'd need to exclude very small wikis. Security may help identify that threshold.
Jan 22 2020
Hi @JAllemandou is this now possible via spark? I'm querying wikidata for GLOW analysis and wondering if there's an update on the hadoop version of wikidata that I should consider or keep in mind. At present I'm setting up SPARQL queries via the SWAP notebooks.
Jan 21 2020
I sent out a request to legal on Thu, Jan 9. I will check in on that request today and will post updates here.
Jan 18 2020
Hoorah! Thank you @Dzahn!
@nshahquinn-wmf The wiki comparison sheet is now updated with Dec 2019 data.
You can now add formatting magic :)
Jan 17 2020
Hi @Aklapper, on my screen, I see tofu characters for the characters on the Javanese script page, essentially just boxes..
I am a data analyst working on GLOW. Here's the GLOW project phab board and the meta page.
I work with Rudolph on the Partnerships and Global Reach team. Rudolph is requesting access to Superset. He is not requesting additional Analytics access. We were told that for Superset access he needed to create a ticket using T160662 as a sample.
Jan 11 2020
I see, thank you. Your clarification is helpful.
Jan 10 2020
Given the above suggestion, I propose a longer timer.
I stop the spark session at the end of the day or when I finish the task...but I have had periods where I need to spend say 30mins reading documentation to address an issue before proceeding with the task at hand on the spark session...so could use a longer timer.
Jan 9 2020
Thank you @elukey
I appreciate your troubleshooting and assessment
Jan 7 2020
Jan 6 2020
No luck with the updated URL
Thank you @Nuria!
I've been using notebooks along with the wmf data package.
I aim to shift to using notebooks with Spark and will reach out if I have any issues. Rereading the SWAP documentation is helpful.
Is this still an ongoing issue? Is this something to keep in mind for project GLOW when we begin evaluating Nigeria specific data?
I'm experiencing the same as @MMiller_WMF . Hue was working fine on Friday and today it's funky.
Today, I'm unable to see any tables and I am seeing error messages. On the left sidebar, the error message says Error loading databases..
On the top right, I am intermittently getting this error message:
java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient in red font.
If I already have a correct/working/functional query, I can paste it in and get results but I cannot unfurl the database icon to see any tables. As DataGrip is not working, I'm hoping to use Hue to test out queries and see what tables are available. To be sure, Hue has been problematic for me (there are some tables that it never shows me)...so if we can ultimately get DataGrip to work, I will be prioritizing that tool.
Dec 24 2019
No luck with the steps included in that link. I went through all of the steps listed for DataGrip. Here's a screenshot that was taken after I tested the connection.
Dec 20 2019
Update: Worked with @elukey just now to gauge the issue.
We tried to update the URL to jdbc:hive2://localhost:10000/default;principal=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA.
Per the documentation on DataGrip the URL will be automatically filled and should look like jdbc:hive2://localhost:10000/default
We updated this to see if it would fix the issue and the test failed.
With the updated URL that we tested, DataGrip now doesn't even let me try to execute a query...the buttons outside of the config window are 'frozen'. A restart did not fix the freeze.
We will touch base about this next week.
Dec 19 2019
Dec 16 2019
#consider https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table to be the ultimate source of truth on wiki counts
excellent! Again, I highly recommend recording a macro as it should just take two clicks and then can be deployed with a shortcut on the next sheet. The process would be clicking
tools > macros > Record macros
and sheets will put your actions into code that it saves as a macro.
but, of course, it's up to you :)
About the data Issues:
#1 I will need a little more insight into gathering Cumulative content edits and content pages with the API. As far as historical cumulative content pages, the Wikistats 2.0 API notes that Total article count can be derived from the Pages Created count, by taking its cumulative value. However, I'm not clear on how this addresses deleted content. I looked at this in T240253 for GLOW India and ended up pulling data from https://meta.wikimedia.org/wiki/List_of_Wikipedias/Table. I will appreciate more detailing on how to pull accurate data for these two measures. In the short term, I recommend having blank columns for any historical year wiki comparison snapshots. So, for example, if we decide to create a 2017 tab, then have blank columns for cumulative content edits and content pages.
Dec 12 2019
#2 - yes, 2017 readership metrics do have identical data to the June 2018 tab. I'm looking into this. For now, can you please remove that sheet from the Wiki Comparison notebook? No dates were hardcoded into the individual queries in the code repository. Instead, there's one cell at the top that defines dates. So I'm not sure how this came about, it seems unusual that one section has different data variables than another...this definitely needs to be fixed.
Dec 11 2019
Thank you @Neil_P._Quinn_WMF