- continued preparation for All-Hand's panel on reuse
- iterated with Jonathan on social media traffic report planning -- in particular looking for anecdotal evidence of a link between external referrals and editing to help us gauge what we think those relationships will look like. Early observations would suggest that Youtube fact-checking links lead to a low level of vandalism, Reddit links seem to lead to minor, positive edits but can have larger positive impacts too depending on the community, Twitter/Facebook had little evidence of leading to editing.
- Improved wiki-topic interface so it's easier to query
- Drafted user script for automatically querying for an article's topic predictions but I do not recommend running it right now as toolforge isn't a trusted domain on Wikipedia for making content requests (see T28508): https://en.wikipedia.org/wiki/User:Isaac_(WMF)/WikidataTopic.js
- I added a short disclaimer statement to the interface noting that it's experimental and no personal data is collected. I asked the cloud team via IRC whether they had other suggestions for privacy policies to link to but they indicated that there are no standard terms etc. that they recommend including.
- Note: uptime does not seem to be an issue with toolforge so I don't think that I have to worry about keeping the server awake!
- Provided concrete examples of top-level findings in support of Science abstract
- Computed data for what % of page views come from men/women for each language based on our survey results -- in general the % of pageviews from men is about 5-10% higher of a number than proportion of readers who are men because men consistently had slightly longer reading sessions than women as well.
- Green light from team to begin expanding out comprehensive paper on Wikipedia readership
- Florian will get back to me on deleting data from WtWRW project to free up space on stat1007
Weekly progress: none while waiting on feedback for Q3 directions. I'll note that the strategy recommendations provide further evidence of why it's valuable for this work to move towards understanding misalignment between readers and content.
Wed, Jan 22
Email sent -- for the final bullet point around the foundational work, I made this suggestion:
Tue, Jan 21
The instances and my proposed solutions are below:
- Current: Harassment is a pervasive issue on the internet—even on Wikipedia. We’re working to understand and combat harassment in Wikimedia projects and discussion spaces.
- Text: Leave as is.
- Link: Possibly change the link to be: https://meta.wikimedia.org/wiki/Community_health_initiative
- Current: Verifiable citations are critical to open knowledge projects like Wikipedia. We’re building an open repository of sources to make it easier to manage citations.
- Text: probably a little too narrowly-scoped. Maybe something like "Verifiability is critical to open knowledge projects like Wikipedia. We're building tools to detect policy violations and improve information provenance."
- Link: Knowledge Integrity Program: https://research.wikimedia.org/knowledge-integrity.html
- Current: Every second 6,000 people visit Wikipedia. We want to know why. Our research aims to understand their motivations and needs so we can identify opportunities to improve.
- Text: leave as is.
- Link: Knowledge Gaps Program: https://research.wikimedia.org/knowledge-gaps.html
- Current: Wikimedia is a global movement but participation has been easier for some users than others. We are partnering with communities to understand why. With everyone’s help we can increase editor diversity across our projects.
- Text: This project has been stalled as far as I know, so we should probably focus on other projects with more to explore. We might want to push for the Foundational work and perhaps makes a nod to extending / supporting the already large community of researchers.
- Link: https://research.wikimedia.org/foundational.html
@Volker_E @bmansurov Just checking in on this as we are considering revisiting this task. Any thoughts / updates around whether there is either established guidance for moving to a static site generator as was discussed previously or if perhaps in the meantime we shouldn't just consider something like is being used by Gapfinder where various elements on that page such as the disclaimer are loaded as static elements using riot.js -- this latter approach I expect would work at least for the header/footer (which are the exact same for all pages).
Fri, Jan 17
- Waiting to hear on Science abstract before proceeding with narrative described above
- A few additional analyses:
- Changing confidence intervals from 99% to 95% does not change the results much -- I will continue to use 99% given that it's the more appropriate option for the number of comparisons being made
- Verified no relationship between gender and day of week / time of day
- Outlined basic narrative and desired data around interplay of Wikipedia and Search
Weekly progress: none while waiting on feedback for Q3 directions
- Launched prototype of API: https://tools.wmflabs.org/wiki-topic/
- Coordinate with Diego to ensure uptime for API
- Improve documentation / styling for those who directly visit API
- Retrain model for new, expanded topic taxonomy
- Optional expansion:
- Add LIME-based explanations for predictions
- Give user more control over how predictions are post-processed -- e.g., whether the presence of geographic coordinates is necessary for Geography predictions
Tue, Jan 14
@Halfak : I moved the code to stat1005 so I can hopefully get access to the GPUs there for any further testing. But here's what I have thusfar:
@JAllemandou Thank you - as ever!
+1: these wikidata parquet (specifically item_page_link) dumps are super useful for us!
Mon, Jan 13
This is great -- thanks @lexnasser and others who supported! I'll rerun some of the queries that inspired this work in a few days and let you know if I see anything amiss, but silence should be interpreted as success :)
Let's keep things simple and let's document that this format is not covered.
Weekly update: no work yet on wikidata-based topic model.
Weekly progress: none while trying to ascertain what direction we will move in for Q3.
Weekly update: continued progress on organizing panel on re-use for All-Hands, which will provide an informal venue for me to share out some of what we know and gather concerns, feedback, suggestions from staff about this area.
Weekly update: proposed a narrative for paper that is less narrow and more a comprehensive view of what we know about readership to Wikipedia. Summary:
Tue, Jan 7
@leila @Capt_Swing : Is there any interest in this task going forward? The obvious example right now being Lauren Maggio (e.g., https://www.biorxiv.org/content/10.1101/797779v1) or some of Bob's work on article embeddings (e.g., https://dlab.epfl.ch/people/west/pub/Josifoski-Paskov-Paskov-Jaggi-West_WSDM-19.pdf)
Mon, Jan 6
@lexnasser thanks for the update -- bummer regarding the issues with the Google Translate but thanks for continuing to work on it. If it ends up becoming too bulky to separately track host and parameters, just let me know and we'll just make sure to note it as a known issue. Better handling of the search engines is already a huge improvement!
Fri, Jan 3
Both this task and T236834 have a lot of rich discussion that I'd love to see on-wiki for the general EventLogging pipeline and QuickSurveys' use of that pipeline. I volunteer to write both during the week of All Hands and would welcome collaborators.
@phuedx yeah, I would second that. I'll let you take the lead but let me know how I can help.
I am hoping to resolve this issue soon, be able to do some final testing, and then release the refinery update quickly after.
@lexnasser that sounds great -- thanks for the update!
Dec 23 2019
So, I could output property values as the number portion of Wikidata Qids as features.
Hmmm...I'm torn. On one hand, I like this approach because most properties have too many values to make a one-hot-coding realistic. On the other hand, I'm also concerned about how this requires choosing a single value for each property though and many items (especially high-traffic / higher-quality items) are going to have multiple instance-of or occupation values. In my experience, there is no obvious way to do this (order on Wikidata is at best a weak proxy and it's very difficult to automatically determine the level of detail that's most useful from the Wikidata taxonomy). It might be that just labeling the White House as a mansion or Douglas Adams as a playwright is acceptable, but I'm hesistant to say that there's a good process for reducing these properties down to a single value. And if a single value isn't good enough, then the end user might just have been better off querying Wikidata themselves. So if there's no way to return an array, we might consider identifying a few static properties that we care about like List/Disambiguation and just return those as booleans rather than returning incomplete data.
I will leave this task open while the external referer data subtask is still open, but I consider the literature review complete at this stage and ready for the next stages of sharing out, iteration, and the start of research.
In the course of the literature review, I identified the data that we have available to us and improvements that could be made to this data in order to make it more useful for understanding traffic:
- External referers: T239625
- Bot identification: T238357#5695312
- Identified a broken dashboard of high-level referer data that has since been fixed: https://discovery.wmflabs.org/external/
- Identified a broken dashboard of the effect of translation on readers that is in the process of being fixed: T239876
- I requested access to the Google Search Console to evaluate what can be learned from that data.
High-level summary is below. The doc has more examples and collected research that I am working towards moving to Meta so it will be accessible. Categories in the taxonomy:
- Mirrors / Portals / Offline Access
- Wikimedia content that can be viewed in full outside of Wikimedia. This ranges from very laudable projects that aim to make Wikipedia accessible to areas without good internet access to just a different interface for Wikipedia content that is arguably an improvement (usually with the addition of advertisements though) to malicious bulk copying of content without providing links or attribution (piracy).
- Positive Intertwining (Linked Open Data)
- This covers reuse that is akin to linked open data (https://en.wikipedia.org/wiki/Linked_data#Linked_open_data) where data structures are linked to support transclusion of content on external sites to allow data to flow back and forth.
- Direct Search
- This covers instances in which outside services provide a direct search into Wikimedia. This is different from Google Search etc. because it is only indexing Wikipedia and often has unclear referral information.
- These are examples where snippets of Wikimedia content are algorithmically evaluated against other sources and then surfaced on platforms outside of Wikimedia projects with attribution and links back to Wikimedia where required. These are generally in good-faith but the long-term impact on Wikimedia is unclear and the details vary greatly.
- Automatic Fact-checking
- These are instances where links back to Wikipedia are automatically inserted by platforms into their site to provide context about sources (e.g., BBC, RT) or problematic content like conspiracy theories. It is similar to snippets but the context is very specific and generally Wikipedia is the only source considered.
- Human-generated References / reuse
- These are organic links to Wikipedia that are generated by users on external platforms that can help surface Wikimedia content to readers on the web
The full document is here (https://docs.google.com/document/d/1moL_JjZLJS-FlnEMrgeoqYmTgA6nqidQVjSn4JO1Jlg/edit#heading=h.bmvp8tvrk94e) but is currently internal until I ensure none of the data etc. is sensitive. I'll add summaries to the subtasks that comprise this literature review.
Dec 19 2019
Dec 17 2019
@Halfak : the Python regexes that my NYU masters students developed for English/Hindi/Russian that might assist in preprocessing XML dump wikitext to model-ready tokens: https://github.com/mmarinated/topic-modeling/blob/master/baseline/data_creation/wiki_parser.py
Great, added! If you see anything that you'd like to change, just ping and I'll update.
Dec 16 2019
Complete -- both uploaded with brief descriptions
Closing this task as summary of results has been added: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Demographics_and_Wikipedia_use_cases#Results
- Students submitted their report: https://drive.google.com/file/d/10KwWAB95dRqhoS7VjmlihUPrpTVOlw6D/view?usp=sharing
Yeah, that works for me. I looked but doesn't seem I can give you edit permissions to a figshare item I created, so just point me towards what files you want uploaded and any additional description I should add. There were also two aaron halfakers (!!) on figshare when I went to add your name to the item and both were labeled as inactive, so let me know if there's an account you want linked to the item as well.
Dec 13 2019
Yes -- sorry for not following up more quickly but I have access and thanks for the quick support!
Tiktok updated with an integration of direct links to Wiki in later September, which is a direct referral source. But we didn't find any significant increase pageviews that look like brand related.
I also looked into this but couldn't find any videos yet that have linked to Wikipedia. I'm really curious to follow this though so if you find any examples, please let me know!
Dec 11 2019
As discussed on IRC: the wikiproject_to_templates YAML is currently missing a number of WikiProjects. Based on the WikiProject templates that I detected in my previous of English Wikipedia by case-insensitive string-matching against "wp" and "wikiproject", here the top 100 templates that are missing from the YAML and how many articles they were found in. There is a long-tail too of unique template names (1877 total, though some of them are false positives). Full list at stat1007:/home/isaacj/drafttopic/templates_missing_from_yaml.tsv
@Miriam looks good to me -- thanks!
Dec 10 2019
@Miriam whenever you put together the wikiworkshop banner for 2020 (e.g., like this: https://research.wikimedia.org/events.html), let me know and i'll update the website! No pressure -- just stumbled across it and realized that we'll want to update it.
Fragments aren't even sent in requests, they are handled entirely client side.
@Pcoombe oh yikes, good point, thanks! I had thought I had verified that it was sent as part of the URL but you're right that it's purely client-side. Well I suppose that resolves why we do not currently track it :)
Dec 9 2019
@Halfak thanks for breaking this out as that other task was rapidly growing larger :)
Hey @lexnasser this is really great! Thanks!!
Dec 3 2019
Thanks @Halfak, this is awesome! I left a bunch of comments with the goal of trimming it down
Dec 2 2019
Nov 27 2019
Presentation went well with lots of questions in particular about the citation usage work.
Slide deck: https://docs.google.com/presentation/d/1etz4ihkP2lu25KJMW61ZF4iyqGOj84iqy-5adMwBnM8/edit?usp=sharing
- Wrapping up work and moving towards writing
- Fixed an error with the regex used for cleaning wikitext that resulted in removing the end of paragraphs if a citation template was included mid-paragraph
- Still some final exploration w/ graph embeddings with Hindi Wikipedia
- Going to use cross-fold validation to get confidence intervals for model performance
weblight data will be excluded from the classification entirely, the way it gets to us it does not have any client IP that we can use. This is true for any other proxy as out traffic layer does not forward for the most part the client IP, this is not likely to change in the near term. See: T232795
Thanks for the pointer!
Nov 26 2019
Hey @Nuria -- I had been doing some of my own research on this as part of some background work around re-use of Wikimedia content. I wanted to throw in a few thoughts in case they're useful (and am largely excited about the proposed spike detection!):
- +1 to identifying weblight traffic via user-agent string. It's a large proportion of the "None" referers, which clouds that data. I suspect it's mostly search but obviously don't know that.
- The weblight data got me thinking about bot-like traffic that is really VPNs or other proxies. I took a look at some of these userhashes that have very high numbers of pageviews per hour and have generated a few hypotheses:
- Some of the userhashes have pageviews that are nearly all for a single project (e.g., en.wikipedia) and/or repeatedly hit the same title (e.g., the userhash behind this: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Simple_Mail_Transfer_Protocol) -- those feel like they are very likely bots. VPN/proxies though often seem to mix projects (because lots of different users are coming in via the same "device") and have an expected number of visits to Wikipedia's Main Page (~1%), so personally I think a high pageview count but more uniform distribution of projects / titles associated with a single userhash might be good evidence of a VPN/proxy as opposed to bot. I don't have a great recommendation for what that threshold is right now, but would be happy to work with you on it.
- I haven't looked at device (i.e. desktop vs. mobile) but a mix of devices might be a useful parameter as well for separating out bots from VPNs
- It looks like Google Translate preserves the user-agent even though the IP seems to maybe be Google servers and not the actual client, so I doubt it would show up in the data but they'd also be simple to exclude via presence of x_analytics_map translationengine.
Nov 22 2019
I wanted to add a couple data points / hypotheses to this discussion:
- Chrome Mobile Version 38 that Nuria mentions as #3 in T195880#4429156 is actually almost all Google Weblight Proxy (per an inspection of the full user-agent string: Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19).
- Intuitively we would have a higher amount of none referer traffic than other sites due to people setting Wikipedia (or Special:Random) as their homepage.
- This is further supported by the fact that there is almost no "none" traffic from Chrome Mobile (if you take out Google Weblight Proxy) and Chrome Mobile defaults either to no home page or just a blank "new tab". I don't have an iPhone but my understanding is that mobile Safari loads up the most recent page you visited (which would be Wikipedia a non-trivial amount of time).
- Approximately 20% of these page views are coming from IP+UA pairs that have over 500 pageviews an hour -- suggesting either bot or shared VPN/proxy.
- It's hard to judge what proportion of those page views are bots vs. VPNs/proxies, but when you define bots as >90% page views to a single project or a single article receiving >10% of page views, it looks like a quarter to a half of these page views could be VPNs and not bots.
- Looking at the titles being read, we further see that while the Main Page generally gets less than 1% of page views, it receives closer to 10% of page views from None referrers (backing up Wikipedia as home page hypothesis).
Nov 21 2019
Nov 13 2019
@leila I'm breaking this next update into a few. First is taking care of the smaller things that don't require reorganization. Does this team page look like what you were expecting? I'll also remove WikiCite from the events page and add the eliciting new editors blogpost. Right now we don't actually have a good spot to put the Understanding Thanks blogpost and something about Doris' Outreachy project, so I'm going to continue to think on that.
Nov 11 2019
Not sure if we can do it in Jupyterhub, but probably we'll be able to add something to the MOTD of the stat/notebook hosts, so when people ssh they'll get instructions about what to do for kerberos, where to find docs, etc.. Nice suggestion thanks!
- Work on this project will come to a close on December 2nd (and the students will switch to writing)
- TFIDF-based "attention" scores works almost as well as learned attention scores but is slower in training right now. Looking into the issue -- might be a function of the TFIDF scores not being normalized to 1 for a given article.
- For any given language, the hope is to have model performance for the following experimental setups:
- Trained purely on that language. Aligned fastText embeddings. Randomly initialized model weights.
- Model trained on English with all weights frozen (no fine-tuning). Aligned fastText embeddings.
- Model trained on English with final layer fine-tuned to new language. Aligned fastText embeddings.
- Model trained on mixture of examples from different languages. Aligned fastText embeddings. No language-specific fine-tuning.
- For transfer learning problem (fine-tune general model to identify a specific wikiproject), examined a more difficult negative sample (positive = Human rights; negative = Politics articles) and found still quite high F1 (>0.9). Going to look into one-shot learning techniques and embedding cosine-distance as an even simpler approach.
Nov 7 2019
The option that is currently available is a keytab
Ok, that works for me. I'll avoid it but it's good to know it's an option if needed.