Page MenuHomePhabricator

Isaac (Isaac Johnson)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Oct 1 2018, 2:19 PM (68 w, 5 d)
Availability
Available
IRC Nick
isaacj
LDAP User
Isaac Johnson
MediaWiki User
Isaac (WMF) [ Global Accounts ]

Recent Activity

Yesterday

Isaac added a comment to T242170: Define research agenda for external re-use.

Weekly update:

  • continued preparation for All-Hand's panel on reuse
  • iterated with Jonathan on social media traffic report planning -- in particular looking for anecdotal evidence of a link between external referrals and editing to help us gauge what we think those relationships will look like. Early observations would suggest that Youtube fact-checking links lead to a low level of vandalism, Reddit links seem to lead to minor, positive edits but can have larger positive impacts too depending on the community, Twitter/Facebook had little evidence of leading to editing.
Fri, Jan 24, 4:30 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242176: Launch experimental API for Wikidata-based topic model.

Weekly update:

  • Improved wiki-topic interface so it's easier to query
  • Drafted user script for automatically querying for an article's topic predictions but I do not recommend running it right now as toolforge isn't a trusted domain on Wikipedia for making content requests (see T28508): https://en.wikipedia.org/wiki/User:Isaac_(WMF)/WikidataTopic.js
  • I added a short disclaimer statement to the interface noting that it's experimental and no personal data is collected. I asked the cloud team via IRC whether they had other suggestions for privacy policies to link to but they indicated that there are no standard terms etc. that they recommend including.
  • Note: uptime does not seem to be an issue with toolforge so I don't think that I have to worry about keeping the server awake!
Fri, Jan 24, 4:17 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242162: Submit paper on reader demographics surveys for peer-review.

Weekly update:

  • Provided concrete examples of top-level findings in support of Science abstract
  • Computed data for what % of page views come from men/women for each language based on our survey results -- in general the % of pageviews from men is about 5-10% higher of a number than proportion of readers who are men because men consistently had slightly longer reading sessions than women as well.
  • Green light from team to begin expanding out comprehensive paper on Wikipedia readership
  • Florian will get back to me on deleting data from WtWRW project to free up space on stat1007
Fri, Jan 24, 4:08 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242173: Build taxonomy of readership gaps.

Weekly progress: none while waiting on feedback for Q3 directions. I'll note that the strategy recommendations provide further evidence of why it's valuable for this work to move towards understanding misalignment between readers and content.

Fri, Jan 24, 3:38 PM · Research (FY2019-20-Research-January-March)

Wed, Jan 22

Isaac added a comment to T243352: Align landing page with wikimediafoundation.org.

Email sent -- for the final bullet point around the foundational work, I made this suggestion:

Wed, Jan 22, 4:00 PM · Research
Isaac added a comment to T181588: Research landing page: Maintenance development post-launch.

Thanks @bmansurov -- I wasn't thinking about the javascript dependency that would introduce. I'll still be interested to hear from Volker regarding the status of working with static site generators then, but it sounds like we should either invest fully in that solution or stick with the simple, hand-edited HTML that we currently have.

Wed, Jan 22, 2:56 PM · Research-Backlog, Research-management

Tue, Jan 21

Isaac added a comment to T243352: Align landing page with wikimediafoundation.org.

The instances and my proposed solutions are below:

Tue, Jan 21, 10:30 PM · Research
Isaac created T243352: Align landing page with wikimediafoundation.org.
Tue, Jan 21, 10:28 PM · Research
Isaac added a comment to T181588: Research landing page: Maintenance development post-launch.

@Volker_E @bmansurov Just checking in on this as we are considering revisiting this task. Any thoughts / updates around whether there is either established guidance for moving to a static site generator as was discussed previously or if perhaps in the meantime we shouldn't just consider something like is being used by Gapfinder where various elements on that page such as the disclaimer are loaded as static elements using riot.js -- this latter approach I expect would work at least for the header/footer (which are the exact same for all pages).

Tue, Jan 21, 10:04 PM · Research-Backlog, Research-management

Fri, Jan 17

Isaac added a comment to T242162: Submit paper on reader demographics surveys for peer-review.

Weekly update:

  • Waiting to hear on Science abstract before proceeding with narrative described above
  • A few additional analyses:
    • Changing confidence intervals from 99% to 95% does not change the results much -- I will continue to use 99% given that it's the more appropriate option for the number of comparisons being made
    • Verified no relationship between gender and day of week / time of day
Fri, Jan 17, 9:40 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242170: Define research agenda for external re-use.

Weekly update:

  • Outlined basic narrative and desired data around interplay of Wikipedia and Search
  • Related:
    • Patch for external data referers that I was supporting is now in production! See: T239625
    • Began planning expanded queries for social media traffic report (T241768) as part of initial inquiries into editor impact of external traffic
Fri, Jan 17, 9:37 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242173: Build taxonomy of readership gaps.

Weekly progress: none while waiting on feedback for Q3 directions

Fri, Jan 17, 9:30 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242176: Launch experimental API for Wikidata-based topic model.

Weekly update:

  • Launched prototype of API: https://tools.wmflabs.org/wiki-topic/
  • TODOs:
    • Write user javascript for wikis to automatically grab article topics a la https://en.wikipedia.org/wiki/User:EpochFail/DraftTopic.js
    • Coordinate with Diego to ensure uptime for API
    • Improve documentation / styling for those who directly visit API
    • Retrain model for new, expanded topic taxonomy
    • Optional expansion:
      • Add LIME-based explanations for predictions
      • Give user more control over how predictions are post-processed -- e.g., whether the presence of geographic coordinates is necessary for Geography predictions
Fri, Jan 17, 9:30 PM · Research (FY2019-20-Research-January-March)

Tue, Jan 14

Isaac added a comment to T242013: Implement native NN topic model in revscoring.

@Halfak : I moved the code to stat1005 so I can hopefully get access to the GPUs there for any further testing. But here's what I have thusfar:

Tue, Jan 14, 11:47 PM · Scoring-platform-team (Current), artificial-intelligence, drafttopic-modeling, revscoring
Isaac added a comment to T209655: Copy Wikidata dumps to HDFS.

@JAllemandou Thank you - as ever!

+1: these wikidata parquet (specifically item_page_link) dumps are super useful for us!

Tue, Jan 14, 6:47 PM · Research-Backlog, Wikidata, Analytics
Isaac moved T242172: Taxonomy of Knowledge Gaps from Staged to In Progress on the Research board.
Tue, Jan 14, 6:31 PM · Research, Epic
Isaac moved T242168: Study external re-use of Wikimedia content from Staged to In Progress on the Research board.
Tue, Jan 14, 6:31 PM · Research, Epic

Mon, Jan 13

Isaac added a comment to T239625: Improve quality of external referer data.

This is great -- thanks @lexnasser and others who supported! I'll rerun some of the queries that inspired this work in a few days and let you know if I see anything amiss, but silence should be interpreted as success :)

Mon, Jan 13, 10:24 PM · Product-Analytics, Analytics-Kanban, Research, Analytics
Isaac added a comment to T239625: Improve quality of external referer data.

Let's keep things simple and let's document that this format is not covered.

Mon, Jan 13, 5:49 PM · Product-Analytics, Analytics-Kanban, Research, Analytics
Isaac added a comment to T242176: Launch experimental API for Wikidata-based topic model.

Weekly update: no work yet on wikidata-based topic model.

Mon, Jan 13, 5:35 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242173: Build taxonomy of readership gaps.

Weekly progress: none while trying to ascertain what direction we will move in for Q3.

Mon, Jan 13, 5:30 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242170: Define research agenda for external re-use.

Weekly update: continued progress on organizing panel on re-use for All-Hands, which will provide an informal venue for me to share out some of what we know and gather concerns, feedback, suggestions from staff about this area.

Mon, Jan 13, 5:15 PM · Research (FY2019-20-Research-January-March)
Isaac added a comment to T242162: Submit paper on reader demographics surveys for peer-review.

Weekly update: proposed a narrative for paper that is less narrow and more a comprehensive view of what we know about readership to Wikipedia. Summary:

Mon, Jan 13, 5:14 PM · Research (FY2019-20-Research-January-March)

Tue, Jan 7

Isaac updated subscribers of T242176: Launch experimental API for Wikidata-based topic model.
Tue, Jan 7, 11:17 PM · Research (FY2019-20-Research-January-March)
Isaac created T242176: Launch experimental API for Wikidata-based topic model.
Tue, Jan 7, 11:06 PM · Research (FY2019-20-Research-January-March)
Isaac edited projects for T242173: Build taxonomy of readership gaps, added: Research (FY2019-20-Research-January-March); removed Research.
Tue, Jan 7, 10:56 PM · Research (FY2019-20-Research-January-March)
Isaac created T242173: Build taxonomy of readership gaps.
Tue, Jan 7, 10:51 PM · Research (FY2019-20-Research-January-March)
Isaac added a subtask for T242172: Taxonomy of Knowledge Gaps: T235544: [AKG] A taxonomy of content gaps in Wikipedia and their causes.
Tue, Jan 7, 10:48 PM · Research, Epic
Isaac added a parent task for T235544: [AKG] A taxonomy of content gaps in Wikipedia and their causes: T242172: Taxonomy of Knowledge Gaps.
Tue, Jan 7, 10:48 PM · Research
Isaac updated the task description for T242172: Taxonomy of Knowledge Gaps.
Tue, Jan 7, 10:48 PM · Research, Epic
Isaac created T242172: Taxonomy of Knowledge Gaps.
Tue, Jan 7, 10:47 PM · Research, Epic
Isaac edited projects for T242162: Submit paper on reader demographics surveys for peer-review, added: Research (FY2019-20-Research-January-March); removed Research.
Tue, Jan 7, 10:38 PM · Research (FY2019-20-Research-January-March)
Isaac edited projects for T242170: Define research agenda for external re-use, added: Research (FY2019-20-Research-January-March); removed Research.
Tue, Jan 7, 10:37 PM · Research (FY2019-20-Research-January-March)
Isaac created T242170: Define research agenda for external re-use.
Tue, Jan 7, 10:33 PM · Research (FY2019-20-Research-January-March)
Isaac added a subtask for T242168: Study external re-use of Wikimedia content: T235780: Literature review of external reuse of Wikimedia content.
Tue, Jan 7, 10:30 PM · Research, Epic
Isaac added a parent task for T235780: Literature review of external reuse of Wikimedia content: T242168: Study external re-use of Wikimedia content.
Tue, Jan 7, 10:30 PM · Research
Isaac created T242168: Study external re-use of Wikimedia content.
Tue, Jan 7, 10:30 PM · Research, Epic
Isaac created T242162: Submit paper on reader demographics surveys for peer-review.
Tue, Jan 7, 9:59 PM · Research (FY2019-20-Research-January-March)
Isaac updated subscribers of T201649: Add section to 'publications.html' for papers written by research collaborators that don't have a WMF co-author.

@leila @Capt_Swing : Is there any interest in this task going forward? The obvious example right now being Lauren Maggio (e.g., https://www.biorxiv.org/content/10.1101/797779v1) or some of Bob's work on article embeddings (e.g., https://dlab.epfl.ch/people/west/pub/Josifoski-Paskov-Paskov-Jaggi-West_WSDM-19.pdf)

Tue, Jan 7, 3:38 PM · Research-management, Research, Research-landing-page
Isaac moved T236713: Improve drafttopic training data pipeline from Staged to In Progress on the Research board.
Tue, Jan 7, 3:32 PM · Scoring-platform-team, NewcomerTasks 1.1, Research
Isaac moved T240276: Restructure WikiProject directory to be better from Staged to In Progress on the Research board.
Tue, Jan 7, 3:31 PM · Scoring-platform-team (Current), NewcomerTasks 1.1, Research
Isaac moved T240282: Improve WikiProject template --> WikiProject mapping from Staged to In Progress on the Research board.
Tue, Jan 7, 3:31 PM · Scoring-platform-team (Current), NewcomerTasks 1.1, Research

Mon, Jan 6

Isaac added a comment to T239625: Improve quality of external referer data.

@lexnasser thanks for the update -- bummer regarding the issues with the Google Translate but thanks for continuing to work on it. If it ends up becoming too bulky to separately track host and parameters, just let me know and we'll just make sure to note it as a known issue. Better handling of the search engines is already a huge improvement!

Mon, Jan 6, 10:53 PM · Product-Analytics, Analytics-Kanban, Research, Analytics

Fri, Jan 3

Isaac added a comment to T220627: QuickSurveys EventLogging missing ~10% of interactions.

Both this task and T236834 have a lot of rich discussion that I'd love to see on-wiki for the general EventLogging pipeline and QuickSurveys' use of that pipeline. I volunteer to write both during the week of All Hands and would welcome collaborators.

@phuedx yeah, I would second that. I'll let you take the lead but let me know how I can help.

Fri, Jan 3, 5:20 PM · MW-1.35-notes (1.35.0-wmf.3; 2019-10-22), Patch-For-Review, Readers-Web-Backlog (Kanbanana-2019-20-Q2), Analytics, Analytics-EventLogging, QuickSurveys
Isaac added a comment to T239625: Improve quality of external referer data.

I am hoping to resolve this issue soon, be able to do some final testing, and then release the refinery update quickly after.

@lexnasser that sounds great -- thanks for the update!

Fri, Jan 3, 5:17 PM · Product-Analytics, Analytics-Kanban, Research, Analytics

Dec 23 2019

Isaac added a comment to T241270: Add wikidata features to topic models.

So, I could output property values as the number portion of Wikidata Qids as features.

Hmmm...I'm torn. On one hand, I like this approach because most properties have too many values to make a one-hot-coding realistic. On the other hand, I'm also concerned about how this requires choosing a single value for each property though and many items (especially high-traffic / higher-quality items) are going to have multiple instance-of or occupation values. In my experience, there is no obvious way to do this (order on Wikidata is at best a weak proxy and it's very difficult to automatically determine the level of detail that's most useful from the Wikidata taxonomy). It might be that just labeling the White House as a mansion or Douglas Adams as a playwright is acceptable, but I'm hesistant to say that there's a good process for reducing these properties down to a single value. And if a single value isn't good enough, then the end user might just have been better off querying Wikidata themselves. So if there's no way to return an array, we might consider identifying a few static properties that we care about like List/Disambiguation and just return those as booleans rather than returning incomplete data.

Dec 23 2019, 5:56 PM · artificial-intelligence, drafttopic-modeling, revscoring, Scoring-platform-team
Isaac updated the task description for T219903: Keep research.wikipedia.org landing page updated.
Dec 23 2019, 5:10 PM · Research
Isaac added a comment to T235780: Literature review of external reuse of Wikimedia content.

I will leave this task open while the external referer data subtask is still open, but I consider the literature review complete at this stage and ready for the next stages of sharing out, iteration, and the start of research.

Dec 23 2019, 5:06 PM · Research
Isaac closed T235784: Identify data / questions that we can(not) answer regarding external reuse, a subtask of T235780: Literature review of external reuse of Wikimedia content, as Resolved.
Dec 23 2019, 5:05 PM · Research
Isaac closed T235784: Identify data / questions that we can(not) answer regarding external reuse as Resolved.

In the course of the literature review, I identified the data that we have available to us and improvements that could be made to this data in order to make it more useful for understanding traffic:

Dec 23 2019, 5:05 PM · Research
Isaac closed T235781: Taxonomy of re-use and current knowledge of the effect on traffic to Wikimedia as Resolved.

High-level summary is below. The doc has more examples and collected research that I am working towards moving to Meta so it will be accessible. Categories in the taxonomy:

  • Mirrors / Portals / Offline Access
    • Wikimedia content that can be viewed in full outside of Wikimedia. This ranges from very laudable projects that aim to make Wikipedia accessible to areas without good internet access to just a different interface for Wikipedia content that is arguably an improvement (usually with the addition of advertisements though) to malicious bulk copying of content without providing links or attribution (piracy).
  • Positive Intertwining (Linked Open Data)
  • Direct Search
    • This covers instances in which outside services provide a direct search into Wikimedia. This is different from Google Search etc. because it is only indexing Wikipedia and often has unclear referral information.
  • Snippets
    • These are examples where snippets of Wikimedia content are algorithmically evaluated against other sources and then surfaced on platforms outside of Wikimedia projects with attribution and links back to Wikimedia where required. These are generally in good-faith but the long-term impact on Wikimedia is unclear and the details vary greatly.
  • Automatic Fact-checking
    • These are instances where links back to Wikipedia are automatically inserted by platforms into their site to provide context about sources (e.g., BBC, RT) or problematic content like conspiracy theories. It is similar to snippets but the context is very specific and generally Wikipedia is the only source considered.
  • Human-generated References / reuse
    • These are organic links to Wikipedia that are generated by users on external platforms that can help surface Wikimedia content to readers on the web
Dec 23 2019, 4:45 PM · Research
Isaac closed T235781: Taxonomy of re-use and current knowledge of the effect on traffic to Wikimedia, a subtask of T235780: Literature review of external reuse of Wikimedia content, as Resolved.
Dec 23 2019, 4:45 PM · Research
Isaac added a comment to T235780: Literature review of external reuse of Wikimedia content.

The full document is here (https://docs.google.com/document/d/1moL_JjZLJS-FlnEMrgeoqYmTgA6nqidQVjSn4JO1Jlg/edit#heading=h.bmvp8tvrk94e) but is currently internal until I ensure none of the data etc. is sensitive. I'll add summaries to the subtasks that comprise this literature review.

Dec 23 2019, 4:37 PM · Research

Dec 19 2019

Isaac committed rRLP3ac345b0bebd: Fix incorrect links on Programs page. (authored by Isaac).
Fix incorrect links on Programs page.
Dec 19 2019, 7:00 PM
Isaac committed rRLPc7fd366c4c27: Refactor projects to be programs and map to 2018 white papers. Update… (authored by Isaac).
Refactor projects to be programs and map to 2018 white papers. Update…
Dec 19 2019, 5:47 PM
Isaac updated the task description for T219903: Keep research.wikipedia.org landing page updated.
Dec 19 2019, 5:40 PM · Research

Dec 17 2019

Isaac added a comment to T236713: Improve drafttopic training data pipeline.

@Halfak : the Python regexes that my NYU masters students developed for English/Hindi/Russian that might assist in preprocessing XML dump wikitext to model-ready tokens: https://github.com/mmarinated/topic-modeling/blob/master/baseline/data_creation/wiki_parser.py

Dec 17 2019, 6:40 PM · Scoring-platform-team, NewcomerTasks 1.1, Research
Isaac closed T240273: Extract cross-wiki WikiProject tags, a subtask of T236713: Improve drafttopic training data pipeline, as Resolved.
Dec 17 2019, 3:15 PM · Scoring-platform-team, NewcomerTasks 1.1, Research
Isaac closed T240273: Extract cross-wiki WikiProject tags as Resolved.

Great, added! If you see anything that you'd like to change, just ping and I'll update.

Dec 17 2019, 3:15 PM · Scoring-platform-team, Research

Dec 16 2019

Isaac updated the task description for T240273: Extract cross-wiki WikiProject tags.
Dec 16 2019, 7:52 PM · Scoring-platform-team, Research
Isaac added a comment to T240273: Extract cross-wiki WikiProject tags.

see https://github.com/halfak/wikitax/tree/master/datasets

Complete -- both uploaded with brief descriptions

Dec 16 2019, 7:51 PM · Scoring-platform-team, Research
Isaac added a comment to T236713: Improve drafttopic training data pipeline.

Some follow-up to a conversation with @Halfak and @dr0ptp4kt :

Dec 16 2019, 7:28 PM · Scoring-platform-team, NewcomerTasks 1.1, Research
Isaac closed T232525: Repeat demographics surveys for longer time period, a subtask of T203042: Output 2.2: Characterizing readership by demographics, as Resolved.
Dec 16 2019, 5:40 PM · Research, address-knowledge-gaps, Epic
Isaac closed T232525: Repeat demographics surveys for longer time period as Resolved.

Closing this task as summary of results has been added: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Reader_Behaviour/Demographics_and_Wikipedia_use_cases#Results

Dec 16 2019, 5:40 PM · Research
Isaac closed T233646: Article Topic NYU Fall 2019 Capstone Project as Resolved.
Dec 16 2019, 4:19 PM · Research
Isaac added a parent task for T239625: Improve quality of external referer data: T235780: Literature review of external reuse of Wikimedia content.
Dec 16 2019, 3:47 PM · Product-Analytics, Analytics-Kanban, Research, Analytics
Isaac added a subtask for T235780: Literature review of external reuse of Wikimedia content: T239625: Improve quality of external referer data.
Dec 16 2019, 3:47 PM · Research
Isaac added a comment to T240273: Extract cross-wiki WikiProject tags.

Yeah, that works for me. I looked but doesn't seem I can give you edit permissions to a figshare item I created, so just point me towards what files you want uploaded and any additional description I should add. There were also two aaron halfakers (!!) on figshare when I went to add your name to the item and both were labeled as inactive, so let me know if there's an account you want linked to the item as well.

Dec 16 2019, 2:58 PM · Scoring-platform-team, Research

Dec 13 2019

Isaac updated the task description for T240273: Extract cross-wiki WikiProject tags.
Dec 13 2019, 10:35 PM · Scoring-platform-team, Research
Isaac added a comment to T240273: Extract cross-wiki WikiProject tags.

@Halfak : dataset is now uploaded to Figshare: https://doi.org/10.6084/m9.figshare.10248344.v1

Dec 13 2019, 10:34 PM · Scoring-platform-team, Research
Isaac added a comment to T240501: Google Search Console access request -- Isaac.

Yes -- sorry for not following up more quickly but I have access and thanks for the quick support!

Dec 13 2019, 6:46 PM · Research, Operations, SRE-Access-Requests
Isaac added a comment to T239811: Investigate recent increase in pageviews in September and October.

Tiktok updated with an integration of direct links to Wiki in later September, which is a direct referral source. But we didn't find any significant increase pageviews that look like brand related.

I also looked into this but couldn't find any videos yet that have linked to Wikipedia. I'm really curious to follow this though so if you find any examples, please let me know!

Dec 13 2019, 5:27 PM · Product-Analytics (Kanban)

Dec 11 2019

Isaac added a comment to T240273: Extract cross-wiki WikiProject tags.

As discussed on IRC: the wikiproject_to_templates YAML is currently missing a number of WikiProjects. Based on the WikiProject templates that I detected in my previous of English Wikipedia by case-insensitive string-matching against "wp" and "wikiproject", here the top 100 templates that are missing from the YAML and how many articles they were found in. There is a long-tail too of unique template names (1877 total, though some of them are false positives). Full list at stat1007:/home/isaacj/drafttopic/templates_missing_from_yaml.tsv

Dec 11 2019, 10:28 PM · Scoring-platform-team, Research
Isaac created T240501: Google Search Console access request -- Isaac.
Dec 11 2019, 8:27 PM · Research, Operations, SRE-Access-Requests
Isaac added a comment to T219903: Keep research.wikipedia.org landing page updated.

@Miriam looks good to me -- thanks!

Dec 11 2019, 7:01 PM · Research
Isaac moved T235780: Literature review of external reuse of Wikimedia content from Staged to In Progress on the Research board.
Dec 11 2019, 2:14 AM · Research
Isaac moved T235781: Taxonomy of re-use and current knowledge of the effect on traffic to Wikimedia from Staged to In Progress on the Research board.
Dec 11 2019, 2:14 AM · Research
Isaac moved T235784: Identify data / questions that we can(not) answer regarding external reuse from Staged to In Progress on the Research board.
Dec 11 2019, 2:14 AM · Research
Isaac moved T239625: Improve quality of external referer data from Staged to In Progress on the Research board.
Dec 11 2019, 2:13 AM · Product-Analytics, Analytics-Kanban, Research, Analytics
Isaac moved T240273: Extract cross-wiki WikiProject tags from Staged to In Progress on the Research board.
Dec 11 2019, 2:13 AM · Scoring-platform-team, Research

Dec 10 2019

Isaac updated the task description for T219903: Keep research.wikipedia.org landing page updated.
Dec 10 2019, 9:42 PM · Research
Isaac updated subscribers of T219903: Keep research.wikipedia.org landing page updated.

@Miriam whenever you put together the wikiworkshop banner for 2020 (e.g., like this: https://research.wikimedia.org/events.html), let me know and i'll update the website! No pressure -- just stumbled across it and realized that we'll want to update it.

Dec 10 2019, 9:42 PM · Research
Isaac closed T240359: Section fragment information stripped from webrequests as Resolved.

Fragments aren't even sent in requests, they are handled entirely client side.

@Pcoombe oh yikes, good point, thanks! I had thought I had verified that it was sent as part of the URL but you're right that it's purely client-side. Well I suppose that resolves why we do not currently track it :)

Dec 10 2019, 7:12 PM · Analytics, Research
Isaac created T240359: Section fragment information stripped from webrequests.
Dec 10 2019, 4:36 PM · Analytics, Research

Dec 9 2019

Isaac added a comment to T240273: Extract cross-wiki WikiProject tags.

@Halfak thanks for breaking this out as that other task was rapidly growing larger :)

Dec 9 2019, 9:43 PM · Scoring-platform-team, Research
Isaac added a comment to T239625: Improve quality of external referer data.

Hey @lexnasser this is really great! Thanks!!

Dec 9 2019, 9:38 PM · Product-Analytics, Analytics-Kanban, Research, Analytics

Dec 3 2019

Isaac added a comment to T236713: Improve drafttopic training data pipeline.

Thanks @Halfak, this is awesome! I left a bunch of comments with the goal of trimming it down

Dec 3 2019, 4:44 PM · Scoring-platform-team, NewcomerTasks 1.1, Research

Dec 2 2019

Isaac created T239625: Improve quality of external referer data.
Dec 2 2019, 3:41 PM · Product-Analytics, Analytics-Kanban, Research, Analytics

Nov 27 2019

Isaac closed T235443: Report on State of Wikimedia Research of Knowledge Integrity as Resolved.

Presentation went well with lots of questions in particular about the citation usage work.
Slide deck: https://docs.google.com/presentation/d/1etz4ihkP2lu25KJMW61ZF4iyqGOj84iqy-5adMwBnM8/edit?usp=sharing

Nov 27 2019, 8:58 PM · Research
Isaac added a comment to T233646: Article Topic NYU Fall 2019 Capstone Project.
  • Wrapping up work and moving towards writing
  • Fixed an error with the regex used for cleaning wikitext that resulted in removing the end of paragraphs if a citation template was included mid-paragraph
  • Still some final exploration w/ graph embeddings with Hindi Wikipedia
  • Going to use cross-fold validation to get confidence intervals for model performance
Nov 27 2019, 8:08 PM · Research
Isaac added a comment to T238357: Label high volume bot spikes in pageview data as automated traffic .

weblight data will be excluded from the classification entirely, the way it gets to us it does not have any client IP that we can use. This is true for any other proxy as out traffic layer does not forward for the most part the client IP, this is not likely to change in the near term. See: T232795

Thanks for the pointer!

Nov 27 2019, 6:25 PM · Analytics

Nov 26 2019

Isaac added a comment to T238357: Label high volume bot spikes in pageview data as automated traffic .

Hey @Nuria -- I had been doing some of my own research on this as part of some background work around re-use of Wikimedia content. I wanted to throw in a few thoughts in case they're useful (and am largely excited about the proposed spike detection!):

  • +1 to identifying weblight traffic via user-agent string. It's a large proportion of the "None" referers, which clouds that data. I suspect it's mostly search but obviously don't know that.
  • The weblight data got me thinking about bot-like traffic that is really VPNs or other proxies. I took a look at some of these userhashes that have very high numbers of pageviews per hour and have generated a few hypotheses:
    • Some of the userhashes have pageviews that are nearly all for a single project (e.g., en.wikipedia) and/or repeatedly hit the same title (e.g., the userhash behind this: https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Simple_Mail_Transfer_Protocol) -- those feel like they are very likely bots. VPN/proxies though often seem to mix projects (because lots of different users are coming in via the same "device") and have an expected number of visits to Wikipedia's Main Page (~1%), so personally I think a high pageview count but more uniform distribution of projects / titles associated with a single userhash might be good evidence of a VPN/proxy as opposed to bot. I don't have a great recommendation for what that threshold is right now, but would be happy to work with you on it.
    • I haven't looked at device (i.e. desktop vs. mobile) but a mix of devices might be a useful parameter as well for separating out bots from VPNs
    • It looks like Google Translate preserves the user-agent even though the IP seems to maybe be Google servers and not the actual client, so I doubt it would show up in the data but they'd also be simple to exclude via presence of x_analytics_map translationengine.
Nov 26 2019, 9:51 PM · Analytics

Nov 22 2019

Isaac added a comment to T195880: % of "none" referers seems too high.

I wanted to add a couple data points / hypotheses to this discussion:

  • Chrome Mobile Version 38 that Nuria mentions as #3 in T195880#4429156 is actually almost all Google Weblight Proxy (per an inspection of the full user-agent string: Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 5 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko; googleweblight) Chrome/38.0.1025.166 Mobile Safari/535.19).
  • Intuitively we would have a higher amount of none referer traffic than other sites due to people setting Wikipedia (or Special:Random) as their homepage.
    • This is further supported by the fact that there is almost no "none" traffic from Chrome Mobile (if you take out Google Weblight Proxy) and Chrome Mobile defaults either to no home page or just a blank "new tab". I don't have an iPhone but my understanding is that mobile Safari loads up the most recent page you visited (which would be Wikipedia a non-trivial amount of time).
  • Approximately 20% of these page views are coming from IP+UA pairs that have over 500 pageviews an hour -- suggesting either bot or shared VPN/proxy.
    • It's hard to judge what proportion of those page views are bots vs. VPNs/proxies, but when you define bots as >90% page views to a single project or a single article receiving >10% of page views, it looks like a quarter to a half of these page views could be VPNs and not bots.
  • Looking at the titles being read, we further see that while the Main Page generally gets less than 1% of page views, it receives closer to 10% of page views from None referrers (backing up Wikipedia as home page hypothesis).
Nov 22 2019, 9:32 PM · Readers-Web-Backlog (Needs Product Owner Decisions), Analytics

Nov 21 2019

Isaac added a comment to T236835: FRUEC: Debug minor discrepancy in landing page data between old and new pipelines.

I'm not sure if this is at all pertinent, but we spent some time trying to debug a situation where we were missing about 10% of EventLogging data from people taking a survey served via the QuickSurveys tool. In essence, for 10% of readers, the tool was displaying the surveys correctly and we knew they had taken the survey but we never get the EventLogging that we should have. We ultimately decided that it was a mixture of adblock (which has settings that allow the javascript etc. to show the survey but blocks EventLogging) and, in our case, people could right-click off the page to take the survey and that wasn't triggered EventLogging as expected.

Nov 21 2019, 8:00 PM · Fundraising Sprint X-rays, Fundraising Sprint A Wrinkle in Timezones, Fundraising Sprint Visual Basic Instinct, Fundraising-Backlog
Isaac updated the task description for T230677: Share out results from demographics surveys.
Nov 21 2019, 2:59 PM · Research

Nov 13 2019

Isaac updated the task description for T219903: Keep research.wikipedia.org landing page updated.
Nov 13 2019, 7:23 PM · Research
Isaac added a comment to T219903: Keep research.wikipedia.org landing page updated.

@leila I'm breaking this next update into a few. First is taking care of the smaller things that don't require reorganization. Does this team page look like what you were expecting? I'll also remove WikiCite from the events page and add the eliciting new editors blogpost. Right now we don't actually have a good spot to put the Understanding Thanks blogpost and something about Doris' Outreachy project, so I'm going to continue to think on that.

Nov 13 2019, 2:42 PM · Research

Nov 11 2019

Isaac added a comment to T212258: Create test Kerberos identities/accounts for some selected users in hadoop test cluster.

Not sure if we can do it in Jupyterhub, but probably we'll be able to add something to the MOTD of the stat/notebook hosts, so when people ssh they'll get instructions about what to do for kerberos, where to find docs, etc.. Nice suggestion thanks!

Nov 11 2019, 9:20 PM · User-Elukey, Analytics-Kanban, Analytics
Isaac added a comment to T233646: Article Topic NYU Fall 2019 Capstone Project.
  • Work on this project will come to a close on December 2nd (and the students will switch to writing)
  • TFIDF-based "attention" scores works almost as well as learned attention scores but is slower in training right now. Looking into the issue -- might be a function of the TFIDF scores not being normalized to 1 for a given article.
  • For any given language, the hope is to have model performance for the following experimental setups:
    • Trained purely on that language. Aligned fastText embeddings. Randomly initialized model weights.
    • Model trained on English with all weights frozen (no fine-tuning). Aligned fastText embeddings.
    • Model trained on English with final layer fine-tuned to new language. Aligned fastText embeddings.
    • Model trained on mixture of examples from different languages. Aligned fastText embeddings. No language-specific fine-tuning.
  • For transfer learning problem (fine-tune general model to identify a specific wikiproject), examined a more difficult negative sample (positive = Human rights; negative = Politics articles) and found still quite high F1 (>0.9). Going to look into one-shot learning techniques and embedding cosine-distance as an even simpler approach.
Nov 11 2019, 9:16 PM · Research

Nov 7 2019

Isaac added a comment to T212258: Create test Kerberos identities/accounts for some selected users in hadoop test cluster.

The option that is currently available is a keytab

Ok, that works for me. I'll avoid it but it's good to know it's an option if needed.

Nov 7 2019, 6:09 PM · User-Elukey, Analytics-Kanban, Analytics