User Details
- User Since
- Oct 1 2018, 2:19 PM (243 w, 4 d)
- Availability
- Available
- IRC Nick
- isaacj
- LDAP User
- Isaac Johnson
- MediaWiki User
- Isaac (WMF) [ Global Accounts ]
Yesterday
Weekly updates:
- Working on iterative feedback sessions on Human Rights Impact checklist. TODOs to help craft a few and potentially do a pilot implementation with some of Diego's models
- Put together some patches for plugin to help improve quality of internal Wikipedia Search Results in anticipation of doing some testing of how well it does. Early indications are that it does just fine as ChatGPT is generally passing a standard list of keywords as opposed to the raw user question
Tue, May 30
Fri, May 26
Weekly updates:
- Participated in Hackathon and processing outcomes from that!
- Put in a few patches for wikigpt plugin to improve logging so we can better analyze the quality of the different search options
- Provided feedback on AI Human Rights checklist and signed up to share out with team in two weeks
Tue, May 23
Thanks @NicoleLBee for adding the notes!
isaacj down from 146G -> 39G. Thanks for the nudge!
Fri, May 19
Fri, May 12
Weekly updates:
- Participated in AI + Wikimedia panel at WikiWorkshop
- Figured out issue with Hackathon demo (cloud vps configuration) and so that is working now (current endpoint: https://wikitech-search.wmcloud.org/docs)! Working on putting together learnings now for the session.
No updates -- hackathon next week and then will return attention here
No updates still with prep for wikiworkshop/hackathon but after next week, hoping to get back to this!
Mon, May 8
How about we set the limit to about 5GB? A little more than twice the RAM limit.
@rook If that would still solve your issue, that sounds great to me and unlikely to cause new problems! thanks!
This makes sense to me overall with a few thoughts about how to reduce frustration on the user side. I took a look at mine and I was at ~5GB (sorry) and so removed a few larger data files that could easily be re-downloaded if needed again but found that actually almost 4GB of this was pip cache that I wasn't even aware of. Assuming this is not just me, a few thoughts:
- The pip cache is pretty hidden -- I first did $ du -hs * and was confused because nothing was standing out as being particularly large but then I actually looked at the cache because I'd been doing some machine learning work and knew that the model files were stored there so assumed that was the issue (which luckily overlapped with the pip cache issue). All to say, you have to look for it explicitly to find it so folks who do a lot of Python work might find themselves inadvertently filling up their quota without understanding why.
- Regarding my second guess for the issue about HuggingFace machine learning models -- it's true that PAWS has limited compute but it does have enough to make it a really nice place for showcasing how to use ML models (current example) for Wikimedia content. Might it make more sense to track storage space to at least match available RAM?
- Personally, part of the challenge for me too is that I'm a long-time user of PAWS now so have built up a number of notebooks with small data components to them that together end up taking up a fair bit of space even if each one is quite within expected PAWS usage.
- Potential compromises:
- Is it possible to move some of this pip cache off the individual hosts or is that a headache / not a good security idea for some reason? If not and there's some sort of warning message that folks would get that could point them to some documentation/tips, including things like using $ du -hs ~/* and doing $ pip cache purge and checking the size of hidden folders would be useful pointers.
- Will there be a way to request extra storage (as with Toolforge / Cloud VPS)? That would honestly solve most of my personal concerns because it sounds like from the statistics (thank you), there aren't too many of us who will be impacted.
- Could there be a temporary space within a working session perhaps that has a much higher limit but is automatically deleted at the end of the session? That way it could be used for larger files such as huggingface models or pip libraries that are necessary within a session and fit normal use expectations but can be safely deleted and downloaded fresh in a future session -- perhaps this is by default the ~/.cache folder?
Fri, May 5
Just collecting some of our thoughts / intentions here for those who are interested:
- Goal will be to demo a ML-backed tool for doing natural-language search of Wikitech documentation. You can see a simple demo here of the process start-to-finish on PAWS though the goal will be to host it as a webapp so folks can actually use it: https://public-paws.wmcloud.org/User:Isaac_(WMF)/hackathon-2023/wikitech-natural-language-search.ipynb
- We'll share some of our learnings along the way about choosing models, adhering to open-source, challenges with working with some common libraries, etc.
- Based on what the group of assembled folks is interested in, we can primarily do Q&A or some live coding / experimenting etc.
- If folks have requests prior to the session, feel free to let us know though no promises that we'll be able to address them.
Weekly updates:
- Reviewing first draft of Human Rights checklist
- Reviewed some of the enwiki edits from the Android pilot and all were looking reasonable.
- Continued work to pull together best practices / tips around hosting ML on cloud services.
No updates this week
May 1 2023
Apr 28 2023
Weekly updates:
- Moving forward with Human Rights approach -- waiting to hear on next steps with them.
- Android pilot seems to be going well -- been monitoring VPS instance to make sure it stays up and will do some evaluation of the edits to get familiar with any issues that are popping up in usage.
- Started working on session for hackathon. Initial focus is on something like WikiGPT but for Wikitech Help namespaces both as a potentiallly useful tool for developers there and also to showcase what's possible with open-source tech. Example: https://public-paws.wmcloud.org/User:Isaac_(WMF)/hackathon-2023/wikitech-natural-language-search.ipynb
No updates this week.
@kostajh see below:
Apr 27 2023
Checking in on the status of this issue. @Mayakp.wiki detected a large spike in pageviews that were being tagged as automated but look pretty clearly like human traffic (see T310846#8809323). The cause seems to be that the implementation by Chrome of the more generic user-agent seems to finally be rolling out in a substantial way (timeline) and so is breaking at least the bot detection pipelines in pretty significant ways. It seems the UA hints were dropped as it wasn't clear that we should be using them or that they would be of much benefit. Likely worth revisiting this conversation or considering alternatives though.
Apr 25 2023
Thanks for response and additional engagement. I don't expect the conclusion to change but some additional context / thoughts:
Apr 17 2023
Thanks for opening this ticket! I've added this to the agenda for next week's team meeting for consideration.
Thanks @nskaggs -- don't hesitate to let me know if any additional details would be useful for folks to know. FYI I'll be out the latter half of this week so if you have any clarifying questions, I might not get back to you until next week.
Apr 14 2023
Weekly updates:
- Worked with Leila to generate some remaining questions for Human Rights folks about how that policy might be used to support the ethical ML space. They're out this week but hopefully responses next week that allow us to move forward.
A few starting plots as I consider the different ways to analyze/showcase the data. I think a lot of this will eventually be more useful when we have more focused questions to ask of it -- e.g., impact of a particular tool on edit types or specific use-cases to consider such as how often do new editors add a new sentence (as is being considered by the Editing team). The plots below are all based on edit data from the main article namespace in French Wikipedia in January 2023 (minus bot edits). Only the edit category chart contains reverts/reverted edits -- they are filtered out for all other charts. I have a TODO to make it easy to understand how the categories below are constructed, but for now, this code contains some details and interfaces were determined via edit tags.
Apr 12 2023
Apr 11 2023
From discussion with Lydia/Diego:
- The concept of completeness feels closer to what we want than quality -- i.e. allowing for more nuance in how many statements are associated with a given item. We came up with a few ideas for how to make assessing item completeness easier (because otherwise it would require very extensive knowledge of a domain area to know how many statements should be associated with an item): I suggested providing the completeness score and quality score and asking the evaluator which was more appropriate but I like Lydia's idea better which was to just provide the completeness score and ask the evaluator if they felt that the actual score was lower, the same, or higher.
- Putting together a dataset like this would be fairly straightforward -- the main challenge is having a nice stratified dataset and one that provides information on top of the original quality-oriented dataset. For example, for highly-extensive items, both models tend to agree that the item is A-class so collecting a lot more annotations won't tell us much. It's only for the shorter items where we begin to see discrepancies and so that's where we should probably focus our efforts. Plus because the model is very specific to the instance-of/occupation properties, we should make sure to have a diversity of items by those properties. This is my main TODO.
- I read through the paper describing the new proposed Wikidata Property Suggester approach. My understanding of the existing item-completeness/recommender systems:
- Existing Wikidata Property Suggester: make recommendations for properties to add based on statistics on co-occurrence of properties. Ignores values of these properties except for instance-of/subclass-of where the statistics are based on the value. Recommendations are ranked by probability of co-occurrence.
- Recoin: similar to above but only uses instance-of property for determining missing properties and adds in refinement of which occupation the item has if it's a human.
- Proposed Wikidata Property Suggester: more advanced system for finding likely co-occurring properties based on more fine-grained association rules -- i.e. doesn't just merge all the individual "if Property A -> Property B k% of the time" but instead does things like 'if Property A and Property B and ... -> Property N k% of the time". Also takes into account instance-of/subclass-of property values like the existing suggester. This seems like a pretty reasonable enhancement and their approach is quite lightweight (~1.5GB RAM for holding data structure).
- I am following the Recoin approach in my model though if the new Property Suggester proves successful and provides the data needed to incorporate into the model (a list of likely missing properties + confidence scores), it would be very reasonable to incorporate that in in place of the Recoin model at a later point and also solve some of the problems that @diego was considering addressing via wikidata embeddings (more nuanced recommendations of missing properties).
Ahh drat forgot to follow up but thanks @nshahquinn-wmf for the archiving and @mpopov for not letting this continue to remain unclear!
Can you say more about this? IIUC, these are different kinds of links, yes? The page and image links are similar as @TheresNoTime says, since they are both internal hyperlinks. Is a link to a category or a template kind of the same, or are those very different?
@Ottomata fair question and I'll try to better explain myself: in theory, "links" cover a lot of interconnections between pages where changes might be useful to know about for an end-user. There are lots of ways to categorize them (intrawiki vs. interwiki vs. external; what syntax to use for creating them; how they're stored in Mediawiki; how they're used; etc.). Given that this link stream question depends on mediawiki code, I'll do my best to categorize them according to a mixture of what they do and how they're indexed on the backend. Apologies if I get any details wrong/missing in trying to do this quickly:
Apr 7 2023
Q: Will it be useful to have the 'prior state' of predicted_classifications in this event?
This is very tempting but I don't personally have a super strong use-case for it and it feels reasonably expensive to get right. A few thoughts:
- The best use-case I can think of for it is in being more kind when updating our Search indices -- e.g., for every revision, we compute the article topics and only if they're different from the previous topics do we send an update to the Search index. This would greatly reduce the updates to Search as most edits won't change an article substantially enough to change its topic. The tricky thing is that the topic model uses an article's links via the pagelinks table, so we don't currently have a way of getting a prediction for a past revision. For this to be feasible, I assume we'd need some cache of prior predictions? This is an extreme case but in general, it's not always a perfect assumption that the current model prediction for an old revision will be the same as the then-current model prediction for an old revision and that could cause issues depending on how we source the prior prediction.
- For other use cases where we're just interested in triggering some behavior based on substantive changes to the article content as proxied by e.g., a large change in quality, my assumption is that we probably should instead focus on getting a stream enrichment that does edit types (diffs) and use that more directly. For example, if we want to flag when an article's quality decreases by a certain quantity, we're probably actually interested in edits that are removing certain types of content and we should just detect that directly with the edit types. The nice thing about the edit types library is that it would just be a direct enrichment and not a LiftWing call, so once there's a stream with previous+current wikitext in it, it's just a processing of those two strings with no additional API calls (or stream with current wikitext and we have the API call to get the previous wikitext).
Apr 6 2023
Resolving as we have now gotten to the state where I can do more large-scale analyses (many wikis across a month time period)
- The main improvements came first via Muniza (pyspark config for handling the computation) and then via Fabian -- mwparserfromhell patch and optimization of my pipeline for computing edit types. Fabian's fix was merged into mwparserfromhell too which is nice confirmation and a valuable contribution to the larger community (though they haven't made a new release in about a year so might be some time before it's default).
- For instance, non-bot edits from a month of French Wikipedia are aggregated below where you can see (if you remove reverted edits) that IP editors are more likely than other groups to be doing small maintenance edits and new editors (1-10 edits) are more likely than otthers to do content generation. The edit size plot (reverts excluded) backs this up. This can now be expanded to other wikis and also I have data on edit difficulty and whether it changed text:
@TheresNoTime thanks for explaining. I think I still lean towards separate streams all things equal then but ultimately I'm fine with whatever is decided so long as it enables your use case.
What do you think?
Hmm...what's the use-case for having wikilinks to articles and images in the same stream? On one hand, assuming the stream specifies the link namespace explicitly, it simplifies things to only have one place to check for link changes. On the other hand, it could force folks to filter a lot of events just to get to the ones that interest them and opens to the door to other questions like whether the intent to also include templatelinks, categorylinks, etc.? As a potential end-user, my gut feeling is to keep them separate like the mediawiki tables because personally I'm not generally working with models that use both links and images (and if I am, I'd likely prefer to just listen for the more generic page-change events because I'm probably watching for a lot of other things like references that aren't link-specific). Curious to hear other perspectives though.
Apr 5 2023
Resolving this task -- at this point we have a clearer picture of where the Tech dept is going with this:
- ML platform will continue to lead the way on internal ML models
- I'm working on our guidance for 3rd-party ML and keeping that aligned with our internally-hosted ML expectations. This is currently taking several forms:
- Engagement with community via hackathon and presumably other venues
- Collecting of teams' experiences with 3rd-party ML to inform our strategy moving forward
- Discussions with Human Rights team about alignment between ethical ML and human rights policy
- Continue piloting work of more advanced ML tools like machine-assisted article descriptions to get a sense of what guardrails are useful etc.
Apr 3 2023
Anyhow, they can also be merged "on the client side" later.
I think I would lean towards this. I like the simplicity of separate streams and in Diego's example, I think might be nice to not have the multilingual model (which if I remember is higher latency) be a blocker for the language-agnostic prediction stream?
Mar 30 2023
Thanks! Indeed many models run pretty slow on CPUs but should be good enough for prototypes and we wouldn't be doing any training of models on Cloud VPS, so that bottleneck is not so awful.
Is that still the case?
Thanks for checking @mpopov The relevant context is T239876 and had to do with specific fixes that might be needed for some errors in the noteboo. My sense is that if someone wanted to revive it, a lot would change given the switches to airflow, updates to wmfdata, etc. so there's no reason to try to do that work unless someone picks up the larger project again.
Mar 28 2023
@kostajh a few of us were also thinking about a session focused on some of the techical aspects of running a LLM on Toolforge/Cloud VPS infrastructure / (hopefully) demoing some LLMs that we'd set up in advance. I haven't submitted the session yet but thoughts on whether to combine efforts here or make separate?
Mar 24 2023
Continued work with Fabian and having promising outcomes! Working now on frwiki after Fabian split up the job into separate stages and progress on enwiki. I'll be able to start analyzing the results now and be able to scale up this sort of analysis: https://public-paws.wmcloud.org/User:Isaac_(WMF)/Edit%20Diffs/Example_Edit_Analysis_French.ipynb
Weekly update:
- Presented at https://schedule.mozillafestival.org/session/SDTAZJ-1
- Met with Santhosh to discuss their learnings from Content Translation. A lot of really good points from that that I'm using to update my learnings doc and he'll share additional background on the tool with me.
- Also updating doc based on outcomes from MachineVision project: community feedback and response.
- I will be submitting a Hackathon proposal around playing with Cloud VPS-hosted AI models. ML Platform will be submitting a separate one to demo WikiGPT but hopefully this should keep the number of demos from WMF staff to a minimum.
Updated API to be slightly more robust to instance-of-only edge cases and provide the individual features. Output for https://wikidata-quality.wmcloud.org/api/item-scores?qid=Q67559155:
{ "item": "https://www.wikidata.org/wiki/Q67559155", "features": { "ref-completeness": 0.9055531797461024, "claim-completeness": 0.903502532415779, "label-desc-completeness": 1.0, "num-claims": 11 }, "predicted-completeness": "A", "predicted-quality": "C" }
Details:
- ref-completeness: what proportion of expected references does the item have? References that are internal to Wikimedia are only given half-credit while external links / identifiers are given full credit. Based on what proportion of claims for a given property typically have references on Wikidata. Also takes into account missing statements.
- claim-completeness: what proportion of the expected claims does the item have. Data taken from Recoin where less common properties for a given instance-of are weighted less.
- label-desc-completeness: what proportion of expected labels/descriptions are present. Right now the expected labels/descriptions are English plus any language for which the item has a sitelink.
- num-claims: how many total properties the item has actually so it's a misnomer and something I'll fix at some point (I don't give more credit for e.g., having 3 authors instead of 1 author for a scientific paper)
- predicted-completeness: E (worst) to A (best) based on (see guidelines), which uses just the proportional *-completeness features.
- predicted-quality: same classes but now also includes the more generic num-claims feature too.
Mar 17 2023
Ok, I verified that I am using the patched mwparserfromhell library on the cluster but the french wikipedia run still fails with messages like:
23/03/17 16:36:50 WARN TaskSetManager: Lost task 73.0 in stage 8.0 (TID 16755) (an-worker1116.eqiad.wmnet executor 87): ExecutorLostFailure (executor 87 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 20.2 GB of 20 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
Weekly update:
- Working to organize session proposals for Hackathon and understand what sorts of models are feasible to self-host on Cloud VPS in preparation
- Continuing to reach out to teams to set up meetings to hear about their experiences with 3rd party ML
- Preparing to be on: https://schedule.mozillafestival.org/session/SDTAZJ-1
I still need to do some checks because I know e.g., this fails when the item lacks statements, but I put together an API for testing the model. It has two outputs: a quality class (E worst to A best) that uses the number of claims on the item as a feature (along with labels/refs/claims completeness) and corresponds very closely to ORES model outputs and the annotated data, and, a completeness class (same set of labels) that does not include the number of claims as a feature and so is more a measure of how complete an item is (a la the Recoin approach).
Mar 15 2023
T312642 for similar request from last year for internal hackathon
Mar 14 2023
Mar 13 2023
Good point. Starting with predicted_ might be a good idea, so there are predicted_classification, predicted_embeddings and predicted_recommendations.
Makes sense to me -- the way I see it is the predicted_ fields could be the dependable/required fields for any downstream applications whereas a field like probabilities might be a bit less standardized and aimed more at research/debugging -- e.g., for a topic model with 64 classes, fine to include all probabilities it seems. If a model had 1000 classes though, maybe doesn't make so much sense to include them all.
Mar 10 2023
Weekly update:
- To the above, I found out about the MediaModeration Extension, which checks images on Wikimedia Commons against the industry-supported PhotoDNA tool to detect extreme content. I also was reminded of an experiment using a tool called Rosette to do named-entity recognition -- i.e. text -> Wikidata item -- for figuring out new ways to connect Wikimedia content.
- Starting to reach out to teams and Legal to understand what a process for providing guidance around these services would look like and what to be aware of.
- Released our summary of the results of the internal ethical salons but will continue to gather feedback from staff about their questions / interests and also focus on learning what community members are thinking (there's a call setup by Partnerships with the community in late March that I'll join)
Weekly updates:
- Fuller edittypes calculation (content generation vs. maintenance etc.) failed with memory errors for a month of French Wikipedia so it does make the scaling more difficult.
- As part of some explorations into edit summary generation, I put together a dataset of edit summaries from English Wikipedia in 2022. Trying to decide now how edit types might help us filter those summaries down to ones that would most benefit from automated edit summaries -- e.g., only ones where words/sentences/paragraphs were changed -- and how the library could potentially also help us structure the input "diff" to a language model for analysis -- e.g., aligning the before/after content as a prompt to the model. I also finished the qualitative coding of 45 edit summaries which gave the following results. Generally summaries are mostly complete and correct but they tend to focus on summarizing the edit over explaining why it was done and some of these high statistics are due to well-formulated bot edit summaries. The issues tend to come with the empty edit summaries which are still not uncommon. Edits sometimes include multiple distinct changes -- e.g., fixing typos and adding a template -- which further complicates the ability of an edit summary to capture what happened.
Weekly updates:
- Discussed with Diego the challenge of whether our annotated data is really assessing what we want it to. I'll try to join the next meeting with Lydia to hear more and figure out our options.
- Diego is also considering how embeddings might help with better missing property / out-of-date property / quality predictions for Wikidata subgraphs where we have a lot more data and the sorts of properties you might expect varies at finer-grained levels than just instance-of/occupation. For examples, instances where e.g., country of citizenship or age might further mediate what claims you'd expect. This could also be useful for fine-grained similarity to e.g., identify similar Wikidata items to use as examples or also improve.
I like the set of languages/scripts you already have for evaluation. I know you're already aware that it will fail for Thai given the lack of explicit punctuation there. A few suggested inclusions:
Mar 7 2023
is score the best name for this field? Is that a generally used term for ML predictions?
I don't remember the logic behind score other than that's what Aaron had always used -- i.e. used to be Scoring Platform team before ML Platform. prediction probably is the more general term but then that feels confusingly redundant with the nested prediction field. Maybe something like model_output as top-level name to allow for different types of outputs? And then it seems we're using prediction to be the summary of the model ouputs (which makes sense generally) and probability to be the full set of outputs with their associated confidence scores. In my comment above though, I suggested a few ways to abuse the probability field in ways that don't really have probabilities (ranked results; embedding vectors) so if we go that direction, I'm wondering if something more generic like details is the only consistent umbrella term? If that feels too generic, then maybe it just makes sense to have three separate schema (one for classification models, one for embeddings, one for recommendation models)?
Do we like this scores field? We have the opportunity to do whatever we want here, so let's take some time to brainstorm and bikeshed on what would be best, so we can use it in all the various ML use cases coming up.
Q: Would it be possible to use the same event field data model for things like image-suggestions?
So I can think of a few types of models in terms of output types:
- Classification models (topic, revert, quality, etc.) -- all of these are essentially some sort of class and associated [0-1] probability which seems well-supported.
- Recommendation models (add-a-link; add-an-image) -- currently my understanding of these models is that they are run in batch and two types of data are produced: tags indicating if an article has a recommendation (that could easily be supported with this schema as a has-rec score with probability of 1) and then the specific recommendations themselves are stored in a Mediawiki table (example schema: T267329). These recs are a lot more complicated. The add-a-link example has a bunch of context fields (see below). Supporting these I assume would require re-introducing arbitrary maps? Or maybe recommendation outputs just would require a second schema (which makes a lot of sense to me because of how different they are from classification models).
* phrase_to_link (text) * context_before (text -- 5 characters of text that occur before the phrase to link) * context_after (text -- 5 characters of text that occur after the phrase to link) * link_target (text) * instance_occurrence (integer) -- number showing how many times the phrase to link appears in the wikitext before we arrive at the one to link * probability (boolean) * insertion_order (integer) -- order in which to insert the link on the page (e.g. recommendation "foo" [0] comes before recommendation "bar baz" [1], which comes before recommendation "bar" [2], etc)
- A simpler subset of these recommendation models would be ones that require less context -- e.g., we have a model that generates potential descriptions to be added to Wikidata. For that, the output would just be a ranked list of text -- e.g., 1: Article description, 2: Description of an article, 3: Another description). The current schema presumably could be hacked to make that work. Top-ranked description in the prediction and then each description is the key in the probability part (or maybe the value and the key is the rank?).
- Embeddings -- we currently train embeddings as part of some of the classification models but it's very reasonable to think that at some point we might want to have a model that just outputs embeddings for articles etc. everytime they're edited so other tools could make use of them without having to train their own. The output format for that then is an n-dimensional vector of floats. In the current schema, it wouldn't really have a prediction but the probability field could probably be repurposed to have the key be the vector index (0,1,2, ..., n) and the value be the embedding value for that index. For my own work, I use 50-dimensional embeddings for space reasons but it's not uncommon to see embeddings on the order of 1000 dimensions, especially for media like images.
Mar 3 2023
Weekly updates:
- I accidentally overwrote some partitions so I'll have to redo some of the edit-type calculations. That's okay though because I wanted to scale up to not just the simple edit types but the more interesting categories (maintenance vs. annotation vs. generation, small/medium/large, etc.)
- I started working on qualitatively evaluating existing edit summaries on Wikipedia to assess what it would be to auto-generate them and get a better sense of how the community currently uses them. Current status: https://docs.google.com/spreadsheets/d/1acuXczi9jS2WKNWGXiwi-cLqaWM5J6kGqhHgY-Osh4Q/edit?usp=sharing
Weekly update: Continued going through notes from salons and brief report-out to Tech Dept. I started collecting examples of 3rd-party ML model services used by WMF to help understand the current landscape for that. What I've got so far:
- Machine Translation
- The Content Translation extension hosts several open-source models but also allows users to choose between several external machine translation APIs.
- Text-to-speech
- The Community Tech team is exploring different text-to-speech approaches for IPA rendering – i.e. pronouncing words.
- Machine Vision (image annotation)
- The Machine Vision extension makes it possible to use external APIs such as Google's Cloud Vision API to identify potential depicts statements for images uploaded to Wikimedia Commons.
- OCR (image-to-text)
- Wikisource makes use of Google's OCR API, especially for languages which are not currently supported by the otherwise standard open-source Tesseract models.
- Plagiarism detection
- Earwig's copyvio tool makes use of Turnitin's API for detecting plagiarism between passages added to Wikipedia and external documents.
I slightly tweaked the model but also experimented with adding just a simple square-root of the number of existing claims to the model and found that that is essentially that's all that is needed to almost match ORES quality (which is near perfect) for predicting item quality. That said, I think this is mainly an issue with the assessment data as opposed to Wikidata quality really just being about the number of statements. For example, the dataset has many Wikidata items that are for disambiguation pages and they're almost all rated E-class (lowest) because their only property is their instance-of. I'd argue though that that's perfectly acceptable for almost all disambiguation pages and these items are nearly complete even with just that one property (you can see the frequency of other properties that occur for these pages but they're pretty low: https://recoin.toolforge.org/getbyclassid.php?subject=Q4167410&n=200). So while the number of claims is a useful feature for matching human perception of quality, I think we'd actually want to leave it out to get closer to the concept of "to what degree is an item missing major information". Where most disambiguation pages would do just fine here but human items that have many more statements (but also a much higher expectation) wouldn't do as well.
Mar 1 2023
Chiming in late but in support of not using the ORES topics for this particular use-case. The ORES topics are most useful as search/list filters because they're standardized (easy to design into interfaces) and my new model makes them available for all Wikipedia languages so we can offer the same functionality to all. They are quite broad as Alex pointed out, but when just being used to narrow down lists of articles that editors are reviewing, that broadness is not such a big deal because editors can easily skip over content that's not relevant to them. Ideally they're also used in an interface that easily lets editors select/unselect them so they can explore the other topic areas (as opposed to stating them at the start and then not easily being able to change them).
Feb 24 2023
Weekly updates:
- Transferring approach from Muniza got me through English and then next set of 6 biggest wikis for calculating an entire month's worth of edits (excluding bots and reverts/reverted). Doing the remainder threw an error though so probably need to cut that in half and try again.
- Next step will be adding to the code so it doesn't just do the simple summary (e.g., 3 templates inserted etc.) but also gathers more details as necessary to map those changes to the broader categories I care about at scale -- e.g., content generation vs. maintenance. That might pose additional scaling challenges but I will cross that bridge when I reach it.
Weekly update:
- Fourth session wrapped up including this retroboard of outstanding questions/thoughts that folks have
- Working now on summarizing feedback from the four sessions with HT. Current synopsis:
- Lots of questions / interest from staff.
- Concern over this being done in-line with our values, especially around knowledge equity (avoiding digital colonialism), transparency / avoiding exploitation, producing verifiable knowledge, and doing this work in close consultation with the broader editor community. We need to make sur we move carefully in this space despite the obvious excitement too (a lot of "on one hand... on the other hand...")
- Clear benefits to trying to craft policy/guidance around how teams should consider using third-party AI-backed tools within their products
Feb 23 2023
oh wow - thanks @rook!
I appreciate the transition period for updating links. I was curious about how many links would be affected and so ran the query (at least for Meta where I expected the most links to exist on-wiki; maybe worth running a similar query for others?) and migrating existing links seems pretty doable as most are on archive pages and so can presumably be ignored: https://quarry.wmcloud.org/query/71641
Is it ok if I arrange a meeting (maybe next week?), including the ML team, Isaac and Ottomata to discuss the source event and output score data model for the outlink topic stream?
Agreed - thanks @achou!
Feb 20 2023
now tracked under T328264
now tracked under T328264
now tracked under T328264
Closing -- now tracked under T328260