Page MenuHomePhabricator

[Session] Automating categories and articles on small wikis
Closed, ResolvedPublic

Description

(Please set yourself as task assignee of this session)

  • Title of session: Automating categories and articles on small wikis
  • Session description: Small presentation about how we automated category and article creation on arywiki, and a discussion about how to improve it and expand it to other small wikis
  • Username for contact: Ideophagous
  • Session duration (25 or 50 min): 25 minutes
  • Session type (presentation, workshop, discussion, etc.): presentation + discussion
  • Language of session (English, Arabic, etc.): English
  • Prerequisites (some Python, etc.): general Wikimedia technical knowledge. Good to know: XML, JSON, python
  • Any other details to share?:
  • Interested? Add your username below:

Notes from session

Automating categories and articles on small wikis

Date and time: Saturday, May 4, 2024 16:00

Relevant links

Discussion
What can be improved?
How to make it easier for normal users to build and run tasks?
How to deploy or expand this to other small wikis?

Presenter

[[User:Ideophagous|Ideophagous]] ([[User talk:Ideophagous|talk]])
https://phabricator.wikimedia.org/p/Maurusian/

Participants

Notes

Transcriber Note: I use assistive tech to take notes. There will be copying and pasting. (Kim)

Summary
The conversation centered around automating the creation and maintenance of articles and categories on Wikipedia and related wikis. Mounir shared their experience with creating a bot script for small wikis, while discussing challenges in creating and maintaining articles about villages in a language-specific wiki. He highlighted the challenges of maintaining MediaWiki entity templates, and discussed using placeholder articles. Mounir acknowledged potential benefits of automated articles, but also addressed limitations and challenges. Later, the group discussed difficulties in collecting data for rural communities due to limited access to information, and Mounir proposed using XML templates to streamline the process. Participants expressed concerns about necessary infrastructure and technical skills, while another participant questioned the feasibility of a centralized platform. Mounir emphasized the importance of involving community members in the process.

Action Items
Look into integrating the article placeholder tool with the text template approach to automatically populate articles
Connect Mounir with designers who could help create a non-wiki user interface like a dynamic form (Done)

Outline
Automating categories and posts on a small wiki using bots.
Mounir discusses automating categories on a small wiki using a bot.
Building a category tree and matching categories with wiki data.
Mounir is building a category tree for a wiki, using a script to create the tree and match categories.
Mounir is also working on adding interlinks between categories and wiki data, using JSON.

Automating article creation on Wikipedia using XML models.
Mounir discussed creating automated articles on Moroccan Arabic Wikipedia using two methods: XML model and draft Namespace.
The XML model relies on wiki data, but there isn't enough information about some topics, resulting in incomplete articles.

Village population data collection and analysis.
Mounir discusses the challenges of creating an article about village populations in a specific region, including inconsistent data and legal restrictions.

Improving a wiki with bots, with 35-100 active editors.
Another participant discusses challenges with engaging a small community of editors on a wiki, despite having 35-100 active editors at any given time.
Using templates to help fill out census data in Indian languages.
Another participant discusses using XML templates to help fill in census data for rural areas.
Second method, using automated articles, is more convenient for communities but may require tech person per language.

Simplifying data collection for non-technical users.
The ideal target for a new project is small models, with potential for customization and automation.
Mounir discussed the possibility of making data from the census searchable, with prefilled information for specific places or people.

Integrating article placeholder with template text.
Another participant discusses article placeholder and template text integration.
Creating a user-friendly interface for generating content.
Another participant suggests creating a non-wiki user interface for users to create entity descriptions.
Using designers to create user stories and whiteboard designs.
Kim mentions designers Justin Scherer and Matthew Williams.

Questions

Article Placeholder
https://www.mediawiki.org/wiki/Extension:ArticlePlaceholder#:~:text=Article%20placeholders%20are%20automatically%20generated,not%20have%20an%20article%20yet.

https://commons.wikimedia.org/wiki/File:Generating_Article_Placeholders_from_Wikidata_for_Wikipedia_-_Increasing_Access_to_Free_and_Open_Knowledge.pdf

Photos

https://phabricator.wikimedia.org/F50159572
https://phabricator.wikimedia.org/F50159571
https://phabricator.wikimedia.org/F50159573

Social

Transcript

Mounir 00:01
I'll be presenting about some experiments I did for automating categories and our posts on on small wiki. And the reason I wanted to present this is because he's rather small wikis, and also maybe to get feedback. I mean, I basically just experiments would be a better ideas or ways to extend this further or implemented. Right. So if you want to detail I just put it here on a sub page of my user page. Yeah, so we will be visiting the links here. First, myself, my username is educators. I started editing I'm thinking should be PDF and now and then when the Moroccan Arabic Wookieepedia here that was 2020 I became even more active. So initially, during that video, I run a box called the regional bots. The region is the name we use for Moroccan Arabic in Moroccan Arabic. Yeah, and that's a small wiki. So right, we just passed the 10,000 articles, but two weeks ago, so we don't have enough editors. And in order to optimize the editing time for our editors, we try to automate as much as possible that is that can be automated without losing the quality. And some of the things we can we only read for example is the categories. For example. Like people born in a particular city, so for example, this category is people born in Stockholm. So this category was entirely

Unknown Speaker 01:50
created with a with

Mounir 01:53
a box that is very diverse. And not only the category but all the category trees all the people who are in Sweden by sending people of stock markets or other other categories. is built using a bot. And now, the idea here is you want to make this easy, you don't have any wants, you don't want to rebuild the, the the script each time so you try to characterize it as much as possible. So this is like the main parameter. Violets adjacent file on that MediaWiki namespace. And this is just some general parameters, the like, specific tasks so that it's every task is that has the same title, just as test number changes. So we have basically two scripts, two different scripts, one of them is for the task for the whole category tree. And the other one is for matching the category because once you create a category you need to match it with other categories and wiki data. So you also need to apply it to to match the categories and I'll show you an example of the category tree fire. So yeah, it looks basically like this. So this is category people born in this is City parameter. And then this is the main category we add and then super category political zero is people born in country, by city and so on and so forth. So we keep building up the all the supertech Oriente reached the base kind of example people have a continent we have all of these categories so we don't have to go any further. And then the parameters are taken from from wiki data. So this is a t 19, the barotse birthplace and the city is that P 70. On Off p 19. And then the content is p three P 17. And, and then the other problem I run into is that I first I just stopped there was nothing then I find that sometimes instead of the city they put somebody in a country so I had to also check the incidence of in this is a city or not, it's not a city. I don't

Unknown Speaker 04:05
add it.

Mounir 04:08
So yeah, the checks were the different insert itself, like entities for the first one to the city. And then same thing for country and thinking for continent. Right? And the idea here is to use the same script to build any kind of category and the other thing is also to make this possible for any user who doesn't know how to build a bot script to also use this the boss. So for example, there's another job this is done number one, I don't remember two or three. It was it was made by an administrator on our wiki who doesn't know how to build a box. You just built the category tree matching and we were able to run out so this is basically how it's done. The next step that I would like to go into is to potentially be able to build a category tree, and also the internet's initial interlinks JSON. So it also looks something similar. So here this is the category I want to add to bullhorn in the city and then in French, it sounds like this. It's kind of hard to go up and then city in Arabic is this way in Spanish in Italian etc. So wherever I find that the country has a certain defined structure I added here so that whenever it's available, it gets matched automatic automatically there. So when I said is that the next step that I would like to go into, I don't know if it's, if it's, if I can do it or if it's feasible is to potentially build these Jason's scripts in an automated fashion as well. So I just tell it, I want to build a category of folks about this thing. And it basically explores the category tree in English for example, and it just generates something like this, which may not be necessarily the final product, but at least it will reduce the amount of time it helps people to write. JSON. Okay, so this was the first part, which is about creating the categories. Now you have the code here as well. You can take a look if you want it's just go on the program and to the details and up there and you can check the rest. So now concerning creating articles, let me just
go through stages. browser support for creating automated articles against the same idea. There are some topics that may be very important, very interesting. But we don't have enough users to write articles about each one of these, or at least we could reduce the amount of time it takes. So we build the structure of the article, maybe some introduction or something and then users can come and expand those articles. So we did this in two methods. There used to be experiments with two methods. The first one was a bit ambitious. So we did that with the with the XML model. So yeah, so it looks something like this. So this is basically have an article tag and each article has paragraphs and each paragraph has sentences. And the sentence has no parameters or sub sentences. So and so there's some sentences can just go you can like below, like recursively substances of substances and so on. And there is a function which is like a recursive function, which keeps first citizens into the future to the bottom of the tree. And unfortunately, in terms of results, it wasn't great because it relied on the wiki data, and they will be there and there aren't there isn't enough information about a lot of topics. So for example, here it just gives an article in one or two sentences basically, even though the the XML model projects the possibility of adding all kinds of permission but this information is not available. On wiki data. So essentially, with this method, the first step would be to first fill though data items with that information and then to move on to the next step which is to to create the art to generate the articles and this is basically to the draft namespace on Moroccan Arabic Wikipedia. Because we have a policy that's for most articles, unless the there is community consensus, they should not be written directly on the on the maintenance phase, they should be written on the draft namespace until they are and the second method where we implemented for Moroccan villages
, so we have the census data from 2004 2014 in Morocco, and for all those cities, the vast majority the villages you don't find information online or you find just automated articles on PPD and Arabic, I think so we thought it would be useful to to have these articles and also fill them with information from the census data because in the end, I think they only had like two or 300 articles. So we did something a bit more. I would say advanced in that sense. So I can show you some examples. So this is first how it's done. It's not doesn't work with an XML template. It's just a simple text template. And in this template, you have variables that you fill out with the information so the names of village villages village type is located in community etc. And this is supposed to be a timeline tab which will show the progress of the population. So I'll show you an example of that we should look like when it's filled out. The here is error because it doesn't have any information there. And basically some notes and two sorts of which are the 2014 14 census data and well the problem we have here is basically we have different types. A situation with rich situations one where the village has all the information. The second is where the village has less than 30 families. So by law, they're not allowed to publish the information about the villages. So for those that aren't visible will be as just as the article without just introduction, and then some some villages they had less information for some reason and if there was a problem with or something. So we also had to remove some some paragraphs. And what the final result looks like I can show an example. So this is a category for one of the provinces. And yeah, so yeah, it looks like this. So you have a introduction, you have an inbox introduction. The village is this and that and looking at it here there and this is how many people live there, according to the census latest census and then the progress of the population betwe
en 2014 to 14. And then some general information about the population like like marriage and fertility rates and how to say like also like, schooling, how many people went to school, employment, etc. So this is not a bad article, I would say because it contains enough information about the topic, and it's entirely driven by bots. Yeah, so just as possible right. So this is basically the the idea I want to show which is like the different experiments we've been, we've been doing on our live wiki. And so now my question to you is, whether you think this can be improved and there are other ways we can this can be done in a better way. If it is possible to make this easier for normal users, so there's less tech savvy users to implement. And the third thing is how can we deploy and expand this other other small wiki so they can potentially do similar thing without having necessarily to rewrite the script or anything of the sort

Kim S 12:52
Can you share a little bit more about the background on the I guess, um, aside from you, who else is working on this? How small is the community in some contexts on who's working on what and like their level of tech savviness? Like, what caused you basically to start just working on the bots in the first place? Aside from just you know, this, the fact that there's not enough folks?

Mounir 13:21
Yeah, so our wiki basically has from let's see the run 35 I mean, the lowest I've seen is 35 people at least once per month. And the highest we recently was a bit above 100. Okay, so I can go up and down. So, if we really talk about the actual active editors at any given time, apart from the administrators, we've have only one or two people are really bad. Okay? So it's not a lot of people. It's hard to engage people into some projects. So we try sometimes to start a project let's write about this. Let's write about that topic, but we don't get enough people to we're engaged in the Edit campaigns. And in terms of programming. So actually, that the second model, a large part of the script was written by Zachary. So he's sitting in the back and yeah, and then myself basically, so the first method and the categories scripts was I wrote myself, and we also what was communicating with the administrators, what their needs are, what they wants to implement, and if it's allowed, of course because they might disagree with the idea of implementing was a bit too much because it still has to be it has to come from people from the villages, we agreed that the benefit was sufficient that we can just directly use cost to direct the articles. Because in any case, if you look for the villages and online, you're going to find information. There's an added value in that sense, and they're there for the vast majority of them. Apart from the sensitive data there's not nothing else to take

Kim S 15:13
fascinating.

Mounir 15:24
So I think numerous history have discussed, you mentioned that in Telugu. Maybe it would be useful to have some automated articles. I mean, do you think this is useful? Is it possible to implement there? I don't know this this.

Unknown Speaker 15:39
I think it's useful, like what the they do community members have done was, it was only one or two people's effort back in 2011 when the census data was released by the government so there was one pick up peculiar volunteer who was able to get the candidate government get the data onto the public platform, which is then he himself was not able to find any other delivery comedians for that. So at last he was able to convince a person and they will started writing the articles manually for all the villages. I don't think it was even completed, but the most difficult part of that was convincing the government to get the data course which they started and I think it's somewhere around 60 or 50%. So it's still under progress and hasn't completed. I think, this should be useful because what from what I have kept going on was they already had a template of how to write it. Now the community already has the rules of how the grammar should be or what are the points each cylinder in in case if it has to cross the stump article pace. So yeah, I think if they have a template and this can be used and modified for their needs, definitely they should be helped fill in the next census data.

Mounir 17:08
Okay. So then, it's like more like the first method, the XML templates, or second one is much better. In your opinion. I just want to know which one I should harder, and that's when I should maybe lead for other types, because the first one is a bit more complicated. Yeah. And the other thing is, I'm not sure how I can make that usable for a normal user. Like how can he was normal user? I don't know maybe just like moving mouse or something that can build an XML. The

Unknown Speaker 17:44
second one is always convenient, because the community has not seen that person for a while. So, the second method is more convenient for them. But since we are also thinking of having at least one tech person per community as a movement partner for in Indian language media, so, we are thinking like probably first should also be used, so that like, it can be easily spread to other languages whom who are thinking the same thing for now, too, but the ideal target should be one and maybe the small models.

Mounir 18:32
I don't know if there's more feedback areas.

Unknown Speaker 18:37
Yeah. Well, I think this could be very helpful especially in years with a few exits. I'm just thinking what level of technical skills would need to have if they wanted to

Unknown Speaker 19:03
have like ones like this one?

Mounir 19:07
Yeah, that's that's a very good question. And that's basically what were the direction I want to do is to abstract away most of the technical aspects so that users can just focus maybe as on on the claim but collecting the data, getting the data, maybe something up or something but like, they don't have to deal with the with the scripts or anything like that. Apart from maybe starting the script itself over such. So that's, that's the reason why I presented this to see if, I mean, where I could go in that direction. Like what should i abstract away and automate to get to that point where it's actually easy to use for many other non technical non technical users?

Unknown Speaker 19:54
Yeah, so there's a there's a extension that was deployed several years ago, I think, call the department placeholder. It might still be deployed actually. Article placeholder right so I'm gonna do I don't know how good the infrastructure is around like actually customizing the output from that but I think it was envisioned that you could turn the contents from what was actually done on any Mickey's first of all, that's to you, but anyway, I was going to say like maybe you might not find a way to make the data from the census, for example, searchable, but then when you click on a red link for something like village, you would have something prefilled that someone could modify it if need be when they wanted to customize articles for a particular place, or person or what happened. Okay, that'd be interesting to see. I mean, I concur with that we can text templates are much nicer for various reasons. But any case, finding a way to make it as good as what there's a lot of efforts were places where like a lot of Massachusetts has had in the past and its enormous and updates, in large part. And with all the centralization that takes place, when we get out of this incentive to start trying to avoid that sort of stillness it costs the responsibility for that service. It's looking like metric results are free but doesn't like on savings.

Mounir 21:28
Somebody here who works on the article.

Unknown Speaker 21:33
So it's like it's not really maintained. Like we don't do that much anymore, but like, I think I'm like the closest person but now

Unknown Speaker 21:41
Lucy was one that wrote it. Yeah,

Unknown Speaker 21:45
yes. So so she obviously voted and did most of the communities back then and yeah, and she left WMDs serves. Yeah, we did some maintenance to it, but it's just weird to keep it running. Yeah, I'm happy to talk about it, actually.

Mounir 22:03
Yeah, sure. So do you think it's possible to integrate the article placeholder with with this idea of a template text templates? That's like the, the variables written in the text template will automatically be replaced with the rookie data entities are only now part of the property values. Do you imagine that that can be done perhaps I don't think so. Okay. But we're, technically what we think we need exactly like just a page, template text is written. And then you just somehow invoke that Hong Kong article placeholder or, I don't know, maybe you have to create a new namespace for that or something. But that's

Unknown Speaker 22:48
basically what we have taken place. All that does is it basically calls out to that, that just passes in the Entity ID, which you want to have a placeholder for and when you can, basically done it has a default view, which is just tested some boxes if the data light, but in theory is nothing stopping you from doing whatever you can do with the text or the world. So basically, everything you can do on a normal page, you could do an article placeholder. I don't think anyone has really done that but just should be recreational Philippines topic.

Unknown Speaker 23:29
Actually, in fact, efficient customization, I wasn't able to find one.

Unknown Speaker 23:36
We discovered I hope it's documentation on mediawiki.org. But come on, okay. But also like so loosely about professional speeches about that, and only media comments and backlash. That is definitely documented. All of these things. So it's also good to know notes Yeah. Yes.

Mounir 24:15
Are there any other ideas?

Kim S 24:49
Have you thought about having like a non wiki user interface that is just like a form that I don't know will be like more? Like, I guess just less intimidating for an average user.

Mounir 25:07
Okay, but what would be what would be in the fields?

Kim S 25:12
Um, I'm not sure I feel like it would. I guess Yeah, it's it's up to you. I just feel like sometimes users are a bit intimidated by just the like, anything outside of an article is intimidating to some users. And maybe if it was like in like, less, more generic looking thing like a forum or Google form or something that can I somehow talk to the API, maybe you'll get like, a little bit more engagement, but maybe that's not even your priority or focus.

Mounir 25:59
Not really. I feel like that's kind of the direction I want to go. But I'm still not sure presented you suggest in the form. I'm not sure exactly what kind of artistic tool I could give the user that would allow them to create something like this. But without without any technical aspects. So a form maybe that could work but I'm not sure exactly what what kind of feel maybe they could feel the title. The title is going to be this entity or this type of entity, and I'd say and then the first sentence they could write in that circle, but yeah, yeah, essentially, I mean, you could just add an element and then basically build the sentences with a form that's I mean, that it should not be acidic form, I guess, you have to be able to add new fields. And yeah, big text has to be

Kim S 26:52
Yeah, so like a dynamic form. That can be that's editable as well.

Unknown Speaker 27:08
Sounds a lot like like cradle for baking items except for being an item and be something potential and and be able to just get their free text between any two. They rearrange the fields, but also tax free checks. So that's an idea.

Kim S 27:41
So we do have a couple designers here that can more than they'll be more than happy to help you. Like create a user story and whiteboard something. They're not here, right? Now, but they should be in the ballroom. So Oh, yeah. Justin. Shear, and then Matthew. What's his last name? I can Yeah, I can. You can follow me afterwards. And I'll introduce you, Matthew. Williams. Yeah.

Mounir 28:29
So there are no other points. Thank you for your attention.

Event Timeline

debt triaged this task as Medium priority.

Results and feedback:

  • Method 2 to generate articles (with text templates) seems easier for users
  • Received suggestion to build a dynamic form for generating text templates using Codex. Received getting-started tutorials from Codex team.

Next steps

  • POC: build a simple form and generate a simple article