Page MenuHomePhabricator

Have Event Invitations scoring model analyzed by Research
Closed, ResolvedPublic

Description

Background: Before we build an MVP for Event Invitations, we will want to have the existing model analyzed by Research to see what recommendations they may have for improving the model.

Relevant links:

Acceptance Criteria:

  • Share the current documentation on the model and the Event Discovery project page with Research for feedback
  • Share feedback on model with team so we can discuss which changes we may want to make & which changes are for MVP vs. post-MVP

Event Timeline

ifried updated the task description. (Show Details)
ifried added a subscriber: Iflorez.
ifried renamed this task from Have Event Invitations model analyzed by Research & Product Analytics to Have Event Invitations scoring model analyzed by Research.May 16 2024, 7:50 PM
ifried updated the task description. (Show Details)

Ok, I got a chance to take a look. Thanks to @Daimona for the very well-documented code and function graphs, @Iflorez for your notes, and @ifried for the excellent Meta overview -- this made my job far simpler! Broad feedback:

  • Overall:
    • Generally I assume your approach is working pretty well (though this is mostly based on hunches as we don't have enough data to really evaluate yet). I have some thoughts on smaller tweaks but I suspect that the report is spot on when it notes that the main feature that seems to impact the effectiveness of these recommendations/invitations is the quality of the landing page for the event and clarity of what they're doing. My main actionable recommendation is to consider flipping the edit-count feature so it prioritizes newcomers over experienced editors (more below).
  • Modeling choices:
    • When I think about designing an algorithm (or learned ML model) to solve this sort of task of classifying data, I'm generally think about the trade-offs between simplicity and precision. Where on one end, you have a really simple model that anyone can hopefully understand and modify. Maybe its not perfect but that's okay because it's transparent to users and easy to adjust. On the other hand, are much more complex models that can't fully be explained but maybe are far more accurate at the task at hand. While it's tempting to think there's a good compromise solution that has a balance of both, I actually generally find myself either going with maximally simple or maximally performant depending on the task. The middle ground unfortunately usually just means that it's sorta hard to explain to folks/adjust but it still isn't so good that folks will happily accept the outputs.
    • For this model, I think we're clearly in the space of wanting a maximally simple approach because the task is inherently hazy (there is no "right" answer for who to invite). I think you're largely in that space, which is great. I think the features are pretty straightforward at a high-level. The geometric mean might confuse some folks but it still is pretty easy to explain. I'll admit that personally I find some of the more complicated feature normalization functions to be more than is necessary but that's a nitpick and the graphs provided in the comments help a lot.
  • Evaluation data:
    • The main challenge is not having a method to evaluate whether the approach is working and potentially train a model to learn what the more empirical balance of feature weights is (right now they're set to 5 recent-activity vs. 4 bytes vs. 1 edit-count). So most of my feedback is based on hunches. With a fair bit of work, I assume you could encode who was recommended vs. who was invited and use that to adjust the model weights -- e.g., via a logistic regression model as Irene raised in her feedback -- but there isn't a lot of data yet so I don't know that that's useful at this stage. If you expand the pilot, I'd recommend tracking this somewhere (user, overallScore, rank, bytesScore, editCountScore, recentActivityScore, selected-for-invite, did-join-event).
    • As an alternative, it's hard to think of reasonable proxies on the wikis for this sort of task that would have more data. You could seek to re-generate past campaign participant lists but these are quite messy in terms of gathering that data and you'd have to find campaigns that recruited experienced contributors and had some sort of topical focus. WikiProjects don't currently have a good way of recording participants or they'd be a natural fit.
  • Feature feedback:
    • Bytes:
      • Simple and while certainly not perfect, in practice it does map pretty well to edit difficulty / engagement once bots and reverts are filtered out as you do (quick analysis). This also makes sense -- adding a new reference/sentence/etc. generally requires adding a fair amount of bytes. The big maintenance edits are often via bots -- e.g., IABot as is mentioned somewhere. If you wanted, you could also do things like remove any edits that match a set of tools often used for basic maintenance -- e.g., for enwiki, I've used ['AWB', 'twinkle', 'huggle', 'WPCleaner', 'canned edit summary', 'OAuth CID: 1805', 'RedWarn', 'Ultraviolet'] in the past as edit tags that indicate tool-assisted edits but that's a manual list I put together based on scanning Special:Tags.
      • The main challenge I see with this is that you're not distinguishing between editors who are active in the topic space and editors who are just very active. For example, when I was doing some work on finding similar users to a given editor (as a proxy for potential sockpuppets), I was building a list of editors who most overlapped with a given editor based on edit history. But if you just use this info, User:Ser Amantio di Nicolao appears on everyone's list because of how prolific they are. Instead, you need to normalize for how many edits someone makes in general (akin to tf-idf). This is challenging to do via the APIs but you might actually consider weakly "penalizing" editors for high-edit count instead of rewarding them. Where the logic is that while high-edit-count editors might be good contributors, they also are more likely to have just incidentally edited these pages and not have an actual topical interest and we want to reduce spam to them too (them ending up on everyone's invitation list).
    • Edit count:
      • Building further on the above about "penalizing" high-edit-count editors. Given the nature of geometric mean (tends towards the lowest value in the set), this means that if you have an editor who signed up and maybe made a few edits via Newcomer Homepage about a topic of interest to them but isn't sure what to do next, they're going to have a really small value for the edit-count feature and also probably for bytes-changed (assuming they made simple edits). This is going to put them always at the bottom of the list when I feel like maybe they're the most important people to invite (as they need that nudge to continue and structure of a campaign and they're least likely to discover the campaign themselves). As I said above, I feel like you might want to actually take the inverse of someone's edit count for this feature so lower-edit-count folks who overlapped with the worklist are prioritized.
    • Recent activity:
      • Seems reasonable to me!
  • Expansions:
    • Honestly I feel like you have a reasonable set of features based on what can be efficiently gathered from the databases. I've done some work on classifying edits by what they changed so you could e.g., isolate editors who added new references to articles and prioritize them, but that's a much more expensive feature to compute and it's not cached anywhere yet unfortunately.
    • We've discussed using the list-building tool (or even just Search's morelike API for something production-ready) to expand the worklist (when they're tiny) to include more articles in the same topical space (and therefore more potential editors). We don't have any way of evaluating how impactful that would be. My feeling would be let's explore it if you're hearing that the invitation lists are too short or overlap too much with who they already were going to invite. But otherwise not necessary yet.

Ok, I got a chance to take a look. Thanks to @Daimona for the very well-documented code and function graphs, @Iflorez for your notes, and @ifried for the excellent Meta overview -- this made my job far simpler! Broad feedback:

Thank you for the in-depth review! I think there's plenty of food for thought.

For this model, I think we're clearly in the space of wanting a maximally simple approach because the task is inherently hazy (there is no "right" answer for who to invite).

Right. On top of that, and this is a key point that I should've maybe made more explicit, the current implementation is entirely based on guesses. We didn't really have data (or time) to build something more rigorous. Nothing new here, but I feel this is so important that I just wanted to repeat and emphasize it :)

With a fair bit of work, I assume you could encode who was recommended vs. who was invited and use that to adjust the model weights -- e.g., via a logistic regression model as Irene raised in her feedback -- but there isn't a lot of data yet so I don't know that that's useful at this stage.

Agreed. I think we did consider this option for a future improvement.

If you wanted, you could also do things like remove any edits that match a set of tools often used for basic maintenance -- e.g., for enwiki, I've used ['AWB', 'twinkle', 'huggle', 'WPCleaner', 'canned edit summary', 'OAuth CID: 1805', 'RedWarn', 'Ultraviolet'] in the past as edit tags that indicate tool-assisted edits but that's a manual list I put together based on scanning Special:Tags.

Good point. I feel like having a way for each Project to configure a list of such tags somewhere would be useful for more than just this task. Maybe it could use community configuration once that becomes a thing.

you're not distinguishing between editors who are active in the topic space and editors who are just very active. [...] Instead, you need to normalize for how many edits someone makes in general (akin to tf-idf). This is challenging to do via the APIs but you might actually consider weakly "penalizing" editors for high-edit count instead of rewarding them. Where the logic is that while high-edit-count editors might be good contributors, they also are more likely to have just incidentally edited these pages and not have an actual topical interest and we want to reduce spam to them too (them ending up on everyone's invitation list).

Makes a lot of sense! I'll need to think about this more, but it seems like it's definitely something we should look into.

Given the nature of geometric mean (tends towards the lowest value in the set), this means that if you have an editor who signed up and maybe made a few edits via Newcomer Homepage about a topic of interest to them but isn't sure what to do next, they're going to have a really small value for the edit-count feature and also probably for bytes-changed (assuming they made simple edits). This is going to put them always at the bottom of the list when I feel like maybe they're the most important people to invite

I believe the current behaviour is intentional: we didn't want to invite newcomers too easily without enough information on what they're actually editing. But again, your point makes sense and I will think about it more and discuss it with the team.

Thanks for engaging! Sounds like we're on the same page, and again I wanted to stress that I think what you all have built is quite reasonable.

I believe the current behaviour is intentional: we didn't want to invite newcomers too easily without enough information on what they're actually editing. But again, your point makes sense and I will think about it more and discuss it with the team.

Ahhh that makes sense. Yes, certainly a reasonable consideration. Maybe then you leave the feature as-is but split the single ranking into one for more-experienced editors and one for newcomers? Or just include the raw editCount in what's passed to the organizers if it isn't already so they can filter themselves based on whether they feel that their event is good for newcomers?

Good point. I feel like having a way for each Project to configure a list of such tags somewhere would be useful for more than just this task. Maybe it could use community configuration once that becomes a thing.

Agreed -- I learned from your code that the revert(ed) tags are actually defined in Mediawiki, which was a pleasant surprise, and would welcome community members curating additional collections.

Thank you so much for taking the time to review our work and provide a detailed summary of your thoughts, @Isaac! There's a lot to dig into here, and we'll be discussing this in greater detail as a team. In the meantime, I wanted to add some reflections on the edit count topic, since that really stuck out to me as something that I would love to explore:

  • I agree that we may want to rethink how the global edit count impacts the score. In the current fiscal year, we have been focused on targeting more experienced editors, since our KR focuses on increasing the percentage of articles of acceptable quality on high-impact topics. However, for the upcoming fiscal year, we will be focused on encouraging more connections between contributors, so edit count is perhaps less useful for some other use cases.
  • I like the idea of allowing organizers to choose whether or not they include new-ish editors or not in their invitation list. Some organizers may not be interested in explicitly inviting relative newcomers, but others may be mildly (or very) interested. There are many potential benefits to inviting new-ish editors, including: newcomers may crave direct outreach and respond positively to it ("Oh, I *do* belong here! Cool! I should join that event!"), newcomers can receive mentorship at the events they are invited to, and organizers may see more people register for their events (though these are all just theories and we would need to test this).
  • Great point about the problem of the hyper-prolific editors always appearing in invitation lists. We could penalize editors with high edit counts, like you wrote. I like this idea, since one of my big concerns about Event Invitations is that it may appear spammy or irritating to highly prolific editors, who end up on invitation lists again and again. Another approach (but perhaps this is too unnecessarily complicated and therefore out of scope?) is if we could include in the scoring the percentage of articles that an editor has worked on in a specific topical area(s) of the event. So, for example, if person A has a very high global edit count but only 1% of their contributions are in the topical area(s) of the event, they would probably not be a good candidate to invite. However, if person B has a very high global edit count and 20% of all of their edits are in the topical area(s) of the event, we may consider that notable and perhaps that person should be included in the invitation list. Anyway, just spitballing thoughts!

Just acknowledging -- thanks @ifried for engaging! All of this makes sense to me. Thinking about not spamming high-edit-count editors though I don't have anything that obviously fixes it if reaching some of these relevant highly-active editors is a priority:

  • Simplest way is probably to just pass along global edit count to organizers and encourage them to practice care before inviting -- e.g., checking user pages to see if they express an interest in the topic.
  • I raised the prospect of filtering out edits with edit tags that suggested that they're semi-automated (autowikibrowser etc.) which could be a small step in filtering out some of this
  • Maybe there's a way to use watchlists to help filter this out -- e.g., for high edit-count editors, check if the worklist articles are on their watchlists? The challenge here is one of privacy -- this is private information and you wouldn't want to leak it. So maybe this is instead future functionality that actually privately notifies editors if an upcoming event has a high overlap with their watchlist?
  • Maybe for each worklist, you also collect a small random sample of articles (e.g., 20) and filter out anyone who also appears heavily in those?
  • Larger than this one project, but we probably will eventually need some sort of opt-out system for these things (invitations but also surveys etc.)? And then it would be just a matter of cross-referencing the invitations with e.g., the user preference table to see who can be contacted

Thanks @Isaac, @ifried and @Daimona, it is so nice to see all this details, here are some thoughts I have:

I think that we could for future iteration on this, add a mechanism to inform organizers that a user shows up in other X number of invitations list, and also when we implement the sending of the invitation on our extension, we could store info like how many times users were invited and filter it by date, I mean we could to do things like:

  • Inform organizers that a user was recently invited to an X number of events in a period (and even show the topics of the events the users were invited to, when we add the ability to add topics to the event)
  • Hide a user on the invitation list if this user was already invited to an X number of events in a period (Not saying we should, but just that it is doable when we implement the logic to invite on our tool)
  • Inform organizer that a user appeared in X number of invitation lists in a period (and even show the topics of the events the users were invited to, when we add the ability to add topics to the event)
  • Hide a user on the invitation list if the user appeared in X number of invitation lists in a period (Not saying we should, but just that it is doable)
  • Inform organizers that a user on the invitation list is attending an event in a close period (This I think we should do only if the user didn’t register as private)

Maybe for each worklist, you also collect a small random sample of articles (e.g., 20) and filter out anyone who also appears heavily in those?

This sounds interesting, I think it would be a nice test to do.

Thank you for sharing these insights, @cmelo & @Isaac! These are some great ideas for ways we can potentially prevent over-inviting, while not entirely preventing prolific editors from being included in invitation lists.

ifried claimed this task.

I am closing this task as Done, but I am adding it as a resource to T370215, which will be a ticket used to track future improvements to the scoring model. Thank you so much for this work, @Isaac! It was really valuable to the team and it will help us think through how we may improve the scoring model after the MVP release.