Page MenuHomePhabricator

Make ORES topics and their translations easily available to MediaWiki extensions
Open, Needs TriagePublic3 Estimated Story Points

Description

As part of T362259, the Campaigns-Product-Team would like to use ORES topics for a new feature. In particular, we would need a list of said topics, as well as translations. The same list and translations are currently already in use in two different MediaWiki extensions:

With this task, I am proposing that the list of topics and its translations be made easily available to MediaWiki extensions. In T368422 there had been an attempt to put these into the ORES extension, which is however not the appropriate place. Here, I'm proposing the WikimediaMessages extension. While not perfect, it has the advantages of being available on every wiki, and being Wikimedia-specific (just like the topic taxonomy).

There would be a new class basically identical to CX's ArticleTopicsDefinition class, providing both the raw topic names and their translated labels. The initial translations could presumably be imported from GrowthExperiments, since they've been around for a long time. The exact API is up for discussion. Now is a good time to identify what each team's needs are, and build it so that we all can use it.

I'm looking for general feedback on the proposal (for example, if there are better places than WikimediaMessages), and for specific feedback about what the new interface should provide. My initial proposal, based on the Campaigns team's needs, would be (method names are just examples):

/** Returns a plain list of topic IDs, for validation and the like */
public static function getTopicList(): array;

/**
 * The "main" entry point, could be identical to ArticleTopicsDefinition::getTopics(). The main difference is that we either
 * make this use message keys (with l10n up to the caller), or add a MessageLocalizer/ITextFormatter parameter.
 */
public static function getGroupedTopicMessages(): array;

/**
 * Returns localised labels for the given topic IDs. Like above, this could either return message keys, or take a
 * MessageLocalizer/ITextFormatter parameter.
 */
public static function getLocalizedLabels( array $topicIDs ): array;

In addition to us Campaigns-Product-Team, I am tagging Growth-Team (ref), Language and Product Localization, and Research as technical stakeholders. Please let me know if I need to use different tags or processes to reach y'all. Thanks in advance!

Event Timeline

Nikerabbit subscribed.

WikimediaMessages sounds good to me.

For code search, it would be better to avoid constructing the message keys dynamically like currently done in CX.

We probably want to preserve the existing translations (from CX?) and try to move them over to WikimediaMessages. We can handle that part once we are that far.

For code search, it would be better to avoid constructing the message keys dynamically like currently done in CX.

Agreed - that, or list the possible keys in a comment as usual.

We probably want to preserve the existing translations (from CX?) and try to move them over to WikimediaMessages. We can handle that part once we are that far.

I was proposing to use Growth translations because they've been around for longer, so I thought they're probably more complete. But as you say, that can come later.


I guess one thing I forgot to mention in the proposal is: how would changes to the list be handled? In T362259 I briefly mentioned the idea of having versioning; I'm not sure if it's a good idea though. We could also just treat it as an ordinary shared code change, and have the person making the change make sure that no code will break (which might include reaching out to the maintainers of said code, i.e. the teams subscribed to this task).

Thanks @Daimona for putting this together! Just adding RecentChanges as another potential stakeholder where these types of labels might appear (and therefore I think @Samwalton9-WMF ?). Context: T245906

One more thing the Campaigns team would like to have is to keep old topics (from a previous version of the taxonomy) around. I'm thinking they could have an extra 'disabled' => true property or something. However, there's no need to implement this now. Instead, it can be done when we'll actually have a change to the taxonomy.

I will make a proof of concept for the current proposal later today.

Change #1100553 had a related patch set uploaded (by Daimona Eaytoy; author: Daimona Eaytoy):

[mediawiki/extensions/WikimediaMessages@master] [POC] Introduce ArticleTopicsRegistry

https://gerrit.wikimedia.org/r/1100553

Just checking in: I'd like to confirm whether people have reviewed the proposal and the proof of concept to make sure that it fits their needs. If so, I would like to polish it up next week and put it up for review, so please let me know if you have any objections. Thanks!

Update: the patch is now in review. One thing I should mention are translations. I found the following differences between GrowthExperiments and ContentTranslation:

TopicGECX
GeographyRegionsGeography
BiographyBiography (all)Biography
Women biographiesBiography (women)Women

I asked @ifried, and we would like to keep the GrowthExperiments versions of all three, because:

  • Geography is the study of Earth, but that category really only contains regions
  • "Women" alone is too broad to be used for just categories (could refer to other things such as women’s health, women’s history, women’s rights).
    • Consequently, we then need to have "Biography (all)".

There are also two messages that differ only in capitalization:

GECX
History and SocietyHistory and society
Science, Technology and MathScience, technology and math

For these, I believe the CX version is preferable due to standard capitalisation.

Please let me know if you have any feedback on the above!

Also, @Nikerabbit: can you please take care of moving the messages on TWN when the above patch is merged? I can prepare a complete map from old to new if you need it in a specific format.

Is the mapping straightforward enough to do with a pattern on Special:ReplaceText? Else I would use moveBatch.php but that needs full list of page titles.

Also, I cannot move stuff while CX is using them, so we should prepare a patch in CX to call WikimediaMessages if available and patch for translatewiki.net to setup this new group. This also affects testing locally.

Is the mapping straightforward enough to do with a pattern on Special:ReplaceText? Else I would use moveBatch.php but that needs full list of page titles.

This is actually a good question. I think it depends on how much we want to try and preserve existing translations, since we're trying to merge two separate sources. I guess a more refined solution would look like this.

  1. For each language and topic, compare growthexperiments-homepage-suggestededits-topic-name-$TOPIC and cx-articletopics-topic-$TOPIC. If they're identical, put the content in wikimedia-articletopics-topic-$TOPIC and delete both sources.
  2. Do the same for growthexperiments-homepage-suggestededits-topic-group-name-$GROUP and cx-articletopics-group-$GROUP --> wikimedia-articletopics-group-$GROUP.
  3. (Now we no longer have identical messages that exist in both sources)
  4. Move the following GrowthExperiments messages, if they exist, regardless of the CX version; delete both sources afterwards.
    1. growthexperiments-homepage-suggestededits-topic-group-name-geography to wikimedia-articletopics-group-geography
    2. growthexperiments-homepage-suggestededits-topic-name-biography to wikimedia-articletopics-topic-biography
    3. growthexperiments-homepage-suggestededits-topic-name-women to wikimedia-articletopics-topic-women
  5. For each topic, if only one exists of growthexperiments-homepage-suggestededits-topic-name-$TOPIC and cx-articletopics-topic-$TOPIC, move its content to wikimedia-articletopics-topic-$TOPIC and delete the source.
  6. Do the same for growthexperiments-homepage-suggestededits-topic-group-name-$GROUP XOR cx-articletopics-group-$GROUP --> wikimedia-articletopics-group-$GROUP.
  7. (Now we only have messages that exist in both sources but are different)
  8. For the following messages, use the CX version and then delete both sources:
    1. cx-articletopics-group-history-and-society -> wikimedia-articletopics-group-history-and-society
    2. cx-articletopics-group-science-technology-and-math -> wikimedia-articletopics-group-science-technology-and-math
  9. For the remaining messages: I don't know. I'd be inclined to keep the GE version because it's probably been around for longer than CX, but that doesn't necessarily mean it's best. Manual review would be ideal but I have no idea how many messages there will be.

I don't think the above can be achieved through any means other than a custom script. Or can it?

Also, I cannot move stuff while CX is using them, so we should prepare a patch in CX to call WikimediaMessages if available

I can make a quick patch to have CX use the new messages as soon as they exist; the code itself can be updated later. Same for GrowthExperiments. Both should end up being two-liner patches (plus removal of messages).

and patch for translatewiki.net to setup this new group. This also affects testing locally.

In r1100553 I'm adding messages to the existing "Wikimedia Messages" group. I thought about creating a new group for topics, but the migration already seemed complex enough and I left that aside. Could we maybe do it after the initial migration?

I don't think the above can be achieved through any means other than a custom script. Or can it?

Sounds like manual work to me. And tricky one in the sense that if we move or delete anything, the next daily export will remove the translations. Requires close coordination to avoid breaking anything.

I will check with my team that they are aware of the implications and okay with that.

Trying to summarize the outstanding questions, so we don't lose the thread:

  • @KStoller-WMF, @PWaigi-WMF: I made a proposal in T380825#10399214 about changing some of the existing labels. This will affect both GrowthExperiments and ContentTranslation, so I'd like to make sure that it's OK from a product perspective.
  • @Nikerabbit: Any feedback on the plan in T380825#10400786? Let me know if you need anything else or in a more specific format. We could also pair on the migration, if you'd like to.
  • @Isaac: In gerrit (thread), we've been trying to find a more accurate name for what we're currently calling "ORES topics". This would be specifically in relation to the taxonomy, and it would need to be a short name. Any suggestions on that?
  1. (Now we only have messages that exist in both sources but are different)

I assume you are referring here (and in all the other items in that list) to the English message, in the respecive en.json files, right? Or do you plan to apply these steps per message?

For example, the English word in both cases might be "Technology", and GrowthExperiment's de.json might have "Technologie", however ContentTranslation's de.json might have "Technik". -- Would that be treated as the messages being the same or as them being different in your series of steps?

  1. (Now we only have messages that exist in both sources but are different)

I assume you are referring here (and in all the other items in that list) to the English message, in the respecive en.json files, right? Or do you plan to apply these steps per message?

No, I'm planning to do this for every language. I realized only now that this is only mentioned in step 1 of T380825#10400786, but all those steps would be run for each language. Pseudo-code below if that helps.

For example, the English word in both cases might be "Technology", and GrowthExperiment's de.json might have "Technologie", however ContentTranslation's de.json might have "Technik". -- Would that be treated as the messages being the same or as them being different in your series of steps?

As being different. For the English messages, I've already verified that the GE and CX versions are identical, with the 5 exceptions mentioned in T380825#10399214. But for translations, we might have the problem you are mentioning. My proposed solution is to use the GE version, but just because it's been around for longer.


Pseudocode
<?php

$languages = [ /* list of all languages */ ];
$topics = [ /* list of all topics */ ];
$groups = [ /* list of all groups */ ];

/** Returns content of message in the given language */
function msg( string $key ): ?string {}
/** Moves $from to $to, deleting $from in the process. No-op if $from does not exist */
function move( string $from, string $to ) {}
/** Deletes the message, no-op if it does not exist */
function delete_msg( string $key ) {}
/** Checks whether the msg exists in the given language */
function msg_exists( string $key ): bool {}

foreach ( $languages as $lang ) {
	// Note, below we always iterate the same topics & groups. In reality, we could remove processed topics from the
	// list after each step.

	// Step 1
	foreach ( $topics as $topic ) {
		if (
			msg( "growthexperiments-homepage-suggestededits-topic-name-$topic/$lang" ) ===
			msg( "cx-articletopics-topic-$topic/$lang" )
		) {
			move( "cx-articletopics-topic-$topic/$lang", "wikimedia-articletopics-topic-$topic/$lang" );
			delete_msg( "growthexperiments-homepage-suggestededits-topic-name-$topic/$lang" );
		}
	}
	// Step 2
	foreach ( $groups as $group ) {
		if (
			msg( "growthexperiments-homepage-suggestededits-topic-group-name-$group/$lang" ) ===
			msg( "cx-articletopics-group-$group/$lang" )
		) {
			move( "cx-articletopics-group-$group/$lang", "wikimedia-articletopics-group-$group/$lang" );
			delete_msg( "growthexperiments-homepage-suggestededits-topic-group-name-$group/$lang" );
		}
	}

	// Step 4
	if ( msg_exists( "growthexperiments-homepage-suggestededits-topic-group-name-geography/$lang" ) ) {
		move(
			"growthexperiments-homepage-suggestededits-topic-group-name-geography/$lang",
			"wikimedia-articletopics-group-geography/$lang"
		);
	}
	delete_msg( "cx-articletopics-group-geography/$lang" );
	$forceGETopics = [ 'biography', 'women' ];
	foreach ( $forceGETopics as $forceGETopic ) {
		if ( msg_exists( "growthexperiments-homepage-suggestededits-topic-name-$forceGETopic/$lang" ) ) {
			move(
				"growthexperiments-homepage-suggestededits-topic-name-$forceGETopic/$lang",
				"wikimedia-articletopics-topic-$forceGETopic/$lang"
			);
		}
		delete_msg( "cx-articletopics-topic-$forceGETopic/$lang" );
	}

	// Step 5
	foreach ( $topics as $topic ) {
		if (
			msg_exists( "growthexperiments-homepage-suggestededits-topic-name-$topic/$lang" ) &&
			!msg_exists( "cx-articletopics-topic-$topic/$lang" )
		) {
			move(
				"growthexperiments-homepage-suggestededits-topic-name-$topic/$lang",
				"wikimedia-articletopics-topic-$topic/$lang"
			);
		} elseif (
			msg_exists( "cx-articletopics-topic-$topic/$lang" ) &&
			!msg_exists( "growthexperiments-homepage-suggestededits-topic-name-$topic/$lang" )
		) {
			move(
				"cx-articletopics-topic-$topic/$lang",
				"wikimedia-articletopics-topic-$topic/$lang"
			);
		}
	}
	// Step 6
	foreach ( $groups as $group ) {
		if (
			msg_exists( "growthexperiments-homepage-suggestededits-topic-group-name-$group/$lang" ) &&
			!msg_exists( "cx-articletopics-group-$group/$lang" )
		) {
			move(
				"growthexperiments-homepage-suggestededits-topic-group-name-$group/$lang",
				"wikimedia-articletopics-group-$group/$lang"
			);
		} elseif (
			msg_exists( "cx-articletopics-group-$group/$lang" ) &&
			!msg_exists( "growthexperiments-homepage-suggestededits-topic-group-name-$group/$lang" )
		) {
			move(
				"cx-articletopics-group-$group/$lang",
				"wikimedia-articletopics-group-$group/$lang"
			);
		}
	}

	// Step 8
	$preferCXGroups = [ 'history-and-society', 'science-technology-and-math' ];
	foreach ( $preferCXGroups as $preferCXGroup ) {
		move(
			"cx-articletopics-group-$preferCXGroup/$lang",
			"wikimedia-articletopics-group-$preferCXGroup/$lang"
		);
		delete_msg( "growthexperiments-homepage-suggestededits-topic-group-name-$preferCXGroup/$lang" );
	}

	// Step 9 (choose GE)
	foreach ( $topics as $topic ) {
		move(
			"growthexperiments-homepage-suggestededits-topic-name-$topic/$lang",
			"wikimedia-articletopics-topic-$topic/$lang"
		);
		delete_msg( "cx-articletopics-topic-$topic/$lang" );
	}
	foreach ( $groups as $group ) {
		move(
			"growthexperiments-homepage-suggestededits-topic-group-name-$group/$lang",
			"wikimedia-articletopics-group-$group/$lang"
		);
		delete_msg( "cx-articletopics-group-$group/$lang" );
	}
}