Page MenuHomePhabricator

Undeploy and archive ActiveAbstract
Closed, ResolvedPublic

Description

The sole purpose of this extension deployed in production is to build the "Extracted page abstracts for Yahoo" dumps. I think providing such dumps doesn't make sense in 2024:

  • Yahoo! doesn't have its own search engine anymore and it's backed by Bing now.
  • If Yahoo! or any other company needs abstract dumps, they should either build it themselves (which is quite easy) or use WME. We are not here to serve big tech.
  • This dump was made in 2005 when network was expensive and more limited than current internet capacity. These days, people can just download the full dump (current version only) and do any transformations they need.
  • The downloads are quite low in numbers. Even those are probably mostly people who are curios or crawlers.
  • In the current technologies, if we want to provide "summary" dumps, it's better to use something better than just getting the first x bytes of the article. An LLM based dump would be much more useful than the status quo.
  • It is adding non-negligible cost on us:

Extension archival checklist:

  • On-wiki documentation
    • Archive documentation on mediawiki.org (https://www.mediawiki.org/wiki/Extension:ActiveAbstract): replace page contents with {{Archived extension|last revision id before archiving|task=T######}} (for extensions); replace T###### with this task's number.
    • If documentation page was translatable, remove <translate>, visit Special:PageTranslation, and click "remove from translation" (if you don't have the translation administrator right, ask a user who does).
    • Update Wikidata item (https://www.wikidata.org/wiki/Q21676088) associated with documentation page
      • add statement Abandonware (Q281039) to instance of (P31) together with qualifier start time (P580) = the YYYY-MM-DD date that you decided to archive extension (generally per edit history)
      • add qualifier end time (P582) = the YYYY-MM-DD (same date as above) to instance of (P31) = MediaWiki extension (Q6805426)
  • Phabricator
    • Mark all Phabricator tasks for the extension either Declined or Invalid. Add a comment pointing to this task when doing so for reference.
    • Archive Phabricator project ActiveAbstract for the extension.
    • Edit Phabricator project ActiveAbstract description for the extension with a link to this ticket.
  • Translatewiki.net/translations
    • If the extension is deployed on Wikimedia sites, but it is known that it's not going to receive significant feature updates or deployed to new wikis, make sure that its project ids (usually "ext-extensionname") appear in the groups/MediaWiki/WikimediaLegacyAgg.yaml file in the translatewiki Gerrit repository and not in WikimediaMainAgg.yaml, WikimediaAdvancedAgg.yaml, etc. (If it also has an api group, it should remain in WikimediaTechnicalAgg.yaml).
    • If the extension is no longer deployed on Wikimedia sites, remove it from all Wikimedia*Agg.yaml. (If it was ever deployed, by this time, it's most likely in WikimediaLegacyAgg.yaml or WikimediaTechnicalAgg.yaml.)
    • If the extension is going to be completely archived and no longer developed, remove it completely from translatewiki.net by making sure that its project IDs don't appear in any of the following files:
  • Configuration/tests/integrations/etc.
  • Repositories

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

IIRC these (and the OAI feeds) were added back in the day when the WMF got some corporate contribution to provide specialised data feeds. I imagine any contractual obligations have long expired (if they even existed), but I don't know who could verify that.

In principle it's nice to have some support for search engines other than Google, but these days we have various HTML dumps and wikitext parsing APIs. Perhaps a short announcement or documentation update could point out how to extract the same information from the HTML dumps.

I asked legal to make sure we don't accidentally violate any binding agreement.

IIRC these (and the OAI feeds) were added back in the day when the WMF got some corporate contribution to provide specialised data feeds. I imagine any contractual obligations have long expired (if they even existed), but I don't know who could verify that.

WME has looked into precedents and it seems it was closed in 2010

In principle it's nice to have some support for search engines other than Google, but these days we have various HTML dumps and wikitext parsing APIs. Perhaps a short announcement or documentation update could point out how to extract the same information from the HTML dumps.

They probably can also use short descriptions (which didn't exist back then).

Seems like it's been quiet on the list servs so far. On February 7th, 2024 if it remains quiet, let's go ahead and stop producing them.

Let's start to prepare for that work by scoping out the the required steps to both stop producing the dump in a way that is easily reversible and disabling them entirely. Based on the feedback we get, we will decide which path to take.

@xcollazo or @Ladsgroup please outline the work required. ty!

Change #1108844 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/dumps@master] Stop producing Yahoo! abstract dumps

https://gerrit.wikimedia.org/r/1108844

This patch would be the easiest way to stop producing the dumps, then the next step would be to remove all mentions of abstract dumps in the dumps 1.0 python scripts. That includes ./xmldumps-backup/xmlabstracts.py plus many classes such as AbstractDump(Dump) and AbstractFileLister(OutputFileLister) in xmldumps-backup/dumps/xmljobs.py and RecombineAbstractDump(RecombineDump) in xmldumps-backup/dumps/recombinejobs.py (plus configs, etc.). Once that's merged and deployed, we can undeploy and archive ActiveAbstract extension

Hi team,

I am an engineer from Yahoo Search working on the wiki dataset. Thanks for all the support!

From our understanding, we are no longer consuming the abstract dump but instead generating the abstract with the Enterprise dump.

However, we are not 100% sure if the data is fully retired as we just took over the project last month.
Could anyone point me to the download page or download link? We can double-check on our side.
And do we have any timeline for removing the data dump?

Thanks in advance!
Jerry

Thanks for checking! The urls and files should have the word "abstract" in them. For example this is for nlwiki: https://dumps.wikimedia.org/nlwiki/20241220/nlwiki-20241220-abstract.xml.gz

Thank you! @Ladsgroup

We have checked on our side and did not find any script or code accessing the pattern "*abstract.xml.gz". I will assume it has been deprecated. :)
I will also keep an eye on the ticket status and we will check again once the data is removed.

Thank you for bringing this to our attention.
Jerry

Seems like it's been quiet on the list servs so far. On February 7th, 2024 if it remains quiet, let's go ahead and stop producing them.

Feb 7th has passed and the eagles have won but no objection or complaints about the abstract dumps. Shall we stop producing them? https://gerrit.wikimedia.org/r/1108844 this should be the starting point.

+1 to move ahead and stop this dump.

Change #1108844 merged by jenkins-bot:

[operations/dumps@master] Stop producing Yahoo! abstract dumps

https://gerrit.wikimedia.org/r/1108844

Mentioned in SAL (#wikimedia-operations) [2025-02-13T12:19:27Z] <ladsgroup@deploy2002> Started deploy [dumps/dumps@2e0a7a5]: Stop producing Yahoo! abstract dumps (T382069)

Mentioned in SAL (#wikimedia-operations) [2025-02-13T12:19:35Z] <ladsgroup@deploy2002> Finished deploy [dumps/dumps@2e0a7a5]: Stop producing Yahoo! abstract dumps (T382069) (duration: 00m 07s)

+1 to move ahead and stop this dump.

Deployed. In the next run, there shouldn't be any yahoo! dumps anymore. After the next run, let's clean up dumps code (the current patch only removes the job to make it easier to revert)

Change #1119486 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/dumps@master] Remove abstract dumps infrastructure

https://gerrit.wikimedia.org/r/1119486

Two dump runs now don't include Yahoo! abstracts anymore. I haven't heard even a single complaint so far. Shall we ax the code and the extension @VirginiaPoundstone ?

Change #1119486 merged by jenkins-bot:

[operations/dumps@master] Remove abstract dumps infrastructure

https://gerrit.wikimedia.org/r/1119486

Mentioned in SAL (#wikimedia-operations) [2025-03-10T10:45:47Z] <ladsgroup@deploy2002> Started deploy [dumps/dumps@afcb740]: Removing Yahoo! abstract dumps code (T382069)

Mentioned in SAL (#wikimedia-operations) [2025-03-10T10:45:54Z] <ladsgroup@deploy2002> Finished deploy [dumps/dumps@afcb740]: Removing Yahoo! abstract dumps code (T382069) (duration: 00m 07s)

This is now removed. @Jdforrester-WMF shall we undeploy ActiveAbstract now?

Change #1126084 had a related patch set uploaded (by Jforrester; author: Jforrester):

[operations/mediawiki-config@master] Stop loading the ActiveAbstract extension for dumps

https://gerrit.wikimedia.org/r/1126084

Change #1126085 had a related patch set uploaded (by Jforrester; author: Jforrester):

[integration/config@master] Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived

https://gerrit.wikimedia.org/r/1126085

[integration/config@master] Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived

https://gerrit.wikimedia.org/r/1126085

For clarity, is the plan to archive the ActiveAbstract extension at the same time as/immediately after undeploying it from WMF production? If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

[integration/config@master] Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived

https://gerrit.wikimedia.org/r/1126085

For clarity, is the plan to archive the ActiveAbstract extension at the same time as/immediately after undeploying it from WMF production?

This is that task.

If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

No, there are very few useful bits of that template for this case.

For clarity, is the plan to archive the ActiveAbstract extension at the same time as/immediately after undeploying it from WMF production? If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

I was thinking the same thing. The gerrit repo archiving steps, phabricaror tag archiving steps, mediawiki.org archiving steps, and wikidata archiving steps seem relevant to this.

Change #1126084 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop loading the ActiveAbstract extension for dumps

https://gerrit.wikimedia.org/r/1126084

Mentioned in SAL (#wikimedia-operations) [2025-03-11T10:37:41Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:05:19Z] <ladsgroup@deploy2002> ladsgroup, jforrester: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:30:00Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:34:49Z] <ladsgroup@deploy2002> ladsgroup, jforrester: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-11T11:43:37Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1126084|Stop loading the ActiveAbstract extension for dumps (T382069)]] (duration: 13m 36s)

For clarity, is the plan to archive the ActiveAbstract extension at the same time as/immediately after undeploying it from WMF production? If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

I was thinking the same thing. The gerrit repo archiving steps, phabricaror tag archiving steps, mediawiki.org archiving steps, and wikidata archiving steps seem relevant to this.

I was about to boldly added a trimmed-down version of the archival checklist to the end of this task's description (to allow for easier tracking of the necessary cleanup steps); however, given how many of the checkboxes I found myself copying over/rewriting from the standard archival task template, to be honest I feel like it'd be simplest to just open a template archival task to ensure everything that needs to be done gets done.

If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

No, there are very few useful bits of that template for this case.

Respectfully, I disagree with this - the majority of the checkboxes on the template seem to apply in this case, and IMO it would be simplest & easiest for tracking to track them using the standard task template.

Change #1126537 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/vagrant@master] dumps: Remove ActiveAbstract

https://gerrit.wikimedia.org/r/1126537

For clarity, is the plan to archive the ActiveAbstract extension at the same time as/immediately after undeploying it from WMF production? If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

I was thinking the same thing. The gerrit repo archiving steps, phabricaror tag archiving steps, mediawiki.org archiving steps, and wikidata archiving steps seem relevant to this.

I was about to boldly added a trimmed-down version of the archival checklist to the end of this task's description (to allow for easier tracking of the necessary cleanup steps); however, given how many of the checkboxes I found myself copying over/rewriting from the standard archival task template, to be honest I feel like it'd be simplest to just open a template archival task to ensure everything that needs to be done gets done.

If you make a duplicate task of this one it will be merged into this one. Feel free to add checkboxes to this task description if you wish, but please do not create a new task.

If so, it might be good to create a separate task to track the extension’s archival, to ensure that all the cleanup steps that need to be done get completed.

No, there are very few useful bits of that template for this case.

Respectfully, I disagree with this - the majority of the checkboxes on the template seem to apply in this case, and IMO it would be simplest & easiest for tracking to track them using the standard task template.

Most of the complexity of that task is for human-facing extensions that third parties might have used (hence the Wikidata stuff) or have i18n (hence the TWN stuff), which are irrelevant.

Most of the complexity of that task is for human-facing extensions that third parties might have used (hence the Wikidata stuff) or have i18n (hence the TWN stuff), which are irrelevant.

To be fair, ActiveAbstract is present on translatewiki. All of the template task's checkboxes under "Configuration/tests/integrations/etc.", "Repositories" & "Phabricator" also seem to apply in this case AFAICS.
Anyways, I'm not gonna create a template archival task for ActiveAbstract if there's opposition, but I do feel that creating one (or at least, copying over the majority of the checkboxes to this task's description) would be beneficial in this case :)

A_smart_kitten updated the task description. (Show Details)

Feel free to add checkboxes to this task description if you wish, but please do not create a new task.

Didn't see this when typing the last reply (d'oh!) - I've copied the majority of the template checklist over to this task. Obviously feel free to edit/refine as needed :)

Change #1126085 merged by jenkins-bot:

[integration/config@master] Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived

https://gerrit.wikimedia.org/r/1126085

Change #1126552 had a related patch set uploaded (by Jforrester; author: Jforrester):

[mediawiki/extensions/ActiveAbstract@master] Empty repo, no longer used

https://gerrit.wikimedia.org/r/1126552

Mentioned in SAL (#wikimedia-releng) [2025-03-11T13:19:22Z] <James_F> Zuul: [mediawiki/extensions/ActiveAbstract] Mark as archived, for T382069

Change #1126552 merged by Jforrester:

[mediawiki/extensions/ActiveAbstract@master] Empty repo, no longer used

https://gerrit.wikimedia.org/r/1126552

Change #1126537 merged by jenkins-bot:

[mediawiki/vagrant@master] dumps: Remove ActiveAbstract

https://gerrit.wikimedia.org/r/1126537

Change #1126587 had a related patch set uploaded (by Pppery; author: Pppery):

[translatewiki@master] Drop active abstract

https://gerrit.wikimedia.org/r/1126587

Change #1126587 merged by jenkins-bot:

[translatewiki@master] Drop active abstract

https://gerrit.wikimedia.org/r/1126587

I'm presently reviewing validator.nu (html5 validator) mostly to check the licensing of their files.

It ships "language-profiles" files created by language-detection (langdetect.jar).

Their wiki states:


This is a language detection library implemented in plain Java. (aliases: language identification, language guessing)

   Presentation : http://www.slideshare.net/shuyo/language-detection-library-for-java

Abstract

   Generate language profiles from Wikipedia abstract xml

   Detect language of a text using naive Bayesian filter

   99% over precision for 53 languages

From what I understood, these profiles have been generated based on your xml dumps.

From the license point of view, I guess this would mean that since they derive from your data, the files are licensed under the same terms as wikipedia text: CC-BY-SA-4.0 OR GFDL-1.1-no-invariants-or-later.

Howerver, the language profiles are more than 10 years old (maybe more than 15) and cover approx 50 languages. It is still possible to create language-profiles based on your abstract.xml files, provided that these files can be found, which is possible with dumps until 2025-02-01.

The method to create language profiles is like below (example with the last file you generated for zh_yue) and also documented here and there.

curl -O https://repo1.maven.org/maven2/net/arnx/jsonic/1.3.10/jsonic-1.3.10.jar
curl -O https://repo1.maven.org/maven2/com/cybozu/labs/langdetect/1.1-20120112/langdetect-1.1-20120112.jar
curl -O https://dumps.wikimedia.org/zh_yuewiki/20250201/zh_yuewiki-20250201-abstract.xml.gz
mkdir -p profiles

java -cp jsonic-1.3.10.jar:langdetect-1.1-20120112.jar com/cybozu/labs/langdetect/Command --genprofile -d . zh_yue

You mention the possibility to extract the same information as that in the abstract.xml from a full dump, but that would be a very huge download especially if we want to have it for every language. Anyway, if one really wants to do so, would you share a XSL stylesheet somewhere to do that conversion, and tell which xml file/dump shall be used as input?

Do you have another suggestion (for end users)? What could be done?

Could you by the way confirm what the license of the files computationally derived from abstract.xml would be?

Change #1154871 had a related patch set uploaded (by Novem Linguae; author: Novem Linguae):

[mediawiki/extensions@master] delete ActiveAbstract extension

https://gerrit.wikimedia.org/r/1154871

Change #1154871 merged by Jforrester:

[mediawiki/extensions@master] delete ActiveAbstract extension

https://gerrit.wikimedia.org/r/1154871

I think the last 3 unmarked items in the checklist at the top of this ticket need someone with CI shell access, a gerrit admin, and a Wikimedia member on GitHub, respectively.

Mentioned in SAL (#wikimedia-releng) [2025-07-07T16:26:00Z] <James_F> jforrester@doc1004:~$ sudo -u doc-uploader rm -rf /srv/doc/cover-extensions/ActiveAbstract/ # For T382069

I think the last 3 unmarked items in the checklist at the top of this ticket need someone with CI shell access, a gerrit admin, and a Wikimedia member on GitHub, respectively.

I'm not a gerrit admin, but I've done the other two.

Thanks James.

@taavi, would you be willing to do "reparent on All-Archived-Project" using your Gerrit admin?

taavi claimed this task.
taavi updated the task description. (Show Details)

Done.