Page MenuHomePhabricator

Externallinks in Russian Wikinews is unusually large
Closed, ResolvedPublic

Description

it's currently sits at 130GB. It alone is responsible for more than 10% of all of s3 (+900 wikis) and en par with extlinks table of commons which has more than 100M pages.

A quick search showed this:

mysql:research@dbstore1007.eqiad.wmnet [ruwikinews]> select el_to_domain_index, count(*) from externallinks group by el_to_domain_index order by count(*) desc limit 5;
+----------------------------------+----------+
| el_to_domain_index               | count(*) |
+----------------------------------+----------+
| https://org.wikimedia.           | 27202031 |
| https://org.archive.web.         |  6866861 |
| https://org.toolforge.pageviews. |  6800521 |
| https://org.wmflabs.tools.       |  5793662 |
| https://com.twitter.             |  1603551 |
+----------------------------------+----------+
5 rows in set (9 min 5.839 sec)

27M links are just to the rest API. This shouldn't be considered external domain:

| 332698561 | 13972941 | https://org.wikimedia. | /api/rest_v1/metrics/pageviews/per-article/ru.wikinews/all-access/user/%D0%92%D0%B5%D1%81%D0%B5%D0%BD%D0%BD%D1%8F%D1%8F%20%D0%B2%D1%8B%D1%81%D1%82%D0%B0%D0%B2%D0%BA%D0%B0%20%C2%AB%D0%9A%D0%BB%D0%B8%D0%BD%D0%BE%D0%BA-2023%C2%BB/daily/2015070100/2023040315

And https://org.wmflabs.tools. and https://org.toolforge.pageviews. and https://org.creativecommons. should switch to interwiki links

Event Timeline

Baseline:

mysql:research@dbstore1007.eqiad.wmnet [ruwikinews]> select count(*) from externallinks;
+----------+
| count(*) |
+----------+
| 60609082 |
+----------+
1 row in set (28.480 sec)

Asking to see if these pages are actually needed

I wonder whether there is a way to provide such download links on page info page. I remember looking into how to add a custom row there myself, for an unrelated thing, and found it difficult.

Third

This one feels like some API design flaw to me. It should not matter whether you are using action API or REST API to access info about the very wiki you are linking from.

Asking to see if these pages are actually needed

I wonder whether there is a way to provide such download links on page info page. I remember looking into how to add a custom row there myself, for an unrelated thing, and found it difficult.

At this point a custom js in common.js would be much better than creating 7M pages.

Asking to see if these pages are actually needed

I wonder whether there is a way to provide such download links on page info page. I remember looking into how to add a custom row there myself, for an unrelated thing, and found it difficult.

Do you mean the "Page information" (action=info) page? I suppose that could work. The pageinfo-header and pageinfo-footer interface messages allow you to add a section with links above or below the rest of the page. That could work through the MediaWiki:-namespace on the wiki.

Example:

If you want it to be inside the "Basic information" table, then this could be done via the PageViewInfo extension. This is the extension that adds the "Page views in the past 30 days" row today. One could write a patch in Gerrit to add a link to the API for JSON data from there.

This would mean that these "Statistic" pages don't need to exist, as the default info action of the software would provide these links instead — would requiring millions of individually created, editable, and database-indexed wiki pages.

FWIW, At least one user thinks these pages in Russian Wikinews are useless and we can simply delete all of them. I'll ask in the main forum before moving forward.

With the current speed of deleting around 60 every minute. It'll take 71 days to finish.

@jcrespo this might cause some reduction in backup sizes of s3.

Progress report: Now 6.1M pages left. Around 10% have been deleted. Here is the size of externallinks table:

mysql:research@dbstore1007.eqiad.wmnet [ruwikinews]> select count(*) from externallinks;
+----------+
| count(*) |
+----------+
| 55219649 |
+----------+
1 row in set (20.300 sec)

Progress update: We have deleted around 2M so far. 5M more left:

mysql:research@dbstore1007.eqiad.wmnet [ruwikinews]> select count(*) from externallinks;
+----------+
| count(*) |
+----------+
| 49028285 |
+----------+
1 row in set (21.682 sec)

Nitpick: as I see, there are 20M and 50M accordingly.

In my opinion, the next best way to reduce the number of links is to replace the https://ru.wikinews.org/wiki/Шаблон:Социальные_закладки template with a script that dynamically adds these links to the page.

Nitpick: as I see, there are 20M and 50M accordingly.

20M rows of externallinks but 2M pages deleted. I should have been clearer that I'm talking about pages and not rows of externallinks

In my opinion, the next best way to reduce the number of links is to replace the https://ru.wikinews.org/wiki/Шаблон:Социальные_закладки template with a script that dynamically adds these links to the page.

Yeah, that'd be nice too. Will get to it after the deletion is done.

There is only 31 pages left in that category.

The number of rows have been dropped to 27M:

mysql:research@dbstore1007.eqiad.wmnet [ruwikinews]> select count(*) from externallinks;
+----------+
| count(*) |
+----------+
| 26122047 |
+----------+
1 row in set (10.476 sec)

I tried changing the template into using interwiki link but due to double escaping, it makes invalid url and there is no way out of it. I think the better option would be T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch)

I'm optimizing all the tables everywhere in T410589: Optimize all core tables, late 2025 so the table will become much smaller soon.

Ladsgroup moved this task from In progress to Done on the DBA board.

FTR:

root@db1157:/srv/sqldata/ruwikinews# ls -Ssh | head
total 81G
 17G externallinks.ibd
 14G logging.ibd
9.7G categorylinks.ibd

Compared to another host that isn't optimized yet:

root@db1175:/srv/sqldata/ruwikinews# ls -Ssh
total 256G
130G externallinks.ibd
 31G categorylinks.ibd
 22G page_props.ibd
 16G logging.ibd