Page MenuHomePhabricator

Provide namespace IDs and names in the databases similar to toolserver.namespace
Closed, ResolvedPublic

Description

Toolserver has the local table toolserver.namespace on all databases that provides the IDs and names of all namespaces in a wiki:

| mysql> SELECT * FROM toolserver.namespace LIMIT 5;
| +----------+------------------+-------+-------------------------------------+
| | dbname   | domain           | ns_id | ns_name                             |
| +----------+------------------+-------+-------------------------------------+
| | abwiki_p | ab.wikipedia.org |    10 | Ашаблон                      |
| | abwiki_p | ab.wikipedia.org |     7 | Обсуждение файла     |
| | abwiki_p | ab.wikipedia.org |    11 | Обсуждение шаблона |
| | abwiki_p | ab.wikipedia.org |     8 | Амедиавики                |
| | abwiki_p | ab.wikipedia.org |     9 | Обсуждение MediaWiki      |
| +----------+------------------+-------+-------------------------------------+
| 5 rows in set (0.00 sec)

| mysql>

This can probably be easily generated from iterating over all wikis, using the API (http://de.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces) to query namespaces and update them in the database if the webcall succeeded, but an even more interesting approach would of course be to clone operations/mediawiki-config and extract the information from there.


Version: unspecified
Severity: trivial
URL: http://article.gmane.org/gmane.org.wikimedia.labs/1381
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=48626
https://bugzilla.wikimedia.org/show_bug.cgi?id=56301

Details

Reference
bz48625

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:17 AM
bzimport added a project: Toolforge.
bzimport set Reference to bz48625.

On second thought: It would be much more elegant and logical to have a table namespace on *each* database (not one per server) that provides only the IDs (ns_id) and names (ns_name) of that particular database.

(In reply to comment #1)

On second thought: It would be much more elegant and logical to have a table
namespace on *each* database (not one per server) that provides only the IDs
(ns_id) and names (ns_name) of that particular database.

There are use cases to have it all together (https://www.wikidata.org/wiki/User_talk:Legoktm/namespaces is an example) in one database, however having it also available on the database would also be useful.

I think for backwards-compatability reasons we should have one central database structured like the toolserver one, and then we can have tables on each individual database.

I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was generated using the API so some information is missing/wrong for private wikis which do not allow meta=siteinfo API access.

(In reply to comment #3)

I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was
generated using the API so some information is missing/wrong for private
wikis
which do not allow meta=siteinfo API access.

Thanks! Could you post the script here as an attachment, please?

(In reply to comment #3)

I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was
generated using the API so some information is missing/wrong for private
wikis
which do not allow meta=siteinfo API access.

Maybe you can make it a "tool"

(In reply to comment #0)

but an even more interesting approach would of course be to clone
operations/mediawiki-config and extract the information from there.

I'm doing things in this way currently (in my script it's not only for namespace names, but also the whole environment so a single table itself wouldn't help me) but there should be some caveat here: to resolve all namespace names you need to have all extensions deployed ready, at least those who create new namespaces, and follow any WMF's deployment of new extensions.

(In reply to comment #4)

(In reply to comment #3)

I have put most this data in p50380g50577__wmf_wiki_info on tools-db. It was
generated using the API so some information is missing/wrong for private
wikis
which do not allow meta=siteinfo API access.

Thanks! Could you post the script here as an attachment, please?

You can find it in /data/project/wmf-wiki-info/PyMediaWikiClient/script.py
The script wont work alone because you also need some dependencies in certain locations (MediaWikiClient - a small python class which makes it easy to deal with MW's API, and WMF's operations/mediawiki-config.git for the s[1-7].dblist files).
Any feedback/review/etc. is welcome

I think we should have a discussion about what the current "toolserver" database is, what we want in the future, and whether we care about breaking backward compatibility.

Some of the design decisions in some of the database tables could probably be re-thought, but only if we're willing to break the current interfaces.

In addition, I think we should only rely on MediaWiki's API for this information (with user authentication, as necessary). This is the cleanest and sanest way to accurately get this information, as far as I know.

This should probably be discussed on-wiki so that we can gather actual requirements and figure out /how/ before we go ahead and implement something.

(In reply to comment #9)

This should probably be discussed on-wiki so that we can gather actual
requirements and figure out /how/ before we go ahead and implement something.

Agreed. That's pretty much what I had in mind in comment 8. Feel free to start that discussion. :-)

I just hit this issue trying to manually run a database report. The lack of a namespace table is a bit painful. I imagine this issue is preventing a number of database reports from being moved to Labs. Hrm. Perhaps everyone should just query the MediaWiki API themselves. Hmm.

Querying the API is probably the only way to get fresh data; the actual configuration is difficult to parse from the configuration and may also be modified by extensions (which vary from project to project).

If we were to provide metadata tables with namespaces and other project-specific information, it'd have to be queried from the API anyways; the only benefit of having it in tables would be to allow joins (don't know how useful that'd be) at the cost of the information lagging behind reality by 0.5 update period on average.

I wrote a simple PHP function that queries the MediaWiki API for namespaces.
See the "getNamespacesAPI" function here:
https://github.com/pietrodn/intersect-contribs/blob/master/pietrodnUtils.php

My two cents:
Generate database reports by queries, like [[:en:Wikipedia:Database reports]], is quite impossible without namespace meta-table.
Get this information via API means making thousands of API calls instead of a single join.

Beta16: I'm not sure how to get "thousands of API calls"; if you are generating a database report for project X, getting the list of namespaces for that project is exactly one API call. OTOH, doing a join while querying some table on that project's database doubles the number of index lookups, at best, and potentially multiplies the number of fetched rows and increases the data set size a lot.

I'm strongly inclined to agree with comment 12 and comment 15. I'm not sure this is a valid bug. I say this as the original author of [[Wikipedia:Database reports]], mentioned in comment 14. Scripts and tools should probably query the MediaWiki Web API themselves.

Given that this would provide nothing that cannot be obtained faster and more reliably by an API call, and that its use would add load to the database for no gain except a minor programming shorthand; closing WONTFIX.

Tool maintainers are perfectly allowed to make such a database available for general use (and maintain it accordingly).

I suggest re-opening this, as there is are at least two use cases the table provides:

  1. creating full page titles and URLs on the SQL level; for this at least the canonical (Talk, User, User Talk, etc) namespaces have to be provided. The Toolserver wiki shows the following example query:

SELECT

ns_name,
page_title

FROM enwiki_p.page
JOIN toolserver.namespacename

ON ns_id = page_namespace AND dbname = 'enwiki_p'

WHERE page_namespace = 6
AND ns_is_favorite = TRUE
LIMIT 1;

Note that it's impossible to construct a URL without the namespace names -- you cannot link to http://en.wikipedia.org/wiki/NS3:valhallasw, for example.
  1. providing localized page titles for multiple sites (see Comment 2).

I agree with Merlijn -> reopened.

This belongs in the application, and not the database. The cost of doubling index lookups on queries is nontrivial for something that can be done at near-zero cost at the applicative level.

Comment 18 example 1 is exactly the use case that is to be /avoided/; that join is pointless and expensive when the application could simply look up the namespace number in a hash when presenting the result.

Closing again; this will not be supported by tool labs.

Marc, we don't accept no. This is widely used at the Toolserver and so we need an alternative. It's not near-zero core and is not pointless and expensive.

Do you think volunteer time is free? What about all the that are currently using this at the Toolserver and are now forced to move to labs?

It is pointless and expensive, and can be done in a line or two of PHP or perl; and probably no more in pretty much any other language. Do you not realize that the query above causes an index lookup and row fetch in the database for every single page?

Indeed, given that query, you more than triple the resources required to run the query, as well as increase the size of the result set by sending redundant data at every row.

You /have/ an alternative: use one line of code at the beginning of your script to fetch the namespaces info, and look the name up when outputting it.

Does that mean small changes in scripts that used to rely on offloading their work to the database? Yes. It's certainly a much smaller change than was required when we added edit tokens, for instance. APIs and interface evolve. Obsolete junk is pruned all the time. This is one of those times.

Reopening this bug will not change those facts, nor will it magically alter how databases work to make it sane to push what is a trivial hash lookup in the application into an expensive join at the database. So please don't; I'd rather not have to start cracking down on disputes in bugzilla (which isn't the right venue for this anyways).

(In reply to comment #22)

It is pointless and expensive, and can be done in a line or two of PHP or
perl;
and probably no more in pretty much any other language. Do you not realize
that the query above causes an index lookup and row fetch in the database for
every single page?

Those are two claims I'd like to see evidence for :-). I don't like MySQL (or MariaDB) very much, but any query planner worth its salt will cache the namespacename table heavily.

Indeed, given that query, you more than triple the resources required to run
the query, as well as increase the size of the result set by sending
redundant
data at every row.
[...]

If we are counting bytes, we should close down Tools *now*.

Most of the queries run on Toolserver/Tools are ineffective, and that fits nicely with MediaWiki being written in PHP.

If it is your professional opinion that with the current setup we need more database servers or bigger network cables to implement toolserver.namespace & Co., that's totally fine, but the solution then isn't WONTFIX, but take it up the chain to the one who signs an order for new equipment.

Two thoughts here:

  1. This is clearly a breaking change for the transition from Toolserver to Labs. Is this appropriately documented somewhere as such?
  1. Is the source code that controls/maintains meta[_p] in a Labs project? I think it should be just another Labs project®, if it isn't already.

I also need to have all namespace names of all wikis in one table on each cluster.

One aditional use case in my script is replacing full page titles of languagelinks, iwlinks and wikidata-sitelink table with namespace id and title without namespace as is needed for further queries in this script on page table.

Doing this at application layer would be very inefficient any slow, because i am doing this with about 6,5 million pages (all living people) twice a day. Currently this part of my tools is done on toolserver with joining namespacename in less than 5 minutes.

Running api queries to get this namespace info for all 500 wikis (wikipedia + commons + wikisource + wikiquote) would take longer. And running 6.5 mio. single update queries on a use table instead of one join would also stress the database server needless.

So the solution for me is to create this table for my own.

(In reply to Tim Landscheidt from comment #23)

It is pointless and expensive, and can be done in a line or two of PHP or
perl;
and probably no more in pretty much any other language. Do you not realize
that the query above causes an index lookup and row fetch in the database for
every single page?

Those are two claims I'd like to see evidence for :-). I don't like MySQL
(or MariaDB) very much, but any query planner worth its salt will cache the
namespacename table heavily.

Indeed, given that query, you more than triple the resources required to run
the query, as well as increase the size of the result set by sending
redundant
data at every row.
[...]

If we are counting bytes, we should close down Tools *now*.

Most of the queries run on Toolserver/Tools are ineffective, and that fits
nicely with MediaWiki being written in PHP.

If it is your professional opinion that with the current setup we need more
database servers or bigger network cables to implement toolserver.namespace
& Co., that's totally fine, but the solution then isn't WONTFIX, but take it
up the chain to the one who signs an order for new equipment.

With no response, I ran some tests today on enwiki_p with toolserver.namespace dumped from the Toolserver to s51073__toolserver.namespace. I did see an increase in looking up and returning an additional field (as expected), but nothing major that would justify buying new equipment. So reopening this issue so that it can be solved in software.

(In reply to Merlijn van Deen from comment #18)

Note that it's impossible to construct a URL without the namespace names

example.

You can select page_id and do //en.wikipedia.org/?curid=$pageid

I checked with several people.
I can move the Toolserver DB MMP off TS and migrate it into Labs.
When I am done something like toolserverdb_p should be available on all user DB hosts.

If anyone wants to join the project I'd be glad since it is intended to be a MMP anyway.

Thanks a lot for the proposal, nosy! Indeed, I would like others to join you because "we" (as in WMDE) cannot / will not maintain a tool that that many other tools depend on in the long run.

Any volunteers? If there aren't any, it might not be such a good idea after all...

Did the first steps - most of the scripts already run and the dbs get filled.
I still dont know how to test the data in regard of being valid.
I'd be glad if some of you who need it anyway probably try to use s51892_toolserverdb on any of the db instances.

Give me a ring once you are satisfied with the result, I can rename the database to something more mnemonic for you.

(Also, unless your database name ends with _p, you will need to give grants to it)

Any comments on Marlen's work so far? Did anyone test it?

All I can do is get a list of tables in that DB. Everything else is denied.

Little status update... According to Nosy, this is not fully done, yet.

Nosy is on it. Long term maintenance of this tool is still an open question though.

Dear all, please verify and close if okay. Thanks, nosy!

No complaints in almost a month. Closing the ticket.

Reopening, would this be merged into meta_p or renamed without a s**_ prefix?

Renaming without prefix makes it easy to keep maintaining with the current tool. Just say the word, and I'll do the rename.

Word :). Please rename the DB to toolserver.

If its not too complicated I would prefer to have a additional DB called toolserver so I can migrate and leave the old DB in place until its done.

Wait, if you want an additional DB named toolserver, where do you want me to put s51073__toolserver?

(Also, if you want everyone to be able to select in a database, it normally needs to be named ending in _p)

I'd need a DB called toolserver. :D
It'd be fine if I can only get toolserver_p.
The rest of the DBs of the user can be left untouched.

Latest plan is to implement a view called toolserverdb that points to the original db.
It would regard:

s51892_toolserverdb_p.language
s51892_toolserverdb_p.namespace
s51892_toolserverdb_p.namespacename
s51892_toolserverdb_p.servermapping
s51892_toolserverdb_p.wiki

(The view is named toolserverdb_p to allow all users select right, at it lives on tools.labsdb).

That has been created. Please note that (a) the views will not automatically track schema changes, but that shouldn't be an issue since those are legacy tables, and (b) the data in those tables is explicitly maintained by the community and may fall out of sync with reality unless rigorously maintained - caveat ūtilisor.

Issues with the data should be reported to the toolserverdb tool maintainers (https://tools.wmflabs.org/?list#toollist-toolserverdb).

Issues with the data should be reported to the toolserverdb tool maintainers (https://tools.wmflabs.org/?list#toollist-toolserverdb).

@Nosy79 fyi: MariaDB [s51892_toolserverdb]> select * from namespace limit 1;
ERROR 1142 (42000): SELECT command denied to user 'u2916'@'10.68.16.8' for table 'namespace'

guess the db is called s51892_toolserverdb_p (with _p at the end for public access)

jeremyb reopened this task as Open.EditedMar 9 2015, 7:26 AM

MariaDB [s51892_toolserverdb_p]> select * from namespacename;
ERROR 144 (HY000): Table './s51892_toolserverdb_p/namespacename' is marked as crashed and last (automatic?) repair failed

That was labsdb1001. apparently 1002 and 1003 are ok.

@Nosy how is this being maintained? cron job? do you want to add a few people to the tool so they can help fix it if it breaks? or will you be able to maintain this for a while?

As I noted in T50625#488761 the correct view to use is named toolserverdb_p and it lives on tools.labsdb

As far as I know, this is the authoritative view.

The database exists, and is enduser-provided. Nothing for ops to do here.

Dispenser subscribed.

Use View as suggested by Coren:
mysql -h tools.labsdb toolserverdb_p

ERROR 1049 (42000): Unknown database 'toolserverdb_p'

Fall back to @Nosy view, which is how old?

SELECT TABLE_NAME, UPDATE_TIME
FROM information_schema.tables
WHERE TABLE_SCHEMA="s51892_toolserverdb_p";
TABLE_NAMEUPDATE_TIME
language2015-03-01 03:00:15
namespace2015-03-26 09:31:17
namespace_test2014-08-07 02:52:38
namespacename2015-10-03 15:07:23
servermapping2015-03-01 03:00:16
wiki2015-03-26 09:31:16

But we're not missing any wikis, right?

SELECT family, COUNT(DISTINCT wiki.dbname) AS Missing
FROM meta_p.wiki
LEFT JOIN s51892_toolserverdb_p.namespace
       ON namespace.dbname=CONCAT(wiki.dbname, "_p")
WHERE namespace.dbname IS NULL
GROUP BY 1;
familyMissing
special3
wikibooks22
wikidata2
wikimania4
wikimedia2
wikipedia20
wikiquote15
wikisource4
wikiversity1
wikivoyage18
wiktionary23

We've still got coverage for the important ones right?

SELECT page_namespace AS "Missing Namespaces"
FROM (SELECT DISTINCT page_namespace FROM enwiki_p.page) AS ns_ids
LEFT JOIN s51892_toolserverdb_p.namespace
       ON namespace.dbname="enwiki_p" AND ns_id=page_namespace
WHERE ns_name IS NULL;
Missing Namespaces
118Draft
119Draft talk
828Module
829Module talk
2300Gadget
bd808 added subscribers: jkroll, Tobi_WMDE_SW, bd808.

This is quite a feat of thread necromancy you have performed @Dispenser. It looks like you are probably reporting two separate and distinct issues at once:

For the first issue, I don't see anything in T50625#488761 that tells us how @coren created the view in the first place, but I have a sneaking suspicion that is was a manually created view rather than something that is automatically maintained. This is probably how the view ended up disappearing. The easiest thing to do today would probably just be to update any documentation to name the s51892_toolserverdb_p database itself rather than the view created alias.

For the second issue the toolserverdb tool maintainers will need to get involved.

bd808 removed coren as the assignee of this task.Mar 26 2017, 9:51 PM
bd808 moved this task from Triage to Tools on the Cloud-Services board.
bd808 added a subscriber: coren.

The problem was solved. Re-opening an old task to get some new work done is probably not the best approach. See also https://krausefx.com/blog/scaling-open-source-communities

bd808 claimed this task.

Reclosing this old bug. The issues mentioned in T50625#3125422 should be filed as new tickets for the toolserverdb tool and either a doc update or a request for an aliasing view with the previously published name.