Page MenuHomePhabricator

SQL function to recover the normal hostname, to install on Wiki Replica instances
Open, MediumPublic

Description

After T343718: Drop old columns of externallinks it is no longer easy to just build the links in SQL.

Available columns are:
el_id | el_from | el_to_domain_index | el_to_path

el_to_domain_index contains the domain reversed, as in http://org.wikimedia.commons.

Goal: create a function we could install in the cloud db servers which is able to convert el_to_domain_index into a form that could be CONCAT() to el_to_path producing a url just with SQL, without needing to do that part in code.

This might be done as a stored function, with a loop in PL/SQL, or as a User-Defined function in C/C++

The PHP version of this is in includes/ExternalLinks/LinkFilter.php

Event Timeline

bd808 renamed this task from SQL function to recover the normal hostname, to install on wmfcloud db replicas to SQL function to recover the normal hostname, to install on Wiki Replica instances.Aug 24 2023, 1:06 AM
bd808 updated the task description. (Show Details)

Untested function based on ideas from https://stackoverflow.com/a/37231257/8171:

1DELIMITER $$
2CREATE OR REPLACE
3FUNCTION domain_index_to_domain(_domain_index TEXT) RETURNS TEXT DETERMINISTIC
4BEGIN
5 DECLARE _domain TEXT DEFAULT NULL;
6 DECLARE _next TEXT DEFAULT NULL;
7 DECLARE _nextlen INT DEFAULT NULL;
8
9 SET _domain_index = TRIM(_domain_index);
10 -- Move protocol to _domain
11 SET _domain = CONCAT(SUBSTRING_INDEX(_domain_index, '://', 1), '://');
12 SET _domain_index = SUBSTRING_INDEX(_domain_index, '://', -1);
13 iterator:
14 LOOP
15 -- exit loop when there's nothing left
16 IF CHAR_LENGTH(TRIM(_domain_index)) = 0 OR _domain_index IS NULL THEN
17 LEAVE iterator;
18 END IF;
19
20 SET _next = SUBSTRING_INDEX(_domain_index, '.', 1);
21 SET _nextlen = CHAR_LENGTH(_next);
22
23 SET _domain = CONCAT(_domain, TRIM(_next), '.');
24 SET _domain_index = INSERT(_domain_index, 1, _nextlen + 1, '');
25 END LOOP;
26 RETURN TRIM(TRAILING '.' FROM _domain);
27END;
28$$
29DELIMITER ;

DBAs don't maintain the views.

I'm personally not very inclined to do a function here. Not that it won't be useful, more on the side of that this table is quite massive (and that's why we are dropping a lot of it) and adding such functions would make it quite hard to actually use the table the right way. i.e.

SELECT * from externallinks WHERE domain_index_to_domain(el_to_domain_index) = 'google.com';

would run the function on every row of extlinks table. In Commons this table has more 600M rows and can easily choke the db's CPU and be extremely slow for something super simple. So if you have the domain and reverse it internally before passing to the query, there is no needed to have the function in the first place. What I want to say is that introducing this function might encourage "bad" use and cause issues.

One other option to consider is that to have an extra column to hold the domain in the cloud only. But we are running out of space in the dbs (the table in commons was 300GB before this change) and specially since cloud replica co-host two or more sections in one physical host, they will have issues sooner.

A middle ground is to introduce that function in toolforge python library. Specially since it's not as easy as reversing dot pieces (ports, username, etc. need to be considered, see the php function)

I'm personally not very inclined to do a function here. Not that it won't be useful, more on the side of that this table is quite massive (and that's why we are dropping a lot of it) and adding such functions would make it quite hard to actually use the table the right way. i.e.

SELECT * from externallinks WHERE domain_index_to_domain(el_to_domain_index) = 'google.com';

would run the function on every row of extlinks table. In Commons this table has more 600M rows and can easily choke the db's CPU and be extremely slow for something super simple. So if you have the domain and reverse it internally before passing to the query, there is no needed to have the function in the first place. What I want to say is that introducing this function might encourage "bad" use and cause issues.

SELECT CONCAT(domain_index_to_domain(el_to_domain_index), el_to_path) from externallinks is the sort of thing that folks were asking for so that they can get the full URL via SQL without needing to add in post-processing in PHP/Python/JS/Ruby/Perl/whatever. Quarry and Superset are used by quite a lot of community members.

Though related, this does not block the parent task.

fnegri triaged this task as Medium priority.Jul 1 2024, 5:55 PM
fnegri subscribed.

SELECT CONCAT(domain_index_to_domain(el_to_domain_index), el_to_path) from externallinks is the sort of thing that folks were asking for so that they can get the full URL via SQL without needing to add in post-processing

If this is the need, could we define a function and expose it through a view only, so that we control how the function is used, and we prevent users from adding it to a WHERE clause like in the example above from @Ladsgroup?

I don't know if MariaDB could allow a function being defined but not used in WHERE (and only in SELECT for example). Not to mention that someone could do something like this:

SELECT CONCAT(domain_index_to_domain(el_to_domain_index), el_to_path) as url from externallinks WHERE url = 'google.com/'

And we are back to the pathological mode.

To emphasize. It's be nice to have that function in quarry/superset itself and deal with it in those services but not on mariadb level.

I don't know if MariaDB could allow a function being defined but not used in WHERE (and only in SELECT for example)

But you can define a function in database priv, and create a VIEW using that function in database unpriv. If the VIEW is created by root, then users with access to database unpriv can query the VIEW (which internally uses the function), but they cannot access the function itself.

Simple test case:

# as root
CREATE DATABASE priv;
CREATE DATABASE unpriv;
CREATE USER unpriv IDENTIFIED BY '***';
GRANT ALL PRIVILEGES ON unpriv.* TO unpriv@'%';
CREATE TABLE priv.test (id INT);
INSERT INTO priv.test VALUES(1);
CREATE FUNCTION priv.double_of (n INT) RETURNS INT RETURN (SELECT 2*n);
CREATE VIEW unpriv.test_p AS SELECT priv.double_of(id) FROM priv.test;

# as user "unpriv" (with access only to db "unpriv")
SELECT * FROM unpriv.test_p; # this works
SELECT priv.double_of(1); # this is blocked (execute command denied)

To emphasize. It's be nice to have that function in quarry/superset itself and deal with it in those services but not on mariadb level.

This is definitely possible, but not straightforward: neither Quarry nor Superset have built-in features to perform custom processing on the results that are returned by the db, they just show the raw results.

That's a nice trick. Feel free to add this to the existing views.