Page MenuHomePhabricator

Source links in the monuments database get too long and are truncated
Closed, ResolvedPublic

Related Objects

Event Timeline

Multichill raised the priority of this task from to Needs Triage.
Multichill updated the task description. (Show Details)
Multichill added subscribers: Multichill, Ahonc.

Hmmm.

CREATE TABLE `monuments_all_tmp` (
  ...
  `source` varchar(255) NOT NULL DEFAULT '',
   ...

Shall I just up that value? Looks like Labs runs a MySql recent enough to allow for VARCHAR of more than 256 (but I suck at DB management :-þ), or should use TEXT or something?

and what about storing <s>permalinks</s> curids/oldid of page? It will be less than 255 chars

Multichill added a subscriber: jcrespo.

I hope @jcrespo can answer your question JeanFred.

The effective maximum length of a VARCHAR in MySQL 5.0.3 and later is subject to the maximum
row size (65,535 bytes, which is shared among all columns) and the character set used.

We use 5.5 on that server, and the column is not indexed, so no issue. Character set is utf8 there, which means it has to be lower than 1/3 of that value.

You also use a non-fixed row type, which means it will not affect the storage of current columns.

We are going to enforce sql_mode traditional on production soon, after that, we will think about enforcing it on labs/tools, which will partial inserts fail completely instead of silently, avoiding truncation of values.

P.S.: Stop using 255 as if it has some meaning- the storage space for varchar(255) and varchar(256) is the same. The jump for using 1 to 2 bytes is in varchar(~85) for utf8. There is no reason to use varchar(255) over varchar(260) (except if it has some logical reason at application side). Also, embrace utf8mb4 (real utf8 implementation) or binary for better international compatibility

Thanks for the input @jcrespo and @Multichill.

Looks like upping the value is all we need. But I’m afraid I’m not familiar enough with that part of the Monuments database to do that quickly. I would appreciate help or very specific directions. :)

If you look at fill_table_monuments_all.sql you'll see:
source varchar(255) NOT NULL DEFAULT '',
That can be raised to 510. Next update the filed will be bigger.

monument_tables.py needs to be modified too:
extra_cols = " source varchar(255) NOT NULL DEFAULT (etc)

After changing that you'll need to run it and apply the sql statements to recreate the tables. You can also use alter table.

And why do not cut title from URL leaving only oldid? It will be enough to reach page and do not need to raise database fields.

I +1 @Ahonc's comments about using oldid or curid of a page instead of its name. It will solve both the problem of exceeding the length as well as the problem that non-latin chars should not be percent encoded.

@Lokal_Profil What do you think of the suggestion of storing the permalink with curid or oldid ?

Change 281279 had a related patch set uploaded (by Lokal Profil):
Increase source column from varchar(256) to varchar(510)

https://gerrit.wikimedia.org/r/281279

@Lokal_Profil What do you think of the suggestion of storing the permalink with curid or oldid ?

It's probably the better suggestion long-term but I guess it requires a lot of fiddling to identify all places which rely on it just being an easy to use url. Additionally it would require T131681.

The patch I added just increases the size. I would say use that as a stop-gap measure and break out curid/oldid as a non-prioritised enhancement?

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Yeah, tables need to be altered or regenerated.

Change 281279 merged by jenkins-bot:
Increase source column from varchar(256) to varchar(510)

https://gerrit.wikimedia.org/r/281279

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Yeah, tables need to be altered or regenerated.

Aren't they regenerated as part of update_monuments.sh which runs fill_table_monuments_all.sql?

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Yeah, tables need to be altered or regenerated.

Aren't they regenerated as part of update_monuments.sh which runs fill_table_monuments_all.sql?

So before that the 'country create sql files' need to be recreated then run to recreate the tables.

this shows that the patch worked as expected. We just need to check that this has been updated in all of the tables (I've only checked ru and uk)

Lokal_Profil claimed this task.

I ran

SELECT COLUMN_NAME, COLUMN_TYPE, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = "source";

and got P2977 so all should be fine.

I then used

SELECT COLUMN_NAME, COLUMN_TYPE, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = "source"
AND COLUMN_TYPE != "varchar(510)";

to get the following tables which we should probably delete at some point.

COLUMN_NAMECOLUMN_TYPETABLE_NAME
sourcevarchar(255)monuments_by_(be-x-old)
sourcevarchar(255)monuments_ru-old_(ru)
sourcevarchar(255)monuments_se-arbetsliv_(sv)
sourcevarchar(255)monuments_se-fornminne_(sv)
Lokal_Profil removed a project: Patch-For-Review.

Reopening since I in the course of T112415 discovered that the source field is still to short in some cases (see e.g. here ).

I guess we can simply bump the column size again?

Technically we could probably simply store the oldid. i.e. //de.wikipedia.org/w/index.php?oldid=146926857 instead of //de.wikipedia.org/w/index.php?title=Liste_der_Baudenkm%C3%A4ler_in_Breitenthal_%28Schwaben%29&oldid=146926857.

The main downside is that we often just strip the oldid bit to get the source-list page. That would have to be replaced by some type of look-up instead which I assume will be way more costly.

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

Technically we could probably simply store the oldid. i.e. //de.wikipedia.org/w/index.php?oldid=146926857 instead of //de.wikipedia.org/w/index.php?title=Liste_der_Baudenkm%C3%A4ler_in_Breitenthal_%28Schwaben%29&oldid=146926857.

Yes

The main downside is that we often just strip the oldid bit to get the source-list page. That would have to be replaced by some type of look-up instead which I assume will be way more costly.

Any idea where this is used exactly in our code?

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

Technically we could probably simply store the oldid. i.e. //de.wikipedia.org/w/index.php?oldid=146926857 instead of //de.wikipedia.org/w/index.php?title=Liste_der_Baudenkm%C3%A4ler_in_Breitenthal_%28Schwaben%29&oldid=146926857.

Yes

The main downside is that we often just strip the oldid bit to get the source-list page. That would have to be replaced by some type of look-up instead which I assume will be way more costly.

Any idea where this is used exactly in our code?

Essentially anywhere we do a regexp on source. An incomplet list:

  • api/CommonFuntions.php:matchWikiprojectLink()
  • erfgoedbot/unused_monument_images.py:processCountry()
  • erfgoedbot/add_coord_to_articles.py:addCoords()
  • ...

While a lookup would probably work fine for the once-a-day python scripts I'm more worried about the api calls where I wouldn't want to do a look-up per item per api call.

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

I think this is probably our best option, if it is a feature the pywikibot community wants.

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

I think this is probably our best option, if it is a feature the pywikibot community wants.

The change has been merged into pywikibot. If we swich over to this method we would have to add the percent_encoded=False argument to the page.permalink() call used to store source.

We would then need to look over any function which outputs source to see if these would need to have a url-encoding bit added.

Change 309858 had a related patch set uploaded (by Lokal Profil):
[NOT READY] Store plain permalink instead of urlencoded one

https://gerrit.wikimedia.org/r/309858

So taking a stabb at avoiding the whole issue of url encoding in the first place I spotted that we are connecting to the database with the charset='latin1' parameter.

@JeanFred @Multichill: Does this mean we are not using utf8 encoding in the tables? If so why and are things likely to start acting up due to that?

In addition to changing the data in the monuments table this would also change the data in the id_dump table. But I can't figure out the purpose of this table (other than tools/id_checker) and so I can't see if any other endpoints now need encoding. Any insights are welcome.

Our database is completely urf8:

def connect_to_monuments_database():

"""Connect to the mysql monuments database, if it fails, go down in flames."""
conn = MySQLdb.connect(
    host=mconfig.db_server, db=mconfig.db, user=config.db_username,
    passwd=config.db_password, use_unicode=True, charset='utf8')

Mediawiki on the other hand is utf8 binary encoded in Latin1

def connect_to_commons_database():

'''
Connect to the commons mysql database, if it fails, go down in flames
'''
conn = MySQLdb.connect('commonswiki.labsdb', db='commonswiki_p',
                       user=config.db_username, passwd=config.db_password, use_unicode=True, charset='latin1')

Check out the beautiful code at https://www.mediawiki.org/wiki/Toolserver:Code_snippets#Fix_UTF-8_encoded_as_latin-1

Our database is completely urf8:

def connect_to_monuments_database():

"""Connect to the mysql monuments database, if it fails, go down in flames."""
conn = MySQLdb.connect(
    host=mconfig.db_server, db=mconfig.db, user=config.db_username,
    passwd=config.db_password, use_unicode=True, charset='utf8')

Mediawiki on the other hand is utf8 binary encoded in Latin1

def connect_to_commons_database():

'''
Connect to the commons mysql database, if it fails, go down in flames
'''
conn = MySQLdb.connect('commonswiki.labsdb', db='commonswiki_p',
                       user=config.db_username, passwd=config.db_password, use_unicode=True, charset='latin1')

Check out the beautiful code at https://www.mediawiki.org/wiki/Toolserver:Code_snippets#Fix_UTF-8_encoded_as_latin-1

Thanks for the explanation. On the plus side that means stuff should still work for us after we drop urlencoding. The downside is that we are likely to be stuck with the latin1 in MediaWiki for the forseable future =(

That leaves id_dump as the only remaining questionmark

In addition to changing the data in the monuments table this would also change the data in the id_dump table. But I can't figure out the purpose of this table (other than tools/id_checker) and so I can't see if any other endpoints now need encoding. Any insights are welcome.

That leaves id_dump as the only remaining questionmark

I do believe that id_dump is only used for id_checker.

In addition to changing the data in the monuments table this would also change the data in the id_dump table. But I can't figure out the purpose of this table (other than tools/id_checker) and so I can't see if any other endpoints now need encoding. Any insights are welcome.

That leaves id_dump as the only remaining questionmark

I do believe that id_dump is only used for id_checker.

Thanks. Then I only need to spin the patch up on a dev machine to check it works and that pywikibot 2.0 contains my permalink patch.

@Multichill @JeanFred I can't find the thread where we previously discussed pywikibot 2.0 vs. the master branch so I'm raising this here.

It's become increasingly obvious to me that the 2.0 branch is lagging so far behind the master branch so that even including specific updates to it is becoming a struggle. My main argument for using it was the ease of use of pip install and what that means for testing. Since then I've done a bit of of pip testing and found out that git+https://github.com/wikimedia/pywikibot-core.git in requirements works fine so my previous objection is moot.

Are there any other reasons for staying on the 2.0 branch as opposed to the master branch?

As far as I can remember this task and T139492 could both be fixed fairly easily if we were on the master branch.

Just switch to master, but don't do automatic updates. Every once in a while, do a manual update, log it and check the next day if everything ran ok. The url is https://gerrit.wikimedia.org/r/pywikibot/core.git btw (github is just a mirror).

I’m fine with switching to master.

There are two different things to take care of:

  • The pywikibot used in unit tests and Docker, which is installed via requirements.txt. Solution is indeed to put in the git repo there together with a specific hash
  • The pywikibot used in production: This is a manual check-out of pywikibot on disk, used via pwb.py.

We should probably align both.

https://gerrit.wikimedia.org/r/#/c/354690/ pinned a newer version of pywikibot in requirements (3.0.20170403).

Next up is checking this one out in production.

Mentioned in SAL (#wikimedia-cloud) [2017-08-02T13:32:05Z] <Lokal_Profil> Upgdate deployed pywikibot to 3.0.20170403 (T112460)

So that should hopefully have worked. I've updated the commit message of https://gerrit.wikimedia.org/r/#/c/309858/ to indicate it is no longer blocked. merging+deploying that should hopefully fix this once and for all

Change 309858 merged by jenkins-bot:
[labs/tools/heritage@master] Store plain permalink instead of urlencoded one

https://gerrit.wikimedia.org/r/309858

This can be closed once we have verified that the link now works (i.e. not truncated). Another good candidate are the ge_(ka) where source links are even longer.

Just spotted that this wasn't deployed. Crossing my fingers and praying that this works

ua_(uk)now looks ok. Waiting for ge_(ka) to Finnish before resolving though.

It works! Boom 2 year old bug closed