Source links in the monuments database get too long and are truncated
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Multichill
	Sep 13 2015, 1:46 PM

Description

See for example http://tools.wmflabs.org/heritage/api/api.php?action=search&format=json&srcountry=ua&srlang=uk&srid=01-204-9002

Details

	Subject	Repo	Branch	Lines +/-
	Store plain permalink instead of urlencoded one	labs/tools/heritage	master	+138 -53
	Increase source column from varchar(256) to varchar(510)	labs/tools/heritage	master	+5 -4

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	JeanFred	T115572 Api html output does not produce links for source for Ukraine
Resolved	JeanFred	T131714 Hold a Wiki Loves Monuments coding sprint at Wikimedia Hackathon 2016
Resolved	Sebastian_Berlin-WMSE	T132029 Html output of api should de-urlencode link titles and pagenames for display
Resolved	Lokal_Profil	T172772 Wlm/heritage/erfgoedbot sprint at Wikimania Hackathon 2017
Resolved	Lokal_Profil	T112460 Source links in the monuments database get too long and are truncated

Event Timeline

Multichill created this task.Sep 13 2015, 1:46 PM

Multichill raised the priority of this task from to Needs Triage.

Multichill updated the task description. (Show Details)

Multichill added a project: Wiki-Loves-Monuments-Database.

Multichill added subscribers: Multichill, Ahonc.

Restricted Application added subscribers: JeanFred, Aklapper. · View Herald TranscriptSep 13 2015, 1:46 PM

Hmmm.

CREATE TABLE `monuments_all_tmp` (
  ...
  `source` varchar(255) NOT NULL DEFAULT '',
   ...

Shall I just up that value? Looks like Labs runs a MySql recent enough to allow for VARCHAR of more than 256 (but I suck at DB management :-þ), or should use TEXT or something?

and what about storing <s>permalinks</s> curids/oldid of page? It will be less than 255 chars

I hope @jcrespo can answer your question JeanFred.

The effective maximum length of a VARCHAR in MySQL 5.0.3 and later is subject to the maximum
row size (65,535 bytes, which is shared among all columns) and the character set used.

We use 5.5 on that server, and the column is not indexed, so no issue. Character set is utf8 there, which means it has to be lower than 1/3 of that value.

You also use a non-fixed row type, which means it will not affect the storage of current columns.

We are going to enforce sql_mode traditional on production soon, after that, we will think about enforcing it on labs/tools, which will partial inserts fail completely instead of silently, avoiding truncation of values.

P.S.: Stop using 255 as if it has some meaning- the storage space for varchar(255) and varchar(256) is the same. The jump for using 1 to 2 bytes is in varchar(~85) for utf8. There is no reason to use varchar(255) over varchar(260) (except if it has some logical reason at application side). Also, embrace utf8mb4 (real utf8 implementation) or binary for better international compatibility

Lokal_Profil subscribed.Sep 27 2015, 5:15 PM

Ahonc added a project: WMUA-WLM-2015.Oct 13 2015, 8:28 PM

intracer moved this task from Backlog to Not Us on the WMUA-WLM-2015 board.Oct 13 2015, 8:56 PM

Ahonc edited subscribers, added: WMUA-WLM-2015; removed: Ahonc.Oct 13 2015, 9:23 PM

JeanFred mentioned this in T112461: Api html output doesn't produce url for source because of new URL format.Oct 15 2015, 8:55 AM

Thanks for the input @jcrespo and @Multichill.

Looks like upping the value is all we need. But I’m afraid I’m not familiar enough with that part of the Monuments database to do that quickly. I would appreciate help or very specific directions. :)

JeanFred added a parent task: T115572: Api html output does not produce links for source for Ukraine.Oct 15 2015, 10:13 AM

If you look at fill_table_monuments_all.sql you'll see:
source varchar(255) NOT NULL DEFAULT '',
That can be raised to 510. Next update the filed will be bigger.

monument_tables.py needs to be modified too:
extra_cols = " source varchar(255) NOT NULL DEFAULT (etc)

After changing that you'll need to run it and apply the sql statements to recreate the tables. You can also use alter table.

And why do not cut title from URL leaving only oldid? It will be enough to reach page and do not need to raise database fields.

I +1 @Ahonc's comments about using oldid or curid of a page instead of its name. It will solve both the problem of exceeding the length as well as the problem that non-latin chars should not be percent encoded.

JeanFred moved this task from Backlog to Bugs on the Wiki-Loves-Monuments-Database board.Nov 5 2015, 1:42 PM

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 5 2015, 1:42 PM

Lokal_Profil mentioned this in T111526: Source list is not correctly encoded, resulting in broken links.Apr 3 2016, 9:28 AM

@Lokal_Profil What do you think of the suggestion of storing the permalink with curid or oldid ?

Change 281279 had a related patch set uploaded (by Lokal Profil):
Increase source column from varchar(256) to varchar(510)

https://gerrit.wikimedia.org/r/281279

gerritbot added a project: Patch-For-Review.Apr 3 2016, 9:35 AM

In T112460#2174926, @JeanFred wrote:

@Lokal_Profil What do you think of the suggestion of storing the permalink with curid or oldid ?

It's probably the better suggestion long-term but I guess it requires a lot of fiddling to identify all places which rely on it just being an easy to use url. Additionally it would require T131681.

The patch I added just increases the size. I would say use that as a stop-gap measure and break out curid/oldid as a non-prioritised enhancement?

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

In T112460#2175187, @JeanFred wrote:

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Yeah, tables need to be altered or regenerated.

Change 281279 merged by jenkins-bot:
Increase source column from varchar(256) to varchar(510)

https://gerrit.wikimedia.org/r/281279

In T112460#2175230, @Multichill wrote:

In T112460#2175187, @JeanFred wrote:

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Yeah, tables need to be altered or regenerated.

Aren't they regenerated as part of update_monuments.sh which runs fill_table_monuments_all.sql?

JeanFred added a parent task: T131714: Hold a Wiki Loves Monuments coding sprint at Wikimedia Hackathon 2016.Apr 3 2016, 9:32 PM

In T112460#2175712, @Lokal_Profil wrote:

In T112460#2175230, @Multichill wrote:

In T112460#2175187, @JeanFred wrote:

Sounds good to me yes. Patch is now merged. Is it fine, or do we need to ALTER the tables ?

Yeah, tables need to be altered or regenerated.

Aren't they regenerated as part of update_monuments.sh which runs fill_table_monuments_all.sql?

So before that the 'country create sql files' need to be recreated then run to recreate the tables.

Lokal_Profil mentioned this in T115572: Api html output does not produce links for source for Ukraine.Apr 4 2016, 5:22 PM

this shows that the patch worked as expected. We just need to check that this has been updated in all of the tables (I've only checked ru and uk)

JeanFred added a parent task: T132029: Html output of api should de-urlencode link titles and pagenames for display.Apr 13 2016, 12:25 AM

Lokal_Profil merged a task: T111526: Source list is not correctly encoded, resulting in broken links.Apr 29 2016, 1:52 PM

Lokal_Profil added subscribers: Antanana, Atsirlin.

I ran

SELECT COLUMN_NAME, COLUMN_TYPE, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = "source";

and got P2977 so all should be fine.

I then used

SELECT COLUMN_NAME, COLUMN_TYPE, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME = "source"
AND COLUMN_TYPE != "varchar(510)";

to get the following tables which we should probably delete at some point.

COLUMN_NAME	COLUMN_TYPE	TABLE_NAME
source	varchar(255)	monuments_by_(be-x-old)
source	varchar(255)	monuments_ru-old_(ru)
source	varchar(255)	monuments_se-arbetsliv_(sv)
source	varchar(255)	monuments_se-fornminne_(sv)

Reopening since I in the course of T112415 discovered that the source field is still to short in some cases (see e.g. here ).

I guess we can simply bump the column size again?

Lokal_Profil triaged this task as High priority.May 23 2016, 1:27 PM

Technically we could probably simply store the oldid. i.e. //de.wikipedia.org/w/index.php?oldid=146926857 instead of //de.wikipedia.org/w/index.php?title=Liste_der_Baudenkm%C3%A4ler_in_Breitenthal_%28Schwaben%29&oldid=146926857.

The main downside is that we often just strip the oldid bit to get the source-list page. That would have to be replaced by some type of look-up instead which I assume will be way more costly.

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

In T112460#2318353, @Lokal_Profil wrote:

Technically we could probably simply store the oldid. i.e. //de.wikipedia.org/w/index.php?oldid=146926857 instead of //de.wikipedia.org/w/index.php?title=Liste_der_Baudenkm%C3%A4ler_in_Breitenthal_%28Schwaben%29&oldid=146926857.

Yes

The main downside is that we often just strip the oldid bit to get the source-list page. That would have to be replaced by some type of look-up instead which I assume will be way more costly.

Any idea where this is used exactly in our code?

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

In T112460#2324759, @Multichill wrote:

In T112460#2318353, @Lokal_Profil wrote:

Technically we could probably simply store the oldid. i.e. //de.wikipedia.org/w/index.php?oldid=146926857 instead of //de.wikipedia.org/w/index.php?title=Liste_der_Baudenkm%C3%A4ler_in_Breitenthal_%28Schwaben%29&oldid=146926857.

Yes

The main downside is that we often just strip the oldid bit to get the source-list page. That would have to be replaced by some type of look-up instead which I assume will be way more costly.

Any idea where this is used exactly in our code?

Essentially anywhere we do a regexp on source. An incomplet list:

api/CommonFuntions.php:matchWikiprojectLink()
erfgoedbot/unused_monument_images.py:processCountry()
erfgoedbot/add_coord_to_articles.py:addCoords()
...

While a lookup would probably work fine for the once-a-day python scripts I'm more worried about the api calls where I wouldn't want to do a look-up per item per api call.

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

I think this is probably our best option, if it is a feature the pywikibot community wants.

In T112460#2325654, @Lokal_Profil wrote:

In T112460#2324759, @Multichill wrote:

In T112460#2318353, @Lokal_Profil wrote:

The alternative is that we skip urlencoding (which requires https://gerrit.wikimedia.org/r/290215)

I think this is probably our best option, if it is a feature the pywikibot community wants.

The change has been merged into pywikibot. If we swich over to this method we would have to add the percent_encoded=False argument to the page.permalink() call used to store source.

We would then need to look over any function which outputs source to see if these would need to have a url-encoding bit added.

Lokal_Profil mentioned this in T117321: Create and update lists of monument lists.Jul 3 2016, 8:20 PM

• Phabricator_maintenance removed a subscriber: WMUA-WLM-2015.Jul 30 2016, 6:17 PM

Change 309858 had a related patch set uploaded (by Lokal Profil):
[NOT READY] Store plain permalink instead of urlencoded one

https://gerrit.wikimedia.org/r/309858

gerritbot added a project: Patch-For-Review.Sep 11 2016, 6:31 PM

So taking a stabb at avoiding the whole issue of url encoding in the first place I spotted that we are connecting to the database with the charset='latin1' parameter.

@JeanFred @Multichill: Does this mean we are not using utf8 encoding in the tables? If so why and are things likely to start acting up due to that?

In addition to changing the data in the monuments table this would also change the data in the id_dump table. But I can't figure out the purpose of this table (other than tools/id_checker) and so I can't see if any other endpoints now need encoding. Any insights are welcome.

Our database is completely urf8:

def connect_to_monuments_database():

"""Connect to the mysql monuments database, if it fails, go down in flames."""
conn = MySQLdb.connect(
    host=mconfig.db_server, db=mconfig.db, user=config.db_username,
    passwd=config.db_password, use_unicode=True, charset='utf8')

Mediawiki on the other hand is utf8 binary encoded in Latin1

def connect_to_commons_database():

'''
Connect to the commons mysql database, if it fails, go down in flames
'''
conn = MySQLdb.connect('commonswiki.labsdb', db='commonswiki_p',
                       user=config.db_username, passwd=config.db_password, use_unicode=True, charset='latin1')

Check out the beautiful code at https://www.mediawiki.org/wiki/Toolserver:Code_snippets#Fix_UTF-8_encoded_as_latin-1

In T112460#2635936, @Multichill wrote:
Our database is completely urf8:

def connect_to_monuments_database():
"""Connect to the mysql monuments database, if it fails, go down in flames."""
conn = MySQLdb.connect(
    host=mconfig.db_server, db=mconfig.db, user=config.db_username,
    passwd=config.db_password, use_unicode=True, charset='utf8')
Mediawiki on the other hand is utf8 binary encoded in Latin1

def connect_to_commons_database():
'''
Connect to the commons mysql database, if it fails, go down in flames
'''
conn = MySQLdb.connect('commonswiki.labsdb', db='commonswiki_p',
                       user=config.db_username, passwd=config.db_password, use_unicode=True, charset='latin1')
Check out the beautiful code at https://www.mediawiki.org/wiki/Toolserver:Code_snippets#Fix_UTF-8_encoded_as_latin-1

Thanks for the explanation. On the plus side that means stuff should still work for us after we drop urlencoding. The downside is that we are likely to be stuck with the latin1 in MediaWiki for the forseable future =(

That leaves id_dump as the only remaining questionmark

In addition to changing the data in the monuments table this would also change the data in the id_dump table. But I can't figure out the purpose of this table (other than tools/id_checker) and so I can't see if any other endpoints now need encoding. Any insights are welcome.

That leaves id_dump as the only remaining questionmark

I do believe that id_dump is only used for id_checker.

In T112460#2661983, @JeanFred wrote:

In addition to changing the data in the monuments table this would also change the data in the id_dump table. But I can't figure out the purpose of this table (other than tools/id_checker) and so I can't see if any other endpoints now need encoding. Any insights are welcome.

That leaves id_dump as the only remaining questionmark

I do believe that id_dump is only used for id_checker.

Thanks. Then I only need to spin the patch up on a dev machine to check it works and that pywikibot 2.0 contains my permalink patch.

@Multichill @JeanFred I can't find the thread where we previously discussed pywikibot 2.0 vs. the master branch so I'm raising this here.

It's become increasingly obvious to me that the 2.0 branch is lagging so far behind the master branch so that even including specific updates to it is becoming a struggle. My main argument for using it was the ease of use of pip install and what that means for testing. Since then I've done a bit of of pip testing and found out that git+https://github.com/wikimedia/pywikibot-core.git in requirements works fine so my previous objection is moot.

Are there any other reasons for staying on the 2.0 branch as opposed to the master branch?

As far as I can remember this task and T139492 could both be fixed fairly easily if we were on the master branch.

Just switch to master, but don't do automatic updates. Every once in a while, do a manual update, log it and check the next day if everything ran ok. The url is https://gerrit.wikimedia.org/r/pywikibot/core.git btw (github is just a mirror).

I’m fine with switching to master.

There are two different things to take care of:

The pywikibot used in unit tests and Docker, which is installed via requirements.txt. Solution is indeed to put in the git repo there together with a specific hash
The pywikibot used in production: This is a manual check-out of pywikibot on disk, used via pwb.py.

We should probably align both.

JeanFred mentioned this in T151783: Switch ErfgoedBot to use pywikibot master (and pin to specific commit).Nov 28 2016, 6:45 PM

https://gerrit.wikimedia.org/r/#/c/354690/ pinned a newer version of pywikibot in requirements (3.0.20170403).

Next up is checking this one out in production.

Mentioned in SAL (#wikimedia-cloud) [2017-08-02T13:32:05Z] <Lokal_Profil> Upgdate deployed pywikibot to 3.0.20170403 (T112460)

So that should hopefully have worked. I've updated the commit message of https://gerrit.wikimedia.org/r/#/c/309858/ to indicate it is no longer blocked. merging+deploying that should hopefully fix this once and for all

Change 309858 merged by jenkins-bot:
[labs/tools/heritage@master] Store plain permalink instead of urlencoded one

https://gerrit.wikimedia.org/r/309858

This can be closed once we have verified that the link now works (i.e. not truncated). Another good candidate are the ge_(ka) where source links are even longer.

Just spotted that this wasn't deployed. Crossing my fingers and praying that this works

Mentioned in SAL (#wikimedia-cloud) [2017-08-09T14:28:43Z] <Lokal_Profil> Deploy latest from Git master: 4bb0c12 (T165759, T165759), 35b20ec, c3be5fe, 196c165, 7b8dcb2, eac0756, e15a912 (T172094), 6c195db, 85b415c (T112460)

ua_(uk)now looks ok. Waiting for ge_(ka) to Finnish before resolving though.

It works! Boom 2 year old bug closed

Lokal_Profil added a parent task: T172772: Wlm/heritage/erfgoedbot sprint at Wikimania Hackathon 2017.Aug 10 2017, 2:03 PM

Source links in the monuments database get too long and are truncatedClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Source links in the monuments database get too long and are truncated
Closed, ResolvedPublic
Actions

Related Objects
Search...