Page MenuHomePhabricator

Add el_timestamp to the externallinks table
Open, MediumPublic

Description

Betacommand	Ive got an interesting idea for the externallinks table. What about having a including the timestamp that a link was added? Like what happens with cl_timestamp ?
legoktm	what's your usecase
Reedy	it's doable, but ^
Betacommand	legoktm: tracking when links are added, so batch requests for archival (IE a partnership with IA) can be done
Betacommand	or tracking how long a link has been in an article without having to check every diff
Betacommand	tracking overall external link volume over time
Betacommand	or within a given time span
legoktm	sounds useful
legoktm	file a bug?
Betacommand	legoktm: I was thinking about it but wanted a sounding board first

This issue came up as I was thinking about external link recovery (Preventing link rot). Right now there is zero ways of finding external links that have been added in the last X time. Which means any attempt at proactive archiving of URLs must be done via database dumps and diffing the externallinks table between two dumps.

While it may be feasible for smaller wikis any type of diffing on a large scale easily becomes unmanageable. Being able to do a select based off a given times would enable this and would allow nightly incremental dumps that could then be passed to archival sites to take proactive steps to avoid link rot.


Version: unspecified
Severity: enhancement

Details

Reference
bz72931

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 4:01 AM
bzimport set Reference to bz72931.
bzimport added a subscriber: Unknown Object (MLST).
coren added a comment.Nov 4 2014, 1:59 AM

Sounds sane; the actual cost of adding a timestamp should be essentially nil, and I can think of a couple use cases when patrolling for spam links that make it easier than trawling the RC.

That said, the column would be nearly useless without an index and I know there's a cost for /that/, so someone more versed in performance will need to chime in.

Reedy added a comment.Nov 4 2014, 3:15 PM

(In reply to Marc A. Pelletier from comment #1)

Sounds sane; the actual cost of adding a timestamp should be essentially
nil, and I can think of a couple use cases when patrolling for spam links
that make it easier than trawling the RC.

That said, the column would be nearly useless without an index and I know
there's a cost for /that/, so someone more versed in performance will need
to chime in.

I think it should be alright. Any indexing has some cost, and we regularly index many tables/columns with timestamps. The cost of doing should be fine as there's a reasonable use case, rather than a generic "this might be useful"

Hey, after reading the above two comments I can't figure it out whether to add a timestamp or not. Looking forward to take up my first task. Please tell

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 17 2016, 9:18 PM

@Adeora7: Thanks for spotting this. As @coren wrote,

someone more versed in performance will need to chime in.

so I'm going to remove the good first task keyword as it's not actionable.

Aklapper set Security to None.
Danny_B moved this task from Unsorted to Add / Create on the Schema-change board.Apr 29 2016, 10:19 PM

@jcrespo this was recently mentioned on mediawiki-l. Do you think this is a sane thing to do?

(Answering from mail)

This is not a trivial nor cost-less change, the *link tables are some of
the largest ones of all wikis in size and height. I have some suggestions
to avoid that, but that is offtopic here.

My biggest question is if this would work at all? As I understand, when a
page is reparsed (edited, template teanscluded edited, purged, etc.) all
links are deleted and reinserted (I could be wrong here). If that is true
maybe that wouldn't work. Please help me by checking the job implementation.

The second concern is that analysis like the one proposed maybe should not
be part of the main db, and could be done on separate servers/labs?

I do not have very clear what is the exact use case, please send me a link
of ongoing discussions and I will (maybe) give a recommendation of the best
option (it feels weird adding a field only to pagelinks). But maybe is
needed, I do not know yet.

(Answering from mail)

This is not a trivial nor cost-less change, the *link tables are some of
the largest ones of all wikis in size and height. I have some suggestions
to avoid that, but that is offtopic here.

My biggest question is if this would work at all? As I understand, when a
page is reparsed (edited, template teanscluded edited, purged, etc.) all
links are deleted and reinserted (I could be wrong here). If that is true
maybe that wouldn't work. Please help me by checking the job implementation.

We compare the curent links with the old links and remove/add the differences. Any links that are already there stay there. They are not deleted/reinserted. So it would work.

The second concern is that analysis like the one proposed maybe should not
be part of the main db, and could be done on separate servers/labs?

Sure. But i think implementation wise it would be much easier to have mw add a timestamp then having it added at some magic stage during the replication process to the labs server (however I dont know much about how that sort of things work, so maybe its a lot easier than i think)

Some of the other comments talking about adding the field and an index. As far as i undeestand an index on the main db servers isnt really needed for the usecases proposed above.

I do not have very clear what is the exact use case, please send me a link
of ongoing discussions and I will (maybe) give a recommendation of the best
option (it feels weird adding a field only to pagelinks). But maybe is
needed, I do not know yet.

Note, we are talking about externalinks not pagelinks (not that it makes a difference, externallinks is probably smaller than pagelinks but its still huge)

I think this bug is the discussion. It would be useful to certain people, but its not a critical need. I would categorize this bug as something we should do if itsnot a lot of work, but if it is a lot of work than weshould not. The original requestors could probably speak better to use case than I could. The main reason I ask is the changes on the application side are pretty trivial, so if its just waiting for someone to make a mediawiki patch id like to do it. However if this represents any significant effort/hardship on the dba side we should probably reject the bug given the usecase is not super compelling.

Bawolff updated the task description. (Show Details)Aug 1 2016, 10:14 PM

One of the most useful use cases is for having a bot that feeds URLs to archiving services such as the wayback or webcite in a manor that is actually feasible. Right now the only way to do a task like that would be to keep an offline duplicate of the el table and compare it during each run. (which eats a LOT of disk space, CPU, and time) Having a incremental based reference point would mean that only those links newer than the last run would need to be processed, and that when retrieving copies of the archived version of a link we have a time reference to work with without needing to parse the history of the article.

Second use case: Tracking overall link rates/utilization over time. Did we suddenly get 5k links to a website in a short period or was it gradual? quite useful for anti-spam monitoring. I can go into further details on the anti-spam angle if needed.

  • Side note XLinkBot on enwiki would need far less resources as it currently has to process all diffs for external link activity.

Third: Overall externallink activity, analytics, statistics, trends, and user created tools

jcrespo added a comment.EditedAug 2 2016, 7:14 AM

They are not deleted/reinserted. So it would work.

Except for 3-second vandalisms that wipe the page :-)

But i think implementation wise it would be much easier to have mw add a timestamp then having it added at some magic stage during the replication process to the labs server

Having the extra field on labs only would be super-easy, because we could use a timestamp field (instead of a mediawiki-standard 14-char, and its overhead would be 3 times (one per labs server) instead of ~100 times (one per production server). It would also be magically handled (on insert and updates). I think your use cases seem more in line with statistics/user created tools rather than production usage, so I believe labs-only would be the best option. We already have some labs-only tables, such as watchlist_counts (watchlist tables are private, but we generate and replicate anonymous statistics to labs).

The other advantage of labs is that you only have to convince me (which you already have) and we could do it short-term; while full mediawiki schema changes take way more time to agree and apply.

I think there could be a second option, more advanced, which is not deleting links ever, and keep track of its full life (when it was added, when it was removed, maybe multiple times), but for that we would need the compression/normalization I briefly mentioned before.

@jcrespo Nothing against labs, but T138967 is a constant and persistent issue which would make the data useless upon the next re-import. Keeping this in mediawiki allows more consistent data and fewer chances of corruption. Yes page blanking vandalism would cause minor variances but isn't a significant issue in the big picture. Yes having a complete historical reference might be nice, its just not practical at this time. The category links table already has the same feature that I'm looking to add to the el table and allows tools like http://tools.wmflabs.org/betacommand-dev/reports/CATCSD.html for patrolling admins

You understand that links* inconsistencies are as bad in production than in labs, right? (whenever a job fails to update the table properly, those problems are created, only that labs is open for checking).

We are about to reimport all tables from production (so all labs drift will disappear) and have new, more powerful servers: T140452

You also say:

that would be to keep an offline duplicate of the el table and compare it during each run. (which eats a LOT of disk space, CPU, and time)

But you suggest that keeping that in real time on >100 production redundant instances will be more optimal than keeping it only on labs, or on a subset of production separate instances, away from the main dbs?

Production issues may occur but getting those fixed is fairly easy, either purging or null editing the page in question is enough. Labs drift on the other hand requires a full re-import (which would loose all of the custom timestamps added over time).

I dont mean to be callused but until I see them in full operation without drift I dont plan on relying on future wishes. This also brings be back to a classic principal KISS. Relying on an outside tool to manipulate the db records during labs replication process is just asking to get the data screwed up and unusable. Since there wont be a corresponding table on the replication master the data will end up being scrubbed and re-imported loosing the needed historical reference points.

You also say:

that would be to keep an offline duplicate of the el table and compare it during each run. (which eats a LOT of disk space, CPU, and time)

But you suggest that keeping that in real time on >100 production redundant instances will be more optimal than keeping it only on labs, or on a subset of production separate instances, away from the main dbs?

I think that there was a misunderstanding on that line of my comment. I wasnt referring to labs at that point. I was referring to anyone who wanted to create a tool using current versions of the el table over time (IE dump diffs).

In reference to keeping it on prod vs labs. Given that labs has known drift issues that can only be fixed with re-imports, any custom modification to the el table will cause issues during the next import unless the timestamps are re-set. Also note that this is the ground work for more advanced tools and options within mediawiki that will be used on non-wmf wikis, which dont have a labs setup. Spam isnt limited to WMF wikis.

I suppose performing a schema change of live database on a heavily-written up-to-70GB table 10,000 times (the number of times we have that table replicated), increasing the storage by a few GB per server is trivial instead of first doing a proof of concept in labs (or some special separate production instances) first, proving everyone how useful it is before committing a lot of resources, is the right choice?

If that is true, the please get the functionality deployed and agreed by mediawiki developers, then follow the steps show here: https://wikitech.wikimedia.org/wiki/Schema_changes for deployment to WMF servers.

If it is not clear enough, I am completely neutral about the change itself, I am opposed to the proposed deployment plan. Schema changes cannot be done optionally (e.g. selectively per wiki)- starting with a smaller scope (separate set of servers or separate tables initially, if it was deployed as a separate extension in the beginning) would be more reasonable starting points.

jcrespo added a comment.EditedAug 2 2016, 11:11 AM

To give you an idea of the effort differences- doing it to labs now will take no more than a day, it requires no special tooling and it is done automatically (replication + timestamp auto-updated fields). You could have it done by next week. Doing in it in production could take over 6 months or more to get deployed to our production.

Doing the first does not limit in any case work towards the second. In fact, I would say it would speed it up because it would prove its usefulness (or not).

The problem with link* tables is that they are very inefficient, so I would like to explore alternative options that re more reliable and less costly (this does not impact directly this ticket, but explains my reticence to do a lot of work on them if they will change in the future).

Also, we should not discard labs dbs: yes, they have issues now, but that is because they have been neglected in the past- this is no longer the case; they deserve better support and they will get it- many tools important for WMF wikis depend on them.

@jcrespo Ok, looks like we have been on two different pages. I thought you where trying to make this a labs only "feature", where as you just want a proof of concept/viability. I know you are doing quite a bit to improve labs quality of service, and this wasn't a dig on that. Rather it was planning for the issues that labs has historically had, and trying to limit the long term impact of those issues. If you want to get it implemented on labs go ahead. Ill reach out to my Wayback Machine contacts and see what the best process moving forward would be for mass archivals.

Having the extra field on labs only would be super-easy, because we could use a timestamp field (instead of a mediawiki-standard 14-char, and its overhead would be 3 times (one per labs server) instead of ~100 times (one per production server). It would also be magically handled (on insert and updates).

Ah, so the labs db would have a TIMESTAMP DEFAULT CURRENT_TIMESTAMP type field, and everything just works. That's a really cool solution to this problem that does indeed sound much easier than doing it in MW (from a WMF DB perspective).

jcrespo added a comment.EditedAug 3 2016, 6:07 AM

@Bawolff I propose you design such a change for labs now (probably on a different ticket) and we could get it done very quicky; then, when you have such a field- you show how useful it is to the community with practical examples; then propose (if needed) a way to integrate that back into mediawiki.

Most functionality start as labs "gadgets", then it gets "promoted" to production when 2 things happen: 1) It is shown to be a really useful thing 2) Someone (volunteer or staff) agrees to code & maintain the functionality on mediawiki. The largest and busiest tables tend to take some time to get changes done (aside from indexes), as a small change may have a large impact and thus need to be done carefully.