Page MenuHomePhabricator

Cleanup ptwikibooks conversion
Closed, ResolvedPublic

Description

LQT->Flow conversion was executed a couple of times (due to import issues)

It looks like all these consecutive import runs resulted in multiple "this LQT page has been archived" messages on the original LQT page.
That content is also copied to the description of the new Flow board.

We should probably do a cleanup & make sure that message is stripped from Flow board descriptions, and only occurs once on the original LQT page.

Details

SubjectRepoBranchLines +/-
mediawiki/extensions/Flowwmf/1.28.0-wmf.14+1 -1
mediawiki/extensions/Flowmaster+1 -1
mediawiki/extensions/Flowwmf/1.28.0-wmf.7+99 -27
mediawiki/extensions/Flowwmf/1.28.0-wmf.7+11 -3
mediawiki/extensions/Flowwmf/1.28.0-wmf.7+300 -0
mediawiki/extensions/Flowmaster+99 -27
mediawiki/extensions/Flowmaster+11 -3
mediawiki/extensions/Flowmaster+300 -0
mediawiki/extensions/Flowmaster+28 -10
mediawiki/extensions/Flowwmf/1.28.0-wmf.3+10 -2
mediawiki/extensions/Flowwmf/1.28.0-wmf.2+10 -2
mediawiki/extensions/Flowmaster+10 -2
mediawiki/extensions/Flowwmf/1.28.0-wmf.2+29 -8
mediawiki/extensions/Flowmaster+29 -8
mediawiki/extensions/Flowmaster+176 -10
mediawiki/extensions/Flowmaster+319 -3
mediawiki/extensions/Flowmaster+43 -11
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 290518 had a related patch set uploaded (by Matthias Mullie):
Don't assume workflows/revisions are inserted in chronological order

https://gerrit.wikimedia.org/r/290518

Change 290517 merged by Dereckson:
Don't assume workflows/revisions are inserted in chronological order

https://gerrit.wikimedia.org/r/290517

Mentioned in SAL [2016-05-25T00:16:11Z] <dereckson@tin> Synchronized /srv/mediawiki-staging/php-1.28.0-wmf.2/extensions/Flow/maintenance/FlowRemoveOldTopics.php: Don't assume workflows/revisions are inserted in chronological order (T119509) (duration: 00m 28s)

Change 290518 merged by jenkins-bot:
Don't assume workflows/revisions are inserted in chronological order

https://gerrit.wikimedia.org/r/290518

Mentioned in SAL [2016-05-25T00:22:14Z] <dereckson@tin> Synchronized /srv/mediawiki-staging/php-1.28.0-wmf.3/extensions/Flow/maintenance/FlowRemoveOldTopics.php: Don't assume workflows/revisions are inserted in chronological order (T119509) (duration: 00m 23s)

@He7d3r, have you noticed what Matthias is going to do concerning the conversion?

Do you think we should leave a message to the community about that?

I wasn't aware of that (this task). I think it is good idea to leave a quick note at
https://pt.wikibooks.org/wiki/Wikilivros:Di%C3%A1logos_comunit%C3%A1rios
saying when something important is going to happen (I can translate it if needed).

Change 295216 had a related patch set uploaded (by Matthias Mullie):
Fix DB source store for nested structures

https://gerrit.wikimedia.org/r/295216

Change 295217 had a related patch set uploaded (by Matthias Mullie):
Script to restore LQT topics to their pre-import state

https://gerrit.wikimedia.org/r/295217

Change 295216 merged by jenkins-bot:
Fix DB source store for nested structures

https://gerrit.wikimedia.org/r/295216

Change 295217 merged by jenkins-bot:
Script to restore LQT topics to their pre-import state

https://gerrit.wikimedia.org/r/295217

Change 295350 had a related patch set uploaded (by Matthias Mullie):
Don't reimport existing headers

https://gerrit.wikimedia.org/r/295350

Change 295350 merged by jenkins-bot:
Don't reimport existing headers

https://gerrit.wikimedia.org/r/295350

Change 296232 had a related patch set uploaded (by Matthias Mullie):
Also delete topics that have more recent updates by (only) talk page manager

https://gerrit.wikimedia.org/r/296232

Change 296232 merged by jenkins-bot:
Also delete topics that have more recent updates by (only) talk page manager

https://gerrit.wikimedia.org/r/296232

Change 296511 had a related patch set uploaded (by Matthias Mullie):
Script to restore LQT topics to their pre-import state

https://gerrit.wikimedia.org/r/296511

Change 296512 had a related patch set uploaded (by Matthias Mullie):
Don't reimport existing headers

https://gerrit.wikimedia.org/r/296512

Change 296513 had a related patch set uploaded (by Matthias Mullie):
Also delete topics that have more recent updates by (only) talk page manager

https://gerrit.wikimedia.org/r/296513

Change 296511 merged by jenkins-bot:
Script to restore LQT topics to their pre-import state

https://gerrit.wikimedia.org/r/296511

Change 296512 merged by jenkins-bot:
Don't reimport existing headers

https://gerrit.wikimedia.org/r/296512

Mentioned in SAL [2016-06-29T15:22:49Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.7/extensions/Flow/maintenance/FlowRestoreLQT.php: SWAT: [[gerrit:296511|Script to restore LQT topics to their pre-import state (T119509)]] (duration: 00m 26s)

Mentioned in SAL [2016-06-29T15:29:30Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.7/extensions/Flow: SWAT: [[gerrit:296512|Do not reimport existing header (T119509)]] (duration: 00m 46s)

Change 296513 merged by jenkins-bot:
Also delete topics that have more recent updates by (only) talk page manager

https://gerrit.wikimedia.org/r/296513

Mentioned in SAL [2016-06-29T15:34:46Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.7/extensions/Flow/maintenance/FlowRemoveOldTopics.php: SWAT: [[gerrit:296513|Also delete topics that have more recent updates by (only) talk page manager (T119509)]] (duration: 00m 25s)

jcrespo moved this task from Triage to Pending comment on the DBA board.

Sorry for the delay, I am ready for the backups. The best way is to schedule some time- the backup process takes 5 minutes (I did a test one just now). Then you continue the process, and I am available in case something goes wrong.

The time doesn't matter much, but any time that is not too late for me (Europe) would be prefered, as I assume the conversion and checking everthing is ok will take some time. Just set a data and a time and I will confirm it for you.

@matthiasmullie Are you planning to finish this up on the Flow side, or do you want to pass it to one of the Collaboration team people (i.e. probably me)?

Sorry. I wanted to send an email about this last weekend, but apparently it didn't go through.
I'm currently on vacation for 2 weeks. I plan to pick this up when I get back, but if you want to do it or if it's urgent, feel free to move ahead :)

jcrespo moved this task from In progress to Done on the DBA board.
jcrespo moved this task from Done to Blocked external/Not db team on the DBA board.

I plan to start running this Wednesday (August 10) morning (European morning). Does that work for you, @jcrespo ?

Yes, it works. Can we set a specific time for the backups? They may take around half an hour, and it will be the time since we will be able to recover data in case something goes badly.

I'm sending an update to pt.wb right now.

@jcrespo Sure. Does 10AM CEST work for you to start the backups? I'll be around.

Yes, contact me on IRC and I will tell you about its progress and when they finish.

I had no idea this was on Phabricator's user pages as well. Just updated mine (mlitn) too, thanks! :)

Backups done, on standby if my help is needed.

I ran a very small test on tin: mwscript extensions/Flow/maintenance/FlowRemoveOldTopics.php --wiki=ptwikibooks --date=20100817174509
It should've deleted 4 topics:

This seems to have worked. However, they're still accessible.
I remembered (too late) that tin is in a different datacenter, with different cache, so they're probably still being displayed from cache. They should fall out of cache in 3 days max.


I ran a subsequent small sample on terbium: mwscript extensions/Flow/maintenance/FlowRemoveOldTopics.php --wiki=ptwikibooks --date=20100818210451
This should've deleted 2 topics:

They're gone!


Moving on with the rest!

@jcrespo: these are 2 of the 4 entries that were deleted in DB (as far as I can tell) but are still in memcached.
I don't have the exact ID of the other 2, but can look them up if these 2 didn't get deleted.

SELECT * FROM flow_workflow WHERE workflow_id IN(UNHEX('04aa045d80e805c1d5b12f'), UNHEX('04aa045d8a59963a643f57'));

Can you confirm that they have indeed been deleted?
If they have, we can ignore the memcached entries: they'll fall out of cache in 3 days and should only be available via permalink anyway.

The script to remove all untouched imported topics has completed running. The only Flow topics that still exist are:

As far as I can tell, this seems to have worked fine, but I will check in more detail tomorrow.
I'll also be doublechecking that we don't have duplicate Flow topics that both had replies. Likely none or few, which I'll be fixing manually.

After that, probably also tomorrow, I'll start running a script that will move some Flow & LQT boards around. There shouldn't be any changes in the pages after this script has run, it just makes sure they're in a state where the importer will recognize boards that have already been (partially) imported, allowing us to pick up the import after that.

@Trizek-WMF/@He7d3r : I believe we've already informed ptwikibooks about this process, right? Just let them know to comment here should they be experiencing issues.
Right now, they may notice that *a lot* of old (inactive) topics have been deleted. That's ok, these will be reimported again later.
Should they notice that recent topics are gone, let us know ASAP, so we can restore a backup.

@jcrespo: Can I bug you again tomorrow (probably around noon) to take another backup before I start running the second script?

Definitely; I will check the rows mentioned tomorrow, too.

Here’s a list of the remaining duplicates I’ve found:

From what I can tell, they haven't automatically been deleted because there was user activity since they were imported (they have been closed/resolved). There were no conflicting comments that needed to be merged between topic duplicates.
@He7d3r: they're all on your talk page, so you'll get to decide what you want to do with them :) You can either deal with them yourself, or we can remove them similar to what we've done in T117514

There were a few others, but they have already been cleaned up in T117514: Fix https://pt.wikibooks.org/wiki/Wikilivros:Plant%C3%A3o_de_d%C3%BAvidas

@jcrespo: can you take another backup at this time, before I start running the next script?

New backup just finished:

-rw-r--r--  1 root  root     76M Aug 11 10:20 ptwikibooks-s3-20160811T102044.sql.gz
-rw-r--r--  1 root  root    1.4G Aug 11 10:24 ptwikibooks-x1-20160811T102159.sql.gz

I can also confirm no such records exist on any of our servers:

$ while read host port; do mysql -h $host -P $port officewiki -e "SELECT * FROM flow_workflow WHERE workflow_id IN(UNHEX('04aa045d80e805c1d5b12f'), UNHEX('04aa045d8a59963a643f57'))"; done < ~/s3.hosts
ERROR 1049 (42000): Unknown database 'officewiki'
ERROR 1049 (42000): Unknown database 'officewiki'
ERROR 1049 (42000): Unknown database 'officewiki'
$ while read host port; do mysql -h $host -P $port flowdb -e "SELECT * FROM flow_workflow WHERE workflow_id IN(UNHEX('04aa045d80e805c1d5b12f'), UNHEX('04aa045d8a59963a643f57'))"; done < ~/x1.hosts

Feel free to remove the duplicates from my talk page.

Change 304205 had a related patch set uploaded (by Matthias Mullie):
Query wiki DB for logging table, not Flow DB

https://gerrit.wikimedia.org/r/304205

(Not urgent) I'm curious about how the URLs are generated: why do each pair of duplicates have the same 6 characters?

@Trizek-WMF/@He7d3r : I believe we've already informed ptwikibooks about this process, right? Just let them know to comment here should they be experiencing issues.
Right now, they may notice that *a lot* of old (inactive) topics have been deleted. That's ok, these will be reimported again later.
Should they notice that recent topics are gone, let us know ASAP, so we can restore a backup.

I didn't noticed that mention earlier, sorry.
@He7d3r, have you sent a message or something?

(Not urgent) I'm curious about how the URLs are generated: why do each pair of duplicates have the same 6 characters?

Part of the UUID is a timestamp of the topic. Since they're duplicate posts, they have the same timestamp and share part of the UUID.

Hmm.. I didn't send a new message after https://pt.wikibooks.org/wiki/Topic:T75gnaqn6o4gviux

Ok. I've posted a message on the topic related to that import. Feel free to translate it (no rush).

Discovered a minor issue with the next script. Should be fixed with https://gerrit.wikimedia.org/r/304205, which I'll try to get reviewed & deployed soon.

Change 304205 merged by jenkins-bot:
Query wiki DB for logging table, not Flow DB

https://gerrit.wikimedia.org/r/304205

Change 304985 had a related patch set uploaded (by Matthias Mullie):
Query wiki DB for logging table, not Flow DB

https://gerrit.wikimedia.org/r/304985

Change 304985 merged by jenkins-bot:
Query wiki DB for logging table, not Flow DB

https://gerrit.wikimedia.org/r/304985

Mentioned in SAL [2016-08-16T16:09:55Z] <thcipriani@tin> Synchronized php-1.28.0-wmf.14/extensions/Flow/maintenance/FlowRestoreLQT.php: SWAT: [[gerrit:304985|Query wiki DB for logging table, not Flow DB (T119509)]] (duration: 00m 57s)

The script fix has been deployed & appears to dryrun just fine.

@jcrespo: can you take another backup please, before I start running this one?

Done.

$ ls -lha *20160817*
-rw-r--r-- 1 root root  72M Aug 17 08:02 ptwikibooks-s3-20160817T080248.sql.gz
-rw-r--r-- 1 root root 1.5G Aug 17 07:50 ptwikibooks-x1-20160817T074408.sql.gz

LQT threads have now been restored. ptwikibooks should now be in good shape again & ready for a new LQT -> Flow import.

matthiasmullie claimed this task.

I'll close this ticket. We can use the parent ticket to get back to the new LQT -> Flow import attempts.
It's possible that that import again fails to complete, but at least we should now be able to debug it without causing duplicate imports (and if we do, T119509#2304693 can just be repeated now)

This is a special case, because the user created the section directly on the page: https://pt.wikibooks.org/w/index.php?title=Discuss%C3%A3o:Introdu%C3%A7%C3%A3o_%C3%A0_f%C3%ADsica/Arquivo_LQT_1&oldid=35890

In the source, you'll see:

== Incorreto ==

This happened in 2006, probably when the page was an old-style talk page.

When T50578: Enable LiquidThreads extension on all talkpages in pt.wikibooks.org was implemented, that old-style talk page became the header of the LQT page.

Threads like https://pt.wikibooks.org/wiki/T%C3%B3pico:Ajuda_Discuss%C3%A3o:Etapas_de_desenvolvimento/economia should be converted properly. But I believe such cases like your link will have to be fixed manually.