Page MenuHomePhabricator

Spike, see how easy/hard is to scoop all tables from Eventlogging log database
Closed, ResolvedPublic

Description

Spike, see how easy/hard is to scoop all tables from Eventlogging log database to hive

Event Timeline

@Nuria I am wondering if we could prioritize this during May, I could also work on it if needed with some advice about sqoop :)

I think we definitely can, let's move it to kanban and assign to @Milimetric that will have, u think, some bandwith to tackle it.

Milimetric moved this task from Next Up to In Progress on the Analytics-Kanban board.

I made the python script I was talking about, you pass it output from information_schema and it makes queries that look like they should work on sqoop. I can try running it anytime, will check with Luca, notebook's here:

http://localhost:8000/user/milimetric/notebooks/Generate%20Sqoop%20Queries%20from%20Information%20Schema%20database.ipynb

(on stat1005)

Before calling it good let's make sure we can scoop a bunch of tables and that data in those tables is queryable for all columns, it is happen before that scoop importas data in binary formats we cannot query

The following tables will not be imported because they have 0 rows:

MobileWikiAppiOSSessions_18064102, ServerSideAccountCreation_5014296, HumanSearchRelevance_17069968, MobileAppLoginAttempts_5257721, TestSearchSatisfaction2_15700292, TwoColConflictConflict_18155295, TranslationRecommendationAPIRequests_15405506, WikipediaPortal_14377354_15423246, ChangesListFilters_16484266, MobileWikiAppiOSFeed_18280649, EUCCStats_18341891, MobileWikiAppiOSReadingLists_17990228, ArticleCreationWorkflow_17145434, EditConflict_8860941, TestSearchSatisfaction2_15644862, TrackedPageContentSaveComplete_8535426, MobileWikiAppiOSFeed_18225687, UniversalLanguageSelector_7327441_15423246, ContentTranslationSuggestion_16344900, EUCCVisit_18354626, Test_12174936, CitationUsage_18810892, MultimediaViewerDimensions_10014238_15423246, MobileWikiAppiOSLoginAction_18064101, ChangesListFilters_16403617_15423246, SearchSatisfactionErrors_17181648, MobileAppCategorizationAttempts_5359208_15423246, MobileWikiAppOfflineLibrary_17649221, AdvancedSearchRequest_17841562, ContentTranslationError_11767097_15423246, LandingPageImpression_17918163, WikidataCompletionSearchClicks_18665070, MultimediaViewerDimensions_10014238, ChangesListFilters_16403617, RecentChangesTopLinks_16732249, MobileWikiAppiOSLoginAction_18121305, GuidedTour_4972209, MobileAppCategorizationAttempts_5359208, TestSearchSatisfaction2_15357244_15423246, LandingPageImpression_18146199, MobileLeftNavbarEditCTA_7074652, AdvancedSearchRequest_18227136, UniversalLanguageSelector_7327441, WikidataCompletionSearchClicks_18624312, MobileWikiAppiOSSessions_18050320, ChangesListFilters_16174591_15423246, MobileWikiAppiOSSessions_18121261, TestSearchSatisfaction2_15357244, MobileLeftNavbarEditCTA_6792179, TestSearchSatisfaction2_16909631, WikidataCompletionSearchClicks_18236836, TranslationRecommendationUserAction_15858947_15423246, AdvancedSearchRequest_17621725, CentralNoticeImpression_17712479, MobileWikiAppOfflineLibrary_17836955, ContentTranslationError_11767097, MobileWikiAppNotificationInteraction_18325732, TestSearchSatisfaction2_14318467, MobileAppUploadAttempts_5334329_15423246, TestSearchSatisfaction2_16270835_15423246, TranslationRecommendationUserAction_15858947, TwoColConflictConflict_17520555, VisualEditorDOMSaved_6063754, UniversalLanguageSelector_17799034, ContentTranslationCTA_16017678_15423246, ChangesListFilters_15876023, MobileAppUploadAttempts_5334329, TestSearchSatisfaction2_16270835, TranslationRecommendationUIRequests_15484897_15423246, InputDeviceDynamics_17687647, VisualEditorDOMRetrieved_5961496, AdvancedSearchRequest_17760635, MobileWikiAppOnThisDay_17836957, ContentTranslationCTA_16017678, EchoPrefUpdate_5488876, MobileWikiAppiOSReadingLists_18047424, ContentTranslationSuggestion_19004928, TranslationRecommendationUIRequests_15484897, ContentTranslation_11628043_15423246, ContentTranslationAbuseFilter_18472730, MobileWikiAppiOSSessions_18047432, ContentTranslation_18999884, ChangesListClickTracking_16484895, GeoFeatures_12914994_15423246, TestSearchSatisfaction2_14098806, MobileAppShareAttempts_5346170_15423246, TestSearchSatisfaction2_15922352_15423246, WikipediaPortal_15890769_15423246, ChangesListHighlights_16484288, ContentTranslation_11628043, WMDEBannerEvents_18437830, MobileWikiAppiOSSettingAction_17990226, CentralAuth_5690875_15423246, ExternalLinksChange_15716074_15423246, GeoFeatures_12914994, NewEditorEdit_6792669_15423246, MobileWikiAppiOSSettingAction_18064085, MobileWikiAppOnThisDay_18118721, TestSearchSatisfaction2_13223897, MobileAppShareAttempts_5346170, TestSearchSatisfaction2_15700292_15423246, WMDEBannerSizeIssue_18193993, TranslationRecommendationAPIRequests_16261139_15423246, MobileWikiAppOnThisDay_17490767, EditConflict_8860941_15423246, ExternalLinksChange_15716074, NewEditorEdit_6792669, MobileWikiAppiOSReadingLists_18064062, HumanSearchRelevance_17073843, TestEventDuplication_16757884, MobileAppLoginAttempts_5257721_15423246, TestSearchSatisfaction2_15922352, WMDEBannerEvents_18193948, TranslationRecommendationAPIRequests_16261139, AdvancedSearchRequest_17379859, WikipediaPortal_15890769, MobileWikiAppiOSReadingLists_18280648, ContentTranslationSuggestion_16344900_15423246, WikidataCompletionSearchClicks_18419582, MobileWikiAppiOSLoginAction_17990227, CentralAuth_5690875

Ok, I sqooped 4 tables, some big some small. I can't for some reason get a Hive table created on top of them, but it's very easy to put a Spark dataframe on it and run spark sql. All columns work on all tables I sqooped. So I think that's fine enough for a backup. I inspected the avro files manually and they all seem good, nothing crazy or special.

So, I wrote a script, sqoop-log-db, it's in my home on stat1006. I commented out the top 10 biggest tables as well as the ones I already sqooped. I'm running it in a screen on that box starting now. I've no idea when it'll be done, but feel free to kill it if it takes up too many resources. I'm around, ping me if anything comes up.

side note: something weird's going on with kinit. I got this error again:

ExitCodeException exitCode=1: kinit: KDC can't fulfill requested option while renewing credentials

But my commands were still working, and my token was definitely expired this time.

Started sqoop on the top 10 tables (well, 9 'cause I had already fetched one of them). I did a diff between what I tried to sqoop and what sqooped and there's about 20 missing so I'll see again once the top ones run.

this task is not completed, correct, have we scooped all tables cc @Milimetric

No, sadly a few of them still failed and I'm rerunning them in the background. It's just two more Edit_* schemas and all the mediawiki_* ones.

Ok, victory is ours! All tables with data (more than 0 rows) from the log database have been sqooped to hdfs here:

hdfs dfs -ls /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-*/*.avro

They're in avro format, so here's a quick way to load up data. Once one of my Analytics fellows checks my work, I'll post these docs on Wikitech and we can move forward decomissioning the mysql server.

pyspark2 --master yarn --executor-memory 8G --executor-cores 1 --driver-memory 4G --conf spark.dynamicAllocation.maxExecutors=64 --jars /srv/deployment/analytics/refinery/artifacts/refinery-job.jar
...

schema = 'Edit_10676603'
template = '/wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-{schema}/part-m-00000.avro'
pathToAvro = template.format(schema=schema)
df = spark.read.format('avro').load(pathToAvro)
df.createOrReplaceTempView(schema)
spark.sql('select * from {schema} limit 10'.format(schema=schema)).show()

I do not understand where the Edit_10676603 schema is being read from to create the table?

ok, note to self, i forgot that in avro, the schema and data are together, duh

We still need to spot check all tables and make sure data is quaeryable, but yeah, almost there

Some comments:

We do not need to scoop the following tables as that data exists on events database and mediawiki_history, let's remove them from the backup directory.

/wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_create_1
drwxr-xr-x - analytics hadoop 0 2020-06-03 18:44 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_create_2
drwxr-xr-x - analytics hadoop 0 2020-06-03 19:46 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_create_3
drwxr-xr-x - analytics hadoop 0 2020-06-03 19:53 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_create_4
drwxr-xr-x - analytics hadoop 0 2020-06-03 19:54 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_delete_1
drwxr-xr-x - analytics hadoop 0 2020-06-03 19:57 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_delete_2
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:00 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_delete_3
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:01 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_delete_4
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:02 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_move_1
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:04 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_move_2
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:07 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_move_3
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:08 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_undelete_1
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:09 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_undelete_2
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:10 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_undelete_3
drwxr-xr-x - analytics hadoop 0 2020-06-03 20:10 /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_undelete_4

Potential issues:

Empty webhost 

MobileWikiAppEdit_17837072


+-------+------------+------------------+-----------------------------+
|webhost|event_action|event_sessionToken|event_wikidataDescriptionEdit|
+-------+------------+------------------+-----------------------------+
|   null|       saved|              null|                         null|
|   null|       start|              null|                         null|
|   null|     preview|              null|                         null|
|   null| saveAttempt|              null|                         null|
|   null|       start|              null|                         null|
|   null|     preview|              null|                         null|
|   null| saveAttempt|              null|                         null|
|   null|     preview|              null|                         null|
|   null|       start|              null|                         null|
|   null|     preview|              null|                         null|
+-------+------------+------------------+-----------------------------+

Edit: nuria looked on log database and these fields are empty there as well so this table is OK

Empty event_token event_userBuckets

ServerSideAccountCreation_17719237

+-----------------+--------------------+----+-----------+-----------------+
|          webhost|                uuid|  dt|event_token|event_userBuckets|
+-----------------+--------------------+----+-----------+-----------------+
|www.mediawiki.org|71b24e6ea6c850d7a...|null|           |                 |
|www.mediawiki.org|ae5e00a2ba09530e8...|null|           |                 |
|www.mediawiki.org|73204f3c21e559f68...|null|           |                 |
|www.mediawiki.org|1d5ef461886f5d069...|null|           |                 |
|www.mediawiki.org|b47e28f8e2845d1c8...|null|           |                 |
|www.mediawiki.org|b074061ce0e259bda...|null|           |                 |
|www.mediawiki.org|553c3b6bd13c58419...|null|           |                 |
|www.mediawiki.org|e6abc691daab5c378...|null|           |                 |
|www.mediawiki.org|787de163665357cba...|null|           |                 |
|www.mediawiki.org|bc571da7196052cbb...|null|           |                 |
+-----------------+--------------------+----+-----------+-----------------+

nuria looked on log database and these two columns are also empty so this table is OK

PageContentSaveComplete_5588433 userAgent column is null that might be actually correct but let's check it out

Edit: userAgent column on that table is null, so it seems this schema is correct

MobileWikiAppArticleSuggestions_10590869_15423246 event_pageTitle is null

Edit: event_pageTitle is null also in the original schema so it seems this schema is correct

I just did some spot checking, we probably need to take a look at these to see if these are issues and if so, if they can be fixed globally (cc @Milimetric )

Some comments:

We do not need to scoop the following tables as that data exists on events database and mediawiki_history, let's remove them from the backup directory.
... [everything like /wmf/data/archive/backup/mysql/eventlogging-all-old-tables/sqoop-import-log-mediawiki_page_*]

Done

@Nuria it seems to me all the other issues you found are not issues? I had looked at null columns too to make sure they match the source and I didn't find any problems. So we can call the data good and move on as far as I can tell?

Once we agree on proceeding, I'll probably reimage db1108 to Buster wiping all the data to start fresh, keep it in mind before giving me the green light :) (we do have a backup on HDFS though).

Spot checked also TaskRecommendation_9266319, WikimediaBlogVisit_5308166, ServerSideAccountCreation_17706338, QuickSurveysResponses_15266417, UploadWizardStep_18973101

I think we can call the export good (yay!) and move on with decomissioning machine, thanks @Milimetric for the ALWAYS FUN scoop work

This one wasn't bad! If we ever need to sqoop a bunch of tables, the python script I wrote should be almost-generic. Basically, it seems we had already caught all the edge cases with our other sqoop work. Applying all of them at once, generically to the schemas as read from INFORMATION_SCHEMA seems to just work.

@Nuria @Milimetric do I have the green light to wipe db1108 and reimage it to Buster? I'll then apply the new config to become our backup db node.