Page MenuHomePhabricator

MobileWikiAppDailyStats should not count Googlebot
Closed, ResolvedPublic

Description

mysql:research@analytics-store.eqiad.wmnet [log]> SELECT DISTINCT(userAgent) FROM log.MobileWikiAppDailyStats_12637385 WHERE userAgent NOT LIKE '%Android%';
userAgent
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Some of these have nonzero installation age indeed (cf. schema):

mysql:research@analytics-store.eqiad.wmnet [log]> SELECT DISTINCT(event_appInstallAgeDays) FROM log.MobileWikiAppDailyStats_12637385 WHERE userAgent LIKE '%Googlebot%';
event_appInstallAgeDays
0
1
2

Event Timeline

Tbayer raised the priority of this task from to Needs Triage.
Tbayer updated the task description. (Show Details)
Tbayer added a project: Analytics.

I believe this is a ticket for reader team so analytics doesn't need to take an action here.

It's my understanding that _all_ MobileWikiApp* event logging schemas are used _only_ by the Android app. It doesn't make sense to me how Googlebot could be counted. It's alarming if these numbers are significant because we have not accounted for them AFAIK. Can anyone provide more detail on this card?

If the user agent is anything other than what you send on the app (and sounds like it is) it is likely some code that is not your app code is accessing your pages. Are there any webviews on the android app that are hosted remotely?

The explanation seems to be that Google periodically runs our app in an automated fashion, in order to evaluate the similarity of how the content is presented in the app versus mobile web. In doing so, it inadvertently triggers some of our eventlogging logic. (and they transform the useragent to say "Googlebot")

Unfortunately I don't see a plausible way for us to determine within the app whether we're being run by Google (and conditionally turn off EL in that case). Aside from filtering our queries, the best option might be to block EL messages from Googlebot at the server end.

@Niedzielski , as to the question how significant this issue is: These Googlebot entries make up 14% of the entire table, although only 0.5% for yesterday.

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT COUNT(*) FROM log.MobileWikiAppDailyStats_12637385 WHERE userAgent NOT LIKE '%Android%';
+----------+
| COUNT(*) |
+----------+
|   138457 |
+----------+

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT COUNT(*) FROM log.MobileWikiAppDailyStats_12637385;
+----------+
| COUNT(*) |
+----------+
|   980225 |
+----------+
mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT COUNT(*) FROM log.MobileWikiAppDailyStats_12637385 WHERE timestamp LIKE '20151103%' AND userAgent NOT LIKE '%Android%';
+----------+
| COUNT(*) |
+----------+
|       34 |
+----------+

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT COUNT(*) FROM log.MobileWikiAppDailyStats_12637385 WHERE timestamp LIKE '20151103%';
+----------+
| COUNT(*) |
+----------+
|     6985 |
+----------+

The explanation seems to be that Google periodically runs our app in an automated fashion, in order to evaluate the similarity of how the content is presented in the app versus mobile web. In doing so, it inadvertently triggers some of our eventlogging logic.

Do you have an explanation as to why some events have "googlebot" in them but not others?

These Googlebot entries make up 14% of the entire database,

This seems way too high if this stats is reported (sampled or not) daily for all clients. Looking at code for this event in particular is reported on a task that is "supposed" to run daily, could it be that for an automated run of some sort that daily task is being executed many times?

@Dbrant how do you think we should document this to ensure that anyone else looking at the data (or we in the future) know to lookout for bots. Could other bots conceivably enter the field without us being aware? I want to avoid a situation like we have on the web where years of data were polluted by bots.

@JKatzWMF I think the best practice should be to always specify the userAgent when writing queries, e.g. LIKE '%-r-%' for the production app, LIKE '%-beta-%' for the beta app, etc. (This is more practical for apps than for web)

I don't think it's possible (nor would we want to) build logic in the app to detect whether we're being run headlessly by Google... so there's really no good way to stop the Googlebot events from being sent from the app side.

actually @Dbrant, does the 'like' clause you suggested actually block googlebot?

It should... unless Googlebot's useragent string contains "-r-"...

FWIW, here is a closer look on how frequently this has occurred over time - while the ratio of Googlebot-run active apps is low currently, it has reached up to 83.6% in the past (on August 14).

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT LEFT(timestamp, 8) AS date, ROUND(100*SUM(IF(userAgent NOT LIKE '%Android%', 1, 0))/SUM(1),1) AS Googlebot_percentage FROM log.MobileWikiAppDailyStats_12637385 GROUP BY date ORDER BY date;
+----------+----------------------+
| date     | Googlebot_percentage |
+----------+----------------------+
| 20150717 |                  0.0 |
| 20150720 |                  0.0 |
| 20150721 |                  0.0 |
| 20150722 |                  0.0 |
| 20150723 |                  0.0 |
| 20150724 |                  0.0 |
| 20150725 |                  0.0 |
| 20150726 |                  0.0 |
| 20150727 |                  0.0 |
| 20150728 |                  0.0 |
| 20150729 |                  0.0 |
| 20150730 |                  0.0 |
| 20150731 |                  0.0 |
| 20150801 |                  0.0 |
| 20150802 |                  0.0 |
| 20150803 |                  0.0 |
| 20150804 |                  0.0 |
| 20150805 |                  0.0 |
| 20150806 |                  0.0 |
| 20150807 |                  0.0 |
| 20150808 |                  0.0 |
| 20150809 |                  0.0 |
| 20150810 |                  0.0 |
| 20150811 |                 37.9 |
| 20150812 |                 78.9 |
| 20150813 |                 78.6 |
| 20150814 |                 83.6 |
| 20150815 |                 28.6 |
| 20150816 |                 14.3 |
| 20150817 |                 19.9 |
| 20150818 |                 25.1 |
| 20150819 |                 19.3 |
| 20150820 |                 15.8 |
| 20150821 |                 16.7 |
| 20150822 |                 14.6 |
| 20150823 |                 10.6 |
| 20150824 |                 11.4 |
| 20150825 |                 12.0 |
| 20150826 |                 20.3 |
| 20150827 |                 28.8 |
| 20150828 |                 26.9 |
| 20150829 |                 22.9 |
| 20150830 |                 20.2 |
| 20150831 |                 23.5 |
| 20150901 |                 20.5 |
| 20150902 |                 18.9 |
| 20150903 |                 20.1 |
| 20150904 |                 24.7 |
| 20150905 |                 17.3 |
| 20150906 |                 10.4 |
| 20150907 |                 10.1 |
| 20150908 |                 12.7 |
| 20150909 |                 15.9 |
| 20150910 |                  2.4 |
| 20150911 |                  0.8 |
| 20150912 |                  0.8 |
| 20150913 |                  0.8 |
| 20150914 |                  1.0 |
| 20150915 |                  0.8 |
| 20150916 |                  0.6 |
| 20150917 |                  0.9 |
| 20150918 |                  0.7 |
| 20150919 |                  0.9 |
| 20150920 |                  0.6 |
| 20150921 |                  1.0 |
| 20150922 |                  0.7 |
| 20150923 |                  0.9 |
| 20150924 |                  0.8 |
| 20150925 |                  0.8 |
| 20150926 |                  0.7 |
| 20150927 |                  0.9 |
| 20150928 |                  0.7 |
| 20150929 |                  0.9 |
| 20150930 |                  1.0 |
| 20151001 |                  0.9 |
| 20151002 |                  1.0 |
| 20151003 |                  0.9 |
| 20151004 |                  0.7 |
| 20151005 |                  0.9 |
| 20151006 |                  1.0 |
| 20151007 |                  0.8 |
| 20151008 |                  1.0 |
| 20151009 |                  0.9 |
| 20151010 |                  0.8 |
| 20151011 |                  0.7 |
| 20151012 |                  1.0 |
| 20151013 |                  1.1 |
| 20151014 |                  0.1 |
| 20151015 |                  1.0 |
| 20151016 |                  0.7 |
| 20151017 |                  0.8 |
| 20151018 |                  0.8 |
| 20151019 |                  0.8 |
| 20151020 |                  0.8 |
| 20151021 |                  0.8 |
| 20151022 |                  0.9 |
| 20151023 |                  0.6 |
| 20151024 |                  0.8 |
| 20151025 |                  0.7 |
| 20151026 |                  1.2 |
| 20151027 |                  0.5 |
| 20151028 |                  0.6 |
| 20151029 |                  0.5 |
| 20151030 |                  0.6 |
| 20151031 |                  0.8 |
| 20151101 |                  0.6 |
| 20151102 |                  0.6 |
| 20151103 |                  0.5 |
| 20151104 |                  0.7 |
| 20151105 |                  0.7 |
| 20151106 |                  0.7 |
| 20151107 |                  0.3 |
| 20151108 |                  0.6 |
| 20151109 |                  0.7 |
| 20151110 |                  1.3 |
| 20151111 |                  0.6 |
| 20151112 |                  0.6 |
| 20151113 |                  0.5 |
+----------+----------------------+

@Tbayer, cool! thanks for sharing the additional context.

BTW, it seems like we could test this ourselves using the "Fetch as Google" feature: https://support.google.com/webmasters/answer/6178089

Removing analytics. Note that our work on this regard can be tracked on T121550 ( eventlogging user agent data should be parsed so spiders can be easily identified {flea}." )

Not sure what tag to use for this ticket to be surfaced on Android.

As an update, this is still happening occasionally in this schema (with Googlebot reaching 11% in a spike two weeks ago):

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT LEFT(timestamp, 8) AS date, COUNT(*) AS total, ROUND(100*SUM(IF(userAgent LIKE '%Googlebot%', 1, 0))/SUM(1),1) AS Googlebot_percentage FROM log.MobileWikiAppDailyStats_12637385 WHERE timestamp LIKE '2016%' GROUP BY date ORDER BY date;
+----------+--------+----------------------+
| date     | total  | Googlebot_percentage |
+----------+--------+----------------------+
| 20160101 |  10794 |                  0.5 |
| 20160102 |  10234 |                  0.6 |
| 20160103 |  10852 |                  0.8 |
| 20160104 |   9095 |                  0.9 |
| 20160105 |   9370 |                  0.9 |
| 20160106 |   9617 |                  0.7 |
| 20160107 |   9200 |                  0.7 |
| 20160108 |   9673 |                  0.7 |
| 20160109 |  10259 |                  0.5 |
| 20160110 |  10924 |                  0.8 |
| 20160111 |   9895 |                  0.8 |
| 20160112 |   9435 |                  1.0 |
| 20160113 |   9165 |                  0.7 |
| 20160114 |   9312 |                  0.4 |
| 20160115 |   9578 |                  0.7 |
| 20160116 |  10715 |                  1.0 |
| 20160117 |  11000 |                  0.9 |
| 20160118 |   9653 |                  1.2 |
| 20160119 |   9642 |                  0.6 |
| 20160120 |   9331 |                  1.7 |
| 20160121 |   9106 |                  0.8 |
| 20160122 |   9597 |                  0.3 |
| 20160123 |  10475 |                  0.6 |
| 20160124 |  10940 |                  0.8 |
| 20160125 |   9684 |                  1.0 |
| 20160126 |   9484 |                  2.1 |
| 20160127 |   9320 |                  0.9 |
| 20160128 |   9182 |                  0.9 |
| 20160129 |   9209 |                  1.0 |
| 20160130 |  10183 |                  0.7 |
| 20160131 |  10812 |                  0.4 |
| 20160201 |   9263 |                  0.8 |
| 20160202 |   9654 |                  2.8 |
| 20160203 |   9237 |                  0.9 |
| 20160204 |   9066 |                  0.9 |
| 20160205 |   9105 |                  0.8 |
| 20160206 |  10082 |                  0.7 |
| 20160207 |  10676 |                  1.3 |
| 20160208 |   9629 |                  1.2 |
| 20160209 |   9448 |                  0.9 |
| 20160210 |   9397 |                  1.0 |
| 20160211 |   9280 |                  0.8 |
| 20160212 |   9412 |                  0.9 |
| 20160213 |  10410 |                  0.9 |
| 20160214 |  10909 |                  0.9 |
| 20160215 |   9734 |                  1.1 |
| 20160216 |   9550 |                  1.1 |
| 20160217 |   9050 |                  1.0 |
| 20160218 |   8997 |                  0.8 |
| 20160219 |   9126 |                  1.0 |
| 20160220 |  10112 |                  0.9 |
| 20160221 |  10674 |                  0.6 |
| 20160222 |   9393 |                  0.9 |
| 20160223 |   9672 |                  0.6 |
| 20160224 |   9340 |                  0.8 |
| 20160225 |   9205 |                  1.0 |
| 20160226 |   9153 |                  0.8 |
| 20160227 |  10218 |                  0.9 |
| 20160228 |  10661 |                  0.4 |
| 20160229 |   9706 |                  0.5 |
| 20160301 |   9366 |                  2.0 |
| 20160302 |   9039 |                  1.0 |
| 20160303 |   9389 |                  1.7 |
| 20160304 |   9362 |                  0.6 |
| 20160305 |  10338 |                  1.2 |
| 20160306 |  10757 |                  1.1 |
| 20160307 |   9403 |                  0.9 |
| 20160308 |   9368 |                  1.1 |
| 20160309 |   9135 |                  0.7 |
| 20160310 |  10026 |                  1.4 |
| 20160311 |  29942 |                  0.6 |
| 20160312 |  69380 |                  0.2 |
| 20160313 |  97012 |                  0.1 |
| 20160314 |  88597 |                  0.1 |
| 20160315 |  90056 |                  0.2 |
| 20160316 |  93100 |                  0.1 |
| 20160317 |  93977 |                  0.1 |
| 20160318 |  95430 |                  0.1 |
| 20160319 | 113355 |                  0.1 |
| 20160320 | 123756 |                  0.0 |
| 20160321 | 103384 |                  0.1 |
| 20160322 |  99978 |                  0.1 |
| 20160323 | 103017 |                  0.1 |
| 20160324 | 105708 |                  0.1 |
| 20160325 | 120550 |                  0.1 |
| 20160326 | 127960 |                  0.1 |
| 20160327 | 140647 |                  0.2 |
| 20160328 | 124517 |                  0.0 |
| 20160329 | 108152 |                  0.1 |
| 20160330 | 108368 |                  0.2 |
| 20160331 | 101123 |                  0.1 |
| 20160401 | 100202 |                  0.2 |
| 20160402 | 118359 |                  0.2 |
| 20160403 | 128941 |                  0.1 |
| 20160404 | 109140 |                  0.3 |
| 20160405 | 104198 |                  0.1 |
| 20160406 | 102354 |                  0.1 |
| 20160407 | 102140 |                  0.2 |
| 20160408 | 104064 |                  0.0 |
| 20160409 | 120313 |                  0.0 |
| 20160410 | 132568 |                  0.1 |
| 20160411 | 108392 |                  0.1 |
| 20160412 | 103767 |                  0.2 |
| 20160413 | 104038 |                  0.1 |
| 20160414 | 102338 |                  0.1 |
| 20160415 | 103793 |                  0.1 |
| 20160416 | 119017 |                  0.1 |
| 20160417 | 129157 |                  0.1 |
| 20160418 | 105950 |                  0.0 |
| 20160419 | 102427 |                  0.1 |
| 20160420 | 102278 |                  0.1 |
| 20160421 | 109616 |                  0.2 |
| 20160422 | 110518 |                  1.1 |
| 20160423 | 123262 |                  1.3 |
| 20160424 | 133150 |                  0.1 |
| 20160425 | 108094 |                  0.1 |
| 20160426 | 103923 |                  0.1 |
| 20160427 | 103516 |                  0.0 |
| 20160428 | 101007 |                  0.0 |
| 20160429 | 102762 |                  0.1 |
| 20160430 | 119287 |                  0.2 |
| 20160501 | 130005 |                  0.0 |
| 20160502 | 109566 |                  0.4 |
| 20160503 | 118533 |                 10.9 |
| 20160504 | 109335 |                  0.6 |
| 20160505 | 113522 |                  0.3 |
| 20160506 | 111303 |                  1.5 |
| 20160507 | 127800 |                  2.1 |
| 20160508 | 136090 |                  0.2 |
| 20160509 | 124297 |                  8.8 |
| 20160510 | 118273 |                  4.3 |
| 20160511 | 109299 |                  0.5 |
| 20160512 | 109717 |                  1.6 |
| 20160513 | 114873 |                  4.3 |
| 20160514 | 132298 |                  2.6 |
| 20160515 | 148252 |                  4.2 |
| 20160516 | 119820 |                  2.2 |
| 20160517 |   3178 |                  1.6 |
+----------+--------+----------------------+
138 rows in set (11.10 sec)

On the other hand, Googlebot is not showing up in other Android schemas, at least not those that I've checked so far:

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT LEFT(timestamp, 6) AS yearmonth, COUNT(*) AS total, ROUND(100*SUM(IF(userAgent LIKE '%Googlebot%', 1, 0))/SUM(1),1) AS Googlebot_percentage FROM log.MobileWikiAppSessions_14031591 GROUP BY yearmonth ORDER BY yearmonth;
+-----------+---------+----------------------+
| yearmonth | total   | Googlebot_percentage |
+-----------+---------+----------------------+
| 201510    |  142818 |                  0.0 |
| 201511    | 1798700 |                  0.0 |
| 201512    | 2266902 |                  0.0 |
| 201601    | 2621557 |                  0.0 |
| 201602    | 2726921 |                  0.0 |
| 201603    | 2884853 |                  0.0 |
| 201604    | 2889241 |                  0.0 |
| 201605    | 1495404 |                  0.0 |
+-----------+---------+----------------------+
8 rows in set (1 min 16.37 sec)

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT LEFT(timestamp, 6) AS yearmonth, COUNT(*) AS total, ROUND(100*SUM(IF(userAgent LIKE '%Googlebot%', 1, 0))/SUM(1),1) AS Googlebot_percentage FROM log.MobileWikiAppToCInteraction_14585319  GROUP BY yearmonth ORDER BY yearmonth;

+-----------+--------+----------------------+
| yearmonth | total  | Googlebot_percentage |
+-----------+--------+----------------------+
| 201511    | 110667 |                  0.0 |
| 201512    | 305436 |                  0.0 |
| 201601    | 366491 |                  0.0 |
| 201602    | 340045 |                  0.0 |
| 201603    | 359271 |                  0.0 |
| 201604    | 354281 |                  0.0 |
| 201605    | 191316 |                  0.0 |
+-----------+--------+----------------------+
7 rows in set (6.74 sec)

mysql:research@analytics-store.eqiad.wmnet [(none)]> SELECT LEFT(timestamp, 6) AS yearmonth, COUNT(*) AS total, ROUND(100*SUM(IF(userAgent LIKE '%Googlebot%', 1, 0))/SUM(1),1) AS Googlebot_percentage FROM log.MobileWikiAppWiktionaryPopup_15158116 GROUP BY yearmonth ORDER BY yearmonth;
+-----------+-------+----------------------+
| yearmonth | total | Googlebot_percentage |
+-----------+-------+----------------------+
| 201601    |   688 |                  0.0 |
| 201602    |  2563 |                  0.0 |
| 201603    |  2702 |                  0.0 |
| 201604    |  2397 |                  0.0 |
| 201605    |  1598 |                  0.0 |
+-----------+-------+----------------------+
5 rows in set (0.05 sec)

It looks like this is going to be solved by the current work on T67508. For the record, I noticed that there are two other Google spiders appearing in this schema besides "GoogleBot": "AddBot-Google" and "AppEngine-Google". Both are already detected as spider in the new parsed UA field.

SELECT DISTINCT userAgent FROM log.MobileWikiAppDailyStats_12637385 WHERE userAgent LIKE '%Google%' LIMIT 50;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| userAgent                                                                                                                                                                                             |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "WikipediaApp/2.5.190-r-2017-02-24 (Android 5.1.1; Phone) Google Play"                                                                                                                                  |
| {"os_minor": null, "os_major": null, "device_family": "Spider", "os_family": "Other", "browser_minor": "1", "wmf_app_version": "-", "browser_major": "2", "browser_family": "Googlebot"}                |
| {"os_minor": null, "os_major": null, "device_family": "Spider", "os_family": "Other", "browser_minor": null, "wmf_app_version": "-", "browser_major": null, "browser_family": "AdsBot-Google"}          |
| {"os_minor": "0", "os_major": "10", "device_family": "Spider", "os_family": "iOS", "browser_minor": null, "wmf_app_version": "5.4.0.1095", "browser_major": null, "browser_family": "AppEngine-Google"} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (5.19 sec)
Dbrant claimed this task.

Please reopen if this is still an issue.