Page MenuHomePhabricator

Determine which Action API parameters to whitelist/blacklist for action_param_hourly aggregate table
Closed, ResolvedPublic

Description

Now that we have raw Action API requests created by T108618: Publish detailed Action API request information to Hadoop we need to design the ETL (extract, transform, load) process that will populate the aggregate reporting tables from T116065: Design aggregate tables to drive Action API reports. Populating the action_ua_hourly and action_action_hourly tables is straight forward, but the action_param_hourly requires making some decisions.

There are many request params we do not want to count distinct values of at all (eg maxlag, smaxage, maxage, requestid, origin, centralauthtoken, titles, pageids). We need to design either a whitelist or blacklist of parameter names. Whitelisting is probably the safest approach.

Event Timeline

To get an idea of what parameters are being sent by clients I ran this hive query:

select count(1) cnt, pkey
from ApiAction LATERAL VIEW explode(map_keys(params)) pTable AS pkey
where year=2016
  and month=4
  and day=1
group by pkey
order by cnt desc
limit 500
;

1cnt pkey
2379629779 action
3373453103 format
4232403950 prop
5134432561 titles
698708577 formatversion
781783097 redirects
864081520 pithumbsize
960928889 piprop
1054846023 pilimit
1152090578 iiprop
1250972051 generator
1346620759 batch
1445507720 continue
1544639195 rvprop
1644577403 wbptterms
1741608793 list
1841302536 search
1940580076 rawcontinue
2035897838 ppprop
2135185123 gpssearch
2235185085 gpslimit
2335090071 gpsnamespace
2433065514 limit
2532581810 uselang
2631530823 page
2729356798 namespace
2829011123 suggest
2927299283 maxage
3027292983 smaxage
3124623210 iiextmetadatalanguage
3224480069 iiextmetadatafilter
3322755234 callback
3421102558 srsearch
3520288910 maxlag
3618647063 explaintext
3717381910 iiurlwidth
3816789504 meta
3916664430 srlimit
4016332069 inprop
4115903014 exintro
4215161880 srprop
4315144228 iiurlheight
4414942417 revids
4512957846 rvlimit
4612442968 sections
4712272623 uiprop
4812183562 sectionprop
4911727842 noheadings
5010384247 indexpageids
5110356894 srwhat
529723256 srinfo
539661843 gsrsearch
549628569 srnamespace
559607068 lllimit
569471949 gsrlimit
579459740 gsrnamespace
589455130 gsrprop
599302036 gsrwhat
609289059 onlyrequestedsections
619279767 gsrinfo
629256178 pageids
639180197 sroffset
648761968 thumbsize
658574990 cllimit
668553491 _
677598286 ids
687536827 bltitle
697526197 exsentences
707500217 bllimit
717432192 requestid
727245836 blfilterredir
736564414 servedby
745748856 rvdir
755662948 iiurlparam
765099283 text
774789530 clshow
784727846 pllimit
794710000 props
804559695 rvsection
814423135 title
823780242 plnamespace
833621443 token
843401196 rdprop
853397135 rdnamespace
863366521 section
873315658 rdlimit
883229274 rdshow
893105058 iimetadataversion
903086288 exchars
913032041 exlimit
923016388 gimlimit
933015288 rvparse
942920258 languages
952790653 rvstartid
962744925 iilimit
972716778 rnnamespace
982675332 rnlimit
992587418 assert
1002553079 language
1012542570 gsroffset
1022471038 siprop
1032461693 llprop
1042284764 euquery
1052283929 eulimit
1062054678 contentmodel
1072010939 iiextmetadatamultilang
1081965700 baserevid
1091938709 fromrev
1101938709 torev
1111784377 sectiontitle
1121776002 contentformat
1131659585 thumbwidth
1141580410 entity
1151500590 property
1161372349 rvendid
1171340273 rcprop
1181338534 tllimit
1191336932 bot
1201315685 summary
1211256350 ususers
1221244886 sites
1231196308 cmlimit
1241183370 usprop
1251124526 rcnamespace
1261108715 converttitles
1271085467 plcontinue
1281020604 cmtitle
129885160 rctype
130881334 basetimestamp
131794194 rcshow
132790996 rclimit
133762151 gscoord
134762149 gsradius
135750326 ucuser
136747187 uclimit
137743391 pst
138706289 ucprop
139687818 redirect
140673951 gtllimit
141651199 type
142632557 cldir
143597056 lllang
144570255 md5
145569930 feedformat
146533539 export
147511979 clprop
148502865 exportnowrap
149483798 pageid
150481197 rvcontentformat
151478811 ucnamespace
152478506 gslimit
153465774 offset
154461618 aiprop
155454042 aisha1base36
156446031 maxlen
157444180 cmtype
158441878 minor
159439961 intoken
160431014 aplimit
161427653 utf8
162424439 gbltitle
163423702 hidebots
164422133 days
165418093 gaplimit
166405892 gbllimit
167397408 gapfrom
168393971 notminor
169389160 clcategories
170384568 blnamespace
171369953 gblfilterredir
172368242 gapfilterredir
173367297 pssearch
174361920 rcdir
175342349 rvdiffto
176338590 pslimit
177337169 gpllimit
178328390 apfrom
179325349 pclimit
180324457 rvtoken
181323965 gcmtitle
182320118 gcmlimit
183304227 iistart
184297052 ucend
185293933 disableeditsection
186288452 cmprop
187286064 rvgeneratexml
188277500 hidecategorization
189274320 apprefix
190273514 id
191271156 gplnamespace
192262809 bkusers
193259613 disablepp
194253168 gcmtype
195251685 letype
196250918 apnamespace
197249747 lelimit
198249328 rccontinue
199237949 bkprop
200231843 gcmnamespace
201229168 starttimestamp
202219446 iwlimit
203219352 iwprop
204212015 notprop
205204326 rvend
206203961 gcllimit
207202682 cmpageid
208195719 geititle
209195719 geilimit
210191608 letitle
211189706 watchlist
212189129 rcstart
213186323 clcontinue
214182837 paction
215181276 rvstart
216181203 geifilterredir
217180595 giulimit
218180595 giutitle
219177805 lhlimit
220177473 lhnamespace
221176398 target
222176252 lhprop
223170918 eititle
224170460 exsectionformat
225168878 tltemplates
226168559 eilimit
227167829 imlimit
228165510 agent
229165389 options
230163572 gapnamespace
231161967 linksite
232161963 rcend
233161707 linktitle
234161000 blredirect
235158873 cmnamespace
236152604 ucdir
237146824 gblnamespace
238143144 lhshow
239143108 cmcontinue
240140329 recreate
241137636 value
242136986 apfilterredir
243136932 gclshow
244133524 gapcontinue
245130503 snaktype
246125258 aulimit
247123613 bklimit
248122093 rvcontinue
249114606 eifilterredir
250110687 data
251108113 lgname
252107461 lgpassword
253104791 gcmprop
25498914 geinamespace
25598761 hash
25697289 generate
25797289 datavalue
25893914 user
25991280 rvtag
26090608 aflprop
26189749 nocreate
26288970 colimit
26386804 oldid
26486422 statement
26585253 auprop
26682162 mobileformat
26781676 includecomments
26881385 noimages
26977740 augroup
27077043 gapdir
27177024 ggsradius
27277020 ggscoord
27375134 notlimit
27474730 wltoken
27574654 wlowner
27674273 from
27768409 grnnamespace
27868326 notsections
27967682 curtimestamp
28067666 ggslimit
28167150 sitefilter
28266416 snaks
28366327 origin
28462650 ledir
28562158 to
28661525 iwurl
28761180 disabletoc
28860665 allrev
28959798 uccontinue
29059337 etag
29158329 lgtoken
29257927 sourcetitle
29356756 content
29456690 progress
29556690 sourcerevision
29656330 ucstart
29753494 rcid
29849640 rctoken
29948479 change
30047227 llinlanguagecode
30146995 leprop
30245653 gcmsort
30344410 rvexcludeuser
30443538 appendtext
30543285 geicontinue
30642660 iutitle
30742640 rvexpandtemplates
30842421 notgroupbysection
30942421 notmessageunreadfirst
31042211 claim
31141754 variant
31241068 guiuser
31340012 blcontinue
31439535 drprop
31539226 languagefallback
31639169 gcmstart
31737802 values
31837802 parser
31937304 niproject
32036243 niregion
32136097 filename
32236086 sectionpreview
32336025 leend
32435924 rctoponly
32534646 wlprop
32634360 captchaid
32734313 captchaword
32834011 wikitext
32933887 ammessages
33033745 drlimit
33133697 druser
33233102 notformat
33332251 amlang
33432077 wllimit
33530884 cachekey
33630771 gsprop
33730691 html
33830684 category
33930668 rvuser
34030415 preview
34129951 cmsort
34229900 ellimit
34328589 einamespace
34428160 cmdir
34528117 guiprop
34628089 template
34728089 view
34827905 page_id
34926959 aufrom
35026927 afllimit
35126099 llcontinue
35226092 iulimit
35326001 iunamespace
35425905 notunreadfirst
35525369 disablelimitreport
35625208 gsprimary
35724203 gsnamespace
35824146 gsglobe
35923638 lestart
36023048 context
36121789 datatype
36221158 auprefix
36321045 grclimit
36420850 wrlimit
36520479 stash
36620439 reason
36720248 feed
36820144 filekey
36920006 references
37019967 iufilterredir
37119954 wrcontinue
37219864 tbtitle
37319864 tbaction
37419663 lang
37519586 generatexml
37619580 coprop
37719212 ignorewarnings
37819127 cmend
37919018 ustoken
38018763 orlimit
38118742 ornamespace
38218030 optionname
38318026 optionvalue
38417895 pccontinue
38517498 gcmdir
38617291 aflend
38716740 grnlimit
38816184 psnamespace
38916065 comment
39015624 grcend
39115624 grcstart
39215624 grcdir
39314898 geuquery
39414898 geunamespace
39514898 geuprotocol
39614889 grcprop
39714626 grccontinue
39814458 codistancefrompoint
39914073 centralauthtoken
40014012 forcelinkupdate
40113708 grnfilterredir
40213683 expiry
40313296 apcontinue
40412548 noemail
40512487 allowusertalk
40612424 wlshow
40712329 normalize
40811828 ailimit
40911824 autoblock
41011630 gqppage
41111630 gqplimit
41210558 gulimit
4139933 name
4149921 password
4159665 gufilterlocal
4169660 guprop
4179374 viprop
4189175 rcuser
4199093 revid
4209091 wlend
4218922 ggsprop
4228900 watch
4238818 filesize
4248753 ggsmaxdim
4258654 hideminor
4268473 afldir
4278468 aflstart
4288350 amfilter
4298152 viurlwidth
4308093 ggsnamespace
4318072 wldir
4328025 lgdomain
4337900 wltype
4347755 leuser
4357699 new
4367598 gcmcontinue
4377408 aifrom
4387354 eicontinue
4397285 smlangprop
4407282 smsiteprop
4417165 smtype
4427156 smstate
4437148 smlimit
4446882 gplcontinue
4456752 gailimit
4466678 notalertunreadfirst
4476678 notmessagecontinue
4486534 geulimit
4496300 gaidir
4506300 gaisort
4516204 email
4526194 unwatch
4536156 grcnamespace
4546130 gaiuser
4555976 friprop
4565858 tags
4575812 tlcontinue
4585616 modules
4595601 pltitles
4605583 dflimit
4615577 gsgpexcludedtitle
4625577 gsgpcount
4635577 gsgptaskname
4645421 grcshow
4655421 grctype
4665341 wlexcludeuser
4675150 eipageid
4685135 imdir
4695065 coprimary
4704817 gacprefix
4714817 gacmin
4724622 eloffset
4734496 gaicontinue
4744209 properties
4754084 apdir
4763878 excontinue
4773830 leaction
4783822 gucontinue
4793713 ucshow
4803703 llurl
4813677 rctag
4823629 tlnamespace
4833586 wpCaptchaId
4843508 wpCaptchaWord
4853220 mgprop
4863220 mgformat
4873001 elexpandurl
4882929 aisha1
4892893 agulimit
4902823 ggsprimary
4912788 imcontinue
4922758 badges
4932738 rev
4942715 source
4952585 createonly
4962526 onlypst
4972466 disabletidy
4982440 undo
4992439 bkdir
5002350 mgroot
5012187 cmstartsortkeyprefix

@Anomie, @Qgil, @EBernhardson, @Addshore: can you help me come up with an initial whitelist of interesting parameters to count in the action_param_hourly table? (Or alternately convince me that a blacklist is a better approach and help create that list.)

I think a whitelist is going to be the safer approach, unless you're not actually capturing the values.

As for which values to capture... To know what modules were requested, you need action and format (for ApiMain); prop, list, and meta (for action=query); submodule (for action=flow); and generator (for anything using ApiPageSet). Beyond that, I don't really know what sorts of uses the data will be put to to be able to try to say which fields would be needed. I'm personally more likely to either need weirdly specific stuff or to just use api-feature-usage to track the things that I'm interested in.

I think a whitelist is going to be the safer approach, unless you're not actually capturing the values.

The intent of the action_param_hourly is to count (action, parameter, value, wiki) tuples. We don't really want to count parameters where the value has a very high carnality (e.g. title or revid). Having looked a bit at the data I'm pretty sure that a whitelist is what we want.

As for which values to capture... To know what modules were requested, you need action and format (for ApiMain); prop, list, and meta (for action=query); submodule (for action=flow); and generator (for anything using ApiPageSet). Beyond that, I don't really know what sorts of uses the data will be put to to be able to try to say which fields would be needed. I'm personally more likely to either need weirdly specific stuff or to just use api-feature-usage to track the things that I'm interested in.

The original request that led to designing this tables was for a report "Ranking of most requested actions/parameters, on a monthly basis + all time (DevRel, to help identifying usage of our APIs and check against our documentation, APIs we should promote...)" (T116065). I dug through my emails and found a discussion with @Spage and @Qgil from October 2015 where we settled on basically the same list that @Anomie is suggesting minus the format parameter. I think format is interesting for discussions of deprecating little used formats (e.g. T118538: Reduce the usage of API format=php), but generally it's not highly useful for a long lived dashboard.

For the initial ETL scripts I'm now thinking we should record summary counts for:

ActionParameter
queryprop
querylist
querymeta
flowsubmodule
*generator

The prop, list and meta parameters we should split the value on | and count each component value separately (i.e. "prop=foo|bar|baz" would count as 3 events).

Once we have this ETL pipeline working, additional parameters can be added if and when someone realizes that they want them for a dashboard.

I can't immediately think of anything for Wikidata here

ActionParameter
queryprop
querylist
querymeta
flowsubmodule
*generator

The prop, list and meta parameters will be split on | and count each component value separately (i.e. "prop=foo|bar|baz" would count as 3 events).