Page MenuHomePhabricator

TF-IDF to determine global stop words
Closed, ResolvedPublic

Description

So I think it would be useful if we generated something like 1000-2500 stop words using TF-ISF per wiki, then run a cross-TF-IDF on all wikis and take the top 500 to 1000 words common in all wikis.

I would expect these to include interwiki language ISO codes and words like http, com, net etc from urls.

Event Timeline

ToAruShiroiNeko assigned this task to Ladsgroup.
ToAruShiroiNeko raised the priority of this task from to High.
ToAruShiroiNeko updated the task description. (Show Details)
ToAruShiroiNeko added a subscriber: ToAruShiroiNeko.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 31 2015, 10:52 PM

Result of my work:

  1. nbsp 14
  2. small 14
  3. html 14
  4. at 14
  5. al 14
  6. and 14
  7. class 14
  8. the 14
  9. on 14
  10. htm 14
  11. border 14
  12. history 14
  13. org 14
  14. name 14
  15. asp 14
  16. com 14
  17. center 14
  18. commons 14
  19. le 14
  20. align 14
  21. top 14
  22. background 14
  23. net 14
  24. image 14
  25. article 14
  26. http 14
  27. by 14
  28. php 14
  29. style 14
  30. url 14
  31. right 14
  32. in 14
  33. thumb 14
  34. png 14
  35. left 13
  36. of 13
  37. width 13
  38. file 13
  39. web 13
  40. all 13
  41. ref 13
  42. em 13
  43. pdf 13
  44. for 13
  45. color 13
  46. sup 13
  47. robert 13
  48. di 13
  49. us 13
  50. paul 13
  51. york 13
  52. size 13
  53. references 13
  54. isbn 13
  55. title 13
  56. news 13
  57. colspan 13
  58. jpg 13
  59. index 13
  60. press 13
  61. page 13
  62. wiki 13
  63. www 13
  64. world 13
  65. svg 13
  66. map 13
  67. gif 13
  68. from 13
  69. info 13
  70. px 13
  71. ii 13
  72. do 13
  73. old 13
  74. text 13
  75. van 13
  76. iii 13
  77. wikipedia 13
  78. gallery 12
  79. wikitable 12
  80. with 12
  81. university 12
  82. link 12
  83. cellpadding 12
  84. bgcolor 12
  85. category 12
  86. john 12
  87. internet 12
  88. font 12
  89. james 12
  90. david 12
  91. ac 11
  92. google 11
  93. yue 11
  94. michael 11
  95. region 11
  96. flag 11
  97. national 11
  98. books 11
  99. logo 11
  100. iv 11
  101. content 11
  102. dr 11
  103. william 11
  104. margin 11
  105. lang 11
  106. des 11
  107. edu 11
  108. george 11
  109. usa 11
  110. art 11
  111. me 11
  112. film 10
  113. cellspacing 10
  114. charles 10
  115. type 10
  116. international 10
  117. du 10
  118. per 10
  119. nan 10
  120. commonscat 10
  121. date 10
  122. del 10
  123. der 10
  124. re 10
  125. ma 10
  126. thomas 9
  127. ad 9
  128. cm 9
  129. ed 9
  130. aspx 9
  131. lat 9
  132. peter 9
  133. main 9
  134. home 9
  135. ng 9
  136. city 9
  137. archive 9
  138. gov 9
  139. online 9
  140. tv 9
  141. end 8
  142. div 8
  143. publisher 8
  144. san 8
  145. site 8
  146. jean 8
  147. sub 8
  148. list 8
  149. maria 8
  150. radio 8
  151. redirect 8
  152. bat 8
  153. media 8
  154. cite 8
  155. last 8
  156. defaultsort 8
  157. von 8
  158. vol 8
  159. solid 8
  160. view 8
  161. pp 8
  162. mm 8
  163. mark 8
  164. status 7
  165. europa 7
  166. foto 7
  167. china 7
  168. alt 7
  169. red 7
  170. sport 7
  171. richard 7
  172. paris 7
  173. november 7
  174. era 7
  175. volume 7
  176. london 7
  177. september 7
  178. smg 7
  179. roma 7
  180. general 7
  181. ten 7
  182. table 7
  183. year 7
  184. video 7
  185. los 7
  186. portal 7
  187. august 7
  188. dan 7
  189. data 7
  190. man 7
  191. son 7
  192. pr 7
  193. martin 7
  194. il 7
  195. first 7
  196. area 6
  197. united 6
  198. see 6
  199. system 6
  200. au 6
  201. amerika 6
  202. central 6
  203. er 6
  204. ex 6
  205. etc 6
  206. code 6
  207. ya 6
  208. albert 6
  209. museum 6
  210. time 6
  211. pages 6
  212. clear 6
  213. book 6
  214. group 6
  215. latm 6
  216. christian 6
  217. ni 6
  218. alexander 6
  219. position 6
  220. links 6
  221. april 6
  222. long 6
  223. un 6
  224. caption 6
  225. source 6
  226. website 6
  227. accessdate 6
  228. roman 6
  229. bin 6
  230. joseph 6
  231. state 6
  232. infobox 6
  233. articles 6
  234. club 6
  235. total 6
  236. start 6
  237. est 6
  238. land 6
  239. location 6
  240. louis 6
  241. arms 6
  242. den 6
  243. lon 6
  244. ra 6
  245. anti 6
  246. park 6
  247. black 6
  248. language 5
  249. band 5
  250. xix 5
  251. journal 5
Ladsgroup closed this task as Resolved.Sep 22 2015, 10:33 AM
Ladsgroup set Security to None.