Page MenuHomePhabricator

Create a report of current keyword values and frequencies
Closed, ResolvedPublic

Description

@Bmueller and @TBurmeister have been thinking about controlled vocabularies for various tagging attributes that are being considered to help the community organize and find toolinfo records. They are interested in knowing a bit more about the values that have been placed in the freeform "keywords" field of the existing records cataloged by Toolhub as part of their research.

One way to get this data would be to ask Elasticsearch for it which might look something like:

$ curl -XPOST -H "Content-Type: application/json" 'localhost:9200/toolhub_tools/_search?pretty' --data '{"size":0,"aggs":{"keywords":{"terms":{"field":"keywords.keyword","size":10000}}}}' |
jq -r '.aggregations.keywords.buckets[] | [.doc_count, .key] | @tsv'

Event Timeline

bd808 triaged this task as Medium priority.Feb 14 2022, 3:41 PM
bd808 moved this task from Groomed/Ready to Review on the Toolhub board.

Actual command used from mwmaint1002.eqiad.wmnet:

$ curl -XPOST -H "Content-Type: application/json" -k 'https://search.svc.eqiad.wmnet:9243/toolhub_tools/_search?pretty' --data '{"size":0,"aggs":{"keywords":{"terms":{"field":"keywords.keyword","size":10000}}}}' |
jq -r '.aggregations.keywords.buckets[] | [.doc_count, .key] | @tsv'
keywords.tsv
170	wikidata
88	wikipedia
73	commons
48	statistics
43	tools
35	api
34	bot
33	images
33	python
30	stats
28	list
27	search
25	toolforge
24	wikimedia commons
24	wikisource
22	files
21	i18n
21	maps
20	category
19	categories
19	irc
18	multilingual
18	query
17	pywikibot
17	sparql
16	map
16	views
16	wdq
16	xtools
14	glam
14	pagepile
14	wlm
13	database
12	analytics
12	cross-wiki
12	mediawiki
12	php
12	upload
11	edits
11	javascript
11	language
11	users
11	wikidata query service
11	wikimedia
10	demo
10	flask
10	generator
10	nodejs
10	oauth
10	ptwikis
10	user
10	wikitext
9	counter
9	editing
9	external
9	graph
9	heritage
9	pageviews
9	phabricator
8	article
8	articles
8	audio
8	edit count
8	openstreetmap
8	page views
8	traffic
8	translation
8	visitors
7	activity
7	authors
7	citation
7	crosswiki
7	enwiki
7	properties
7	redirect
7	swedish
7	test
7	tree
7	user analysis
7	user edits
6	coordinates
6	deployment
6	events
6	missing
6	monuments
6	web tools
6	wiki loves monuments
5	attribution
5	automated contributions
5	cloud-vps
5	gerrit
5	grid-engine
5	ip
5	lexeme
5	links
5	machine learning
5	non-automated
5	nonautomated
5	rss
5	svg
5	sweden
5	table
5	userscript
5	visual
5	visualization
5	wikiproject
5	wle
4	authority control
4	batch
4	biology
4	books
4	chatbot
4	contributions
4	convert
4	creative commons
4	discord
4	discoverability
4	dnb
4	edit
4	editors
4	flagged revisions
4	flickr
4	game
4	interlanguage
4	interwiki
4	json
4	languages
4	leaflet
4	pages
4	patrol
4	photos
4	reference
4	references
4	reporting
4	research
4	reuse
4	science
4	stewards
4	template
4	templates
4	transfer
4	user script
4	vandalism
4	video
4	visualisation
4	watchlist
4	wmse
3	admin
3	articles for creation
3	biography
3	bookmarklet
3	ci
3	codequality
3	conversion
3	converter
3	copy
3	csv
3	data-services
3	disambiguation
3	email
3	file
3	gamification
3	global
3	image
3	importing
3	infobox
3	interface
3	isbn
3	kubernetes
3	kulturarvsdata
3	label
3	lists
3	media
3	mobile
3	monitoring
3	multilingualism
3	mysql
3	ogg
3	openrefine
3	osm
3	outreachy
3	pdf
3	plwiki
3	projects
3	proxy
3	random
3	ranking
3	recent changes
3	reconciliation
3	redlinks
3	report
3	reports
3	shex
3	spelling
3	sql
3	syntax
3	thanks
3	timeline
3	timezones
3	tooltranslate
3	tutorial
3	user statistics
3	warped
3	whois
3	wiki-replicas
3	wikicite
3	wikicommons
3	wikilovesmonuments
3	youtube
2	abstract wikipedia
2	address
2	admins
2	anti-harassment
2	anti-spam
2	archive
2	assessment
2	automated edits
2	beacon
2	beta
2	bio
2	calendar
2	charts
2	chinese
2	citations
2	class
2	classes
2	community award
2	community-tech
2	configuration
2	coordinate
2	coverage
2	css
2	dashboard
2	dataset
2	dbpedia
2	death
2	docker
2	doi
2	dumps
2	duplicates
2	ebooks
2	elections
2	entity schema
2	enwiktionary
2	epub
2	eswiki
2	excel
2	extensions
2	external links
2	feed
2	filter
2	floss
2	footnote
2	gallery
2	genealogy
2	geocoding
2	geograph
2	geonotice
2	german wikipedia
2	git
2	github
2	good
2	google
2	gps
2	guidelines
2	html
2	ical
2	iiif
2	information
2	intersection
2	java
2	kb
2	kml
2	labels
2	language identification
2	license review
2	link
2	lint
2	maintenance
2	math
2	meta
2	multimedia
2	name
2	nationaal archief
2	new
2	ocr
2	operations
2	ores
2	pages created
2	paws
2	playlist
2	polls
2	popular
2	prefix
2	pronunciation
2	provenance
2	qid
2	quickstatements
2	randomness
2	rangeblocks
2	reactjs
2	reasonator
2	reconcile
2	redirects
2	regexp
2	reverse geocoding
2	reversions
2	revisions
2	rust
2	sal
2	sandbox
2	scientific articles
2	sdoc
2	semi-automated edits
2	shorturls
2	skim
2	software-development
2	steward
2	structured-data
2	sysop
2	sysops
2	tables
2	task
2	telegram
2	time
2	tool
2	topics
2	transform
2	translate
2	trees
2	url
2	user contributions
2	vote
2	votes
2	web
2	wikilaeum
2	wikiläum
2	wikipathways
2	wiktionary
2	word count
2	xml
2	zppixbot
1	abuse
1	abusefilter
1	adminstats
1	agile
1	album
1	algorithmic accountability
1	alpha
1	animals
1	anniversaries
1	apt
1	arabic
1	archiving
1	around
1	art
1	article review
1	articles created
1	artwork
1	artworks
1	assamese calendar
1	assamese wikisourse
1	automation
1	bash
1	bbr
1	bible
1	bilderwunsch
1	birthdays
1	block
1	blocking
1	blogging
1	blogs
1	blp
1	blubber
1	book
1	brokenimages
1	brokentemplates
1	browse
1	browser
1	bub
1	buildings
1	bulk upload
1	calendars
1	campaigns
1	candidates
1	cas
1	catalog
1	category management
1	ceph
1	chart
1	chat
1	check
1	checker
1	chemicals
1	chemistry
1	cite
1	ckbwikipedia
1	cleanup
1	cli
1	clock
1	cloud
1	code
1	code review
1	coding
1	coding conventions
1	collaboration
1	computer vision
1	constests
1	content
1	content contributor
1	contentmine
1	contest
1	convention
1	coortinates
1	copyright
1	copyright violations
1	copyvios
1	counter-vandalism
1	crawling
1	creation
1	creator
1	cron
1	crowdsource
1	cvn
1	cycling
1	daily
1	data quality
1	datasheet
1	date
1	dates
1	datetime
1	dead
1	delinker
1	deno
1	depiction
1	depicts
1	description
1	descriptions
1	diagram
1	dictionary
1	differential privacy
1	directory
1	distributed-game
1	distribution
1	django
1	djvu
1	dnsbl
1	documentation
1	domain
1	doodle
1	download
1	drag'n'drop
1	dynamicpagelist
1	edit analysis
1	edit summaries
1	edit summary
1	edit-counter
1	editathons
1	editcount
1	editcounter
1	editor
1	editor retention
1	education program
1	embedding
1	en.wp
1	engineering productivity
1	excellent
1	experiments
1	export
1	extraction
1	facebook
1	facet
1	fair
1	featured
1	featured articles
1	featured video candidates
1	feedback request service
1	feeds
1	find-a-grave
1	fixing
1	fmis
1	forward
1	fpc
1	framadate
1	freebase
1	fun
1	gelocation
1	generation
1	geo
1	geolocation
1	geolookup
1	gitlab
1	global eits
1	glyph
1	gmt
1	google books
1	grammar
1	graphs
1	gtaa
1	gwt
1	hashing
1	heatmap
1	history
1	iabot
1	icalendar
1	ics
1	image requested
1	incubator
1	infra
1	innerlinks
1	instance of
1	interact-oa
1	interactoa
1	internet archive
1	internet-archive
1	internetarchivebot
1	inventory
1	ip address
1	ip info
1	ipaddress
1	ipv4
1	ipv6
1	isin
1	it.wikinews
1	italy
1	items
1	itwiki
1	itwikiversity
1	jem
1	jembot
1	jobs
1	json-ld
1	json-schema
1	jsonp
1	jupyter notebook
1	jury
1	l10n
1	langviews
1	ldap
1	libraries
1	linkbot
1	linked data
1	linkrot
1	listener
1	live
1	location
1	lod
1	login
1	logs
1	lonely
1	m3u
1	mailing list
1	main authors
1	main subject
1	management
1	mapillary
1	massviews
1	mattermost
1	mde
1	mdwiki
1	mediaviews
1	meetup
1	messenger
1	metadata
1	metrics
1	mistersynergy
1	mod
1	model cards
1	modmessage
1	momentjs
1	monument
1	monuments-database
1	morbid
1	movement strategy
1	moves
1	mp4
1	museu paulista
1	music
1	musicbrainz
1	mwparserfromhell
1	na
1	names
1	namespace
1	national archive
1	nccommons
1	network
1	new pages
1	newcomers
1	news
1	newsletters
1	ninthcircuit
1	obits
1	oclc
1	offentligkonst.se
1	oojs-ui
1	ooui
1	open data
1	otrs
1	outreach
1	packaging
1	page
1	page title
1	paintings
1	panoramio
1	parser
1	pashto
1	paste
1	pastebin
1	patch
1	patroller
1	patrollers
1	pattern
1	performance
1	periodic table
1	permalink
1	permissions
1	person
1	persondata
1	petscan
1	pica
1	pilot flag
1	plagiarism
1	plain
1	plain-text
1	plaintexteditcounter
1	play
1	playcounts
1	plwiktionary
1	png
1	polices
1	policy
1	politics
1	potd
1	preferences
1	priority
1	privacy engineering
1	product ontology
1	project
1	project statistics
1	prometheus
1	proofread
1	property
1	qa
1	quality
1	quarry
1	quips
1	quiz
1	rating
1	rbl
1	rdf
1	rdns
1	reader
1	reading
1	recentchanges
1	recursion
1	recursive
1	redirectviews
1	regex
1	regulatory networks
1	releng
1	remarkup
1	rental
1	replag
1	reverse dns
1	revert
1	reviewer
1	revision
1	runtime
1	schedule
1	schema
1	schema.org
1	scholarly articles
1	scientific article
1	score
1	scraping
1	screensaver
1	scripting
1	sde
1	sections
1	seedlist
1	semantics
1	seo
1	server
1	sgd
1	shape expressions
1	share
1	short
1	signature
1	similarity
1	single
1	sitelink
1	sitelinks
1	siteviews
1	skins
1	social media
1	sockpuppetry
1	sockpuppets
1	sort
1	sound
1	source available
1	source-metadata
1	sources
1	spam blacklist
1	specie
1	spell checking
1	spotify
1	statements
1	staten generaal digitaal
1	stock market
1	streaming
1	string
1	structured data
1	style
1	sul
1	summary
1	surveys
1	syndication
1	task1
1	templatedata
1	text
1	theme
1	timezone
1	titles
1	today in history
1	todo
1	toolhub
1	toolinfo
1	top
1	topviews
1	torrents
1	train
1	training
1	translatewiki
1	trends
1	tts
1	twitter
1	typescript
1	unicode
1	unused
1	uploads
1	user interface
1	usergroups
1	userviews
1	vagrant
1	validator
1	vandals
1	ve
1	venn
1	verse
1	videocuttool
1	vikidia
1	visualeditor
1	viz
1	volley
1	volleyball
1	w.wiki
1	wdqs
1	web app
1	webfont
1	webm
1	webp
1	webscraper
1	widar
1	wiki
1	wiki-irc
1	wikibase
1	wikihack
1	wikiholic
1	wikiholism
1	wikilambda
1	wikilinks
1	wikilovesearth
1	wikimania
1	wikimedia nederland
1	wikinews
1	wikipediholism
1	wikiportret
1	wikiradio
1	wikitable
1	wikiversity
1	wiktionnaire
1	wmcs
1	wmf
1	words
1	yaml
1	zhwiki

This data dump has also been saved as a gsheet in the technical engagement team drive where we hope to do some additional analysis.