Page Menu
Home
Phabricator
Search
Configure Global Search
Log In
Files
F30603251
raw.txt
No One
Actions
View File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Award Token
Flag For Later
Authored By
dcausse
Oct 9 2019, 1:39 PM
2019-10-09 13:39:07 (UTC+0)
Size
1 KB
Referenced Files
None
Subscribers
None
raw.txt
View Options
from pyspark.sql import functions as F, types as T, Window
df = (
spark.read.table("wmf.webrequest")
.where(F.col('year') == '2019')
.where(F.col('month') == '10')
.where(F.col('day') == '08')
.where(F.col('hour') == '11')
.where(F.array_contains(F.col('tags'), 'sparql'))
.cache()
)
df.groupBy(df.http_method).agg(F.count(F.lit(1))).show(20)
queries = df.where(df.http_method == 'GET').select('uri_query')
from urllib.parse import parse_qsl
def ext_query(uq):
uq = uq[1:]
parsed = parse_qsl(uq)
try:
matches = [q[1] for q in parsed if q[0] == 'query']
return matches[0]
except IndexError:
return None
ext_query_udf = F.udf(ext_query)
sparql_queries = (queries.withColumn('sparql_query', ext_query_udf(df.uri_query))
.where(F.col('sparql_query').isNotNull())
.select(F.col('sparql_query')))
fetched_queries = sparql_queries.distinct().limit(10000).orderBy(F.rand()).toPandas()
with open('/home/dcausse/sparqlqueries.lst', 'w', encoding='utf-8') as f:
for idx, q in fetched_queries.iterrows():
f.write(q['sparql_query'])
f.write('\n---\n')
File Metadata
Details
Attached
Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
8094967
Default Alt Text
raw.txt (1 KB)
Attached To
Mode
P9281 extract wdqs sparql queries
Attached
Detach File
Event Timeline
Log In to Comment