Page MenuHomePhabricator

More pdf stats

Authored By
akosiaris
Nov 10 2020, 1:58 PM
Size
1 KB
Referenced Files
None
Subscribers
None

More pdf stats

"""
Just get some basic stats from Proton externally
"""
import io
import pprint
import random
import requests
from pdfminer.high_level import extract_text
stats = {
'status_codes': {},
'valid': 0,
'invalid': 0,
'cl-matches-bytes': 0,
'cl-fails-bytes': 0,
}
WIKIS = ['en', 'es', 'el', 'it', 'fr', 'simple', 'de', 'ar', 'bg', 'no', 'tr']
for i in range(100):
wiki = random.choice(WIKIS)
r = requests.get('https://{}.wikipedia.org/api/rest_v1/page/random/title'.format(wiki))
title = r.json()['items'][0]['title']
print('Random page title: {}'.format(title))
pdf = requests.get('https://en.wikipedia.org/api/rest_v1/page/pdf/{}'.format(title))
code = pdf.status_code
if not code in stats.keys():
stats[code] = {
'cl-matches-bytes': 0,
'cl-fails-bytes': 0,
'valid': 0,
'invalid': 0,
}
if int(pdf.headers['content-length']) == len(pdf.content):
stats[code]['cl-matches-bytes'] += 1
else:
stats[code]['cl-fails-bytes'] += 1
f = io.BytesIO(pdf.content)
try:
extract_text(f)
stats[code]['valid'] += 1
except :
stats[code]['invalid'] += 1
pprint.pprint(stats)

File Metadata

Mime Type
text/plain; charset=utf-8
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
8782986
Default Alt Text
More pdf stats (1 KB)

Event Timeline