Page MenuHomePhabricator

Wikisource Ebooks: Collect Data on Current Ebook Exports [8 hours]
Closed, ResolvedPublic

Description

As a Wikisource user, I want baseline data related to ebook exports to be established by the Community Tech team, so they can set goals for potential improvements and track progress.

Background: As part of our Wikisource work, we hope to improve the overall reliability of ebook exports. In order to effectively do this, we need to have a current understanding of overall reliability. The purpose of this ticket is to identify how we can determine this data and to then go ahead and retrieve the data. There are 3 main types of issues we encounter with ebook exports: 1) the tool doesn't work at all (which we can track with uptimerobot), 2) the tool technically works but there are errors (which we can track with other error logs, potentially logstash, as an example), 3) formatting & styles issues (which we need to get from direct examples & feedback from users -- this isn't easy to track).

Acceptance Criteria:

  • Collect information on the following:
    • Total number of ebook exports in the last 30 days, 60 days, and 90 days (logs are at https://wsexport.wmflabs.org/logs/ --> how can I easily access/analyze them?)
    • Total number of WSExport connection timeouts in the past month, date of timeout, and duration of timeout (you can use T226136 as a model).
    • Current uptime stats for the last 24 hours, 30 days, 60 days, and 90 days
    • Do we know which file formats are downloaded most often right now from the export tool?
    • Do we know what % of traffic to WS is from mobile?
  • Determine how we can collect logged data on ebook export errors (for example, can this be done in logstash), which may include:
    • Linting errors
    • Export tool errors out
  • Determine the number of the errors for the last 24 hours, 30 days, 60 days, and 90 days, if possible?
  • Can we generate data on any changes that may have occurred in uptime and reliability from before and after the move to VPS?
  • General question: Is UptimeRobot giving us a complete picture of uptime, or is anything missing? Just want to check in about this.

Event Timeline

ifried updated the task description. (Show Details)
ifried renamed this task from Wikisource Ebooks: Collect Data on Current Ebook Exports [placeholder] to Wikisource Ebooks: Collect Data on Current Ebook Exports .Jun 25 2020, 7:30 PM
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried updated the task description. (Show Details)
ifried renamed this task from Wikisource Ebooks: Collect Data on Current Ebook Exports to Wikisource Ebooks: Collect Data on Current Ebook Exports [8 hours].Jun 30 2020, 11:35 PM

Errors before VPS:

The error log from the old wsexport Toolforge account runs from 2019-07-03 to 2020-03-24 (it contains entries later than that, but they're not errors). We migrated to VPS on 2020-02-11. (There's another bunch of errors in the 30 days to March 24, all relating to requests to https://tools.wmflabs.org/phetools/credits.py , mostly 404 Not Found; I've opened T257543 for this.)

In the three months prior to 2020-02-11, the following exception counts were logged:

2019-11-12 to 2019-12-122019-12-12 to 2020-01-122020-01-12 to 2020-02-12
4830 Exception
2665 WSExportInvalidArgumentException
1708 ProcessTimedOutException
173 ProcessFailedException
111 HttpException
78 ServerException
40 ConnectException
35 RequestException
20 ProcessSignaledException
14076 Exception
6183 WSExportInvalidArgumentException
6170 ServerException
1373 ProcessTimedOutException
152 ProcessFailedException
142 ConnectException
44 RequestException
6 ProcessSignaledException
5 HttpException
1 ClientException
3782 Exception
2851 ProcessTimedOutException
506 ServerException
174 ProcessFailedException
149 ConnectException
45 RequestException
37 ProcessSignaledException
15 WSExportInvalidArgumentException
5 HttpException

(Hacky thing to get these counts: sed -n '/2020-01-11/,/2020-02-12/p' old-wsexport-error-log.txt | grep -P '^[0-9]{4}-' | grep -o '[^\/]*Exception' | sort | uniq -c | sort -nr)

Hey, @Samwilson, I added two more points to this ticket, based on some data that @Prtksxna was interested in us collecting (let me know if that's okay!). I added:

  • Do we know which file formats are downloaded most often right now from the export tool?
  • Do we know what % of traffic to WS is from mobile?

If you think it doesn't fit into the ticket, I can change things. Thanks!

Hey, @Samwilson! Just pinging you about the fact that Viticulum has collected some information on Wikisource download times, which may be interesting to look into/analyze, as well: https://meta.wikimedia.org/wiki/User:Viticulum/WSExport-Export_Time

Total number of ebook exports in the last 30 days, 60 days, and 90 days:

select `format`,
  sum( if(`time` between date_sub(now(), interval 30 day) and now(), 1, 0) ) as 30day,
  sum( if(`time` between date_sub(now(), interval 60 day) and now(), 1, 0) ) as 60day,
  sum( if(`time` between date_sub(now(), interval 90 day) and now(), 1, 0) ) as 90day
from books_generated
group by `format`
order by 30day desc, 60day desc, 90day desc
Format30day60day90day
pdf-a5240465245582496
epub236954979780318
mobi93662582336949
epub-364501553622509
pdf-a44791976516142
txt223241726874
rtf98527484408
epub-21768100
htmlz112239
odt92946
pdf-letter81132
pdf-a63711
atom000

The totals of each of the main formats (bundling the pdfs and epubs together):

select date_format(`time`, '%Y-%m') as month,
  sum(if(format like 'pdf%', 1,0)) as pdf,
  sum(if(format like 'epub%', 1,0)) as epub,
  sum(if(format like 'mobi%', 1,0)) as mobi,
  sum(if(format not like 'mobi%', if(format not like 'epub%', if(format not like 'pdf%', 1,0),0),0)) as other
from books_generated group by date_format(`time`, '%Y-%m')
monthpdfepubmobiother
2020-0166457511701938313709
2020-023483347599137466527
2020-0343684227100128286096
2020-044892551125127057201
2020-054180044736122747818
2020-063622637222152283429
2020-073089132252138364278
2020-089909108852076636

chart2.png (394×1 px, 33 KB)

There has been a bit under a million books exported so far this year.

Some of the logged data is not accurate, due to T242760#5874616.

Current uptime stats for the last 24 hours, 30 days, 60 days, and 90 days:

Our stats are at https://stats.uptimerobot.com/BN16RUOP5/782558466 but they're not necessarily very accurate because they only check whether the tool is online, and not whether it can actually export ebooks. There have been a few occasions when exporting has failed (e.g. from a database connectivity issue) but the tool has not shown as offline.

The UptimeRobot stats since March 1st this year are as follows:

2020-08-01 21:39:06	Connection Timeout	0 hrs, 8 mins
2020-07-22 18:22:01	Connection Timeout	0 hrs, 8 mins
2020-07-21 00:46:47	Connection Timeout	0 hrs, 3 mins
2020-07-20 21:40:45	Connection Timeout	0 hrs, 13 mins
2020-07-15 08:49:33	Connection Timeout	0 hrs, 3 mins
2020-07-12 18:08:25	Connection Timeout	0 hrs, 3 mins

Full stats are here:

We only keep access logs for two weeks, at the moment they go back to August 3 and number 177,322.

I loaded them into AWStats, and got the following numbers...

To get an idea of desktop vs mobile, we can look at OS info from user agents :

 	OS		Pages	Percent	Hits	Percent
	Linux		136,127	76.7 %	136,138	76.7 %
	Windows		24,708	13.9 %	24,712	13.9 %
	Android		9,827	5.5 %	9,827	5.5 %
	Macintosh	4,266	2.4 %	4,266	2.4 %
	iOS		1,765	0.9 %	1,765	0.9 %
	Unknown		465	0.2 %	467	0.2 %
	Unknown Unix 	74	0 %	74	0 %
	Java Mobile	53	0 %	53	0 %
	OS/2		19	0 %	19	0 %
	BSD		3	0 %	3	0 %

Or look at referers from wikis; the ones with ".m." in the URL are mobile:

Total: 811 different pages-url	Pages	Percent	Hits	Percent
https://fr.wikisource.org      12,176	32.2 % 12,176	32.2 %
https://pl.wikisource.org	3,270	8.6 %	3,270	8.6 %
https://pl.m.wikisource.org	2,471	6.5 %	2,471	6.5 %
https://en.wikisource.org	2,153	5.7 %	2,153	5.7 %
https://en.m.wikisource.org	1,656	4.3 %	1,656	4.3 %
		/tool/book.php	1,429	3.7 %	1,429	3.7 %
https://es.wikisource.org	1,401	3.7 %	1,401	3.7 %
https://it.wikisource.org	1,228	3.2 %	1,228	3.2 %
https://de.wikisource.org	1,101	2.9 %	1,101	2.9 %
https://ta.m.wikisource.org	  891	2.3 %	  891	2.3 %
https://it.m.wikisource.org	  841	2.2 %	  841	2.2 %
https://zh.wikisource.org	  749	1.9 %	  749	1.9 %
https://es.m.wikisource.org	  671	1.7 %	  671	1.7 %
https://hy.m.wikisource.org	  650	1.7 %	  650	1.7 %
https://ro.m.wikisource.org	  632	1.6 %	  632	1.6 %
https://la.wikisource.org	  627	1.6 %	  627	1.6 %
https://hy.wikisource.org	  515	1.3 %	  515	1.3 %
https://ta.wikisource.org	  473	1.2 %	  473	1.2 %
https://fr.m.wikisource.org	  468	1.2 %	  468	1.2 %

To give us a baseline for epub errors, I made a script to look at random ebooks. For example, taking 5 random works from 15 random Wikisources gave the following errors with epubcheck:

CSS-008 -- 5 -- CSS-008	ERROR	An error occurred while parsing the CSS: %1$s.	
RSC-005 -- 68 -- RSC-005	ERROR	Error while parsing file '%1$s'.	
CSS-007 -- 253 -- CSS-007	INFO	Font-face reference %1$s refers to non-standard font type %2$s.	
OPF-053 -- 3 -- OPF-053	WARNING	Date value '%1$s' does not follow recommended syntax as per http://www.w3.org/TR/NOTE-datetime:%2$s.	
RSC-012 -- 2 -- RSC-012	ERROR	Fragment identifier is not defined.	
PKG-003 -- 3 -- PKG-003	ERROR	Unable to read EPUB file header.  This is likely a corrupted EPUB file.	
PKG-008 -- 3 -- PKG-008	FATAL	Unable to read file '%1$s'.

These messages aren't very useful, but they're good for grouping classes of error by. When we want to look into the actual errors, we can get the full details.

I can re-run this with larger numbers, if we think that'd be more useful.

Error counts from the last 14 days are as follows:

[other] => 250
[ProcessTimedOutException] => 21637
[ProcessFailedException] => 805
[ServerException] => 143
[WSExportInvalidArgumentException] => 4
[HttpException] => 168

I think that's about the last of the data for this ticket. To summarise the above comments:

  • Total number of ebook exports in the last 30 days, 60 days, and 90 days
  • Total number of WSExport connection timeouts in the past month, date of timeout, and duration of timeout

See T256018#6378892

  • Current uptime stats for the last 24 hours, 30 days, 60 days, and 90 days
  • Do we know which file formats are downloaded most often right now from the export tool?

See T256018#6387718

  • Do we know what % of traffic to WS is from mobile?

See T256018#6387750

  • Determine how we can collect logged data on ebook export errors (for example, can this be done in logstash), which may include: Linting errors

See T256018#6388041

  • Determine the number of the errors for the last 24 hours, 30 days, 60 days, and 90 days, if possible?

See above.

  • Can we generate data on any changes that may have occurred in uptime and reliability from before and after the move to VPS?

See T256018#6292268 for some error counts from pre-VPS, and above for recent ones.

  • General question: Is UptimeRobot giving us a complete picture of uptime, or is anything missing? Just want to check in about this.

No, it's not. We could look at setting up a periodic (hourly? daily?) export of a random ebook, to check that the tool is not online online but also operational.

In general, I think there are a few things we could do to give us better visibility of how the tool is performing:

  • We already log every export with time, format, language, and title. We could add the user-agent and the referring URL to this data, to get a better idea of what devices people are using and what sites they're coming from. This may constitute personal information though, so maybe it'd be better to avoid it.
  • Keep error logs for longer than 14 days. We have a system for SVGTranslate that sends an email on every exception. That might get pretty annoying, but it might be useful to at least look at the logs sometimes and try to fix any fixable errors. Lots of this will change when we do the job queue work, but just increasing the log time is pretty easy and shouldn't result in much extra disk space used or anything.
  • Some of the linting errors are fixable in templates etc. on wiki, but we could do more to get them visible by editors.
  • The current stats page isn't very useful, and could quite easily be improved with some e.g. a chart of format vs time for a given language. We should open tickets for adding the most commonly wanted statistics, so they can be retrieved at any time.

@Samwilson What do you make of the PDF downloads? I thought the current PDF link goes to Electron, and there is no way to get a PDF except by using the tool directly. Could these be coming from French and Bengali?

@Samwilson What do you make of the PDF downloads? I thought the current PDF link goes to Electron, and there is no way to get a PDF except by using the tool directly. Could these be coming from French and Bengali?

No, some Wikisources have direct links to WSExport PDFs as well as Electron. It looks like French and Polish have the highest PDF download counts:

lang30day60day90day
fr70482453246628
pl129252449543
es104429694384
ta87926374201
it82126875295
hy42911331848
uk36515122654
select `lang`,
  sum( if(`time` between date_sub(now(), interval 30 day) and now(), 1, 0) ) as 30day,
  sum( if(`time` between date_sub(now(), interval 60 day) and now(), 1, 0) ) as 60day,
  sum( if(`time` between date_sub(now(), interval 90 day) and now(), 1, 0) ) as 90day
from books_generated
where format like 'pdf%'
group by `lang`
order by 30day desc, 60day desc, 90day desc

I think T255790 will get rid of the ElectronPDF link.

This investigation has been completed, and the potential work for installing a stats front will be examined in T261480. We will refer to this ticket and its data over the course of our ebook export improvement project. However, it is appropriate to close the ticket, as the work is complete.