Page MenuHomePhabricator

Wikisource: supporting schema data for ebook exports
Open, Needs TriagePublic


Background: We don't have the schema data (stored in mariadb or other dbs used for queries) to determine the product metrics for Wikisource ebook exports. In other words, we can't easily determine how many books are being exported by Wikisource users per day, month, year, and various trends. This information could be useful, since it could let us better measure our impact. Do we want to add it/can we add it in a manageable scope? Something to discuss during estimation, so this ticket is a reminder of the discussion.

Acceptance Criteria:

  • Bring schema data of ebook export downloads to mariadb, so we can chart ebook export downloads over a period of time (such as per day, per week, etc)

Event Timeline

I'm not sure I understand what's missing here. At the moment we store date/time, language, title, format, and (soon, T267079) generation duration for every ebook exported. This gives us totals per day, month etc and things like the 'recently popular' list. It sounds like we also want to be able to look at per user statistics, is that right?

@Samwilson Perhaps we already do have this data available! I wrote this placeholder ticket after talking with Jennifer, since she found she was able to create charts for Wikisource data, but she hadn't been able to create one for Wikisource downloads yet. So, the purpose of this ticket was to bring up the question of where/how to get this data, so @jwang could create such a Superset chart of ebook exports. As for the question of per user statistics, perhaps this could be useful? What do you think would be the main reason/use case for getting this data at a per user level? To see if certain users are increasing their activity over time?

I think the value of tracking users would be smaller than the trouble of storing personal information. Actually, it's probably moot because we're not allowed to store usernames on Toolforge.

I think I just misread the task description about "users per day" anyway! :)

@Samwilson, I am curious what the pros and cons are to store data on a user created database on Toolforge, compared with the other options, like product mariadb or eventlogging database, from a developer's perspective?

From the point of view of data analysis and visualization, the later two are accessible for dashboard tools, like superset, Jupyter Notebook. With that we are able to build a dashboard or automated report to track the impact of this ebook project in the long term. Just like the dashboards we have for pageviews, edits and editors on wikisource, mentioned in the ticket:

Wikisource: Create dashboard on Wikisource activity dashboards:
Total number of pageviews:
Total number of edits:
Total number of active editors:

While a user created database on toolforge is not accessible by superset or Jupyter Notebook as far as I know. And cannot join with the tables in wikisource product database. But usually benefits come with a cost. ^_^ I don't know the effort behind it. I just want to learn from your point of view. And definitely, product team make a call on it.

It'd certainly be great if we could use existing analysis tools to look at wsexport data!

I'm not sure it's possible/permitted for a Toolforge tool to write to a production database though. I think to do that we'd have to move the tool to production, and that's a much bigger task.

Maybe it'd be possible to have a separate superset installation for Toolforge? (I've no idea what that would take to get going.)

@Samwilson, totally understand. Thanks for sharing.

This would be one way to make it easier to run arbitrary queries against the wsexport database: T151158: Support queries against Quarry's own database and ToolsDB

When we switched to using MariaDB to store the logs (instead of SQLite) we thought it'd be immediately possible to use Quarry, but we were wrong.