Page MenuHomePhabricator

Record domains on which we get Zotero misses in production so we can identify good places to prioritise writing new translators
Open, MediumPublic40 Estimated Story Points

Event Timeline

Jdforrester-WMF raised the priority of this task from to Needs Triage.
Jdforrester-WMF updated the task description. (Show Details)
Jdforrester-WMF added a subscriber: Jdforrester-WMF.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 18 2015, 1:08 AM

Which project is this about or who should this be assigned to?

Krenair added a subscriber: Krenair.

Zotero... Citoid, I think?

Jdforrester-WMF set Security to None.
Jdforrester-WMF added subscribers: Mvolz, mobrovac.

For now, there is the Citoid dashboard which gives a basic count of hits vs misses. This doesn't help us identifying the missing translators, though.

Mvolz added a comment.Jul 24 2015, 1:00 PM

It's not the exact number of misses though because for each request we try
Zotero twice; a successful Zotero request will include 0 or 1 501 error; an
unsuccessful one, 2 501 errors.

Is this something that we could report using statsd? We could simply report
all the domains which end up being scraped as stats. Not sure if this is
what statsd is for or not though as obviously there will be a lot of unique
domains...would it be better to simply analyze the logs for this info? (If
so, this task is blocked by T102986 which prevents logs from easily being
analysed).

I've had requests in the past to report similar stats (I.e most used DOIs)
or have an RSS feed or similar of all generated citations (like we do with
edits on the wikipedias) but wasn't sure what the best way to do this was.

It's not the exact number of misses though because for each request we try
Zotero twice; a successful Zotero request will include 0 or 1 501 error; an
unsuccessful one, 2 501 errors.

Yup, hence my for now. Even if it were the exact number, it would be quantitative, not qualitative.

Is this something that we could report using statsd? We could simply report
all the domains which end up being scraped as stats. Not sure if this is
what statsd is for or not though as obviously there will be a lot of unique
domains...would it be better to simply analyze the logs for this info? (If
so, this task is blocked by T102986 which prevents logs from easily being
analysed).

I've had requests in the past to report similar stats (I.e most used DOIs)
or have an RSS feed or similar of all generated citations (like we do with
edits on the wikipedias) but wasn't sure what the best way to do this was.

Citoid logs are structured and saved in ElasticSearch, which is an indexing service. We should, thus, be able to get all that info with (rather simple) API calls. For that, we need to ensure that:

  • ES actually indexes the fields we are interested in (request_id, url/search, doi); AFAIK that's not the case right now
  • All Citoid logs get into ElasticSearch. That's currently not the case - request logs are emitted at the trace level, while only logs with the level warn or higher are sent to Logstash/ES. My concern here is that storing all of the logs in ElasticSearch might be too nosy and increase its cluster's load.

Alternatively, we might want to use a different log name for such cases, e.g. citoid-domain or citoid-doi, etc., once a request has been handled and we know whether it has been successfully translated/scraped or not.

Mvolz moved this task from Backlog to Production on the Citoid board.Jul 26 2015, 5:16 PM
Jdforrester-WMF renamed this task from Record Zotero misses in production so we can identify good places to prioritise writing new translators to Record domains on which we get Zotero misses in production so we can identify good places to prioritise writing new translators.Jul 31 2015, 11:33 PM
Jdforrester-WMF triaged this task as Medium priority.
Mvolz added a comment.Aug 4 2015, 3:42 PM

Possibly we could just use the information here and cross-reference that with translators? T96927