The documentation for citoid is a rather bare-bone "we have a service citoid" and it is nothing near whats needed to track down problems with the live service. Trying to figure out what works and whats not, I have spent several days with only getting a marginally better idea about whats working.
There are no clear information on how and why some metadata parses and some don't. I assume that these translators are supposed to work, but they only do so sometimes. So to be sure that I had infact a valid page I wrote a server that does no caching at all and also tells everything downstream to not do an caching at all. Then I used the turtle_movie.html test file and checked the response. What I got back was this
[ { "url":"https://my.own.web.server", "itemType":"webpage", "language":"en-US", "title":"Turtles of the Jungle", "abstractNote":"A 2008 film about jungle turtles.", "websiteTitle":"Awesome Turtle Movies Website", "date":"2012-02-04", "accessDate":"2016-04-25", "author":[ ["","http://www.example.com/turtlelvr"] ] } ]
Clearly some fields are missing, and the itemType is "webpage". Author is also broken, but hey, its a test file.
Starting to be a bit frustrated I found a page at The Time, which I copied locally. I chose this file because citoid said it was something different from a "webpage". Using their own server I got this
[ { "itemType":"magazineArticle", "notes":[], "tags":[ {"tag":"al-qaeda","type":1}, {"tag":"crime","type":1}, {"tag":"hizballah","type":1}, {"tag":"narcotrafficking","type":1}, {"tag":"syria","type":1}, {"tag":"breaking bad","type":1}, {"tag":"captagon","type":1}, {"tag":"meth","type":1}, {"tag":"syria","type":1} ], "title":"Syria’s Breaking Bad: Are Amphetamines Funding the War?", "publicationTitle":"Time", "url":"http://world.time.com/2013/10/28/syrias-breaking-bad-are-amphetamines-funding-the-war/", "abstractNote":"A spike in the trafficking of the illegal drug Captagon, popular in the Middle East, is connected to Syria's brutal civil war", "libraryCatalog":"world.time.com", "accessDate":"2016-04-24", "ISSN":["0040-718X"], "shortTitle":"Syria’s Breaking Bad", "author":[ ["Aryn","Baker"] ] } ]
Note that itemType is now "magazineArticle".
My exact copy of the same page turned up this
[ { "url":"https://my.own.server", "itemType":"webpage", "websiteTitle":"TIME.com", "title":"Syria’s Breaking Bad: Are Amphetamines Funding the War? | TIME.com", "abstractNote":"A spike in the trafficking of the illegal drug Captagon, popular in the Middle East, is connected to Syria's brutal civil war", "accessDate":"2016-04-25" } ]
Note that now the itemType is "webpage". Something is clearly wrong, unless citoid sees something different than what I see. And yes, I have checked that metadata that parses and gets reported back from citoid infact reflects my changes to the same page on my server.
We need a better description of the parser process, how it chose the translator to use and why it typically fails. As it is now it is very difficult to tell external parties how they should change their pages so they will parse correctly, now we more or less only have a black box that does "something" and often not even that.
Sorry for the tone, I'm a bit frustrated over this.