Documentation for citoid in general, but dwindled down info type handling
Closed, ResolvedPublic0 Estimated Story Points
Actions

Assigned To

None

Authored By

	jeblad
	Apr 25 2016, 12:31 PM

Description

The documentation for citoid is a rather bare-bone "we have a service citoid" and it is nothing near whats needed to track down problems with the live service. Trying to figure out what works and whats not, I have spent several days with only getting a marginally better idea about whats working.

There are no clear information on how and why some metadata parses and some don't. I assume that these translators are supposed to work, but they only do so sometimes. So to be sure that I had infact a valid page I wrote a server that does no caching at all and also tells everything downstream to not do an caching at all. Then I used the turtle_movie.html test file and checked the response. What I got back was this

[
	{
		"url":"https://my.own.web.server",
		"itemType":"webpage",
		"language":"en-US",
		"title":"Turtles of the Jungle",
		"abstractNote":"A 2008 film about jungle turtles.",
		"websiteTitle":"Awesome Turtle Movies Website",
		"date":"2012-02-04",
		"accessDate":"2016-04-25",
		"author":[
			["","http://www.example.com/turtlelvr"]
		]
	}
]

Clearly some fields are missing, and the itemType is "webpage". Author is also broken, but hey, its a test file.

Starting to be a bit frustrated I found a page at The Time, which I copied locally. I chose this file because citoid said it was something different from a "webpage". Using their own server I got this

[
	{
		"itemType":"magazineArticle",
		"notes":[],
		"tags":[
			{"tag":"al-qaeda","type":1},
			{"tag":"crime","type":1},
			{"tag":"hizballah","type":1},
			{"tag":"narcotrafficking","type":1},
			{"tag":"syria","type":1},
			{"tag":"breaking bad","type":1},
			{"tag":"captagon","type":1},
			{"tag":"meth","type":1},
			{"tag":"syria","type":1}
		],
		"title":"Syria’s Breaking Bad: Are Amphetamines Funding the War?",
		"publicationTitle":"Time",
		"url":"http://world.time.com/2013/10/28/syrias-breaking-bad-are-amphetamines-funding-the-war/",
		"abstractNote":"A spike in the trafficking of the illegal drug Captagon, popular in the Middle East, is connected to Syria's brutal civil war",
		"libraryCatalog":"world.time.com",
		"accessDate":"2016-04-24",
		"ISSN":["0040-718X"],
		"shortTitle":"Syria’s Breaking Bad",
		"author":[
			["Aryn","Baker"]
		]
	}
]

Note that itemType is now "magazineArticle".

My exact copy of the same page turned up this

[
	{
		"url":"https://my.own.server",
		"itemType":"webpage",
		"websiteTitle":"TIME.com",
		"title":"Syria’s Breaking Bad: Are Amphetamines Funding the War? | TIME.com",
		"abstractNote":"A spike in the trafficking of the illegal drug Captagon, popular in the Middle East, is connected to Syria's brutal civil war",
		"accessDate":"2016-04-25"
	}
]

Note that now the itemType is "webpage". Something is clearly wrong, unless citoid sees something different than what I see. And yes, I have checked that metadata that parses and gets reported back from citoid infact reflects my changes to the same page on my server.

We need a better description of the parser process, how it chose the translator to use and why it typically fails. As it is now it is very difficult to tell external parties how they should change their pages so they will parse correctly, now we more or less only have a black box that does "something" and often not even that.

Sorry for the tone, I'm a bit frustrated over this.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		None	T133531 Documentation for citoid in general, but dwindled down info type handling
		Resolved		Mvolz	T133540 Use DC metadata to detemine itemType and fix open graph type determination

Event Timeline

jeblad created this task.Apr 25 2016, 12:31 PM

Restricted Application added a project: VisualEditor. · View Herald TranscriptApr 25 2016, 12:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

jeblad updated the task description. (Show Details)Apr 25 2016, 12:38 PM

Just a guess; are you running a separate translator for Time.com? That could be the reason for some of the weirdness. It still does not answer why local newspapers gives the wrong answer from what I expect.

The reason for the discrepancy is that citoid uses Zotero, but if Zotero
isn't running it falls back on native citoid code. It looks like your local
version doesn't have Zotero working so you're only getting the native
parser results.

I'm testing against this endpoint https://citoid.wikimedia.org/api?format=mediawiki&search=

Ah yeah, so Zotero uses regex against the submitted URL to determine which
Zotero translator to use. If you're using a different URL with the same
metadata you'll get the metadata from the citoid translators instead.

I made a stripped down html-page like the following, but it still turns up as "webpage" <s>and with missing fields</s>. [Not missing fields in this case]

<html lang="en">
	<head>
		<meta charset="utf-8">
		<title>Turtles are AWESOME!!1 | Awesome Turtles Website</title>
		
		<meta name="author" content="Turtle Lvr">
		<meta name="robots" content="we welcome our robot overlords"/>
		<meta name="description" content="Exposition on the awesomeness of turtles"/>
		<meta name="keywords" content="turtle, movie" />
		
		<link rel="canonical" href="http://example.com/turtles" />
		<link rel="publisher" href="https://mediawiki.org"/>
		<link rel="author" href="http://examples.com/turtlelvr"/>
		<link rel="shortlink" href="http://example.com/c" />
		
		<!--Dublin Core-->
		<meta name="DC.Title" content="Turtles of the Jungle" >
		<meta name="DC.Creator" content="http://www.example.com/turtlelvr" >
		<meta name="DC.Description" content="A 2008 film about jungle turtles." >
		<meta name="DC.Date" content="2012-02-04 12:00:00" >
		<meta name="DC.Type" content="Image.Moving" >
	</head>
	<body>
	</body>
</html>

It will always turn out to be webpage, because you can't tell it's a
magazineArticle from the metadata alone. You need to know it's from time.com
for that. This is part of the reason the zotero translators give much
better results, because the itemType can be hard coded based on the URL.

Webpage is the default itemType because technically, everything is a
webpage :)

It is explicitly said which type of item it describes, but the information is neglected?

Hmm, I don't think we're using Dublin core types to determine type. Only
open graph. That could be filed as a separate bug.

Check translator for Dublin Core, this clearly shows that the type is in use.

Checked OpenGraph, page still fails to give correct type.

This page

<html lang="en">

<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# video: http://ogp.me/ns/video#">

<meta charset="utf-8">

<title>Turtles are AWESOME!!1 | Awesome Turtles Website</title>

<meta name="author" content="Turtle Lvr">
<meta name="robots" content="we welcome our robot overlords"/>
<meta name="description" content="Exposition on the awesomeness of turtles"/>
<meta name="keywords" content="turtle, movie" />

<link rel="canonical" href="http://example.com/turtles" />
<link rel="publisher" href="https://mediawiki.org"/>
<link rel="author" href="http://examples.com/turtlelvr"/>
<link rel="shortlink" href="http://example.com/c" />

<!--Open Graph-->

<meta property="og:locale" content="en_US" />
<meta property="og:type" content="video.movie" />
<meta property="og:title" content="Turtles of the Jungle" />
<meta property="og:description" content="A 2008 film about jungle turtles." />
<meta property="og:url" content="http://example.com" />
<meta property="og:site_name" content="Awesome Turtle Movies Website" />
<meta property="og:image" content="http://example.com/turtle.jpg" />
<meta property="og:image" content="http://example.com/shell.jpg" />

<meta property="video:tag" content="turtle" />
<meta property="video:tag" content="movie" />
<meta property="video:tag" content="awesome" />
<meta property="video:director" content="http://www.example.com/PhilTheTurtle" />
<meta property="video:actor" content="http://www.example.com/PatTheTurtle" />
<meta property="video:actor:role" content="Turtle #3" /> <!-- Currently ignored -->
<meta property="video:actor" content="http://www.example.com/SaminaTheTurtle" />
<meta property="video:writer" content="http://www.example.com/TinaTheTurtle" />
<meta property="video:release_date" content="2015-01-14T19:14:27+00:00" />
<meta property="video:duration"  content="1000000" />

</head>

<body>
</body>

</html>

Gives this result

[{
"url":"https://this.is.my.shit",
"itemType":"webpage",
"language":"en-US",
"title":"Turtles of the Jungle",
"abstractNote":"A 2008 film about jungle turtles.",
"websiteTitle":"Awesome Turtle Movies Website",
"accessDate":"2016-04-25",
"author":[["Turtle","Lvr"]]
}]

It is run through citoid.wikimedia.org

Yeah, that's a bug, it should be use here but isn't:
https://github.com/wikimedia/citoid/blob/ff2c50b778d78539ca4be5670b09bb9b87ee4b36/lib/Scraper.js#L355

Mvolz created subtask T133540: Use DC metadata to detemine itemType and fix open graph type determination.Apr 25 2016, 1:54 PM

I wonder if this happen for all of the citoid translators. Will check...

Base data is picked up from turtle_movie.html
Type from OpenGraph fails
Type from DublinCore fails
Nothing additional is picked up from Twitter
Nothing additional is picked up from AL

Jdforrester-WMF moved this task from To Triage to Freezer on the VisualEditor board.Apr 26 2016, 7:11 PM

jeblad renamed this task from Documentation for citoid to Documentation for citoid in general, but dwindled down info type handling.May 12 2016, 1:50 AM

• Elitre subscribed.May 12 2016, 12:12 PM

Jdforrester-WMF triaged this task as High priority.Jun 8 2016, 12:00 PM

Jdforrester-WMF added a project: Documentation.

Findings will be welcome at https://www.mediawiki.org/wiki/Citoid/Enabling_Citoid_on_your_wiki .

• mobrovac closed subtask T133540: Use DC metadata to detemine itemType and fix open graph type determination as Resolved.Jun 23 2016, 2:44 PM

Some examples

NRK: Ny reality-serie (citoid) really strange author entry, but it is corerct given the source. This type of error is pretty common. If the publisher is used, then clean out the same entries from author.
Aftenposten: Pinlig for Skåber (citoid) has site name both for Twitter and OGP, still does not use them.
Dagbladet: Her er katastrofetallene som vitner om Ramsay-fiasko (citoid) here something goes seriously wrong during parsing of author.

Wild guess, type handling is still broken? That is, the code does not fall back to use Twitter and OGP.

Another thing that might help is that responses now include the metadata source in the source field, so you can tell if any given response is coming from a zotero translator or not:

See:

https://citoid.wikimedia.org/api?format=mediawiki&search=example.com

vs.

https://citoid.wikimedia.org/api?format=mediawiki&search=http%3A%2F%2Flink.springer.com%2Fchapter%2F10.1007%2F11926078_68

(We do not support Twitter and AL currently in citoid.)

I've added a section here with a little more information, including docs about the new source field: https://www.mediawiki.org/wiki/Citoid/API#Response

Too much to hope for that this resolves this ticket? :D

Mvolz mentioned this in T147079: publicationTitle /blogTitle not present in response from citoid .Sep 30 2016, 4:27 PM

In T133531#2618242, @jeblad wrote:

Some examples

NRK: Ny reality-serie (citoid) really strange author entry, but it is corerct given the source. This type of error is pretty common. If the publisher is used, then clean out the same entries from author.

Can't do much about that in citoid, since it's a problem with the underlying metadata. This kind of issue is best solved by creating a Zotero translator which can deal with issues specific to particular sites.

Aftenposten: Pinlig for Skåber (citoid) has site name both for Twitter and OGP, still does not use them.

There are at least two bugs encompassed in that one, created: T147079

Dagbladet: Her er katastrofetallene som vitner om Ramsay-fiasko (citoid) here something goes seriously wrong during parsing of author.

Wild guess, type handling is still broken? That is, the code does not fall back to use Twitter and OGP.

This is a problem with a Zotero translator, probably could be filed upstream.

I'm going to mark this as resolved as the service now gives the source, which should make it much easier to debug, and the updated documentation on wikitech. Happy to re-open if it's not satisfactory though.

Mvolz closed this task as Resolved.Oct 14 2016, 1:25 PM

Jdforrester-WMF moved this task from Freezer to External and Administrivia on the VisualEditor board.Oct 20 2016, 5:32 PM

Jdforrester-WMF set the point value for this task to 0.

Documentation for citoid in general, but dwindled down info type handlingClosed, ResolvedPublic0 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Documentation for citoid in general, but dwindled down info type handling
Closed, ResolvedPublic0 Estimated Story Points
Actions

Related Objects
Search...