Page MenuHomePhabricator

Add a bibtex format
Closed, ResolvedPublic

Description

Semantic MediaWiki has a bibtex export format, provided in the Semantic Results Format extension (https://www.semantic-mediawiki.org/wiki/Help:BibTeX_format). I think it would be nice to include such a format in Cargo too.

I have made a patch where I added 2 export formats, namely:

  • bibtex
  • bibtex export

The bibtex format prints directly the bibtex entries on the page (using HTML). The bibtex export format is intended for exporting the data, it creates a link to download the bibtex entries on a file, or to see them in a new tab.

An exemplary query (in this for publications in conferences) is the following:

{{#cargo_query:
tables=conference_publications
|fields= bibtex_title=title, bibtex_author=author, conference=booktitle, start_date=date, doi, initial_page=initialpage, last_page=lastpage, CONCAT(city, ", ", country)=address
|order by=start_date DESC
|format=bibtex export
|default entry type=inproceedings
}}

As shown in the example, the fields queried must be named with a specific field alias, most of the aliases are the ones accepted by the bibtex format, but we also define some additional ones. The full list of fields accepted by the bibtex format is the following:

address, annote, author, booktitle, chapter, crossref, doi, edition, editor, howpublished, institution, journal, key, month, note, number, organization, pages, publisher, school, series, title, type, volume, year

A description for each field can be seen on the wikipedia page (https://en.wikipedia.org/wiki/BibTeX#Field_types).

In the case of the author and editor fields, a list of names (in bibtex format) separated by 'and' must be given. For these cases, if the type of the cargo data is a list of values, the code takes care of constructing the list with the 'and' separator. If the type of the cargo data is not a list, then the value is used directly.

additionaly, the following extra special aliases are also available:

  • bibtexkey The name of the bibtex entry, if not specified or empty the name of the entry will be generated from the authors, the title and the year data.
  • date Date of the publication. This will be converted to the year and month fields in the bibtex entry.
  • entrytype Type of the bibtex entry, e.g., article, book, booklet, etc. The full list of available types can be seen in https://en.wikipedia.org/wiki/BibTeX#Entry_types. However, note that the code does not check if the field values are valid entry types, it just takes the values directly. If not specified or if the value is empty, the entry type will be set to the one specified in the default entry type option (see below for the description of this option). If default entry type is not spefied then the entry type will be set to article.
  • initialpage and endpage Indicate the initial and end pages, respectively. Can be used instead of the pages alias.

Additional allowed parameters for both formats are:

  • default entry type The default type of the bibtex entries, e.g., article, book, booklet, etc. If not specified the default will be set to article.
  • plain text This must be 1 to output a plain text result (no HTML), and 0 to output a result in HTML format. For the bibtex format the default is 0 (HTML is created), and for the bibtex export format the default is 1 (plain text is created). Originally I had called this option no html, but I changed the name because the semantic and inner working is different from the no html Cargo option.

The bibtex export format also allows the following parameters:

  • link text Sets the text of the link (default is "View BibTeX", or the value at the page MediaWiki:cargo-viewbibtex)
  • export as file Sets the way of exporting the data. It set to 1, the results will be downloaded on a file, if set to 0, the results will be shown on a new tab. The default is 0.
  • filename Sets the name of the file that is downloaded (default is "results.bib")

As an example, the query shown as example above would produce a list of entries like the following:

@inproceedings{dominguez2017throughput,
  title={Throughput-Based Performance Evaluation of 5G-Candidate Waveforms in High Speed Scenarios},
  author={Tom\'as Dom\'inguez-Bola{\~n}o  and  Jos\'e Rodr\'iguez-Pi{\~n}eiro  and  Jos\'e A. Garc\'ia-Naya  and  Castedo, Luis},
  address={Kos, Greece},
  booktitle={25th European Signal Processing Conference (EUSIPCO 2017)},
  doi={10.23919/EUSIPCO.2017.8081280},
  pages={643--647},
  year={2017},
  month=aug,
}

The key of the entry is generated from the authors, the title and the year data, since no bibtexkey field is provided in the query. The format of this key is the same as in the citations from Google Scholar, i.e., <Last name><Year><First word of title>. When generating the key, the code removes all non alphabetical characters from the Last name and the first word of the title.

Event Timeline

Change 511394 had a related patch set uploaded (by Tombolano; owner: Tombolano):
[mediawiki/extensions/Cargo@master] Added 'bibtex' display format and 'bibtex export' export format.

https://gerrit.wikimedia.org/r/511394

It would indeed be nice to have BibTeX support in Cargo! This is great!

Is there really a need for two different formats, though? As you note, in SMW it's only an export format. Is there a use case for displaying the information on the page?

Hi Yaron, I think that you are right, I think that there is no important use case for a display format, and that the export format is enough. So I have uploaded a new patch to leave only a bibtex format (which now will be the exporting format).

The options of this format are the same but without the plain text option, which is not necessary anymore:

  • default entry type The default type of the bibtex entries, e.g., article, book, booklet, etc. If not specified the default will be set to article.
  • link text Sets the text of the link (default is "View BibTeX", or the value at the page MediaWiki:cargo-viewbibtex)
  • export as file Sets the way of exporting the data. It set to 1, the results will be downloaded on a file, if set to 0, the results will be shown on a new tab. The default is 0.
  • filename Sets the name of the file that is downloaded (default is "results.bib")

I have tried to do several queries and I think everything is working fine.

Okay, great - this simplifies things. I have some more questions:

  • How useful is "export as file"? None of the other export formats have it: they either always lead to a file download (csv, excel) or never (json).
  • The values for the $titleExcludeWords array - is that some official list, or did you come up with it yourself? And how will this work for non-English wikis?
  • Similarly, with $monthStrings, should that be hardcoded in English? (If not, those values are already available for every language in MediaWiki.)

Also some comments:

  • There needs to be an addition to qqq.json for "View BibTeX" as well.
  • Overall, the code formatting looks great - the one exception I see is, when concatenating strings, there need to be spaces - so "$a . $b" instead of "$a.$b".
  • You should be the author of the format, not me. :)

Could the export as file be dropped, and filename be used only? i.e. if a filename is provided, it'll prompt for download, and if it's not then it'll display directly in the browser (as plain text)?

I don't want to derail this Phabricator page, but I've asked what I think may be a related question on the Cargo talk page: https://www.mediawiki.org/wiki/Extension_talk:Cargo#Text-file_output_display_format

Hi, thanks a lot for the questions and comments:

  • How useful is "export as file"? None of the other export formats have it: they either always lead to a file download (csv, excel) or never (json).

I though that some people may want to be able to download a file, so i made an option for that. But you are right, that option can be removed and only show the results opening a page in a new tab. Anyway, if somebody wants to download the file they just have to make right click on the link and on the contextual menu click on "save link as...". Maybe the behaviour suggested by @Samwilson could be implemented, but I think that for simplicity showing the results in a new tab is enough.

  • The values for the $titleExcludeWords array - is that some official list, or did you come up with it yourself? And how will this work for non-English wikis?

Thank you for pointing that out, I hadn't though about other languages. Actually, the language of the publication should be considered, not the language of the wiki. However, I think this is too complicated. My idea was to be able to generate a key with the same style as Google scholar: <Last name><Year><First relevant word>. Google scholar removes a few stop words (i.e., articles, prepositions, etc), so I tried to do the same.

If we look at semantic mediawiki, the bibtex format generates a key (for the title part) just concatenating the first letter of all the words of the title. For instance, the key of the example that I give in the first post is dominguez2017throughput, but with semantic mediawiki the key would be dominguez2017tpef5wihss. I think that the key in the Google scholar format is saner, but it has the disadvantage that it's more likely to get 'collisions', i.e., the same key could be generated for two different publications if the first author, the year, and the first word of the title is the same.

A simple language-agnostic algorithm to select a word of the title is to select the first word such that its lenght is greater than a preset value. I think that this value could be 5 or 6, to obtain a medium-sized word. If no word of this lenght is found the code would try again substracting one to the value, and so on until a word is found. This way, a sane key for the entry is obtained, and the algorithm does not depend on the language of the wiki or publication.

  • Similarly, with $monthStrings, should that be hardcoded in English? (If not, those values are already available for every language in MediaWiki.)

Yes. When you have for example month=nov, nov is not taken as a string but as a variable (or macro) that will be substituted by the actual month name. It is the bibliography style (or the user) the one who must define the values for these macros. The standard bibliography styles included with BibTeX do define the months strings for the three letter macros (you can check the .bst files in https://ctan.org/tex-archive/biblio/bibtex/base). Therefore, the three letter macro is recommended, although in practice you could use any string. A good explanation is given on this stackoverflow answer: https://tex.stackexchange.com/a/286422. Also, I have searched other bibliography styles for non English languages on CTAN and all of them define the months strings for the English three letter macros, so it seems like a de facto standard.


I have uploaded a new patch with the following changes:

  • I updated the qqq.json file
  • I added spaces when concatenating strings (e.g., $a . $b)
  • I removed the options of export as file and filename. Now the only options accepted by the format are default entry type and link text
  • I updated the algorithm to obtain the key for the entries as I explained above.

Regarding the question of @Jonathan3, actually I dont know, I think that the template format generates html code, so maybe is not so easy to modify it.

Change 511394 merged by jenkins-bot:
[mediawiki/extensions/Cargo@master] Added 'bibtex' export format.

https://gerrit.wikimedia.org/r/511394

Thanks for accepting this change!

Now the missing part is adding the documentation to the mediawiki page of the extension. I think that for that is better to wait until the new release of Cargo is released.