Introduction
CiteULike is a wonderful service that helps researchers manage their bibliographies and keep track of what they are reading. The idea is similar to social bookmarking sites like
Delicious or
Reddit, except that it's for papers instead of web pages. Once you find an article that you are interested in reading, you can post that article to your
CiteULike? library, tag the article with descriptive tags, and even upload your own personal PDF so you can read the article when you're not surfing from a university IP. In addition to all this, the real hook is that
CiteULike? will extract all the bibliographic information about an article at the click of a button, provided that you can find the article in some fairly well known database. Between me and my closest friends, this is the real reason that we use
CiteULike? . I hate crawling through articles later, trying to find a reference to something that I
know I read 6 months ago in order to complete a bibliography. I've gotten to the point that I sometimes refuse to read a paper that I haven't entered into my
CiteULike? database.
CiteULike? can extract the bibliography information from many websites, but not all of them. Currently,
CiteULike? can't extract the bibliographic info from the following computer science websites:
- CiteSeerX?
- IEEE computer society digital library (CSDL)
This really frustrates me-- often times I'll find an extremely interesting article on one of these sites, but I'm afraid to make the commitment because I'm too lazy to extract the bibliographic information by hand and too scared to gleam anyone else's idea without being about to cite it later.
Project proposal
For this weekly project, I am going to write as many bibliographic extractors as I can for
CiteULike? .
CiteULike? has tutorials and example code detailing how to write such an extractor. Learning the system should be as simple as
- learn how a major portal's HTML is laid out
- write regular expressions to extract the relevant info
- figure how to feed that information into CiteULike? 's servers.
Build notes
I spent a good 8 hours this weekend finishing up this project. It took way longer than I anticipated, but that's my fault and not the plug-in creator's fault. The plugin documentation was wonderful for this project, but there is a lot of it to read. Hopefully, you can use this page in order to build a plugin with minimal fuss.
First check out the source code from the
subversion repository (
svn co http://svn.citeulike.org/svn/ citeulike ). Also, pull up the
documentation, because you'll be needing it later. This documentation is very comprehensive, but you don't need to read all of it at first. Give it a scan for 10 minutes and move on to implementing the plugin.
I won't regurgitate the excellent documentation, but basically every plugin is made from a descr/plugin.tcl file and a perl/plugin.pl file ( or a python/plugin.py ) file. The two commands you'll need to know are
./driver.tcl parse URL
Asks each plugin in turn if it is interested in the URL, you'll use this to test your plugin.
./driver.tcl test plugin
Runs the unit tests for the plugin. You'll run this when you are finished.
Here's my quick start guide if anyone wants to quickly make a
CiteULike? plugin:
Conclusion
This build took my way longer than expected, but now that I have the process down I'm fairly sure I could knock out another plugin in a matter of hours. The hardest part of implementing this project was knowing where to go if something went wrong. I started by reading the documentation, which in retrospect was a mistake-- I learned much more by copy and pasting another person's plugin and decimating it to serve my purposes.
My biggest stalls where when I ran out of ideas on how to fix my current problem. Hopefully by posting my build process, I'll make it easier for the next person who comes along.
I contributed
CiteSeerX? and CSDL to the repository, and they went live on 2008-07-14. The
CiteSeerX? plugin as I wrote it will only work on journal papers and conference proceedings. This is because
CiteSeerX? doesn't use Bibtex of RIS or another standard format, but instead has reinvented it's own. I need to find a book chapter on
CiteSeerX? in order to properly parse book chapters, so if you find one let me know.
The CSDL plugin should work on basically anything, but it might have problems parsing unicode characters in names. CSDL currently encodes the bibtex in a javascript string. I need to use a perl parser to un HTML-ify the string in order to really make it bulletproof, but I was too lazy to look up the documentation.