You are here: Home>Rob> (15 Jul 2008, RobBlake)EditAttach

CiteULikeExtensionProject

15 Jul 2008 - 01:25:41 by Rob Blake in Finished

 

Introduction

CiteULike is a wonderful service that helps researchers manage their bibliographies and keep track of what they are reading. The idea is similar to social bookmarking sites like Delicious or Reddit, except that it's for papers instead of web pages. Once you find an article that you are interested in reading, you can post that article to your CiteULike? library, tag the article with descriptive tags, and even upload your own personal PDF so you can read the article when you're not surfing from a university IP. In addition to all this, the real hook is that CiteULike? will extract all the bibliographic information about an article at the click of a button, provided that you can find the article in some fairly well known database. Between me and my closest friends, this is the real reason that we use CiteULike? . I hate crawling through articles later, trying to find a reference to something that I know I read 6 months ago in order to complete a bibliography. I've gotten to the point that I sometimes refuse to read a paper that I haven't entered into my CiteULike? database.

CiteULike? can extract the bibliography information from many websites, but not all of them. Currently, CiteULike? can't extract the bibliographic info from the following computer science websites:

  • CiteSeerX?
  • IEEE computer society digital library (CSDL)

This really frustrates me-- often times I'll find an extremely interesting article on one of these sites, but I'm afraid to make the commitment because I'm too lazy to extract the bibliographic information by hand and too scared to gleam anyone else's idea without being about to cite it later.

Project proposal

For this weekly project, I am going to write as many bibliographic extractors as I can for CiteULike? . CiteULike? has tutorials and example code detailing how to write such an extractor. Learning the system should be as simple as

  • learn how a major portal's HTML is laid out
  • write regular expressions to extract the relevant info
  • figure how to feed that information into CiteULike? 's servers.

Build notes

I spent a good 8 hours this weekend finishing up this project. It took way longer than I anticipated, but that's my fault and not the plug-in creator's fault. The plugin documentation was wonderful for this project, but there is a lot of it to read. Hopefully, you can use this page in order to build a plugin with minimal fuss.

First check out the source code from the subversion repository ( svn co http://svn.citeulike.org/svn/ citeulike ). Also, pull up the documentation, because you'll be needing it later. This documentation is very comprehensive, but you don't need to read all of it at first. Give it a scan for 10 minutes and move on to implementing the plugin.

I won't regurgitate the excellent documentation, but basically every plugin is made from a descr/plugin.tcl file and a perl/plugin.pl file ( or a python/plugin.py ) file. The two commands you'll need to know are

./driver.tcl parse URL
  Asks each plugin in turn if it is interested in the URL, you'll use this to test your plugin.
./driver.tcl test plugin
  Runs the unit tests for the plugin.  You'll run this when you are finished.

Here's my quick start guide if anyone wants to quickly make a CiteULike? plugin:

  • Copy one of the .cul files in descr, change the obvious fields in your copy
  • Make a dummy program that returns "status\terr" no matter what. Check it by running it from the command line.
  • Using this status program, adjust the regex in the .cul file so that =driver.tcl parse http://url= is catching all the URLs you want.
  • Change the dummy program so that it reads in one line from standard in, parse the URL entered on that line so that it gets all the unique bits needed to identify the article. Run the program manually and have it print out the unique identifiers.
  • Make your dummy plugin return
    begin_tsv
    linkout URL_UIDS
    end_tsv
    
    You can copy from another plugin or read the format for linkouts. You'll probably only need $ckey_1. Remember to use tabs instead of spaces.
  • Change the format_linkout line in the .cul file. See if running driver.tcl.parse formats the output link correctly.
  • Copy from another plugin to download the desired webpage
  • Change the plugin so it outputs "status ok" before exiting.
  • See if the database allows bibtex downloads. If so, download the bibtex and use begin_bibtex before begin_tsv. Read the HOWTO.txt for more. Test when done with driver.tcl to see if it properly parsed the bibtex.
  • Figure out regexps you can use to extract the rest of the information you couldn't get through bibtex. Run the code each time and check that you're getting good results. Test both by running your plugin manually and through driver.tcl parse.
  • take the output from driver.tcl for a couple of URLs and reformat it so it can go into the test section of your plugin. Look at other tests for examples.

Conclusion

This build took my way longer than expected, but now that I have the process down I'm fairly sure I could knock out another plugin in a matter of hours. The hardest part of implementing this project was knowing where to go if something went wrong. I started by reading the documentation, which in retrospect was a mistake-- I learned much more by copy and pasting another person's plugin and decimating it to serve my purposes.

My biggest stalls where when I ran out of ideas on how to fix my current problem. Hopefully by posting my build process, I'll make it easier for the next person who comes along.

I contributed CiteSeerX? and CSDL to the repository, and they went live on 2008-07-14. The CiteSeerX? plugin as I wrote it will only work on journal papers and conference proceedings. This is because CiteSeerX? doesn't use Bibtex of RIS or another standard format, but instead has reinvented it's own. I need to find a book chapter on CiteSeerX? in order to properly parse book chapters, so if you find one let me know.

The CSDL plugin should work on basically anything, but it might have problems parsing unicode characters in names. CSDL currently encodes the bibtex in a javascript string. I need to use a perl parser to un HTML-ify the string in order to really make it bulletproof, but I was too lazy to look up the documentation.

Comments (edit)

 

Topic revision: r1 - 15 Jul 2008 - 01:25:41 - RobBlake
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback