CrossRef's Extracto can extract unstructured references from article PDFs. The service wraps CrossRef Lab's PDF content extraction toolkit, pdf-extract which is an open source Ruby library currently hosted on GitHub.
The extraction toolkit is a work in progress so please don't be surprised if you get incomplete results or completely rubbish output. It's also heavily reliant on a bunch of heuristics whose defaults may not work well with your PDFs. If this service does return junk results for your PDFs you may want to try running pdf-extract with custom settings.
Not very. This is a technology preview running on development servers.
Do you have a large batch of PDFs that you'd like to submit for reference scanning programatically? This service can be used via a REST API for programmatic upload of PDFs and subsequent retrieval of references. Retrieval of references requires two or more HTTP calls - one to upload a PDF and one or more to request the status of parsed references.
Upload a PDF by POSTing a PDF to the
/pdfs
resource, while also specifying a content type of
application/pdf
. This should not be a multi-part form post. Instead the
content of the HTTP request message must be the binary content of a PDF file.
For example, with curl:
$ curl -H "Content-Type: application/pdf" --data-binary "@myfile.pdf" http://extracto.labs.crossref.org/pdfs
This request will return a JSON document containing an ID which can be used to query the status of reference parsing.
{
"id": "4ea837fdf8049231e4000001"
}
Use the ID returned from uploading a PDF to query the status of reference parsing,
and retrieve references if they are available. Sent a GET request to
/pdfs/ID
.
$ curl http://extracto.labs.crossref.org/pdfs/4ea837fdf8049231e4000001
{
"parsed": true,
"uploaded": true,
"failed": false,
"file_digest": "71e176c528fcfa1b502b49465ffb51ca",
"doi": "10.10/abc"
"id": "4ea837fdf8049231e4000001"
}
If the status response reports that the pdf is parsed, as in
data["parsed"] == true
then a GET request can be made to retrieve references from
/pdfs/ID/refs
. If the status reports that parsing has not completed then the client should
wait a few seconds and make another request for the status.
$ curl http://extracto.labs.crossref.org/pdfs/4ea837fdf8049231e4000001/refs [ "Special Scrutiny: A Targeted Form of Research Protocol Review ANN INTERN MED 2004;140:220-223", ... ]
Extracto can produce
CrossRef deposit XML
for parsed PDFs. Deposit XML including parsed references for a particular PDF
will be generated and returned on a GET request to
/pdfs/ID/deposit
.