What is this?

CrossRef's Extracto can extract unstructured references from article PDFs. The service wraps CrossRef Lab's PDF content extraction toolkit, pdf-extract which is an open source Ruby library currently hosted on GitHub.

How accurate is it?

The extraction toolkit is a work in progress so please don't be surprised if you get incomplete results or completely rubbish output. It's also heavily reliant on a bunch of heuristics whose defaults may not work well with your PDFs. If this service does return junk results for your PDFs you may want to try running pdf-extract with custom settings.

How reliable/fast is this service?

Not very. This is a technology preview running on development servers.

Upload a PDF for reference extraction
Optional. Used to create CrossRef deposit data for your PDF's references.
Note: Content from your PDF will be used as training data for pdf-extract. Content will not be given to third parties.
If this is not the first time you are uploading a PDF, check this box if you wish to re-process the PDF. Otherwise you will be redirected to cached results.

RESTish API

Do you have a large batch of PDFs that you'd like to submit for reference scanning programatically? This service can be used via a REST API for programmatic upload of PDFs and subsequent retrieval of references. Retrieval of references requires two or more HTTP calls - one to upload a PDF and one or more to request the status of parsed references.

Uploading a PDF

Upload a PDF by POSTing a PDF to the /pdfs resource, while also specifying a content type of application/pdf . This should not be a multi-part form post. Instead the content of the HTTP request message must be the binary content of a PDF file. For example, with curl:

$ curl -H "Content-Type: application/pdf" --data-binary "@myfile.pdf" http://extracto.labs.crossref.org/pdfs

This request will return a JSON document containing an ID which can be used to query the status of reference parsing.

{
  "id": "4ea837fdf8049231e4000001"
}

Retrieving references

Use the ID returned from uploading a PDF to query the status of reference parsing, and retrieve references if they are available. Sent a GET request to /pdfs/ID .

$ curl http://extracto.labs.crossref.org/pdfs/4ea837fdf8049231e4000001

{
  "parsed": true,
  "uploaded": true,
  "failed": false,
  "file_digest": "71e176c528fcfa1b502b49465ffb51ca",
  "doi": "10.10/abc"
  "id": "4ea837fdf8049231e4000001"
}

If the status response reports that the pdf is parsed, as in data["parsed"] == true then a GET request can be made to retrieve references from /pdfs/ID/refs . If the status reports that parsing has not completed then the client should wait a few seconds and make another request for the status.

$ curl http://extracto.labs.crossref.org/pdfs/4ea837fdf8049231e4000001/refs

[
  "Special Scrutiny: A Targeted Form of Research Protocol Review ANN INTERN MED 2004;140:220-223",
  ...
]

Generating CrossRef deposit XML

Extracto can produce CrossRef deposit XML for parsed PDFs. Deposit XML including parsed references for a particular PDF will be generated and returned on a GET request to /pdfs/ID/deposit .