Text Mining

How do I retrieve the documents I want to mine?
If you already know the DOI (or the PII, a proprietary Elsevier identifier) of the documents you want to mine, you can get them one by one by calling this URL:

	http://api.elsevier.com/content/article/doi/[DOI]
or
	http://api.elsevier.com/content/article/pii/[PII]
Notes:
- [DOI] is the formatted DOI - e.g. http://api.elsevier.com/content/article/doi/10.1016/j.ibusrev.2010.09.002.
- [PII] is Elsevier's own internal document identifier (e.g. S0190962202001020), which is also used in the permanent URLs of articles on ScienceDirect (e.g. www.sciencedirect.com/science/article/pii/S0190962202001020)
- This interface requires your client to pass in a valid API Key as an HTTP request header:
	X-ELS-APIKey: [APIKey]
This is the APIKey that is created for you when you register your text mining project here.

Alternatively, the APIKey can be passed-in as a URL parameter:
	http://api.elsevier.com/content/article/doi/[DOI]?APIKey=[APIKey]
E.g. http://api.elsevier.com/content/article/doi/10.1016/j.ibusrev.2010.09.002?APIKey=665f15638156da2156b60a48095b4abc (please note that the APIKey in this example is not actually valid).

By default, a request to http://api.elsevier.com/content/article/... will return:
- the full-text XML version (FULL view) of the document, if your client IP address is recognized as that of a subscribing institute that has access to that article on ScienceDirect (which we call 'being entitled to the document');
- the abstract XML version (META_ABS) of the document, if you're not entitled.

If you explicitly request the FULL view, the response will be the full-text XML if you're entitled, and an error if you're not entitled:
	http://api.elsevier.com/content/article/doi/[DOI]?view=FULL
This may make client-side exception handling a little easier to code.

Apart from a 'view=...' parameter, there is the option to pass in an 'Accept' parameter that allows you to specify whether you want the full XML version (with the full-text containing Elsevier's DTD markup in all its glory) or a 'simplified' document which contains a structured metadata 'header' and the body of the text itself as one big UTF-8 string, without markup. By default (as implied by the previous point), the response is the full-text XML, but the 'Accept' parameter can be used to explicitly request the one format or the other. The parameter can be passed-in as an HTTP request header:
	Accept: text/xml (for full XML)
or
	Accept: text/plain (for stripped-down full-text)
Alternatively, the parameter can be passed-in as a URL parameter:
	http://api.elsevier.com/content/article/doi/[DOI]?httpAccept=text/xml
or
	http://api.elsevier.com/content/article/doi/[DOI]?httpAccept=text/plain
It is possible to pass in multiple variables as a URL parameter - e.g. a combination of APIKey and request format:
	http://api.elsevier.com/content/article/doi/[DOI]?APIKey=[APIKey]&httpAccept=text/plain


Where can I find the DTD for your XML articles?
Here: http://www.elsevier.com/author-schemas/elsevier-xml-dtds-and-transport-schemas

How do I select the corpus I want to mine?
Or, in other words: how to find out what documents are relevant? Ultimately, your corpus selection process will have to lead to a list of URIs for the documents you want to mine. Generating that list can generally be done in two ways:
- searching an index that returns a list of documents. This approach allows you to limit your corpus to documents that match certain search terms, such as keywords, author, date, etceteras. See more below.
- browsing a resource to find references to documents and collating those references in a list. This approach allows you to limit a corpus by the way it is referenced or structured on a site; examples are using a hierarchal site map to only mine a part of that site's hierarchy, and using citation links between documents to mine a set of documents and all the documents they reference. See more below.

How do I search for documents I want to mine?
Elsevier's own search index for ScienceDirect can be targeted through:

	http://api.elsevier.com/content/search/scidir?query=[query].
A request to this URL returns a list of documents matching the [query] with their basic metadata and their URIs to retrieve them from api.elsevier.com as well.

Notes:
- This interface requires your client to pass in a valid API Key as an HTTP request header:
	X-ELS-APIKey: [APIKey]
- [query] can be any search query that is valid on the expert search form on www.sciencedirect.com.
- For performance reasons, a search request returns up to 200 results in a single response. Multiple requests can be used to collate more than 200 results:
	http://api.elsevier.com/content/search/scidir?query=[query]&count=200
	http://api.elsevier.com/content/search/scidir?query=[query]&count=200&start=201
	http://api.elsevier.com/content/search/scidir?query=[query]&count=200&start=401
	http://api.elsevier.com/content/search/scidir?query=[query]&count=200&start=601
	etc.
Also, the ScienceDirect search index will never return more than the first 5,000 or so (the exact number varies) of results for any given query. This means that with this request...:
	http://api.elsevier.com/content/search/scidir?query=[query]&count=200&start=5001
... you will likely hit the end of the available results set. This can be worked-around (to some degree) by making the query itself more restrictive; e.g. if your search is '?query='heart+attack', you can use a date limiter - '?query='heart+attack+AND+PUBYEAR(2012)' - to first collate the results from 2012, and then move on to PUBYEAR(2011), etc., all the while using the '&count=200&start=...' to stage through the results for each query.
- Please note that by default, searches return documents that you may not have access to under your subscription, which will lead to an authorization error if you try to retrieve such a document by calling its URL. If you want to avoid that by limiting your results to subcribed content only, use the '&subscribed=true' parameter in the request URL {check}:

How can I use crawling to select a corpus?
ScienceDirect has a site map with the following general hierarchy:



This differs somewhat per journal, since some titles don't have issues but do have volumes, and vice versa. But the lowest level is always a list of articles, and therefore this sitemap can be used to target content in specific journals, with the option to take into account each journal's volumes/issues.
For text mining purposes, we have exposed an XML version of this sitemap on: http://api.elsevier.com/sitemap/page/sitemap/index.html

Notes:
- The sitemap doesn't know about your subscriptions and therefore may list URLs for documents you don't have access to. This will lead to an authorization error if you try to retrieve such a document, and your crawling script will have to be able to handle that error.
- No APIKey is needed to crawl the sitemap; however, whenever your crawler follows a URL from the sitemap to an actual article (i.e. follows a URL of the form http://api.elsevier.com/content/article/...), an APIKey will be needed in the request as outlined above.

Can I use other sources to help me select my corpus before I download it?
Of course! Here are some options:
- Google and Google Scholar: if you append "site:sciencedirect.com OR site:linkinghub.elsevier.com" to your query, you'll limit search results to content indexed on ScienceDirect.com. In the URLs of the results, you'll see an identifier like S0190962202001020 or B0741521410016381, which is the PII of that article that you can use to construct a request with to retrieve that article - see above.
- If you use A&I databases like PubMed or Web Of Science or Elsevier's own Scopus, they will often return the DOI for documents in results sets. You can use that DOI to construct a retrieval request to our full-text API {see above} to try to retrieve that document; if the document is not an Elsevier document or if you are not subscribed to it, you will simply get an error.
- CrossRef's Metadata services, while not available to everyone, allow metadata-based requests to retrieve DOIs.
- Many Elsevier journals also have their own separate sites, such as cell.com and thelancet.com; they use DOIs and PIIs to identify articles as well, and thus could help in identifying the corpus to mine.