Logo Search packages:      
Sourcecode: python-biopython version File versions

Bio::EUtils::ThinClient Namespace Reference


Detailed Description

Low-level interface to NCBI's EUtils for Entrez search and retrieval.

For higher-level interfaces, see DBIdsClient (which works with a set
of database identifiers) and HistoryClient (which does a much better
job of handling history).

There are five classes of services:
  ESearch - search a database
  EPost - upload a list of indicies for further use
  ESummary - get document summaries for a given set of records
  EFetch - get the records translated to a given format
  ELink - find related records in other databases

You can find more information about them at
  http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
but that document isn't very useful.  Perhaps the following is better.

EUtils offers a structured way to query Entrez, get the results in
various formats, and get information about related documents.  The way
to start off is create an EUtils object.

>>> import EUtils
>>> from EUtils.ThinClient import ThinClient
>>> eutils = ThinClient.ThinClient()
>>> 

You can search Entrez with the "esearch" method.  This does a query on
the server, which generates a list of identifiers for records that
matched the query.  However, not all the identifiers are returned.
You can request only a subset of the matches (using the 'retstart' and
'retmax') terms.  This is useful because searches like 'cancer' can
have over 1.4 million matches.  Most people would rather change the
query or look at more details about the first few hits than wait to
download all the identifiers before doing anything else.

The esearch method, and indeed all these methods, returns a
'urllib.addinfourl' which is an HTTP socket connection that has
already parsed the HTTP header and is ready to read the data from the
server.

For example, here's a query and how to use it

  Search in PubMed for the term cancer for the entrez date from the
  last 60 days and retrieve the first 10 IDs and translations using
  the history parameter.

>>> infile = eutils.esearch("cancer",
...                         daterange = EUtils.WithinNDays(60, "edat"),
...                         retmax = 10)
>>>
>>> print infile.read()
<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
        <Count>7228</Count>
        <RetMax>10</RetMax>
        <RetStart>0</RetStart>
        <IdList>
    <Id>12503096</Id>
    <Id>12503075</Id>
    <Id>12503073</Id>
    <Id>12503033</Id>
    <Id>12503030</Id>
    <Id>12503028</Id>
    <Id>12502932</Id>
    <Id>12502925</Id>
    <Id>12502881</Id>
    <Id>12502872</Id>
        </IdList>
        <TranslationSet>
    <Translation>
            <From>cancer%5BAll+Fields%5D</From>
            <To>(%22neoplasms%22%5BMeSH+Terms%5D+OR+cancer%5BText+Word%5D)</To>
    </Translation>
        </TranslationSet>
        <TranslationStack>
    <TermSet>
            <Term>"neoplasms"[MeSH Terms]</Term>
            <Field>MeSH Terms</Field>
            <Count>1407151</Count>
            <Explode>Y</Explode>
    </TermSet>
    <TermSet>
            <Term>cancer[Text Word]</Term>
            <Field>Text Word</Field>
            <Count>382919</Count>
            <Explode>Y</Explode>
    </TermSet>
    <OP>OR</OP>
    <TermSet>
            <Term>2002/10/30[edat]</Term>
            <Field>edat</Field>
            <Count>-1</Count>
            <Explode>Y</Explode>
    </TermSet>
    <TermSet>
            <Term>2002/12/29[edat]</Term>
            <Field>edat</Field>
            <Count>-1</Count>
            <Explode>Y</Explode>
    </TermSet>
    <OP>RANGE</OP>
    <OP>AND</OP>
        </TranslationStack>
</eSearchResult>

>>>

You get a raw XML input stream which you can process in many ways.
(The appropriate DTDs are included in the subdirectory "DTDs" and see
also the included POM reading code.)

    WARNING! As of this writing (2002/12/3) NCBI returns their
    XML encoded as Latin-1 but their processing instruction says
    it is UTF-8 because they leave out the "encoding" attribute.
    Until they fix it you will need to recode the input stream
    before processing it with XML tools, like this

        import codecs
        infile = codecs.EncodedFile(infile, "utf-8", "iso-8859-1")


The XML fields are mostly understandable:
  Count -- the total number of matches from this search
  RetMax -- the number of <ID> values returned in this subset
  RetStart -- the start position of this subset in the list of
      all matches

  IDList and ID -- the identifiers in this subset

  TranslationSet / Translation -- if the search field is not
      explicitly specified ("qualified"), then the server will
      apply a set of hueristics to improve the query.  Eg, in
      this case "cancer" is first parsed as
        cancer[All Fields]
      then turned into the query
        "neoplasms"[MeSH Terms] OR cancer[Text Word]

      Note that these terms are URL escaped.
      For details on how the translation is done, see
http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#AutomaticTermMapping

  TranslationStack -- The (possibly 'improved' query) fully
      parsed out and converted into a postfix (RPN) notation.
      The above example is written in the Entrez query language as

        ("neoplasms"[MeSH Terms] OR cancer[Text Word]) AND
         2002/10/30:2002/12/29[edat]
      Note that these terms are *not* URL escaped.  Nothing like
      a bit of inconsistency for the soul.

      The "Count" field shows how many matches were found for each
      term of the expression.  I don't know what "Explode" does.


Let's get more information about the first record, which has an id of
12503096.  There are two ways to query for information, one uses a set
of identifiers and the other uses the history.  I'll talk about the
history one in a bit.  To use a set of identifiers you need to make a
DBIds object containing the that list.

>>> dbids = EUtils.DBIds("pubmed", ["12503096"])
>>>

Now get the summary using dbids

>>> infile = eutils.esummary_using_dbids(dbids)
>>> print infile.read()
<?xml version="1.0"?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_020511.dtd">
<eSummaryResult>
<DocSum>
        <Id>12503096</Id>
        <Item Name="PubDate" Type="Date">2003 Jan 30</Item>
        <Item Name="Source" Type="String">Am J Med Genet</Item>
        <Item Name="Authors" Type="String">Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K</Item>
        <Item Name="Title" Type="String">What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</Item>
        <Item Name="Volume" Type="String">116</Item>
        <Item Name="Pages" Type="String">222-8</Item>
        <Item Name="EntrezDate" Type="Date">2002/12/28 04:00</Item>
        <Item Name="PubMedId" Type="Integer">12503096</Item>
        <Item Name="MedlineId" Type="Integer">22390532</Item>
        <Item Name="Lang" Type="String">English</Item>
        <Item Name="PubType" Type="String"></Item>
        <Item Name="RecordStatus" Type="String">PubMed - in process</Item>
        <Item Name="Issue" Type="String">3</Item>
        <Item Name="SO" Type="String">2003 Jan 30;116(3):222-8</Item>
        <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item>
        <Item Name="JTA" Type="String">3L4</Item>
        <Item Name="ISSN" Type="String">0148-7299</Item>
        <Item Name="PubId" Type="String"></Item>
        <Item Name="PubStatus" Type="Integer">4</Item>
        <Item Name="Status" Type="Integer">5</Item>
        <Item Name="HasAbstract" Type="Integer">1</Item>
        <Item Name="ArticleIds" Type="List">
    <Item Name="PubMedId" Type="String">12503096</Item>
    <Item Name="DOI" Type="String">10.1002/ajmg.a.10844</Item>
    <Item Name="MedlineUID" Type="String">22390532</Item>
        </Item>
</DocSum>
</eSummaryResult>
>>>

This is just a summary.  To get the full details, including an
abstract (if available) use the 'efetch' method.  I'll only print a
bit to convince you it has an abstract.

>>> s = eutils.efetch_using_dbids(dbids).read()
>>> print s[587:860]
<ArticleTitle>What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?</ArticleTitle>
<Pagination>
<MedlinePgn>222-8</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>Women recruited from a hereditary cancer registry provided
>>>

Suppose instead you want the data in a text format.  Different
databases have different text formats.  For example, PubMed has a
"docsum" format which gives just the summary of a document and
"medline" format as needed for a citation database.  To get these, use
a "text" "retmode" ("return mode") and select the appropriate
"rettype" ("return type").

Here are examples of those two return types

>>> print eutils.efetch_using_dbids(dbids, "text", "docsum").read()[:497]
1:  Coyne JC, Kruus L, Racioppo M, Calzone KA, Armstrong K.
What do ratings of cancer-specific distress mean among women at high risk of breast and ovarian cancer?
Am J Med Genet. 2003 Jan 30;116(3):222-8.
PMID: 12503096 [PubMed - in process]
>>> print eutils.efetch_using_dbids(dbids, "text", "medline").read()[:369]
UI  - 22390532
PMID- 12503096
DA  - 20021227
IS  - 0148-7299
VI  - 116
IP  - 3
DP  - 2003 Jan 30
TI  - What do ratings of cancer-specific distress mean among women at high risk
      of breast and ovarian cancer?
PG  - 222-8
AB  - Women recruited from a hereditary cancer registry provided ratings of
      distress associated with different aspects of high-risk status
>>> 

It's also possible to get a list of records related to a given
article.  This is done through the "elink" method.  For example,
here's how to get the list of PubMed articles related to the above
PubMed record.  (Again, truncated because otherwise there is a lot of
data.)

>>> print eutils.elink_using_dbids(dbids).read()[:590]
<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
<eLinkResult>
<LinkSet>
        <DbFrom>pubmed</DbFrom>
        <IdList>
    <Id>12503096</Id>
        </IdList>
        <LinkSetDb>
    <DbTo>pubmed</DbTo>
    <LinkName>pubmed_pubmed</LinkName>
    <Link>
            <Id>12503096</Id>
            <Score>2147483647</Score>
    </Link>
    <Link>
            <Id>11536413</Id>
            <Score>30817790</Score>
    </Link>
    <Link>
            <Id>11340606</Id>
            <Score>29939219</Score>
    </Link>
    <Link>
            <Id>10805955</Id>
            <Score>29584451</Score>
    </Link>
>>>

For a change of pace, let's work with the protein database to learn
how to work with history.  Suppose I want to do a multiple sequene
alignment of bacteriorhodopsin with all of its neighbors, where
"neighbors" is defined by NCBI.  There are good programs for this -- I
just need to get the records in the right format, like FASTA.

The bacteriorhodopsin I'm interested in is BAA75200, which is
GI:4579714, so I'll start by asking for its neighbors.

>>> results = eutils.elink_using_dbids(
...             EUtils.DBIds("protein", ["4579714"]),
...             db = "protein").read()
>>> print results[:454]
<?xml version="1.0"?>
<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd">
<eLinkResult>
<LinkSet>
        <DbFrom>protein</DbFrom>
        <IdList>
    <Id>4579714</Id>
        </IdList>
        <LinkSetDb>
    <DbTo>protein</DbTo>
    <LinkName>protein_protein</LinkName>
    <Link>
            <Id>4579714</Id>
            <Score>2147483647</Score>
    </Link>
    <Link>
            <Id>11277596</Id>
            <Score>1279</Score>
    </Link>
>>>

Let's get all the <Id> fields.  (While the following isn't a good way
to parse XML, it is easy to understand and works well enough for this
example.)  Note that I remove the first <Id> because that's from the
query and not from the results.

>>> import re
>>> ids = re.findall(r"<Id>(\d+)</Id>", results)
>>> ids = ids[1:]
>>> len(ids)
222
>>> dbids = EUtils.DBIds("protein", ids)
>>> 

That's a lot of records.  I could use 'efetch_using_dbids' but there's
a problem with that.  Efetch uses the HTTP GET protocol to pass
information to the EUtils server.  ("GET" is what's used when you type
a URL in the browser.)  Each id takes about 9 characters, so the URL
would be over 2,000 characters long.  This may not work on some
systems, for example, some proxies do not support long URLs.  (Search
for "very long URLs" for examples.)

Instead, we'll upload the list to the server then fetch the FASTA
version using the history.

The first step is to upload the data.  We want to put that into the
history so we set 'usehistory' to true.  There's no existing history
so the webenv string is None.


>>> print eutils.epost(dbids, usehistory = 1, webenv = None).read()
<?xml version="1.0"?>
<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
<ePostResult>
        <QueryKey>1</QueryKey>
        <WebEnv>%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C</WebEnv>
</ePostResult>

>>>

This says that the identifiers were saved as query #1, which will be
used later on as the "query_key" field.  The WebEnv is a cookie (or
token) used to tell the server where to find that query.  The WebEnv
changes after every history-enabled ESearch or EPost so you'll need to
parse the output from those to get the new WebEnv field.  You'll also
need to unquote it since it is URL-escaped.

Also, you will need to pass in the name of the database used for the
query in order to access the history.  Why?  I don't know -- I figure
the WebEnv and query_key should be enough to get the database name.

>>> import urllib
>>> webenv = urllib.unquote("%7BPgTHRHFBsJfC%3C%5C%5C%5B%3EAfJCKQ%5Ey%60%3CGkH%5DH%5E%3DJHGBKAJ%3F%40CbCiG%3FE%3C")
>>> print webenv
{PgTHRHFBsJfC<\\[>AfJCKQ^y`<GkH]H^=JHGBKAJ?@CbCiG?E<
>>>

Okay, now to get the data in FASTA format.  Notice that I need the
'retmax' in order to include all the records in the result.  (The
default is 20 records.)

>>> fasta = eutils.efetch_using_history("protein", webenv, query_key = "1",
...                                     retmode = "text", rettype = "fasta",
...                                     retmax = len(dbids)).read()
>>> fasta.count(">")
222
>>> print fasta[:694]
>gi|14194475|sp|O93742|BACH_HALSD Halorhodopsin (HR)
MMETAADALASGTVPLEMTQTQIFEAIQGDTLLASSLWINIALAGLSILLFVYMGRNLEDPRAQLIFVAT
LMVPLVSISSYTGLVSGLTVSFLEMPAGHALAGQEVLTPWGRYLTWALSTPMILVALGLLAGSNATKLFT
AVTADIGMCVTGLAAALTTSSYLLRWVWYVISCAFFVVVLYVLLAEWAEDAEVAGTAEIFNTLKLLTVVL
WLGYPIFWALGAEGLAVLDVAVTSWAYSGMDIVAKYLFAFLLLRWVVDNERTVAGMAAGLGAPLARCAPA
DD
>gi|14194474|sp|O93741|BACH_HALS4 Halorhodopsin (HR)
MRSRTYHDQSVCGPYGSQRTDCDRDTDAGSDTDVHGAQVATQIRTDTLLHSSLWVNIALAGLSILVFLYM
ARTVRANRARLIVGATLMIPLVSLSSYLGLVTGLTAGPIEMPAAHALAGEDVLSQWGRYLTWTLSTPMIL
LALGWLAEVDTADLFVVIAADIGMCLTGLAAALTTSSYAFRWAFYLVSTAFFVVVLYALLAKWPTNAEAA
GTGDIFGTLRWLTVILWLGYPILWALGVEGFALVDSVGLTSWGYSLLDIGAKYLFAALLLRWVANNERTI
AVGQRSGRGAIGDPVED
>>> 

To round things out, here's a query which refines the previous query.
I want to get all records from the first search which also have the
word "Structure" in them.  (My background was originally structural
biophysics, whaddya expect?  :)

>>> print eutils.search("#1 AND structure", db = "protein", usehistory = 1,
...                     webenv = webenv).read()
<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
        <Count>67</Count>
        <RetMax>20</RetMax>
        <RetStart>0</RetStart>
        <QueryKey>2</QueryKey>
        <WebEnv>UdvMf%3F%60G%3DIE%60bG%3DGec%3E%3D%3Cbc_%5DgBAf%3EAi_e%5EAJcHgDi%3CIqGdE%7BmC%3C</WebEnv>
        <IdList>
    <Id>461608</Id>
    <Id>114808</Id>
    <Id>1364150</Id>
    <Id>1363466</Id>
    <Id>1083906</Id>
    <Id>99232</Id>
    <Id>99212</Id>
    <Id>81076</Id>
    <Id>114811</Id>
    <Id>24158915</Id>
    <Id>24158914</Id>
    <Id>24158913</Id>
    <Id>1168615</Id>
    <Id>114812</Id>
    <Id>114809</Id>
    <Id>17942995</Id>
    <Id>17942994</Id>
    <Id>17942993</Id>
    <Id>20151159</Id>
    <Id>20150922</Id>
        </IdList>
        <TranslationSet>
        </TranslationSet>
        <TranslationStack>
    <TermSet>
            <Term>#1</Term>
            <Field>All Fields</Field>
            <Count>222</Count>
            <Explode>Y</Explode>
    </TermSet>
    <TermSet>
            <Term>structure[All Fields]</Term>
            <Field>All Fields</Field>
            <Count>142002</Count>
            <Explode>Y</Explode>
    </TermSet>
    <OP>AND</OP>
        </TranslationStack>
</eSearchResult>

>>> 

One last thing about history.  It doesn't last very long -- perhaps an
hour or so.  (Untested.)  You may be able to toss it some keep-alive
signal every once in a while.  Or you may want to keep 

The known 'db' fields and primary IDs (if known) are
  genome -- GI number
  nucleotide -- GI number
  omim  -- MIM number
  popset -- GI number
  protein -- GI number
  pubmed  -- PMID
  sequences (not available; this will combine all sequence databases)
  structure -- MMDB ID
  taxonomy -- TAXID

The 'field' parameter is different for different databases.  The
fields for PubMed are listed at

http://www.ncbi.nlm.nih.gov/entrez/query/static/help/pmhelp.html#SearchFieldDescriptionsandTags

  Affiliation -- AD
  All Fields -- All
  Author -- AU
  EC/RN Number -- RN
  Entrez Date -- EDAT  (also valid for 'datetype')
  Filter -- FILTER
  Issue -- IP
  Journal Title -- TA
  Language -- LA
  MeSH Date -- MHDA  (also valid for 'datetype')
  MeSH Major Topic -- MAJR
  MeSH Subheadings -- SH
  MeSH Terms -- MH
  Pagination -- PG
  Personal Name as Subject -- PS
  Publication Date -- DP  (also valid for 'datetype')
  Publication Type -- PT
  Secondary Source ID -- SI
  Subset -- SB
  Substance Name -- NM
  Text Words -- TW
  Title -- TI
  Title/Abstract -- TIAB
  Unique Identifiers -- UID
  Volume -- VI

The fields marked as 'datetype' can also be used for date searches.
Date searches can be done in the query (for example, as

   1990/01/01:1999/12/31[edat]

or by passing a WithinNDays or DateRange field to the 'date' parameter
of the search.


Please pay attention to the usage limits!  The are listed at
  http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html

At the time of this writing they are:
    * Run retrieval scripts on weekends or between 9 PM and 5 AM ET
weekdays for any series of more than 100 requests.
    * Make no more than one request every 3 seconds.
    * Only 5000 PubMed records may be retrieved in a single day.

    * NCBI's Disclaimer and Copyright notice must be evident to users
      of your service.  NLM does not hold the copyright on the PubMed
      abstracts the journal publishers do.  NLM provides no legal
      advice concerning distribution of copyrighted materials, consult
      your legal counsel.

(Their disclaimer is at
       http://www.ncbi.nlm.nih.gov/About/disclaimer.html )



Classes

class  ThinClient

Functions

def _dbids_to_id_string

Variables

int DUMP_RESULT = 0
int DUMP_URL = 0
string EMAIL = "biopython-dev@biopython.org"
string TOOL = "EUtils_Python_client"


Generated by  Doxygen 1.6.0   Back to index