Package Bio :: Package GenBank
[hide private]
[frames] | no frames]

Package GenBank

source code

Code to work with GenBank formatted files.

Rather than using Bio.GenBank, you are now encouraged to use Bio.SeqIO with
the "genbank" or "embl" format names to parse GenBank or EMBL files into
SeqRecord and SeqFeature objects (see the Biopython tutorial for details).

Also, rather than using Bio.GenBank to search or download files from the NCBI,
you are now encouraged to use Bio.Entrez instead (again, see the Biopython
tutorial for details).

Currently the ONLY reason to use Bio.GenBank directly is for the RecordParser
which turns a GenBank file into GenBank-specific Record objects.  This is a
much closer representation to the raw file contents that the SeqRecord
alternative from the FeatureParser (used in Bio.SeqIO).

Classes:
Iterator              Iterate through a file of GenBank entries
ErrorFeatureParser    Catch errors caused during parsing.
FeatureParser         Parse GenBank data in SeqRecord and SeqFeature objects.
RecordParser          Parse GenBank data into a Record object.

Exceptions:
ParserFailureError    Exception indicating a failure in the parser (ie.
                      scanner or consumer)
LocationParserError   Exception indiciating a problem with the spark based
                      location parser.


17-MAR-2009: added wgs, wgs_scafld for GenBank whole genome shotgun master records.
These are GenBank files that summarize the content of a project, and provide lists of
scaffold and contig files in the project. These will be in annotations['wgs'] and
annotations['wgs_scafld']. These GenBank files do not have sequences. See
http://groups.google.com/group/bionet.molbio.genbank/browse_thread/thread/51fb88bf39e7dc36

http://is.gd/nNgk
for more details of this format, and an example.
Added by Ying Huang & Iddo Friedberg

Submodules [hide private]

Classes [hide private]
  Iterator
Iterator interface to move over a file of GenBank entries one at a time.
  ParserFailureError
Failure caused by some kind of problem in the parser.
  LocationParserError
Could not Properly parse out a location from a GenBank file.
  FeatureParser
Parse GenBank files into Seq + Feature objects.
  RecordParser
Parse GenBank files into Record objects
  _BaseGenBankConsumer
Abstract GenBank consumer providing useful general functions.
  _FeatureConsumer
Create a SeqRecord object with Features to return.
  _RecordConsumer
Create a GenBank Record object from scanner generated information.
Functions [hide private]
 
_pos(pos_str, offset=0)
Build a Position object (PRIVATE).
source code
 
_loc(loc_str, expected_seq_length)
FeatureLocation from non-compound non-complement location (PRIVATE).
source code
 
_split_compound_loc(compound_loc)
Split a tricky compound location string (PRIVATE).
source code
 
_test()
Run the Bio.GenBank module's doctests.
source code
Variables [hide private]
  GENBANK_INDENT = 12
  GENBANK_SPACER = ' '
  FEATURE_KEY_INDENT = 5
  FEATURE_QUALIFIER_INDENT = 21
  FEATURE_KEY_SPACER = ' '
  FEATURE_QUALIFIER_SPACER = ' '
  _solo_location = '[<>]?\\d+'
  _pair_location = '[<>]?\\d+\\.\\.[<>]?\\d+'
  _between_location = '\\d+\\^\\d+'
  _within_position = '\\(\\d+\\.\\d+\\)'
  _re_within_position = re.compile(r'\(\d+\.\d+\)')
  _within_location = '([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\...
  _oneof_position = 'one\\-of\\(\\d+(,\\d+)+\\)'
  _re_oneof_position = re.compile(r'one-of\(\d+(,\d+)+\)')
  _oneof_location = '([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\...
  _simple_location = '\\d+\\.\\.\\d+'
  _re_simple_location = re.compile(r'\d+\.\.\d+')
  _re_simple_compound = re.compile(r'^(join|order|bond)\(\d+\.\....
  _complex_location = '([a-zA-z][a-zA-Z0-9]*(\\.[a-zA-Z0-9]+)?\\...
  _re_complex_location = re.compile(r'^([a-zA-z][a-zA-Z0-9]*(\.[...
  _possibly_complemented_complex_location = '(([a-zA-z][a-zA-Z0-...
  _re_complex_compound = re.compile(r'^(join|order|bond)\((([a-z...
  __package__ = 'Bio.GenBank'
Function Details [hide private]

_pos(pos_str, offset=0)

source code 

Build a Position object (PRIVATE).

For an end position, leave offset as zero (default):

>>> _pos("5")
ExactPosition(5)

For a start position, set offset to minus one (for Python counting):

>>> _pos("5", -1)
ExactPosition(4)

This also covers fuzzy positions:

>>> _pos("<5")
BeforePosition(5)
>>> _pos(">5")
AfterPosition(5)
>>> _pos("one-of(5,8,11)")
OneOfPosition([ExactPosition(5), ExactPosition(8), ExactPosition(11)])
>>> _pos("(8.10)")
WithinPosition(8,2)

_loc(loc_str, expected_seq_length)

source code 

FeatureLocation from non-compound non-complement location (PRIVATE).

Simple examples,

>>> _loc("123..456", 1000)
FeatureLocation(ExactPosition(122),ExactPosition(456))
>>> _loc("<123..>456", 1000)
FeatureLocation(BeforePosition(122),AfterPosition(456))

A more complex location using within positions,

>>> _loc("(9.10)..(20.25)", 1000)
FeatureLocation(WithinPosition(8,1),WithinPosition(20,5))

Zero length between feature,

>>> _loc("123^124", 1000)
FeatureLocation(ExactPosition(123),ExactPosition(123))

The expected sequence length is needed for a special case, a between position at the start/end of a circular genome:

>>> _loc("1000^1", 1000)
FeatureLocation(ExactPosition(1000),ExactPosition(1000))

Apart from this special case, between positions P^Q must have P+1==Q,

>>> _loc("123^456", 1000)
Traceback (most recent call last):
   ...
ValueError: Invalid between location '123^456'

_split_compound_loc(compound_loc)

source code 

Split a tricky compound location string (PRIVATE).

>>> list(_split_compound_loc("123..145"))
['123..145']
>>> list(_split_compound_loc("123..145,200..209"))
['123..145', '200..209']
>>> list(_split_compound_loc("one-of(200,203)..300"))
['one-of(200,203)..300']
>>> list(_split_compound_loc("complement(123..145),200..209"))
['complement(123..145)', '200..209']
>>> list(_split_compound_loc("123..145,one-of(200,203)..209"))
['123..145', 'one-of(200,203)..209']
>>> list(_split_compound_loc("123..145,one-of(200,203)..one-of(209,211),300"))
['123..145', 'one-of(200,203)..one-of(209,211)', '300']
>>> list(_split_compound_loc("123..145,complement(one-of(200,203)..one-of(209,211)),300"))
['123..145', 'complement(one-of(200,203)..one-of(209,211))', '300']
>>> list(_split_compound_loc("123..145,200..one-of(209,211),300"))
['123..145', '200..one-of(209,211)', '300']
>>> list(_split_compound_loc("123..145,200..one-of(209,211)"))
['123..145', '200..one-of(209,211)']

Variables Details [hide private]

_within_location

Value:
'([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\\(\\d+\\.\\d+\\))'

_oneof_location

Value:
'([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\d+|one\\-of\\(\\d\
+(,\\d+)+\\))'

_re_simple_compound

Value:
re.compile(r'^(join|order|bond)\(\d+\.\.\d+(,\d+\.\.\d+)*\)$')

_complex_location

Value:
'([a-zA-z][a-zA-Z0-9]*(\\.[a-zA-Z0-9]+)?\\:)?([<>]?\\d+\\.\\.[<>]?\\d+\
|[<>]?\\d+|\\d+\\^\\d+|([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\\
\(\\d+\\.\\d+\\))|([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\\
d+|one\\-of\\(\\d+(,\\d+)+\\)))'

_re_complex_location

Value:
re.compile(r'^([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?:)?([<>]?\d+\.\.[<\
>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\.\.([<>]?\d+|\(\d+\.\\
d+\))|([<>]?\d+|one-of\(\d+(,\d+)+\))\.\.([<>]?\d+|one-of\(\d+(,\d+)+\\
)))$')

_possibly_complemented_complex_location

Value:
'(([a-zA-z][a-zA-Z0-9]*(\\.[a-zA-Z0-9]+)?\\:)?([<>]?\\d+\\.\\.[<>]?\\d\
+|[<>]?\\d+|\\d+\\^\\d+|([<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\
\\(\\d+\\.\\d+\\))|([<>]?\\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\
\d+|one\\-of\\(\\d+(,\\d+)+\\)))|complement\\(([a-zA-z][a-zA-Z0-9]*(\\\
.[a-zA-Z0-9]+)?\\:)?([<>]?\\d+\\.\\.[<>]?\\d+|[<>]?\\d+|\\d+\\^\\d+|([\
<>]?\\d+|\\(\\d+\\.\\d+\\))\\.\\.([<>]?\\d+|\\(\\d+\\.\\d+\\))|([<>]?\\
\d+|one\\-of\\(\\d+(,\\d+)+\\))\\.\\.([<>]?\\d+|one\\-of\\(\\d+(,\\d+)\
+\\)))\\))'

_re_complex_compound

Value:
re.compile(r'^(join|order|bond)\((([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+\
)?:)?([<>]?\d+\.\.[<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\.\
\.([<>]?\d+|\(\d+\.\d+\))|([<>]?\d+|one-of\(\d+(,\d+)+\))\.\.([<>]?\d+\
|one-of\(\d+(,\d+)+\)))|complement\(([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9\
]+)?:)?([<>]?\d+\.\.[<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\
\.\.([<>]?\d+|\(\d+\.\d+\))|([<>]?\d+|one-of\(\d+(,\d+)+\))\.\.([<>]?\\
d+|one-of\(\d+(,\d+)+\)))\))(,(([a-zA-z][a-zA-Z0-9]*(\.[a-zA-Z0-9]+)?:\
)?([<>]?\d+\.\.[<>]?\d+|[<>]?\d+|\d+\^\d+|([<>]?\d+|\(\d+\.\d+\))\.\.(\
...