Package Bio :: Package Alphabet
[hide private]
[frames] | no frames]

Package Alphabet

source code

Alphabets used in Seq objects etc to declare sequence type and letters.

This is used by sequences which contain a finite number of similar words.

Submodules [hide private]

Classes [hide private]
  Alphabet
  SingleLetterAlphabet
  ProteinAlphabet
  NucleotideAlphabet
  DNAAlphabet
  RNAAlphabet
  SecondaryStructure
  ThreeLetterProtein
  AlphabetEncoder
  Gapped
  HasStopCodon
Functions [hide private]
 
_get_base_alphabet(alphabet)
Returns the non-gapped non-stop-codon Alphabet object (PRIVATE).
source code
 
_ungap(alphabet)
Returns the alphabet without any gap encoder (PRIVATE).
source code
 
_consensus_base_alphabet(alphabets)
Returns a common but often generic base alphabet object (PRIVATE).
source code
 
_consensus_alphabet(alphabets)
Returns a common but often generic alphabet object (PRIVATE).
source code
 
_check_type_compatible(alphabets)
Returns True except for DNA+RNA or Nucleotide+Protein (PRIVATE).
source code
 
_verify_alphabet(sequence)
Check all letters in sequence are in the alphabet (PRIVATE).
source code
Variables [hide private]
  generic_alphabet = Alphabet()
  single_letter_alphabet = SingleLetterAlphabet()
  generic_protein = ProteinAlphabet()
  generic_nucleotide = NucleotideAlphabet()
  generic_dna = DNAAlphabet()
  generic_rna = RNAAlphabet()
  __package__ = None
hash(x)
Function Details [hide private]

_consensus_base_alphabet(alphabets)

source code 

Returns a common but often generic base alphabet object (PRIVATE).

This throws away any AlphabetEncoder information, e.g. Gapped alphabets.

Note that DNA+RNA -> Nucleotide, and Nucleotide+Protein-> generic single letter. These DO NOT raise an exception!

_consensus_alphabet(alphabets)

source code 

Returns a common but often generic alphabet object (PRIVATE).

>>> from Bio.Alphabet import IUPAC
>>> _consensus_alphabet([IUPAC.extended_protein, IUPAC.protein])
ExtendedIUPACProtein()
>>> _consensus_alphabet([generic_protein, IUPAC.protein])
ProteinAlphabet()

Note that DNA+RNA -> Nucleotide, and Nucleotide+Protein-> generic single letter. These DO NOT raise an exception!

>>> _consensus_alphabet([generic_dna, generic_nucleotide])
NucleotideAlphabet()
>>> _consensus_alphabet([generic_dna, generic_rna])
NucleotideAlphabet()
>>> _consensus_alphabet([generic_dna, generic_protein])
SingleLetterAlphabet()
>>> _consensus_alphabet([single_letter_alphabet, generic_protein])
SingleLetterAlphabet()

This is aware of Gapped and HasStopCodon and new letters added by other AlphabetEncoders. This WILL raise an exception if more than one gap character or stop symbol is present.

>>> from Bio.Alphabet import IUPAC
>>> _consensus_alphabet([Gapped(IUPAC.extended_protein), HasStopCodon(IUPAC.protein)])
HasStopCodon(Gapped(ExtendedIUPACProtein(), '-'), '*')
>>> _consensus_alphabet([Gapped(IUPAC.protein, "-"), Gapped(IUPAC.protein, "=")])
Traceback (most recent call last):
    ...
ValueError: More than one gap character present
>>> _consensus_alphabet([HasStopCodon(IUPAC.protein, "*"), HasStopCodon(IUPAC.protein, "+")])
Traceback (most recent call last):
    ...
ValueError: More than one stop symbol present

_check_type_compatible(alphabets)

source code 

Returns True except for DNA+RNA or Nucleotide+Protein (PRIVATE).

>>> _check_type_compatible([generic_dna, generic_nucleotide])
True
>>> _check_type_compatible([generic_dna, generic_rna])
False
>>> _check_type_compatible([generic_dna, generic_protein])
False
>>> _check_type_compatible([single_letter_alphabet, generic_protein])
True

This relies on the Alphabet subclassing hierarchy. It does not check things like gap characters or stop symbols.

_verify_alphabet(sequence)

source code 

Check all letters in sequence are in the alphabet (PRIVATE).

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
...              IUPAC.protein)
>>> _verify_alphabet(my_seq)
True

This example has an X, which is not in the IUPAC protein alphabet (you should be using the IUPAC extended protein alphabet):

>>> bad_seq = Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVFX",
...                IUPAC.protein)
>>> _verify_alphabet(bad_seq)
False

This replaces Bio.utils.verify_alphabet() since we are deprecating that. Potentially this could be added to the Alphabet object, and I would like it to be an option when creating a Seq object... but that might slow things down.