Package Bio :: Package AlignIO
[hide private]
[frames] | no frames]

Source Code for Package Bio.AlignIO

  1  # Copyright 2008-2010 by Peter Cock.  All rights reserved. 
  2  # This code is part of the Biopython distribution and governed by its 
  3  # license.  Please see the LICENSE file that should have been included 
  4  # as part of this package. 
  5   
  6  """Multiple sequence alignment input/output as alignment objects. 
  7   
  8  The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in 
  9  fact the two are connected internally.  Both modules use the same set of file 
 10  format names (lower case strings).  From the user's perspective, you can read 
 11  in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you 
 12  can read in the sequences within these alignmenta using Bio.SeqIO. 
 13   
 14  Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by 
 15  a whole chapter in our tutorial: 
 16   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html} 
 17   - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf} 
 18   
 19  Input 
 20  ===== 
 21  For the typical special case when your file or handle contains one and only 
 22  one alignment, use the function Bio.AlignIO.read().  This takes an input file 
 23  handle (or in recent versions of Biopython a filename as a string), format 
 24  string and optional number of sequences per alignment.  It will return a single 
 25  MultipleSeqAlignment object (or raise an exception if there isn't just one 
 26  alignment): 
 27   
 28      >>> from Bio import AlignIO 
 29      >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip") 
 30      >>> print align 
 31      SingleLetterAlphabet() alignment with 3 rows and 384 columns 
 32      -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI 
 33      MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU 
 34      ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN 
 35   
 36  For the general case, when the handle could contain any number of alignments, 
 37  use the function Bio.AlignIO.parse(...) which takes the same arguments, but 
 38  returns an iterator giving MultipleSeqAlignment objects (typically used in a 
 39  for loop). If you want random access to the alignments by number, turn this 
 40  into a list: 
 41   
 42      >>> from Bio import AlignIO 
 43      >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss")) 
 44      >>> print alignments[2] 
 45      SingleLetterAlphabet() alignment with 2 rows and 120 columns 
 46      -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec 
 47      LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver 
 48   
 49  Most alignment file formats can be concatenated so as to hold as many 
 50  different multiple sequence alignments as possible.  One common example 
 51  is the output of the tool seqboot in the PHLYIP suite.  Sometimes there 
 52  can be a file header and footer, as seen in the EMBOSS alignment output. 
 53   
 54  Output 
 55  ====== 
 56  Use the function Bio.AlignIO.write(...), which takes a complete set of 
 57  Alignment objects (either as a list, or an iterator), an output file handle 
 58  (or filename in recent versions of Biopython) and of course the file format:: 
 59   
 60      from Bio import AlignIO 
 61      alignments = ... 
 62      count = SeqIO.write(alignments, "example.faa", "fasta") 
 63   
 64  If using a handle make sure to close it to flush the data to the disk:: 
 65   
 66      from Bio import AlignIO 
 67      alignments = ... 
 68      handle = open("example.faa", "w") 
 69      count = SeqIO.write(alignments, handle, "fasta") 
 70      handle.close() 
 71   
 72  In general, you are expected to call this function once (with all your 
 73  alignments) and then close the file handle.  However, for file formats 
 74  like PHYLIP where multiple alignments are stored sequentially (with no file 
 75  header and footer), then multiple calls to the write function should work as 
 76  expected when using handles. 
 77   
 78  If you are using a filename, the repeated calls to the write functions will 
 79  overwrite the existing file each time. 
 80   
 81  Conversion 
 82  ========== 
 83  The Bio.AlignIO.convert(...) function allows an easy interface for simple 
 84  alignnment file format conversions. Additionally, it may use file format 
 85  specific optimisations so this should be the fastest way too. 
 86   
 87  In general however, you can combine the Bio.AlignIO.parse(...) function with 
 88  the Bio.AlignIO.write(...) function for sequence file conversion. Using 
 89  generator expressions provides a memory efficient way to perform filtering or 
 90  other extra operations as part of the process. 
 91   
 92  File Formats 
 93  ============ 
 94  When specifying the file format, use lowercase strings.  The same format 
 95  names are also used in Bio.SeqIO and include the following: 
 96   
 97   - clustal   - Ouput from Clustal W or X, see also the module Bio.Clustalw 
 98                 which can be used to run the command line tool from Biopython. 
 99   - emboss    - EMBOSS tools' "pairs" and "simple" alignment formats. 
100   - fasta     - The generic sequence file format where each record starts with 
101                 an identifer line starting with a ">" character, followed by 
102                 lines of sequence. 
103   - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA 
104                 tools when used with the -m 10 command line option for machine 
105                 readable output. 
106   - ig        - The IntelliGenetics file format, apparently the same as the 
107                 MASE alignment format. 
108   - nexus     - Output from NEXUS, see also the module Bio.Nexus which can also 
109                 read any phylogenetic trees in these files. 
110   - phylip    - Used by the PHLIP tools. 
111   - stockholm - A richly annotated alignment file format used by PFAM. 
112   
113  Note that while Bio.AlignIO can read all the above file formats, it cannot 
114  write to all of them. 
115   
116  You can also use any file format supported by Bio.SeqIO, such as "fasta" or 
117  "ig" (which are listed above), PROVIDED the sequences in your file are all the 
118  same length. 
119  """ 
120  __docformat__ = "epytext en" #not just plaintext 
121   
122  #TODO 
123  # - define policy on reading aligned sequences with gaps in 
124  #   (e.g. - and . characters) including how the alphabet interacts 
125  # 
126  # - Can we build the to_alignment(...) functionality 
127  #   into the generic Alignment class instead? 
128  # 
129  # - How best to handle unique/non unique record.id when writing. 
130  #   For most file formats reading such files is fine; The stockholm 
131  #   parser would fail. 
132  # 
133  # - MSF multiple alignment format, aka GCG, aka PileUp format (*.msf) 
134  #   http://www.bioperl.org/wiki/MSF_multiple_alignment_format 
135   
136  from Bio.Align import MultipleSeqAlignment 
137  from Bio.Align.Generic import Alignment 
138  from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet 
139   
140  import StockholmIO 
141  import ClustalIO 
142  import NexusIO 
143  import PhylipIO 
144  import EmbossIO 
145  import FastaIO 
146   
147  #Convention for format names is "mainname-subtype" in lower case. 
148  #Please use the same names as BioPerl and EMBOSS where possible. 
149   
150  _FormatToIterator = {#"fasta" is done via Bio.SeqIO 
151                       "clustal" : ClustalIO.ClustalIterator, 
152                       "emboss" : EmbossIO.EmbossIterator, 
153                       "fasta-m10" : FastaIO.FastaM10Iterator, 
154                       "nexus" : NexusIO.NexusIterator, 
155                       "phylip" : PhylipIO.PhylipIterator, 
156                       "phylip-relaxed" : PhylipIO.RelaxedPhylipIterator, 
157                       "stockholm" : StockholmIO.StockholmIterator, 
158                       } 
159   
160  _FormatToWriter = {#"fasta" is done via Bio.SeqIO 
161                     #"emboss" : EmbossIO.EmbossWriter, (unfinished) 
162                     "nexus" : NexusIO.NexusWriter, 
163                     "phylip" : PhylipIO.PhylipWriter, 
164                     "phylip-relaxed" : PhylipIO.RelaxedPhylipWriter, 
165                     "stockholm" : StockholmIO.StockholmWriter, 
166                     "clustal" : ClustalIO.ClustalWriter, 
167                     } 
168   
169 -def write(alignments, handle, format):
170 """Write complete set of alignments to a file. 171 172 Arguments: 173 - alignments - A list (or iterator) of Alignment objects (ideally the 174 new MultipleSeqAlignment objects), or (if using Biopython 175 1.54 or later) a single alignment object. 176 - handle - File handle object to write to, or filename as string 177 (note older versions of Biopython only took a handle). 178 - format - lower case string describing the file format to write. 179 180 You should close the handle after calling this function. 181 182 Returns the number of alignments written (as an integer). 183 """ 184 from Bio import SeqIO 185 186 #Try and give helpful error messages: 187 if not isinstance(format, basestring): 188 raise TypeError("Need a string for the file format (lower case)") 189 if not format: 190 raise ValueError("Format required (lower case string)") 191 if format != format.lower(): 192 raise ValueError("Format string '%s' should be lower case" % format) 193 194 if isinstance(alignments, Alignment): 195 #This raised an exception in order version of Biopython 196 alignments = [alignments] 197 198 if isinstance(handle, basestring): 199 handle = open(handle, "w") 200 handle_close = True 201 else: 202 handle_close = False 203 204 #Map the file format to a writer class 205 if format in _FormatToWriter: 206 writer_class = _FormatToWriter[format] 207 count = writer_class(handle).write_file(alignments) 208 elif format in SeqIO._FormatToWriter: 209 #Exploit the existing SeqIO parser to the dirty work! 210 #TODO - Can we make one call to SeqIO.write() and count the alignments? 211 count = 0 212 for alignment in alignments: 213 if not isinstance(alignment, Alignment): 214 raise TypeError(\ 215 "Expect a list or iterator of Alignment objects.") 216 SeqIO.write(alignment, handle, format) 217 count += 1 218 elif format in _FormatToIterator or format in SeqIO._FormatToIterator: 219 raise ValueError("Reading format '%s' is supported, but not writing" \ 220 % format) 221 else: 222 raise ValueError("Unknown format '%s'" % format) 223 224 assert isinstance(count, int), "Internal error - the underlying %s " \ 225 "writer should have returned the alignment count, not %s" \ 226 % (format, repr(count)) 227 228 if handle_close: 229 handle.close() 230 231 return count
232 233 #This is a generator function!
234 -def _SeqIO_to_alignment_iterator(handle, format, alphabet=None, seq_count=None):
235 """Uses Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE). 236 237 Arguments: 238 - handle - handle to the file. 239 - format - string describing the file format. 240 - alphabet - optional Alphabet object, useful when the sequence type 241 cannot be automatically inferred from the file itself 242 (e.g. fasta, phylip, clustal) 243 - seq_count - Optional integer, number of sequences expected in each 244 alignment. Recommended for fasta format files. 245 246 If count is omitted (default) then all the sequences in the file are 247 combined into a single MultipleSeqAlignment. 248 """ 249 from Bio import SeqIO 250 assert format in SeqIO._FormatToIterator 251 252 if seq_count: 253 #Use the count to split the records into batches. 254 seq_record_iterator = SeqIO.parse(handle, format, alphabet) 255 256 records = [] 257 for record in seq_record_iterator: 258 records.append(record) 259 if len(records) == seq_count: 260 yield MultipleSeqAlignment(records, alphabet) 261 records = [] 262 if len(records) > 0: 263 raise ValueError("Check seq_count argument, not enough sequences?") 264 else: 265 #Must assume that there is a single alignment using all 266 #the SeqRecord objects: 267 records = list(SeqIO.parse(handle, format, alphabet)) 268 if records: 269 yield MultipleSeqAlignment(records, alphabet) 270 raise StopIteration
271
272 -def _force_alphabet(alignment_iterator, alphabet):
273 """Iterate over alignments, over-riding the alphabet (PRIVATE).""" 274 #Assume the alphabet argument has been pre-validated 275 given_base_class = _get_base_alphabet(alphabet).__class__ 276 for align in alignment_iterator: 277 if not isinstance(_get_base_alphabet(align._alphabet), 278 given_base_class): 279 raise ValueError("Specified alphabet %s clashes with "\ 280 "that determined from the file, %s" \ 281 % (repr(alphabet), repr(align._alphabet))) 282 for record in align: 283 if not isinstance(_get_base_alphabet(record.seq.alphabet), 284 given_base_class): 285 raise ValueError("Specified alphabet %s clashes with "\ 286 "that determined from the file, %s" \ 287 % (repr(alphabet), repr(record.seq.alphabet))) 288 record.seq.alphabet = alphabet 289 align._alphabet = alphabet 290 yield align
291
292 -def parse(handle, format, seq_count=None, alphabet=None):
293 """Iterate over an alignment file as MultipleSeqAlignment objects. 294 295 Arguments: 296 - handle - handle to the file, or the filename as a string 297 (note older verions of Biopython only took a handle). 298 - format - string describing the file format. 299 - alphabet - optional Alphabet object, useful when the sequence type 300 cannot be automatically inferred from the file itself 301 (e.g. fasta, phylip, clustal) 302 - seq_count - Optional integer, number of sequences expected in each 303 alignment. Recommended for fasta format files. 304 305 If you have the file name in a string 'filename', use: 306 307 >>> from Bio import AlignIO 308 >>> filename = "Emboss/needle.txt" 309 >>> format = "emboss" 310 >>> for alignment in AlignIO.parse(filename, format): 311 ... print "Alignment of length", alignment.get_alignment_length() 312 Alignment of length 124 313 Alignment of length 119 314 Alignment of length 120 315 Alignment of length 118 316 Alignment of length 125 317 318 If you have a string 'data' containing the file contents, use: 319 320 from Bio import AlignIO 321 from StringIO import StringIO 322 my_iterator = AlignIO.parse(StringIO(data), format) 323 324 Use the Bio.AlignIO.read() function when you expect a single record only. 325 """ 326 from Bio import SeqIO 327 328 handle_close = False 329 330 if isinstance(handle, basestring): 331 handle = open(handle, "rU") 332 #TODO - On Python 2.5+ use with statement to close handle 333 handle_close = True 334 335 #Try and give helpful error messages: 336 if not isinstance(format, basestring): 337 raise TypeError("Need a string for the file format (lower case)") 338 if not format: 339 raise ValueError("Format required (lower case string)") 340 if format != format.lower(): 341 raise ValueError("Format string '%s' should be lower case" % format) 342 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \ 343 isinstance(alphabet, AlphabetEncoder)): 344 raise ValueError("Invalid alphabet, %s" % repr(alphabet)) 345 if seq_count is not None and not isinstance(seq_count, int): 346 raise TypeError("Need integer for seq_count (sequences per alignment)") 347 348 #Map the file format to a sequence iterator: 349 if format in _FormatToIterator: 350 iterator_generator = _FormatToIterator[format] 351 if alphabet is None : 352 i = iterator_generator(handle, seq_count) 353 else: 354 try: 355 #Initially assume the optional alphabet argument is supported 356 i = iterator_generator(handle, seq_count, alphabet=alphabet) 357 except TypeError: 358 #It isn't supported. 359 i = _force_alphabet(iterator_generator(handle, seq_count), 360 alphabet) 361 362 elif format in SeqIO._FormatToIterator: 363 #Exploit the existing SeqIO parser to the dirty work! 364 i = _SeqIO_to_alignment_iterator(handle, format, 365 alphabet=alphabet, 366 seq_count=seq_count) 367 else: 368 raise ValueError("Unknown format '%s'" % format) 369 370 #This imposes some overhead... wait until we drop Python 2.4 to fix it 371 for a in i: 372 yield a 373 if handle_close: 374 handle.close()
375
376 -def read(handle, format, seq_count=None, alphabet=None):
377 """Turns an alignment file into a single MultipleSeqAlignment object. 378 379 Arguments: 380 - handle - handle to the file, or the filename as a string 381 (note older verions of Biopython only took a handle). 382 - format - string describing the file format. 383 - alphabet - optional Alphabet object, useful when the sequence type 384 cannot be automatically inferred from the file itself 385 (e.g. fasta, phylip, clustal) 386 - seq_count - Optional integer, number of sequences expected in each 387 alignment. Recommended for fasta format files. 388 389 If the handle contains no alignments, or more than one alignment, 390 an exception is raised. For example, using a PFAM/Stockholm file 391 containing one alignment: 392 393 >>> from Bio import AlignIO 394 >>> filename = "Clustalw/protein.aln" 395 >>> format = "clustal" 396 >>> alignment = AlignIO.read(filename, format) 397 >>> print "Alignment of length", alignment.get_alignment_length() 398 Alignment of length 411 399 400 If however you want the first alignment from a file containing 401 multiple alignments this function would raise an exception. 402 403 >>> from Bio import AlignIO 404 >>> filename = "Emboss/needle.txt" 405 >>> format = "emboss" 406 >>> alignment = AlignIO.read(filename, format) 407 Traceback (most recent call last): 408 ... 409 ValueError: More than one record found in handle 410 411 Instead use: 412 413 >>> from Bio import AlignIO 414 >>> filename = "Emboss/needle.txt" 415 >>> format = "emboss" 416 >>> alignment = AlignIO.parse(filename, format).next() 417 >>> print "First alignment has length", alignment.get_alignment_length() 418 First alignment has length 124 419 420 You must use the Bio.AlignIO.parse() function if you want to read multiple 421 records from the handle. 422 """ 423 iterator = parse(handle, format, seq_count, alphabet) 424 try: 425 first = iterator.next() 426 except StopIteration: 427 first = None 428 if first is None: 429 raise ValueError("No records found in handle") 430 try: 431 second = iterator.next() 432 except StopIteration: 433 second = None 434 if second is not None: 435 raise ValueError("More than one record found in handle") 436 if seq_count: 437 assert len(first)==seq_count 438 return first
439
440 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
441 """Convert between two alignment files, returns number of alignments. 442 443 - in_file - an input handle or filename 444 - in_format - input file format, lower case string 445 - output - an output handle or filename 446 - out_file - output file format, lower case string 447 - alphabet - optional alphabet to assume 448 449 NOTE - If you provide an output filename, it will be opened which will 450 overwrite any existing file without warning. This may happen if even the 451 conversion is aborted (e.g. an invalid out_format name is given). 452 """ 453 #TODO - Add optimised versions of important conversions 454 #For now just off load the work to SeqIO parse/write 455 if isinstance(in_file, basestring): 456 in_handle = open(in_file, "rU") 457 in_close = True 458 else: 459 in_handle = in_file 460 in_close = False 461 #This will check the arguments and issue error messages, 462 alignments = parse(in_handle, in_format, None, alphabet) 463 #Don't open the output file until we've checked the input is OK: 464 if isinstance(out_file, basestring): 465 out_handle = open(out_file, "w") 466 out_close = True 467 else: 468 out_handle = out_file 469 out_close = False 470 #This will check the arguments and issue error messages, 471 #after we have opened the file which is a shame. 472 count = write(alignments, out_handle, out_format) 473 #Must now close any handles we opened 474 if in_close: 475 in_handle.close() 476 if out_close: 477 out_handle.close() 478 return count
479
480 -def _test():
481 """Run the Bio.AlignIO module's doctests. 482 483 This will try and locate the unit tests directory, and run the doctests 484 from there in order that the relative paths used in the examples work. 485 """ 486 import doctest 487 import os 488 if os.path.isdir(os.path.join("..", "..", "Tests")): 489 print "Runing doctests..." 490 cur_dir = os.path.abspath(os.curdir) 491 os.chdir(os.path.join("..", "..", "Tests")) 492 doctest.testmod() 493 os.chdir(cur_dir) 494 del cur_dir 495 print "Done" 496 elif os.path.isdir(os.path.join("Tests", "Fasta")): 497 print "Runing doctests..." 498 cur_dir = os.path.abspath(os.curdir) 499 os.chdir(os.path.join("Tests")) 500 doctest.testmod() 501 os.chdir(cur_dir) 502 del cur_dir 503 print "Done"
504 505 if __name__ == "__main__": 506 _test() 507