1
2
3
4
5
6 """Multiple sequence alignment input/output as alignment objects.
7
8 The Bio.AlignIO interface is deliberately very similar to Bio.SeqIO, and in
9 fact the two are connected internally. Both modules use the same set of file
10 format names (lower case strings). From the user's perspective, you can read
11 in a PHYLIP file containing one or more alignments using Bio.AlignIO, or you
12 can read in the sequences within these alignmenta using Bio.SeqIO.
13
14 Bio.AlignIO is also documented at U{http://biopython.org/wiki/AlignIO} and by
15 a whole chapter in our tutorial:
16 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.html}
17 - U{http://biopython.org/DIST/docs/tutorial/Tutorial.pdf}
18
19 Input
20 =====
21 For the typical special case when your file or handle contains one and only
22 one alignment, use the function Bio.AlignIO.read(). This takes an input file
23 handle (or in recent versions of Biopython a filename as a string), format
24 string and optional number of sequences per alignment. It will return a single
25 MultipleSeqAlignment object (or raise an exception if there isn't just one
26 alignment):
27
28 >>> from Bio import AlignIO
29 >>> align = AlignIO.read("Phylip/interlaced.phy", "phylip")
30 >>> print align
31 SingleLetterAlphabet() alignment with 3 rows and 384 columns
32 -----MKVILLFVLAVFTVFVSS---------------RGIPPE...I-- CYS1_DICDI
33 MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTL...VAA ALEU_HORVU
34 ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSL...PLV CATH_HUMAN
35
36 For the general case, when the handle could contain any number of alignments,
37 use the function Bio.AlignIO.parse(...) which takes the same arguments, but
38 returns an iterator giving MultipleSeqAlignment objects (typically used in a
39 for loop). If you want random access to the alignments by number, turn this
40 into a list:
41
42 >>> from Bio import AlignIO
43 >>> alignments = list(AlignIO.parse("Emboss/needle.txt", "emboss"))
44 >>> print alignments[2]
45 SingleLetterAlphabet() alignment with 2 rows and 120 columns
46 -KILIVDDQYGIRILLNEVFNKEGYQTFQAANGLQALDIVTKER...--- ref_rec
47 LHIVVVDDDPGTCVYIESVFAELGHTCKSFVRPEAAEEYILTHP...HKE gi|94967506|receiver
48
49 Most alignment file formats can be concatenated so as to hold as many
50 different multiple sequence alignments as possible. One common example
51 is the output of the tool seqboot in the PHLYIP suite. Sometimes there
52 can be a file header and footer, as seen in the EMBOSS alignment output.
53
54 Output
55 ======
56 Use the function Bio.AlignIO.write(...), which takes a complete set of
57 Alignment objects (either as a list, or an iterator), an output file handle
58 (or filename in recent versions of Biopython) and of course the file format::
59
60 from Bio import AlignIO
61 alignments = ...
62 count = SeqIO.write(alignments, "example.faa", "fasta")
63
64 If using a handle make sure to close it to flush the data to the disk::
65
66 from Bio import AlignIO
67 alignments = ...
68 handle = open("example.faa", "w")
69 count = SeqIO.write(alignments, handle, "fasta")
70 handle.close()
71
72 In general, you are expected to call this function once (with all your
73 alignments) and then close the file handle. However, for file formats
74 like PHYLIP where multiple alignments are stored sequentially (with no file
75 header and footer), then multiple calls to the write function should work as
76 expected when using handles.
77
78 If you are using a filename, the repeated calls to the write functions will
79 overwrite the existing file each time.
80
81 Conversion
82 ==========
83 The Bio.AlignIO.convert(...) function allows an easy interface for simple
84 alignnment file format conversions. Additionally, it may use file format
85 specific optimisations so this should be the fastest way too.
86
87 In general however, you can combine the Bio.AlignIO.parse(...) function with
88 the Bio.AlignIO.write(...) function for sequence file conversion. Using
89 generator expressions provides a memory efficient way to perform filtering or
90 other extra operations as part of the process.
91
92 File Formats
93 ============
94 When specifying the file format, use lowercase strings. The same format
95 names are also used in Bio.SeqIO and include the following:
96
97 - clustal - Ouput from Clustal W or X, see also the module Bio.Clustalw
98 which can be used to run the command line tool from Biopython.
99 - emboss - EMBOSS tools' "pairs" and "simple" alignment formats.
100 - fasta - The generic sequence file format where each record starts with
101 an identifer line starting with a ">" character, followed by
102 lines of sequence.
103 - fasta-m10 - For the pairswise alignments output by Bill Pearson's FASTA
104 tools when used with the -m 10 command line option for machine
105 readable output.
106 - ig - The IntelliGenetics file format, apparently the same as the
107 MASE alignment format.
108 - nexus - Output from NEXUS, see also the module Bio.Nexus which can also
109 read any phylogenetic trees in these files.
110 - phylip - Used by the PHLIP tools.
111 - stockholm - A richly annotated alignment file format used by PFAM.
112
113 Note that while Bio.AlignIO can read all the above file formats, it cannot
114 write to all of them.
115
116 You can also use any file format supported by Bio.SeqIO, such as "fasta" or
117 "ig" (which are listed above), PROVIDED the sequences in your file are all the
118 same length.
119 """
120 __docformat__ = "epytext en"
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136 from Bio.Align import MultipleSeqAlignment
137 from Bio.Align.Generic import Alignment
138 from Bio.Alphabet import Alphabet, AlphabetEncoder, _get_base_alphabet
139
140 import StockholmIO
141 import ClustalIO
142 import NexusIO
143 import PhylipIO
144 import EmbossIO
145 import FastaIO
146
147
148
149
150 _FormatToIterator = {
151 "clustal" : ClustalIO.ClustalIterator,
152 "emboss" : EmbossIO.EmbossIterator,
153 "fasta-m10" : FastaIO.FastaM10Iterator,
154 "nexus" : NexusIO.NexusIterator,
155 "phylip" : PhylipIO.PhylipIterator,
156 "phylip-relaxed" : PhylipIO.RelaxedPhylipIterator,
157 "stockholm" : StockholmIO.StockholmIterator,
158 }
159
160 _FormatToWriter = {
161
162 "nexus" : NexusIO.NexusWriter,
163 "phylip" : PhylipIO.PhylipWriter,
164 "phylip-relaxed" : PhylipIO.RelaxedPhylipWriter,
165 "stockholm" : StockholmIO.StockholmWriter,
166 "clustal" : ClustalIO.ClustalWriter,
167 }
168
169 -def write(alignments, handle, format):
170 """Write complete set of alignments to a file.
171
172 Arguments:
173 - alignments - A list (or iterator) of Alignment objects (ideally the
174 new MultipleSeqAlignment objects), or (if using Biopython
175 1.54 or later) a single alignment object.
176 - handle - File handle object to write to, or filename as string
177 (note older versions of Biopython only took a handle).
178 - format - lower case string describing the file format to write.
179
180 You should close the handle after calling this function.
181
182 Returns the number of alignments written (as an integer).
183 """
184 from Bio import SeqIO
185
186
187 if not isinstance(format, basestring):
188 raise TypeError("Need a string for the file format (lower case)")
189 if not format:
190 raise ValueError("Format required (lower case string)")
191 if format != format.lower():
192 raise ValueError("Format string '%s' should be lower case" % format)
193
194 if isinstance(alignments, Alignment):
195
196 alignments = [alignments]
197
198 if isinstance(handle, basestring):
199 handle = open(handle, "w")
200 handle_close = True
201 else:
202 handle_close = False
203
204
205 if format in _FormatToWriter:
206 writer_class = _FormatToWriter[format]
207 count = writer_class(handle).write_file(alignments)
208 elif format in SeqIO._FormatToWriter:
209
210
211 count = 0
212 for alignment in alignments:
213 if not isinstance(alignment, Alignment):
214 raise TypeError(\
215 "Expect a list or iterator of Alignment objects.")
216 SeqIO.write(alignment, handle, format)
217 count += 1
218 elif format in _FormatToIterator or format in SeqIO._FormatToIterator:
219 raise ValueError("Reading format '%s' is supported, but not writing" \
220 % format)
221 else:
222 raise ValueError("Unknown format '%s'" % format)
223
224 assert isinstance(count, int), "Internal error - the underlying %s " \
225 "writer should have returned the alignment count, not %s" \
226 % (format, repr(count))
227
228 if handle_close:
229 handle.close()
230
231 return count
232
233
235 """Uses Bio.SeqIO to create an MultipleSeqAlignment iterator (PRIVATE).
236
237 Arguments:
238 - handle - handle to the file.
239 - format - string describing the file format.
240 - alphabet - optional Alphabet object, useful when the sequence type
241 cannot be automatically inferred from the file itself
242 (e.g. fasta, phylip, clustal)
243 - seq_count - Optional integer, number of sequences expected in each
244 alignment. Recommended for fasta format files.
245
246 If count is omitted (default) then all the sequences in the file are
247 combined into a single MultipleSeqAlignment.
248 """
249 from Bio import SeqIO
250 assert format in SeqIO._FormatToIterator
251
252 if seq_count:
253
254 seq_record_iterator = SeqIO.parse(handle, format, alphabet)
255
256 records = []
257 for record in seq_record_iterator:
258 records.append(record)
259 if len(records) == seq_count:
260 yield MultipleSeqAlignment(records, alphabet)
261 records = []
262 if len(records) > 0:
263 raise ValueError("Check seq_count argument, not enough sequences?")
264 else:
265
266
267 records = list(SeqIO.parse(handle, format, alphabet))
268 if records:
269 yield MultipleSeqAlignment(records, alphabet)
270 raise StopIteration
271
273 """Iterate over alignments, over-riding the alphabet (PRIVATE)."""
274
275 given_base_class = _get_base_alphabet(alphabet).__class__
276 for align in alignment_iterator:
277 if not isinstance(_get_base_alphabet(align._alphabet),
278 given_base_class):
279 raise ValueError("Specified alphabet %s clashes with "\
280 "that determined from the file, %s" \
281 % (repr(alphabet), repr(align._alphabet)))
282 for record in align:
283 if not isinstance(_get_base_alphabet(record.seq.alphabet),
284 given_base_class):
285 raise ValueError("Specified alphabet %s clashes with "\
286 "that determined from the file, %s" \
287 % (repr(alphabet), repr(record.seq.alphabet)))
288 record.seq.alphabet = alphabet
289 align._alphabet = alphabet
290 yield align
291
292 -def parse(handle, format, seq_count=None, alphabet=None):
293 """Iterate over an alignment file as MultipleSeqAlignment objects.
294
295 Arguments:
296 - handle - handle to the file, or the filename as a string
297 (note older verions of Biopython only took a handle).
298 - format - string describing the file format.
299 - alphabet - optional Alphabet object, useful when the sequence type
300 cannot be automatically inferred from the file itself
301 (e.g. fasta, phylip, clustal)
302 - seq_count - Optional integer, number of sequences expected in each
303 alignment. Recommended for fasta format files.
304
305 If you have the file name in a string 'filename', use:
306
307 >>> from Bio import AlignIO
308 >>> filename = "Emboss/needle.txt"
309 >>> format = "emboss"
310 >>> for alignment in AlignIO.parse(filename, format):
311 ... print "Alignment of length", alignment.get_alignment_length()
312 Alignment of length 124
313 Alignment of length 119
314 Alignment of length 120
315 Alignment of length 118
316 Alignment of length 125
317
318 If you have a string 'data' containing the file contents, use:
319
320 from Bio import AlignIO
321 from StringIO import StringIO
322 my_iterator = AlignIO.parse(StringIO(data), format)
323
324 Use the Bio.AlignIO.read() function when you expect a single record only.
325 """
326 from Bio import SeqIO
327
328 handle_close = False
329
330 if isinstance(handle, basestring):
331 handle = open(handle, "rU")
332
333 handle_close = True
334
335
336 if not isinstance(format, basestring):
337 raise TypeError("Need a string for the file format (lower case)")
338 if not format:
339 raise ValueError("Format required (lower case string)")
340 if format != format.lower():
341 raise ValueError("Format string '%s' should be lower case" % format)
342 if alphabet is not None and not (isinstance(alphabet, Alphabet) or \
343 isinstance(alphabet, AlphabetEncoder)):
344 raise ValueError("Invalid alphabet, %s" % repr(alphabet))
345 if seq_count is not None and not isinstance(seq_count, int):
346 raise TypeError("Need integer for seq_count (sequences per alignment)")
347
348
349 if format in _FormatToIterator:
350 iterator_generator = _FormatToIterator[format]
351 if alphabet is None :
352 i = iterator_generator(handle, seq_count)
353 else:
354 try:
355
356 i = iterator_generator(handle, seq_count, alphabet=alphabet)
357 except TypeError:
358
359 i = _force_alphabet(iterator_generator(handle, seq_count),
360 alphabet)
361
362 elif format in SeqIO._FormatToIterator:
363
364 i = _SeqIO_to_alignment_iterator(handle, format,
365 alphabet=alphabet,
366 seq_count=seq_count)
367 else:
368 raise ValueError("Unknown format '%s'" % format)
369
370
371 for a in i:
372 yield a
373 if handle_close:
374 handle.close()
375
376 -def read(handle, format, seq_count=None, alphabet=None):
377 """Turns an alignment file into a single MultipleSeqAlignment object.
378
379 Arguments:
380 - handle - handle to the file, or the filename as a string
381 (note older verions of Biopython only took a handle).
382 - format - string describing the file format.
383 - alphabet - optional Alphabet object, useful when the sequence type
384 cannot be automatically inferred from the file itself
385 (e.g. fasta, phylip, clustal)
386 - seq_count - Optional integer, number of sequences expected in each
387 alignment. Recommended for fasta format files.
388
389 If the handle contains no alignments, or more than one alignment,
390 an exception is raised. For example, using a PFAM/Stockholm file
391 containing one alignment:
392
393 >>> from Bio import AlignIO
394 >>> filename = "Clustalw/protein.aln"
395 >>> format = "clustal"
396 >>> alignment = AlignIO.read(filename, format)
397 >>> print "Alignment of length", alignment.get_alignment_length()
398 Alignment of length 411
399
400 If however you want the first alignment from a file containing
401 multiple alignments this function would raise an exception.
402
403 >>> from Bio import AlignIO
404 >>> filename = "Emboss/needle.txt"
405 >>> format = "emboss"
406 >>> alignment = AlignIO.read(filename, format)
407 Traceback (most recent call last):
408 ...
409 ValueError: More than one record found in handle
410
411 Instead use:
412
413 >>> from Bio import AlignIO
414 >>> filename = "Emboss/needle.txt"
415 >>> format = "emboss"
416 >>> alignment = AlignIO.parse(filename, format).next()
417 >>> print "First alignment has length", alignment.get_alignment_length()
418 First alignment has length 124
419
420 You must use the Bio.AlignIO.parse() function if you want to read multiple
421 records from the handle.
422 """
423 iterator = parse(handle, format, seq_count, alphabet)
424 try:
425 first = iterator.next()
426 except StopIteration:
427 first = None
428 if first is None:
429 raise ValueError("No records found in handle")
430 try:
431 second = iterator.next()
432 except StopIteration:
433 second = None
434 if second is not None:
435 raise ValueError("More than one record found in handle")
436 if seq_count:
437 assert len(first)==seq_count
438 return first
439
440 -def convert(in_file, in_format, out_file, out_format, alphabet=None):
441 """Convert between two alignment files, returns number of alignments.
442
443 - in_file - an input handle or filename
444 - in_format - input file format, lower case string
445 - output - an output handle or filename
446 - out_file - output file format, lower case string
447 - alphabet - optional alphabet to assume
448
449 NOTE - If you provide an output filename, it will be opened which will
450 overwrite any existing file without warning. This may happen if even the
451 conversion is aborted (e.g. an invalid out_format name is given).
452 """
453
454
455 if isinstance(in_file, basestring):
456 in_handle = open(in_file, "rU")
457 in_close = True
458 else:
459 in_handle = in_file
460 in_close = False
461
462 alignments = parse(in_handle, in_format, None, alphabet)
463
464 if isinstance(out_file, basestring):
465 out_handle = open(out_file, "w")
466 out_close = True
467 else:
468 out_handle = out_file
469 out_close = False
470
471
472 count = write(alignments, out_handle, out_format)
473
474 if in_close:
475 in_handle.close()
476 if out_close:
477 out_handle.close()
478 return count
479
481 """Run the Bio.AlignIO module's doctests.
482
483 This will try and locate the unit tests directory, and run the doctests
484 from there in order that the relative paths used in the examples work.
485 """
486 import doctest
487 import os
488 if os.path.isdir(os.path.join("..", "..", "Tests")):
489 print "Runing doctests..."
490 cur_dir = os.path.abspath(os.curdir)
491 os.chdir(os.path.join("..", "..", "Tests"))
492 doctest.testmod()
493 os.chdir(cur_dir)
494 del cur_dir
495 print "Done"
496 elif os.path.isdir(os.path.join("Tests", "Fasta")):
497 print "Runing doctests..."
498 cur_dir = os.path.abspath(os.curdir)
499 os.chdir(os.path.join("Tests"))
500 doctest.testmod()
501 os.chdir(cur_dir)
502 del cur_dir
503 print "Done"
504
505 if __name__ == "__main__":
506 _test()
507