1
2
3
4
5
6
7
8
9 """Bio.SeqIO support for the FASTQ and QUAL file formats.
10
11 Note that you are expected to use this code via the Bio.SeqIO interface, as
12 shown below.
13
14 The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute
15 to bundle a FASTA sequence and its PHRED quality data (integers between 0 and
16 90). Rather than using a single FASTQ file, often paired FASTA and QUAL files
17 are used containing the sequence and the quality information separately.
18
19 The PHRED software reads DNA sequencing trace files, calls bases, and
20 assigns a non-negative quality value to each called base using a logged
21 transformation of the error probability, Q = -10 log10( Pe ), for example::
22
23 Pe = 1.0, Q = 0
24 Pe = 0.1, Q = 10
25 Pe = 0.01, Q = 20
26 ...
27 Pe = 0.00000001, Q = 80
28 Pe = 0.000000001, Q = 90
29
30 In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40.
31 In the QUAL format these quality values are held as space separated text in
32 a FASTA like file format. In the FASTQ format, each quality values is encoded
33 with a single ASCI character using chr(Q+33), meaning zero maps to the
34 character "!" and for example 80 maps to "q". For the Sanger FASTQ standard
35 the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and
36 quality are then stored in pairs in a FASTA like format.
37
38 Unfortunately there is no official document describing the FASTQ file format,
39 and worse, several related but different variants exist. For more details,
40 please read this open access publication::
41
42 The Sanger FASTQ file format for sequences with quality scores, and the
43 Solexa/Illumina FASTQ variants.
44 P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby),
45 M.L.Heuer (BioJava) and P.M. Rice (EMBOSS).
46 Nucleic Acids Research 2010 38(6):1767-1771
47 http://dx.doi.org/10.1093/nar/gkp1137
48
49 The good news is that Roche 454 sequencers can output files in the QUAL format,
50 and sensibly they use PHREP style scores like Sanger. Converting a pair of
51 FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL
52 files from a Roche 454 SFF binary file, use the Roche off instrument command
53 line tool "sffinfo" with the -q or -qual argument. You can extract a matching
54 FASTA file using the -s or -seq argument instead.
55
56 The bad news is that Solexa/Illumina did things differently - they have their
57 own scoring system AND their own incompatible versions of the FASTQ format.
58 Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can
59 be negative. PHRED scores and Solexa scores are NOT interchangeable (but a
60 reasonable mapping can be achieved between them, and they are approximately
61 equal for higher quality reads).
62
63 Confusingly early Solexa pipelines produced a FASTQ like file but using their
64 own score mapping and an ASCII offset of 64. To make things worse, for the
65 Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the
66 FASTQ file format, this time using PHRED scores (which is more consistent) but
67 with an ASCII offset of 64.
68
69 i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ
70 file format: The original Sanger PHRED standard, and two from Solexa/Illumina.
71
72 The good news is that as of CASAVA version 1.8, Illumina sequencers will
73 produce FASTQ files using the standard Sanger encoding.
74
75 You are expected to use this module via the Bio.SeqIO functions, with the
76 following format names:
77
78 - "qual" means simple quality files using PHRED scores (e.g. from Roche 454)
79 - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII
80 offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+).
81 These can potentially hold PHRED scores from 0 to 93.
82 - "fastq-sanger" is an alias for "fastq".
83 - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ
84 files, using Solexa scores with an ASCII offset 64. These can hold Solexa
85 scores from -5 to 62.
86 - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using
87 PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0
88 to 62.
89
90 We could potentially add support for "qual-solexa" meaning QUAL files which
91 contain Solexa scores, but thus far there isn't any reason to use such files.
92
93 For example, consider the following short FASTQ file::
94
95 @EAS54_6_R1_2_1_413_324
96 CCCTTCTTGTCTTCAGCGTTTCTCC
97 +
98 ;;3;;;;;;;;;;;;7;;;;;;;88
99 @EAS54_6_R1_2_1_540_792
100 TTGGCAGGCCAAGGCCGATGGATCA
101 +
102 ;;;;;;;;;;;7;;;;;-;;;3;83
103 @EAS54_6_R1_2_1_443_348
104 GTTGCTTCTGGCGTGGGTGGGGGGG
105 +
106 ;;;;;;;;;;;9;7;;.7;393333
107
108 This contains three reads of length 25. From the read length these were
109 probably originally from an early Solexa/Illumina sequencer but this file
110 follows the Sanger FASTQ convention (PHRED style qualities with an ASCII
111 offet of 33). This means we can parse this file using Bio.SeqIO using
112 "fastq" as the format name:
113
114 >>> from Bio import SeqIO
115 >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"):
116 ... print record.id, record.seq
117 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
118 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
119 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
120
121 The qualities are held as a list of integers in each record's annotation:
122
123 >>> print record
124 ID: EAS54_6_R1_2_1_443_348
125 Name: EAS54_6_R1_2_1_443_348
126 Description: EAS54_6_R1_2_1_443_348
127 Number of features: 0
128 Per letter annotation for: phred_quality
129 Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet())
130 >>> print record.letter_annotations["phred_quality"]
131 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
132
133 You can use the SeqRecord format method to show this in the QUAL format:
134
135 >>> print record.format("qual")
136 >EAS54_6_R1_2_1_443_348
137 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
138 24 18 18 18 18
139 <BLANKLINE>
140
141 Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"):
142
143 >>> print record.format("fastq")
144 @EAS54_6_R1_2_1_443_348
145 GTTGCTTCTGGCGTGGGTGGGGGGG
146 +
147 ;;;;;;;;;;;9;7;;.7;393333
148 <BLANKLINE>
149
150 Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset
151 of 64):
152
153 >>> print record.format("fastq-illumina")
154 @EAS54_6_R1_2_1_443_348
155 GTTGCTTCTGGCGTGGGTGGGGGGG
156 +
157 ZZZZZZZZZZZXZVZZMVZRXRRRR
158 <BLANKLINE>
159
160 You can also get Biopython to convert the scores and show a Solexa style
161 FASTQ file:
162
163 >>> print record.format("fastq-solexa")
164 @EAS54_6_R1_2_1_443_348
165 GTTGCTTCTGGCGTGGGTGGGGGGG
166 +
167 ZZZZZZZZZZZXZVZZMVZRXRRRR
168 <BLANKLINE>
169
170 Notice that this is actually the same output as above using "fastq-illumina"
171 as the format! The reason for this is all these scores are high enough that
172 the PHRED and Solexa scores are almost equal. The differences become apparent
173 for poor quality reads. See the functions solexa_quality_from_phred and
174 phred_quality_from_solexa for more details.
175
176 If you wanted to trim your sequences (perhaps to remove low quality regions,
177 or to remove a primer sequence), try slicing the SeqRecord objects. e.g.
178
179 >>> sub_rec = record[5:15]
180 >>> print sub_rec
181 ID: EAS54_6_R1_2_1_443_348
182 Name: EAS54_6_R1_2_1_443_348
183 Description: EAS54_6_R1_2_1_443_348
184 Number of features: 0
185 Per letter annotation for: phred_quality
186 Seq('TTCTGGCGTG', SingleLetterAlphabet())
187 >>> print sub_rec.letter_annotations["phred_quality"]
188 [26, 26, 26, 26, 26, 26, 24, 26, 22, 26]
189 >>> print sub_rec.format("fastq")
190 @EAS54_6_R1_2_1_443_348
191 TTCTGGCGTG
192 +
193 ;;;;;;9;7;
194 <BLANKLINE>
195
196 If you wanted to, you could read in this FASTQ file, and save it as a QUAL file:
197
198 >>> from Bio import SeqIO
199 >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq")
200 >>> out_handle = open("Quality/temp.qual", "w")
201 >>> SeqIO.write(record_iterator, out_handle, "qual")
202 3
203 >>> out_handle.close()
204
205 You can of course read in a QUAL file, such as the one we just created:
206
207 >>> from Bio import SeqIO
208 >>> for record in SeqIO.parse("Quality/temp.qual", "qual"):
209 ... print record.id, record.seq
210 EAS54_6_R1_2_1_413_324 ?????????????????????????
211 EAS54_6_R1_2_1_540_792 ?????????????????????????
212 EAS54_6_R1_2_1_443_348 ?????????????????????????
213
214 Notice that QUAL files don't have a proper sequence present! But the quality
215 information is there:
216
217 >>> print record
218 ID: EAS54_6_R1_2_1_443_348
219 Name: EAS54_6_R1_2_1_443_348
220 Description: EAS54_6_R1_2_1_443_348
221 Number of features: 0
222 Per letter annotation for: phred_quality
223 UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?')
224 >>> print record.letter_annotations["phred_quality"]
225 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
226
227 Just to keep things tidy, if you are following this example yourself, you can
228 delete this temporary file now:
229
230 >>> import os
231 >>> os.remove("Quality/temp.qual")
232
233 Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL
234 files. Because the Bio.SeqIO system is designed for reading single files, you
235 would have to read the two in separately and then combine the data. However,
236 since this is such a common thing to want to do, there is a helper iterator
237 defined in this module that does this for you - PairedFastaQualIterator.
238
239 Alternatively, if you have enough RAM to hold all the records in memory at once,
240 then a simple dictionary approach would work:
241
242 >>> from Bio import SeqIO
243 >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta"))
244 >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"):
245 ... reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"]
246
247 You can then access any record by its key, and get both the sequence and the
248 quality scores.
249
250 >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq")
251 @EAS54_6_R1_2_1_540_792
252 TTGGCAGGCCAAGGCCGATGGATCA
253 +
254 ;;;;;;;;;;;7;;;;;-;;;3;83
255 <BLANKLINE>
256
257 It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are
258 using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values,
259 "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina"
260 for the more recent variant), as this cannot be detected reliably
261 automatically.
262
263 To illustrate this problem, let's consider an artifical example:
264
265 >>> from Bio.Seq import Seq
266 >>> from Bio.Alphabet import generic_dna
267 >>> from Bio.SeqRecord import SeqRecord
268 >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test",
269 ... description="Made up!")
270 >>> print test.format("fasta")
271 >Test Made up!
272 NACGTACGTA
273 <BLANKLINE>
274 >>> print test.format("fastq")
275 Traceback (most recent call last):
276 ...
277 ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test).
278
279 We created a sample SeqRecord, and can show it in FASTA format - but for QUAL
280 or FASTQ format we need to provide some quality scores. These are held as a
281 list of integers (one for each base) in the letter_annotations dictionary:
282
283 >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40]
284 >>> print test.format("qual")
285 >Test Made up!
286 0 1 2 3 4 5 10 20 30 40
287 <BLANKLINE>
288 >>> print test.format("fastq")
289 @Test Made up!
290 NACGTACGTA
291 +
292 !"#$%&+5?I
293 <BLANKLINE>
294
295 We can check this FASTQ encoding - the first PHRED quality was zero, and this
296 mapped to a exclamation mark, while the final score was 40 and this mapped to
297 the letter "I":
298
299 >>> ord('!') - 33
300 0
301 >>> ord('I') - 33
302 40
303 >>> [ord(letter)-33 for letter in '!"#$%&+5?I']
304 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
305
306 Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED
307 scores with an offset of 64:
308
309 >>> print test.format("fastq-illumina")
310 @Test Made up!
311 NACGTACGTA
312 +
313 @ABCDEJT^h
314 <BLANKLINE>
315
316 And we can check this too - the first PHRED score was zero, and this mapped to
317 "@", while the final score was 40 and this mapped to "h":
318
319 >>> ord("@") - 64
320 0
321 >>> ord("h") - 64
322 40
323 >>> [ord(letter)-64 for letter in "@ABCDEJT^h"]
324 [0, 1, 2, 3, 4, 5, 10, 20, 30, 40]
325
326 Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style
327 FASTQ files look for the same data! Then we have the older Solexa/Illumina
328 format to consider which encodes Solexa scores instead of PHRED scores.
329
330 First let's see what Biopython says if we convert the PHRED scores into Solexa
331 scores (rounding to one decimal place):
332
333 >>> for q in [0,1,2,3,4,5,10,20,30,40]:
334 ... print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q))
335 PHRED 0 maps to Solexa -5.0
336 PHRED 1 maps to Solexa -5.0
337 PHRED 2 maps to Solexa -2.3
338 PHRED 3 maps to Solexa -0.0
339 PHRED 4 maps to Solexa 1.8
340 PHRED 5 maps to Solexa 3.3
341 PHRED 10 maps to Solexa 9.5
342 PHRED 20 maps to Solexa 20.0
343 PHRED 30 maps to Solexa 30.0
344 PHRED 40 maps to Solexa 40.0
345
346 Now here is the record using the old Solexa style FASTQ file:
347
348 >>> print test.format("fastq-solexa")
349 @Test Made up!
350 NACGTACGTA
351 +
352 ;;>@BCJT^h
353 <BLANKLINE>
354
355 Again, this is using an ASCII offset of 64, so we can check the Solexa scores:
356
357 >>> [ord(letter)-64 for letter in ";;>@BCJT^h"]
358 [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40]
359
360 This explains why the last few letters of this FASTQ output matched that using
361 the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores
362 are approximately equal.
363
364 """
365 __docformat__ = "epytext en"
366
367 from Bio.Alphabet import single_letter_alphabet
368 from Bio.Seq import Seq, UnknownSeq
369 from Bio.SeqRecord import SeqRecord
370 from Bio.SeqIO.Interfaces import SequentialSequenceWriter
371 from math import log
372 import warnings
373 from Bio import BiopythonWarning
374
375
376
377
378 SANGER_SCORE_OFFSET = 33
379 SOLEXA_SCORE_OFFSET = 64
380
382 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality.
383
384 PHRED and Solexa quality scores are both log transformations of a
385 probality of error (high score = low probability of error). This function
386 takes a PHRED score, transforms it back to a probability of error, and
387 then re-expresses it as a Solexa score. This assumes the error estimates
388 are equivalent.
389
390 How does this work exactly? Well the PHRED quality is minus ten times the
391 base ten logarithm of the probability of error::
392
393 phred_quality = -10*log(error,10)
394
395 Therefore, turning this round::
396
397 error = 10 ** (- phred_quality / 10)
398
399 Now, Solexa qualities use a different log transformation::
400
401 solexa_quality = -10*log(error/(1-error),10)
402
403 After substitution and a little manipulation we get::
404
405 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10)
406
407 However, real Solexa files use a minimum quality of -5. This does have a
408 good reason - a random a random base call would be correct 25% of the time,
409 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED
410 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random
411 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5.
412
413 Taken literally, this logarithic formula would map a PHRED quality of zero
414 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED
415 score of zero means a probability of error of one (i.e. the base call is
416 definitely wrong), which is worse than random! In practice, a PHRED quality
417 of zero usually means a default value, or perhaps random - and therefore
418 mapping it to the minimum Solexa score of -5 is reasonable.
419
420 In conclusion, we follow EMBOSS, and take this logarithmic formula but also
421 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED
422 quality of zero to -5.0 as well.
423
424 Note this function will return a floating point number, it is up to you to
425 round this to the nearest integer if appropriate. e.g.
426
427 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2)
428 80.00
429 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2)
430 50.00
431 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2)
432 19.96
433 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2)
434 9.54
435 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2)
436 3.35
437 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2)
438 1.80
439 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2)
440 -0.02
441 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2)
442 -2.33
443 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2)
444 -5.00
445 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2)
446 -5.00
447
448 Notice that for high quality reads PHRED and Solexa scores are numerically
449 equal. The differences are important for poor quality reads, where PHRED
450 has a minimum of zero but Solexa scores can be negative.
451
452 Finally, as a special case where None is used for a "missing value", None
453 is returned:
454
455 >>> print solexa_quality_from_phred(None)
456 None
457 """
458 if phred_quality is None:
459
460
461 return None
462 elif phred_quality > 0:
463
464
465 return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10))
466 elif phred_quality == 0:
467
468 return -5.0
469 else:
470 raise ValueError("PHRED qualities must be positive (or zero), not %s" \
471 % repr(phred_quality))
472
474 """Convert a Solexa quality (which can be negative) to a PHRED quality.
475
476 PHRED and Solexa quality scores are both log transformations of a
477 probality of error (high score = low probability of error). This function
478 takes a Solexa score, transforms it back to a probability of error, and
479 then re-expresses it as a PHRED score. This assumes the error estimates
480 are equivalent.
481
482 The underlying formulas are given in the documentation for the sister
483 function solexa_quality_from_phred, in this case the operation is::
484
485 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10)
486
487 This will return a floating point number, it is up to you to round this to
488 the nearest integer if appropriate. e.g.
489
490 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2)
491 80.00
492 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2)
493 20.04
494 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2)
495 10.41
496 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2)
497 3.01
498 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2)
499 1.19
500
501 Note that a solexa_quality less then -5 is not expected, will trigger a
502 warning, but will still be converted as per the logarithmic mapping
503 (giving a number between 0 and 1.19 back).
504
505 As a special case where None is used for a "missing value", None is
506 returned:
507
508 >>> print phred_quality_from_solexa(None)
509 None
510 """
511 if solexa_quality is None:
512
513 return None
514 if solexa_quality < -5:
515 warnings.warn("Solexa quality less than -5 passed, %s" \
516 % repr(solexa_quality), BiopythonWarning)
517 return 10*log(10**(solexa_quality/10.0) + 1, 10)
518
520 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE).
521
522 If there are no PHRED qualities, but there are Solexa qualities, those are
523 used instead after conversion.
524 """
525 try:
526 return record.letter_annotations["phred_quality"]
527 except KeyError:
528 pass
529 try:
530 return [phred_quality_from_solexa(q) for \
531 q in record.letter_annotations["solexa_quality"]]
532 except KeyError:
533 raise ValueError("No suitable quality scores found in "
534 "letter_annotations of SeqRecord (id=%s)." \
535 % record.id)
536
537
538 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \
539 for qp in range(0, 93+1))
540
541 _solexa_to_sanger_quality_str = dict( \
542 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \
543 for qs in range(-5, 93+1))
545 """Returns a Sanger FASTQ encoded quality string (PRIVATE).
546
547 >>> from Bio.Seq import Seq
548 >>> from Bio.SeqRecord import SeqRecord
549 >>> r = SeqRecord(Seq("ACGTAN"), id="Test",
550 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]})
551 >>> _get_sanger_quality_str(r)
552 'SI?5+!'
553
554 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO),
555 the PHRED qualities are integers, this function is able to use a very fast
556 pre-cached mapping. However, if they are floats which differ slightly, then
557 it has to do the appropriate rounding - which is slower:
558
559 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2",
560 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]})
561 >>> _get_sanger_quality_str(r2)
562 'SI?5+!'
563
564 If your scores include a None value, this raises an exception:
565
566 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3",
567 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]})
568 >>> _get_sanger_quality_str(r3)
569 Traceback (most recent call last):
570 ...
571 TypeError: A quality value of None was found
572
573 If (strangely) your record has both PHRED and Solexa scores, then the PHRED
574 scores are used in preference:
575
576 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4",
577 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0],
578 ... "solexa_quality":[-5,-4,0,None,0,40]})
579 >>> _get_sanger_quality_str(r4)
580 'SI?5+!'
581
582 If there are no PHRED scores, but there are Solexa scores, these are used
583 instead (after the approriate conversion):
584
585 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5",
586 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]})
587 >>> _get_sanger_quality_str(r5)
588 'I?5+$"'
589
590 Again, integer Solexa scores can be looked up in a pre-cached mapping making
591 this very fast. You can still use approximate floating point scores:
592
593 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6",
594 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]})
595 >>> _get_sanger_quality_str(r6)
596 'I?5+$"'
597
598 Notice that due to the limited range of printable ASCII characters, a
599 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ
600 file (using ASCII 126, the tilde). This function will issue a warning
601 in this situation.
602 """
603
604
605
606 try:
607
608 qualities = record.letter_annotations["phred_quality"]
609 except KeyError:
610
611 pass
612 else:
613
614 try:
615 return "".join([_phred_to_sanger_quality_str[qp] \
616 for qp in qualities])
617 except KeyError:
618
619 pass
620 if None in qualities:
621 raise TypeError("A quality value of None was found")
622 if max(qualities) >= 93.5:
623 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
624 BiopythonWarning)
625
626 return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \
627 for qp in qualities])
628
629 try:
630 qualities = record.letter_annotations["solexa_quality"]
631 except KeyError:
632 raise ValueError("No suitable quality scores found in "
633 "letter_annotations of SeqRecord (id=%s)." \
634 % record.id)
635
636 try:
637 return "".join([_solexa_to_sanger_quality_str[qs] \
638 for qs in qualities])
639 except KeyError:
640
641 pass
642 if None in qualities:
643 raise TypeError("A quality value of None was found")
644
645
646 if max(qualities) >= 93.5:
647 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ",
648 BiopythonWarning)
649
650 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \
651 for qs in qualities])
652
653
654 assert 62+SOLEXA_SCORE_OFFSET == 126
655 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \
656 for qp in range(0, 62+1))
657
658 _solexa_to_illumina_quality_str = dict( \
659 (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
660 for qs in range(-5, 62+1))
662 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE).
663
664 Notice that due to the limited range of printable ASCII characters, a
665 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ
666 file (using ASCII 126, the tilde). This function will issue a warning
667 in this situation.
668 """
669
670
671
672 try:
673
674 qualities = record.letter_annotations["phred_quality"]
675 except KeyError:
676
677 pass
678 else:
679
680 try:
681 return "".join([_phred_to_illumina_quality_str[qp] \
682 for qp in qualities])
683 except KeyError:
684
685 pass
686 if None in qualities:
687 raise TypeError("A quality value of None was found")
688 if max(qualities) >= 62.5:
689 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
690 BiopythonWarning)
691
692 return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \
693 for qp in qualities])
694
695 try:
696 qualities = record.letter_annotations["solexa_quality"]
697 except KeyError:
698 raise ValueError("No suitable quality scores found in "
699 "letter_annotations of SeqRecord (id=%s)." \
700 % record.id)
701
702 try:
703 return "".join([_solexa_to_illumina_quality_str[qs] \
704 for qs in qualities])
705 except KeyError:
706
707 pass
708 if None in qualities:
709 raise TypeError("A quality value of None was found")
710
711
712 if max(qualities) >= 62.5:
713 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ",
714 BiopythonWarning)
715
716 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \
717 for qs in qualities])
718
719
720 assert 62+SOLEXA_SCORE_OFFSET == 126
721 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \
722 for qs in range(-5, 62+1))
723
724 _phred_to_solexa_quality_str = dict(\
725 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \
726 for qp in range(0, 62+1))
728 """Returns a Solexa FASTQ encoded quality string (PRIVATE).
729
730 Notice that due to the limited range of printable ASCII characters, a
731 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ
732 file (using ASCII 126, the tilde). This function will issue a warning
733 in this situation.
734 """
735
736
737
738 try:
739
740 qualities = record.letter_annotations["solexa_quality"]
741 except KeyError:
742
743 pass
744 else:
745
746 try:
747 return "".join([_solexa_to_solexa_quality_str[qs] \
748 for qs in qualities])
749 except KeyError:
750
751 pass
752 if None in qualities:
753 raise TypeError("A quality value of None was found")
754 if max(qualities) >= 62.5:
755 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
756 BiopythonWarning)
757
758 return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \
759 for qs in qualities])
760
761 try:
762 qualities = record.letter_annotations["phred_quality"]
763 except KeyError:
764 raise ValueError("No suitable quality scores found in "
765 "letter_annotations of SeqRecord (id=%s)." \
766 % record.id)
767
768 try:
769 return "".join([_phred_to_solexa_quality_str[qp] \
770 for qp in qualities])
771 except KeyError:
772
773
774 pass
775 if None in qualities:
776 raise TypeError("A quality value of None was found")
777
778
779 if max(qualities) >= 62.5:
780 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ",
781 BiopythonWarning)
782 return "".join([chr(min(126,
783 int(round(solexa_quality_from_phred(qp))) + \
784 SOLEXA_SCORE_OFFSET)) \
785 for qp in qualities])
786
787
789 """Iterate over Fastq records as string tuples (not as SeqRecord objects).
790
791 This code does not try to interpret the quality string numerically. It
792 just returns tuples of the title, sequence and quality as strings. For
793 the sequence and quality, any whitespace (such as new lines) is removed.
794
795 Our SeqRecord based FASTQ iterators call this function internally, and then
796 turn the strings into a SeqRecord objects, mapping the quality string into
797 a list of numerical scores. If you want to do a custom quality mapping,
798 then you might consider calling this function directly.
799
800 For parsing FASTQ files, the title string from the "@" line at the start
801 of each record can optionally be omitted on the "+" lines. If it is
802 repeated, it must be identical.
803
804 The sequence string and the quality string can optionally be split over
805 multiple lines, although several sources discourage this. In comparison,
806 for the FASTA file format line breaks between 60 and 80 characters are
807 the norm.
808
809 WARNING - Because the "@" character can appear in the quality string,
810 this can cause problems as this is also the marker for the start of
811 a new sequence. In fact, the "+" sign can also appear as well. Some
812 sources recommended having no line breaks in the quality to avoid this,
813 but even that is not enough, consider this example::
814
815 @071113_EAS56_0053:1:1:998:236
816 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA
817 +071113_EAS56_0053:1:1:998:236
818 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
819 @071113_EAS56_0053:1:1:182:712
820 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG
821 +
822 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
823 @071113_EAS56_0053:1:1:153:10
824 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT
825 +
826 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
827 @071113_EAS56_0053:1:3:990:501
828 TGGGAGGTTTTATGTGGA
829 AAGCAGCAATGTACAAGA
830 +
831 IIIIIII.IIIIII1@44
832 @-7.%<&+/$/%4(++(%
833
834 This is four PHRED encoded FASTQ entries originally from an NCBI source
835 (given the read length of 36, these are probably Solexa Illumna reads where
836 the quality has been mapped onto the PHRED values).
837
838 This example has been edited to illustrate some of the nasty things allowed
839 in the FASTQ format. Firstly, on the "+" lines most but not all of the
840 (redundant) identifiers are ommited. In real files it is likely that all or
841 none of these extra identifiers will be present.
842
843 Secondly, while the first three sequences have been shown without line
844 breaks, the last has been split over multiple lines. In real files any line
845 breaks are likely to be consistent.
846
847 Thirdly, some of the quality string lines start with an "@" character. For
848 the second record this is unavoidable. However for the fourth sequence this
849 only happens because its quality string is split over two lines. A naive
850 parser could wrongly treat any line starting with an "@" as the beginning of
851 a new sequence! This code copes with this possible ambiguity by keeping
852 track of the length of the sequence which gives the expected length of the
853 quality string.
854
855 Using this tricky example file as input, this short bit of code demonstrates
856 what this parsing function would return:
857
858 >>> handle = open("Quality/tricky.fastq", "rU")
859 >>> for (title, sequence, quality) in FastqGeneralIterator(handle):
860 ... print title
861 ... print sequence, quality
862 071113_EAS56_0053:1:1:998:236
863 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III
864 071113_EAS56_0053:1:1:182:712
865 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+
866 071113_EAS56_0053:1:1:153:10
867 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6
868 071113_EAS56_0053:1:3:990:501
869 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(%
870 >>> handle.close()
871
872 Finally we note that some sources state that the quality string should
873 start with "!" (which using the PHRED mapping means the first letter always
874 has a quality score of zero). This rather restrictive rule is not widely
875 observed, so is therefore ignored here. One plus point about this "!" rule
876 is that (provided there are no line breaks in the quality sequence) it
877 would prevent the above problem with the "@" character.
878 """
879
880
881 handle_readline = handle.readline
882
883
884 while True:
885 line = handle_readline()
886 if line == "" : return
887 if line[0] == "@":
888 break
889
890 while True:
891 if line[0] != "@":
892 raise ValueError("Records in Fastq files should start with '@' character")
893 title_line = line[1:].rstrip()
894
895
896
897 seq_string = handle_readline().rstrip()
898
899 while True:
900 line = handle_readline()
901 if not line:
902 raise ValueError("End of file without quality information.")
903 if line[0] == "+":
904
905 second_title = line[1:].rstrip()
906 if second_title and second_title != title_line:
907 raise ValueError("Sequence and quality captions differ.")
908 break
909 seq_string += line.rstrip()
910
911
912 if " " in seq_string or "\t" in seq_string:
913 raise ValueError("Whitespace is not allowed in the sequence.")
914 seq_len = len(seq_string)
915
916
917 quality_string = handle_readline().rstrip()
918
919 while True:
920 line = handle_readline()
921 if not line : break
922 if line[0] == "@":
923
924
925
926
927 if len(quality_string) >= seq_len:
928
929
930 break
931
932 quality_string += line.rstrip()
933
934 if seq_len != len(quality_string):
935 raise ValueError("Lengths of sequence and quality values differs "
936 " for %s (%i and %i)." \
937 % (title_line, seq_len, len(quality_string)))
938
939
940 yield (title_line, seq_string, quality_string)
941 if not line : return
942 assert False, "Should not reach this line"
943
944
946 """Generator function to iterate over FASTQ records (as SeqRecord objects).
947
948 - handle - input file
949 - alphabet - optional alphabet
950 - title2ids - A function that, when given the title line from the FASTQ
951 file (without the beginning >), will return the id, name and
952 description (in that order) for the record as a tuple of
953 strings. If this is not given, then the entire title line
954 will be used as the description, and the first word as the
955 id and name.
956
957 Note that use of title2ids matches that of Bio.SeqIO.FastaIO.
958
959 For each sequence in a (Sanger style) FASTQ file there is a matching string
960 encoding the PHRED qualities (integers between 0 and about 90) using ASCII
961 values with an offset of 33.
962
963 For example, consider a file containing three short reads::
964
965 @EAS54_6_R1_2_1_413_324
966 CCCTTCTTGTCTTCAGCGTTTCTCC
967 +
968 ;;3;;;;;;;;;;;;7;;;;;;;88
969 @EAS54_6_R1_2_1_540_792
970 TTGGCAGGCCAAGGCCGATGGATCA
971 +
972 ;;;;;;;;;;;7;;;;;-;;;3;83
973 @EAS54_6_R1_2_1_443_348
974 GTTGCTTCTGGCGTGGGTGGGGGGG
975 +
976 ;;;;;;;;;;;9;7;;.7;393333
977
978 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching
979 string encoding the PHRED qualities using a ASCI values with an offset of
980 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88").
981
982 Using this module directly you might run:
983
984 >>> handle = open("Quality/example.fastq", "rU")
985 >>> for record in FastqPhredIterator(handle):
986 ... print record.id, record.seq
987 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
988 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
989 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
990 >>> handle.close()
991
992 Typically however, you would call this via Bio.SeqIO instead with "fastq"
993 (or "fastq-sanger") as the format:
994
995 >>> from Bio import SeqIO
996 >>> handle = open("Quality/example.fastq", "rU")
997 >>> for record in SeqIO.parse(handle, "fastq"):
998 ... print record.id, record.seq
999 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1000 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1001 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1002 >>> handle.close()
1003
1004 If you want to look at the qualities, they are record in each record's
1005 per-letter-annotation dictionary as a simple list of integers:
1006
1007 >>> print record.letter_annotations["phred_quality"]
1008 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1009 """
1010 assert SANGER_SCORE_OFFSET == ord("!")
1011
1012
1013
1014
1015
1016 q_mapping = dict()
1017 for letter in range(0, 255):
1018 q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET
1019 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1020 if title2ids:
1021 id, name, descr = title2ids(title_line)
1022 else:
1023 descr = title_line
1024 id = descr.split()[0]
1025 name = id
1026 record = SeqRecord(Seq(seq_string, alphabet),
1027 id=id, name=name, description=descr)
1028 qualities = [q_mapping[letter] for letter in quality_string]
1029 if qualities and (min(qualities) < 0 or max(qualities) > 93):
1030 raise ValueError("Invalid character in quality string")
1031
1032
1033
1034
1035
1036 dict.__setitem__(record._per_letter_annotations,
1037 "phred_quality", qualities)
1038 yield record
1039
1040
1042 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping).
1043
1044 The optional arguments are the same as those for the FastqPhredIterator.
1045
1046 For each sequence in Solexa/Illumina FASTQ files there is a matching string
1047 encoding the Solexa integer qualities using ASCII values with an offset
1048 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython
1049 will NOT perform any automatic conversion when loading.
1050
1051 NOTE - This file format is used by the OLD versions of the Solexa/Illumina
1052 pipeline. See also the FastqIlluminaIterator function for the NEW version.
1053
1054 For example, consider a file containing these five records::
1055
1056 @SLXA-B3_649_FC8437_R1_1_1_610_79
1057 GATGTGCAATACCTTTGTAGAGGAA
1058 +SLXA-B3_649_FC8437_R1_1_1_610_79
1059 YYYYYYYYYYYYYYYYYYWYWYYSU
1060 @SLXA-B3_649_FC8437_R1_1_1_397_389
1061 GGTTTGAGAAAGAGAAATGAGATAA
1062 +SLXA-B3_649_FC8437_R1_1_1_397_389
1063 YYYYYYYYYWYYYYWWYYYWYWYWW
1064 @SLXA-B3_649_FC8437_R1_1_1_850_123
1065 GAGGGTGTTGATCATGATGATGGCG
1066 +SLXA-B3_649_FC8437_R1_1_1_850_123
1067 YYYYYYYYYYYYYWYYWYYSYYYSY
1068 @SLXA-B3_649_FC8437_R1_1_1_362_549
1069 GGAAACAAAGTTTTTCTCAACATAG
1070 +SLXA-B3_649_FC8437_R1_1_1_362_549
1071 YYYYYYYYYYYYYYYYYYWWWWYWY
1072 @SLXA-B3_649_FC8437_R1_1_1_183_714
1073 GTATTATTTAATGGCATACACTCAA
1074 +SLXA-B3_649_FC8437_R1_1_1_183_714
1075 YYYYYYYYYYWYYYYWYWWUWWWQQ
1076
1077 Using this module directly you might run:
1078
1079 >>> handle = open("Quality/solexa_example.fastq", "rU")
1080 >>> for record in FastqSolexaIterator(handle):
1081 ... print record.id, record.seq
1082 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1083 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1084 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1085 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1086 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1087 >>> handle.close()
1088
1089 Typically however, you would call this via Bio.SeqIO instead with
1090 "fastq-solexa" as the format:
1091
1092 >>> from Bio import SeqIO
1093 >>> handle = open("Quality/solexa_example.fastq", "rU")
1094 >>> for record in SeqIO.parse(handle, "fastq-solexa"):
1095 ... print record.id, record.seq
1096 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA
1097 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA
1098 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG
1099 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG
1100 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA
1101 >>> handle.close()
1102
1103 If you want to look at the qualities, they are recorded in each record's
1104 per-letter-annotation dictionary as a simple list of integers:
1105
1106 >>> print record.letter_annotations["solexa_quality"]
1107 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17]
1108
1109 These scores aren't very good, but they are high enough that they map
1110 almost exactly onto PHRED scores:
1111
1112 >>> print "%0.2f" % phred_quality_from_solexa(25)
1113 25.01
1114
1115 Let's look at faked example read which is even worse, where there are
1116 more noticeable differences between the Solexa and PHRED scores::
1117
1118 @slxa_0001_1_0001_01
1119 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1120 +slxa_0001_1_0001_01
1121 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1122
1123 Again, you would typically use Bio.SeqIO to read this file in (rather than
1124 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will
1125 contain thousands of reads, so you would normally use Bio.SeqIO.parse()
1126 as shown above. This example has only as one entry, so instead we can
1127 use the Bio.SeqIO.read() function:
1128
1129 >>> from Bio import SeqIO
1130 >>> handle = open("Quality/solexa_faked.fastq", "rU")
1131 >>> record = SeqIO.read(handle, "fastq-solexa")
1132 >>> handle.close()
1133 >>> print record.id, record.seq
1134 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1135 >>> print record.letter_annotations["solexa_quality"]
1136 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5]
1137
1138 These quality scores are so low that when converted from the Solexa scheme
1139 into PHRED scores they look quite different:
1140
1141 >>> print "%0.2f" % phred_quality_from_solexa(-1)
1142 2.54
1143 >>> print "%0.2f" % phred_quality_from_solexa(-5)
1144 1.19
1145
1146 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format
1147 method to output the record(s):
1148
1149 >>> print record.format("fastq-solexa")
1150 @slxa_0001_1_0001_01
1151 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1152 +
1153 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<;
1154 <BLANKLINE>
1155
1156 Note this output is slightly different from the input file as Biopython
1157 has left out the optional repetition of the sequence identifier on the "+"
1158 line. If you want the to use PHRED scores, use "fastq" or "qual" as the
1159 output format instead, and Biopython will do the conversion for you:
1160
1161 >>> print record.format("fastq")
1162 @slxa_0001_1_0001_01
1163 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN
1164 +
1165 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##""
1166 <BLANKLINE>
1167
1168 >>> print record.format("qual")
1169 >slxa_0001_1_0001_01
1170 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21
1171 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2
1172 1 1
1173 <BLANKLINE>
1174
1175 As shown above, the poor quality Solexa reads have been mapped to the
1176 equivalent PHRED score (e.g. -5 to 1 as shown earlier).
1177 """
1178 q_mapping = dict()
1179 for letter in range(0, 255):
1180 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
1181 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1182 if title2ids:
1183 id, name, descr = title_line
1184 else:
1185 descr = title_line
1186 id = descr.split()[0]
1187 name = id
1188 record = SeqRecord(Seq(seq_string, alphabet),
1189 id=id, name=name, description=descr)
1190 qualities = [q_mapping[letter] for letter in quality_string]
1191
1192 if qualities and (min(qualities) < -5 or max(qualities)>62):
1193 raise ValueError("Invalid character in quality string")
1194
1195
1196 dict.__setitem__(record._per_letter_annotations,
1197 "solexa_quality", qualities)
1198 yield record
1199
1200
1202 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping).
1203
1204 The optional arguments are the same as those for the FastqPhredIterator.
1205
1206 For each sequence in Illumina 1.3+ FASTQ files there is a matching string
1207 encoding PHRED integer qualities using ASCII values with an offset of 64.
1208
1209 >>> from Bio import SeqIO
1210 >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina")
1211 >>> print record.id, record.seq
1212 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1213 >>> max(record.letter_annotations["phred_quality"])
1214 40
1215 >>> min(record.letter_annotations["phred_quality"])
1216 0
1217
1218 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores
1219 with an ASCII offset of 64. They are approximately equal but only for high
1220 quality reads. If you have an old Solexa/Illumina file with negative
1221 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail:
1222
1223 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina")
1224 Traceback (most recent call last):
1225 ...
1226 ValueError: Invalid character in quality string
1227
1228 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33.
1229 """
1230 q_mapping = dict()
1231 for letter in range(0, 255):
1232 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET
1233 for title_line, seq_string, quality_string in FastqGeneralIterator(handle):
1234 if title2ids:
1235 id, name, descr = title2ids(title_line)
1236 else:
1237 descr = title_line
1238 id = descr.split()[0]
1239 name = id
1240 record = SeqRecord(Seq(seq_string, alphabet),
1241 id=id, name=name, description=descr)
1242 qualities = [q_mapping[letter] for letter in quality_string]
1243 if qualities and (min(qualities) < 0 or max(qualities) > 62):
1244 raise ValueError("Invalid character in quality string")
1245
1246
1247 dict.__setitem__(record._per_letter_annotations,
1248 "phred_quality", qualities)
1249 yield record
1250
1252 """For QUAL files which include PHRED quality scores, but no sequence.
1253
1254 For example, consider this short QUAL file::
1255
1256 >EAS54_6_R1_2_1_413_324
1257 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1258 26 26 26 23 23
1259 >EAS54_6_R1_2_1_540_792
1260 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1261 26 18 26 23 18
1262 >EAS54_6_R1_2_1_443_348
1263 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1264 24 18 18 18 18
1265
1266 Using this module directly you might run:
1267
1268 >>> handle = open("Quality/example.qual", "rU")
1269 >>> for record in QualPhredIterator(handle):
1270 ... print record.id, record.seq
1271 EAS54_6_R1_2_1_413_324 ?????????????????????????
1272 EAS54_6_R1_2_1_540_792 ?????????????????????????
1273 EAS54_6_R1_2_1_443_348 ?????????????????????????
1274 >>> handle.close()
1275
1276 Typically however, you would call this via Bio.SeqIO instead with "qual"
1277 as the format:
1278
1279 >>> from Bio import SeqIO
1280 >>> handle = open("Quality/example.qual", "rU")
1281 >>> for record in SeqIO.parse(handle, "qual"):
1282 ... print record.id, record.seq
1283 EAS54_6_R1_2_1_413_324 ?????????????????????????
1284 EAS54_6_R1_2_1_540_792 ?????????????????????????
1285 EAS54_6_R1_2_1_443_348 ?????????????????????????
1286 >>> handle.close()
1287
1288 Becase QUAL files don't contain the sequence string itself, the seq
1289 property is set to an UnknownSeq object. As no alphabet was given, this
1290 has defaulted to a generic single letter alphabet and the character "?"
1291 used.
1292
1293 By specifying a nucleotide alphabet, "N" is used instead:
1294
1295 >>> from Bio import SeqIO
1296 >>> from Bio.Alphabet import generic_dna
1297 >>> handle = open("Quality/example.qual", "rU")
1298 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna):
1299 ... print record.id, record.seq
1300 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN
1301 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN
1302 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN
1303 >>> handle.close()
1304
1305 However, the quality scores themselves are available as a list of integers
1306 in each record's per-letter-annotation:
1307
1308 >>> print record.letter_annotations["phred_quality"]
1309 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1310
1311 You can still slice one of these SeqRecord objects with an UnknownSeq:
1312
1313 >>> sub_record = record[5:10]
1314 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"]
1315 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26]
1316 """
1317
1318 while True:
1319 line = handle.readline()
1320 if line == "" : return
1321 if line[0] == ">":
1322 break
1323
1324 while True:
1325 if line[0] != ">":
1326 raise ValueError("Records in Fasta files should start with '>' character")
1327 if title2ids:
1328 id, name, descr = title2ids(line[1:].rstrip())
1329 else:
1330 descr = line[1:].rstrip()
1331 id = descr.split()[0]
1332 name = id
1333
1334 qualities = []
1335 line = handle.readline()
1336 while True:
1337 if not line : break
1338 if line[0] == ">": break
1339 qualities.extend([int(word) for word in line.split()])
1340 line = handle.readline()
1341
1342 if qualities and min(qualities) < 0:
1343 raise ValueError(("Negative quality score %i found in %s. " + \
1344 "Are these Solexa scores, not PHRED scores?") \
1345 % (min(qualities), id))
1346
1347
1348 record = SeqRecord(UnknownSeq(len(qualities), alphabet),
1349 id = id, name = name, description = descr)
1350
1351
1352 dict.__setitem__(record._per_letter_annotations,
1353 "phred_quality", qualities)
1354 yield record
1355
1356 if not line : return
1357 assert False, "Should not reach this line"
1358
1360 """Class to write standard FASTQ format files (using PHRED quality scores).
1361
1362 Although you can use this class directly, you are strongly encouraged
1363 to use the Bio.SeqIO.write() function instead via the format name "fastq"
1364 or the alias "fastq-sanger". For example, this code reads in a standard
1365 Sanger style FASTQ file (using PHRED scores) and re-saves it as another
1366 Sanger style FASTQ file:
1367
1368 >>> from Bio import SeqIO
1369 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1370 >>> out_handle = open("Quality/temp.fastq", "w")
1371 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1372 3
1373 >>> out_handle.close()
1374
1375 You might want to do this if the original file included extra line breaks,
1376 which while valid may not be supported by all tools. The output file from
1377 Biopython will have each sequence on a single line, and each quality
1378 string on a single line (which is considered desirable for maximum
1379 compatibility).
1380
1381 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa
1382 quality scores) is converted into a standard Sanger style FASTQ file using
1383 PHRED qualities:
1384
1385 >>> from Bio import SeqIO
1386 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1387 >>> out_handle = open("Quality/temp.fastq", "w")
1388 >>> SeqIO.write(record_iterator, out_handle, "fastq")
1389 5
1390 >>> out_handle.close()
1391
1392 This code is also called if you use the .format("fastq") method of a
1393 SeqRecord, or .format("fastq-sanger") if you prefer that alias.
1394
1395 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is
1396 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1397 warning is issued.
1398
1399 P.S. To avoid cluttering up your working directory, you can delete this
1400 temporary file now:
1401
1402 >>> import os
1403 >>> os.remove("Quality/temp.fastq")
1404 """
1405 assert SANGER_SCORE_OFFSET == ord("!")
1406
1408 """Write a single FASTQ record to the file."""
1409 assert self._header_written
1410 assert not self._footer_written
1411 self._record_written = True
1412
1413 if record.seq is None:
1414 raise ValueError("No sequence for record %s" % record.id)
1415 seq_str = str(record.seq)
1416 qualities_str = _get_sanger_quality_str(record)
1417 if len(qualities_str) != len(seq_str):
1418 raise ValueError("Record %s has sequence length %i but %i quality scores" \
1419 % (record.id, len(seq_str), len(qualities_str)))
1420
1421
1422
1423 id = self.clean(record.id)
1424 description = self.clean(record.description)
1425 if description and description.split(None, 1)[0]==id:
1426
1427 title = description
1428 elif description:
1429 title = "%s %s" % (id, description)
1430 else:
1431 title = id
1432
1433 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1434
1436 """Class to write QUAL format files (using PHRED quality scores).
1437
1438 Although you can use this class directly, you are strongly encouraged
1439 to use the Bio.SeqIO.write() function instead. For example, this code
1440 reads in a FASTQ file and saves the quality scores into a QUAL file:
1441
1442 >>> from Bio import SeqIO
1443 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq")
1444 >>> out_handle = open("Quality/temp.qual", "w")
1445 >>> SeqIO.write(record_iterator, out_handle, "qual")
1446 3
1447 >>> out_handle.close()
1448
1449 This code is also called if you use the .format("qual") method of a
1450 SeqRecord.
1451
1452 P.S. Don't forget to clean up the temp file if you don't need it anymore:
1453
1454 >>> import os
1455 >>> os.remove("Quality/temp.qual")
1456 """
1457 - def __init__(self, handle, wrap=60, record2title=None):
1458 """Create a QUAL writer.
1459
1460 Arguments:
1461 - handle - Handle to an output file, e.g. as returned
1462 by open(filename, "w")
1463 - wrap - Optional line length used to wrap sequence lines.
1464 Defaults to wrapping the sequence at 60 characters
1465 Use zero (or None) for no wrapping, giving a single
1466 long line for the sequence.
1467 - record2title - Optional function to return the text to be
1468 used for the title line of each record. By default
1469 a combination of the record.id and record.description
1470 is used. If the record.description starts with the
1471 record.id, then just the record.description is used.
1472
1473 The record2title argument is present for consistency with the
1474 Bio.SeqIO.FastaIO writer class.
1475 """
1476 SequentialSequenceWriter.__init__(self, handle)
1477
1478 self.wrap = None
1479 if wrap:
1480 if wrap < 1:
1481 raise ValueError
1482 self.wrap = wrap
1483 self.record2title = record2title
1484
1486 """Write a single QUAL record to the file."""
1487 assert self._header_written
1488 assert not self._footer_written
1489 self._record_written = True
1490
1491 handle = self.handle
1492 wrap = self.wrap
1493
1494 if self.record2title:
1495 title = self.clean(self.record2title(record))
1496 else:
1497 id = self.clean(record.id)
1498 description = self.clean(record.description)
1499 if description and description.split(None, 1)[0]==id:
1500
1501 title = description
1502 elif description:
1503 title = "%s %s" % (id, description)
1504 else:
1505 title = id
1506 handle.write(">%s\n" % title)
1507
1508 qualities = _get_phred_quality(record)
1509 try:
1510
1511
1512 qualities_strs = [("%i" % round(q, 0)) for q in qualities]
1513 except TypeError, e:
1514 if None in qualities:
1515 raise TypeError("A quality value of None was found")
1516 else:
1517 raise e
1518
1519 if wrap > 5:
1520
1521 data = " ".join(qualities_strs)
1522 while True:
1523 if len(data) <= wrap:
1524 self.handle.write(data + "\n")
1525 break
1526 else:
1527
1528
1529 i = data.rfind(" ", 0, wrap)
1530 handle.write(data[:i] + "\n")
1531 data = data[i+1:]
1532 elif wrap:
1533
1534 while qualities_strs:
1535 line = qualities_strs.pop(0)
1536 while qualities_strs \
1537 and len(line) + 1 + len(qualities_strs[0]) < wrap:
1538 line += " " + qualities_strs.pop(0)
1539 handle.write(line + "\n")
1540 else:
1541
1542 data = " ".join(qualities_strs)
1543 handle.write(data + "\n")
1544
1546 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities).
1547
1548 This outputs FASTQ files like those from the early Solexa/Illumina
1549 pipeline, using Solexa scores and an ASCII offset of 64. These are
1550 NOT compatible with the standard Sanger style PHRED FASTQ files.
1551
1552 If your records contain a "solexa_quality" entry under letter_annotations,
1553 this is used, otherwise any "phred_quality" entry will be used after
1554 conversion using the solexa_quality_from_phred function. If neither style
1555 of quality scores are present, an exception is raised.
1556
1557 Although you can use this class directly, you are strongly encouraged
1558 to use the Bio.SeqIO.write() function instead. For example, this code
1559 reads in a FASTQ file and re-saves it as another FASTQ file:
1560
1561 >>> from Bio import SeqIO
1562 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa")
1563 >>> out_handle = open("Quality/temp.fastq", "w")
1564 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa")
1565 5
1566 >>> out_handle.close()
1567
1568 You might want to do this if the original file included extra line breaks,
1569 which (while valid) may not be supported by all tools. The output file
1570 from Biopython will have each sequence on a single line, and each quality
1571 string on a single line (which is considered desirable for maximum
1572 compatibility).
1573
1574 This code is also called if you use the .format("fastq-solexa") method of
1575 a SeqRecord. For example,
1576
1577 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1578 >>> print record.format("fastq-solexa")
1579 @Test PHRED qualities from 40 to 0 inclusive
1580 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1581 +
1582 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;;
1583 <BLANKLINE>
1584
1585 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is
1586 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit,
1587 a warning is issued.
1588
1589 P.S. Don't forget to delete the temp file if you don't need it anymore:
1590
1591 >>> import os
1592 >>> os.remove("Quality/temp.fastq")
1593 """
1595 """Write a single FASTQ record to the file."""
1596 assert self._header_written
1597 assert not self._footer_written
1598 self._record_written = True
1599
1600
1601 if record.seq is None:
1602 raise ValueError("No sequence for record %s" % record.id)
1603 seq_str = str(record.seq)
1604 qualities_str = _get_solexa_quality_str(record)
1605 if len(qualities_str) != len(seq_str):
1606 raise ValueError("Record %s has sequence length %i but %i quality scores" \
1607 % (record.id, len(seq_str), len(qualities_str)))
1608
1609
1610
1611 id = self.clean(record.id)
1612 description = self.clean(record.description)
1613 if description and description.split(None, 1)[0]==id:
1614
1615 title = description
1616 elif description:
1617 title = "%s %s" % (id, description)
1618 else:
1619 title = id
1620
1621 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1622
1624 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores).
1625
1626 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline,
1627 using PHRED scores and an ASCII offset of 64. Note these files are NOT
1628 compatible with the standard Sanger style PHRED FASTQ files which use an
1629 ASCII offset of 32.
1630
1631 Although you can use this class directly, you are strongly encouraged to
1632 use the Bio.SeqIO.write() function with format name "fastq-illumina"
1633 instead. This code is also called if you use the .format("fastq-illumina")
1634 method of a SeqRecord. For example,
1635
1636 >>> from Bio import SeqIO
1637 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger")
1638 >>> print record.format("fastq-illumina")
1639 @Test PHRED qualities from 40 to 0 inclusive
1640 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN
1641 +
1642 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@
1643 <BLANKLINE>
1644
1645 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is
1646 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a
1647 warning is issued.
1648 """
1650 """Write a single FASTQ record to the file."""
1651 assert self._header_written
1652 assert not self._footer_written
1653 self._record_written = True
1654
1655
1656 if record.seq is None:
1657 raise ValueError("No sequence for record %s" % record.id)
1658 seq_str = str(record.seq)
1659 qualities_str = _get_illumina_quality_str(record)
1660 if len(qualities_str) != len(seq_str):
1661 raise ValueError("Record %s has sequence length %i but %i quality scores" \
1662 % (record.id, len(seq_str), len(qualities_str)))
1663
1664
1665
1666 id = self.clean(record.id)
1667 description = self.clean(record.description)
1668 if description and description.split(None, 1)[0]==id:
1669
1670 title = description
1671 elif description:
1672 title = "%s %s" % (id, description)
1673 else:
1674 title = id
1675
1676 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1677
1679 """Iterate over matched FASTA and QUAL files as SeqRecord objects.
1680
1681 For example, consider this short QUAL file with PHRED quality scores::
1682
1683 >EAS54_6_R1_2_1_413_324
1684 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26
1685 26 26 26 23 23
1686 >EAS54_6_R1_2_1_540_792
1687 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26
1688 26 18 26 23 18
1689 >EAS54_6_R1_2_1_443_348
1690 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18
1691 24 18 18 18 18
1692
1693 And a matching FASTA file::
1694
1695 >EAS54_6_R1_2_1_413_324
1696 CCCTTCTTGTCTTCAGCGTTTCTCC
1697 >EAS54_6_R1_2_1_540_792
1698 TTGGCAGGCCAAGGCCGATGGATCA
1699 >EAS54_6_R1_2_1_443_348
1700 GTTGCTTCTGGCGTGGGTGGGGGGG
1701
1702 You can parse these separately using Bio.SeqIO with the "qual" and
1703 "fasta" formats, but then you'll get a group of SeqRecord objects with
1704 no sequence, and a matching group with the sequence but not the
1705 qualities. Because it only deals with one input file handle, Bio.SeqIO
1706 can't be used to read the two files together - but this function can!
1707 For example,
1708
1709 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1710 ... open("Quality/example.qual", "rU"))
1711 >>> for record in rec_iter:
1712 ... print record.id, record.seq
1713 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC
1714 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA
1715 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG
1716
1717 As with the FASTQ or QUAL parsers, if you want to look at the qualities,
1718 they are in each record's per-letter-annotation dictionary as a simple
1719 list of integers:
1720
1721 >>> print record.letter_annotations["phred_quality"]
1722 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18]
1723
1724 If you have access to data as a FASTQ format file, using that directly
1725 would be simpler and more straight forward. Note that you can easily use
1726 this function to convert paired FASTA and QUAL files into FASTQ files:
1727
1728 >>> from Bio import SeqIO
1729 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"),
1730 ... open("Quality/example.qual", "rU"))
1731 >>> out_handle = open("Quality/temp.fastq", "w")
1732 >>> SeqIO.write(rec_iter, out_handle, "fastq")
1733 3
1734 >>> out_handle.close()
1735
1736 And don't forget to clean up the temp file if you don't need it anymore:
1737
1738 >>> import os
1739 >>> os.remove("Quality/temp.fastq")
1740 """
1741 from Bio.SeqIO.FastaIO import FastaIterator
1742 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \
1743 title2ids=title2ids)
1744 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \
1745 title2ids=title2ids)
1746
1747
1748
1749 while True:
1750 try:
1751 f_rec = fasta_iter.next()
1752 except StopIteration:
1753 f_rec = None
1754 try:
1755 q_rec = qual_iter.next()
1756 except StopIteration:
1757 q_rec = None
1758 if f_rec is None and q_rec is None:
1759
1760 break
1761 if f_rec is None:
1762 raise ValueError("FASTA file has more entries than the QUAL file.")
1763 if q_rec is None:
1764 raise ValueError("QUAL file has more entries than the FASTA file.")
1765 if f_rec.id != q_rec.id:
1766 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \
1767 % (f_rec.id, q_rec.id))
1768 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]):
1769 raise ValueError("Sequence length and number of quality scores disagree for %s" \
1770 % f_rec.id)
1771
1772 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"]
1773 yield f_rec
1774
1775
1776
1778 """Run the Bio.SeqIO module's doctests.
1779
1780 This will try and locate the unit tests directory, and run the doctests
1781 from there in order that the relative paths used in the examples work.
1782 """
1783 import doctest
1784 import os
1785 if os.path.isdir(os.path.join("..", "..", "Tests")):
1786 print "Runing doctests..."
1787 cur_dir = os.path.abspath(os.curdir)
1788 os.chdir(os.path.join("..", "..", "Tests"))
1789 assert os.path.isfile("Quality/example.fastq")
1790 assert os.path.isfile("Quality/example.fasta")
1791 assert os.path.isfile("Quality/example.qual")
1792 assert os.path.isfile("Quality/tricky.fastq")
1793 assert os.path.isfile("Quality/solexa_faked.fastq")
1794 doctest.testmod(verbose=0)
1795 os.chdir(cur_dir)
1796 del cur_dir
1797 print "Done"
1798
1799 if __name__ == "__main__":
1800 _test()
1801