Package Bio :: Package SeqIO :: Module QualityIO
[hide private]
[frames] | no frames]

Source Code for Module Bio.SeqIO.QualityIO

   1  # Copyright 2009-2010 by Peter Cock.  All rights reserved. 
   2  # This code is part of the Biopython distribution and governed by its 
   3  # license.  Please see the LICENSE file that should have been included 
   4  # as part of this package. 
   5  # 
   6  # This module is for reading and writing FASTQ and QUAL format files as 
   7  # SeqRecord objects, and is expected to be used via the Bio.SeqIO API. 
   8   
   9  """Bio.SeqIO support for the FASTQ and QUAL file formats. 
  10   
  11  Note that you are expected to use this code via the Bio.SeqIO interface, as 
  12  shown below. 
  13   
  14  The FASTQ file format is used frequently at the Wellcome Trust Sanger Institute 
  15  to bundle a FASTA sequence and its PHRED quality data (integers between 0 and 
  16  90).  Rather than using a single FASTQ file, often paired FASTA and QUAL files 
  17  are used containing the sequence and the quality information separately. 
  18   
  19  The PHRED software reads DNA sequencing trace files, calls bases, and 
  20  assigns a non-negative quality value to each called base using a logged 
  21  transformation of the error probability, Q = -10 log10( Pe ), for example:: 
  22   
  23      Pe = 1.0,         Q =  0 
  24      Pe = 0.1,         Q = 10 
  25      Pe = 0.01,        Q = 20 
  26      ... 
  27      Pe = 0.00000001,  Q = 80 
  28      Pe = 0.000000001, Q = 90 
  29   
  30  In typical raw sequence reads, the PHRED quality valuea will be from 0 to 40. 
  31  In the QUAL format these quality values are held as space separated text in 
  32  a FASTA like file format.  In the FASTQ format, each quality values is encoded 
  33  with a single ASCI character using chr(Q+33), meaning zero maps to the 
  34  character "!" and for example 80 maps to "q".  For the Sanger FASTQ standard 
  35  the allowed range of PHRED scores is 0 to 93 inclusive. The sequences and 
  36  quality are then stored in pairs in a FASTA like format. 
  37   
  38  Unfortunately there is no official document describing the FASTQ file format, 
  39  and worse, several related but different variants exist. For more details, 
  40  please read this open access publication:: 
  41   
  42      The Sanger FASTQ file format for sequences with quality scores, and the 
  43      Solexa/Illumina FASTQ variants. 
  44      P.J.A.Cock (Biopython), C.J.Fields (BioPerl), N.Goto (BioRuby), 
  45      M.L.Heuer (BioJava) and P.M. Rice (EMBOSS). 
  46      Nucleic Acids Research 2010 38(6):1767-1771 
  47      http://dx.doi.org/10.1093/nar/gkp1137 
  48   
  49  The good news is that Roche 454 sequencers can output files in the QUAL format, 
  50  and sensibly they use PHREP style scores like Sanger.  Converting a pair of 
  51  FASTA and QUAL files into a Sanger style FASTQ file is easy. To extract QUAL 
  52  files from a Roche 454 SFF binary file, use the Roche off instrument command 
  53  line tool "sffinfo" with the -q or -qual argument.  You can extract a matching 
  54  FASTA file using the -s or -seq argument instead. 
  55   
  56  The bad news is that Solexa/Illumina did things differently - they have their 
  57  own scoring system AND their own incompatible versions of the FASTQ format. 
  58  Solexa/Illumina quality scores use Q = - 10 log10 ( Pe / (1-Pe) ), which can 
  59  be negative.  PHRED scores and Solexa scores are NOT interchangeable (but a 
  60  reasonable mapping can be achieved between them, and they are approximately 
  61  equal for higher quality reads). 
  62   
  63  Confusingly early Solexa pipelines produced a FASTQ like file but using their 
  64  own score mapping and an ASCII offset of 64. To make things worse, for the 
  65  Solexa/Illumina pipeline 1.3 onwards, they introduced a third variant of the 
  66  FASTQ file format, this time using PHRED scores (which is more consistent) but 
  67  with an ASCII offset of 64. 
  68   
  69  i.e. There are at least THREE different and INCOMPATIBLE variants of the FASTQ 
  70  file format: The original Sanger PHRED standard, and two from Solexa/Illumina. 
  71   
  72  The good news is that as of CASAVA version 1.8, Illumina sequencers will 
  73  produce FASTQ files using the standard Sanger encoding. 
  74   
  75  You are expected to use this module via the Bio.SeqIO functions, with the 
  76  following format names: 
  77   
  78   - "qual" means simple quality files using PHRED scores (e.g. from Roche 454) 
  79   - "fastq" means Sanger style FASTQ files using PHRED scores and an ASCII 
  80      offset of 33 (e.g. from the NCBI Short Read Archive and Illumina 1.8+). 
  81      These can potentially hold PHRED scores from 0 to 93. 
  82   - "fastq-sanger" is an alias for "fastq". 
  83   - "fastq-solexa" means old Solexa (and also very early Illumina) style FASTQ 
  84      files, using Solexa scores with an ASCII offset 64. These can hold Solexa 
  85      scores from -5 to 62. 
  86   - "fastq-illumina" means newer Illumina 1.3 to 1.7 style FASTQ files, using 
  87      PHRED scores but with an ASCII offset 64, allowing PHRED scores from 0 
  88      to 62. 
  89   
  90  We could potentially add support for "qual-solexa" meaning QUAL files which 
  91  contain Solexa scores, but thus far there isn't any reason to use such files. 
  92   
  93  For example, consider the following short FASTQ file:: 
  94   
  95      @EAS54_6_R1_2_1_413_324 
  96      CCCTTCTTGTCTTCAGCGTTTCTCC 
  97      + 
  98      ;;3;;;;;;;;;;;;7;;;;;;;88 
  99      @EAS54_6_R1_2_1_540_792 
 100      TTGGCAGGCCAAGGCCGATGGATCA 
 101      + 
 102      ;;;;;;;;;;;7;;;;;-;;;3;83 
 103      @EAS54_6_R1_2_1_443_348 
 104      GTTGCTTCTGGCGTGGGTGGGGGGG 
 105      + 
 106      ;;;;;;;;;;;9;7;;.7;393333 
 107   
 108  This contains three reads of length 25.  From the read length these were 
 109  probably originally from an early Solexa/Illumina sequencer but this file 
 110  follows the Sanger FASTQ convention (PHRED style qualities with an ASCII 
 111  offet of 33).  This means we can parse this file using Bio.SeqIO using 
 112  "fastq" as the format name: 
 113   
 114      >>> from Bio import SeqIO 
 115      >>> for record in SeqIO.parse("Quality/example.fastq", "fastq"): 
 116      ...     print record.id, record.seq 
 117      EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 
 118      EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 
 119      EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 
 120   
 121  The qualities are held as a list of integers in each record's annotation: 
 122   
 123      >>> print record 
 124      ID: EAS54_6_R1_2_1_443_348 
 125      Name: EAS54_6_R1_2_1_443_348 
 126      Description: EAS54_6_R1_2_1_443_348 
 127      Number of features: 0 
 128      Per letter annotation for: phred_quality 
 129      Seq('GTTGCTTCTGGCGTGGGTGGGGGGG', SingleLetterAlphabet()) 
 130      >>> print record.letter_annotations["phred_quality"] 
 131      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 132   
 133  You can use the SeqRecord format method to show this in the QUAL format: 
 134   
 135      >>> print record.format("qual") 
 136      >EAS54_6_R1_2_1_443_348 
 137      26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 
 138      24 18 18 18 18 
 139      <BLANKLINE> 
 140   
 141  Or go back to the FASTQ format, use "fastq" (or "fastq-sanger"): 
 142   
 143      >>> print record.format("fastq") 
 144      @EAS54_6_R1_2_1_443_348 
 145      GTTGCTTCTGGCGTGGGTGGGGGGG 
 146      + 
 147      ;;;;;;;;;;;9;7;;.7;393333 
 148      <BLANKLINE> 
 149   
 150  Or, using the Illumina 1.3+ FASTQ encoding (PHRED values with an ASCII offset 
 151  of 64): 
 152   
 153      >>> print record.format("fastq-illumina") 
 154      @EAS54_6_R1_2_1_443_348 
 155      GTTGCTTCTGGCGTGGGTGGGGGGG 
 156      + 
 157      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 158      <BLANKLINE> 
 159   
 160  You can also get Biopython to convert the scores and show a Solexa style 
 161  FASTQ file: 
 162   
 163      >>> print record.format("fastq-solexa") 
 164      @EAS54_6_R1_2_1_443_348 
 165      GTTGCTTCTGGCGTGGGTGGGGGGG 
 166      + 
 167      ZZZZZZZZZZZXZVZZMVZRXRRRR 
 168      <BLANKLINE> 
 169   
 170  Notice that this is actually the same output as above using "fastq-illumina" 
 171  as the format! The reason for this is all these scores are high enough that 
 172  the PHRED and Solexa scores are almost equal. The differences become apparent 
 173  for poor quality reads. See the functions solexa_quality_from_phred and 
 174  phred_quality_from_solexa for more details. 
 175   
 176  If you wanted to trim your sequences (perhaps to remove low quality regions, 
 177  or to remove a primer sequence), try slicing the SeqRecord objects.  e.g. 
 178   
 179      >>> sub_rec = record[5:15] 
 180      >>> print sub_rec 
 181      ID: EAS54_6_R1_2_1_443_348 
 182      Name: EAS54_6_R1_2_1_443_348 
 183      Description: EAS54_6_R1_2_1_443_348 
 184      Number of features: 0 
 185      Per letter annotation for: phred_quality 
 186      Seq('TTCTGGCGTG', SingleLetterAlphabet()) 
 187      >>> print sub_rec.letter_annotations["phred_quality"] 
 188      [26, 26, 26, 26, 26, 26, 24, 26, 22, 26] 
 189      >>> print sub_rec.format("fastq") 
 190      @EAS54_6_R1_2_1_443_348 
 191      TTCTGGCGTG 
 192      + 
 193      ;;;;;;9;7; 
 194      <BLANKLINE> 
 195       
 196  If you wanted to, you could read in this FASTQ file, and save it as a QUAL file: 
 197   
 198      >>> from Bio import SeqIO 
 199      >>> record_iterator = SeqIO.parse("Quality/example.fastq", "fastq") 
 200      >>> out_handle = open("Quality/temp.qual", "w") 
 201      >>> SeqIO.write(record_iterator, out_handle, "qual") 
 202      3 
 203      >>> out_handle.close() 
 204   
 205  You can of course read in a QUAL file, such as the one we just created: 
 206   
 207      >>> from Bio import SeqIO 
 208      >>> for record in SeqIO.parse("Quality/temp.qual", "qual"): 
 209      ...     print record.id, record.seq 
 210      EAS54_6_R1_2_1_413_324 ????????????????????????? 
 211      EAS54_6_R1_2_1_540_792 ????????????????????????? 
 212      EAS54_6_R1_2_1_443_348 ????????????????????????? 
 213   
 214  Notice that QUAL files don't have a proper sequence present!  But the quality 
 215  information is there: 
 216   
 217      >>> print record 
 218      ID: EAS54_6_R1_2_1_443_348 
 219      Name: EAS54_6_R1_2_1_443_348 
 220      Description: EAS54_6_R1_2_1_443_348 
 221      Number of features: 0 
 222      Per letter annotation for: phred_quality 
 223      UnknownSeq(25, alphabet = SingleLetterAlphabet(), character = '?') 
 224      >>> print record.letter_annotations["phred_quality"] 
 225      [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 
 226   
 227  Just to keep things tidy, if you are following this example yourself, you can 
 228  delete this temporary file now: 
 229   
 230      >>> import os 
 231      >>> os.remove("Quality/temp.qual") 
 232   
 233  Sometimes you won't have a FASTQ file, but rather just a pair of FASTA and QUAL 
 234  files.  Because the Bio.SeqIO system is designed for reading single files, you 
 235  would have to read the two in separately and then combine the data.  However, 
 236  since this is such a common thing to want to do, there is a helper iterator 
 237  defined in this module that does this for you - PairedFastaQualIterator. 
 238   
 239  Alternatively, if you have enough RAM to hold all the records in memory at once, 
 240  then a simple dictionary approach would work: 
 241   
 242      >>> from Bio import SeqIO 
 243      >>> reads = SeqIO.to_dict(SeqIO.parse(open("Quality/example.fasta"), "fasta")) 
 244      >>> for rec in SeqIO.parse(open("Quality/example.qual"), "qual"): 
 245      ...     reads[rec.id].letter_annotations["phred_quality"]=rec.letter_annotations["phred_quality"] 
 246   
 247  You can then access any record by its key, and get both the sequence and the 
 248  quality scores. 
 249   
 250      >>> print reads["EAS54_6_R1_2_1_540_792"].format("fastq") 
 251      @EAS54_6_R1_2_1_540_792 
 252      TTGGCAGGCCAAGGCCGATGGATCA 
 253      + 
 254      ;;;;;;;;;;;7;;;;;-;;;3;83 
 255      <BLANKLINE> 
 256   
 257  It is important that you explicitly tell Bio.SeqIO which FASTQ variant you are 
 258  using ("fastq" or "fastq-sanger" for the Sanger standard using PHRED values, 
 259  "fastq-solexa" for the original Solexa/Illumina variant, or "fastq-illumina" 
 260  for the more recent variant), as this cannot be detected reliably 
 261  automatically. 
 262   
 263  To illustrate this problem, let's consider an artifical example: 
 264   
 265      >>> from Bio.Seq import Seq 
 266      >>> from Bio.Alphabet import generic_dna 
 267      >>> from Bio.SeqRecord import SeqRecord 
 268      >>> test = SeqRecord(Seq("NACGTACGTA", generic_dna), id="Test", 
 269      ... description="Made up!") 
 270      >>> print test.format("fasta") 
 271      >Test Made up! 
 272      NACGTACGTA 
 273      <BLANKLINE> 
 274      >>> print test.format("fastq") 
 275      Traceback (most recent call last): 
 276       ... 
 277      ValueError: No suitable quality scores found in letter_annotations of SeqRecord (id=Test). 
 278   
 279  We created a sample SeqRecord, and can show it in FASTA format - but for QUAL 
 280  or FASTQ format we need to provide some quality scores. These are held as a 
 281  list of integers (one for each base) in the letter_annotations dictionary: 
 282   
 283      >>> test.letter_annotations["phred_quality"] = [0,1,2,3,4,5,10,20,30,40] 
 284      >>> print test.format("qual") 
 285      >Test Made up! 
 286      0 1 2 3 4 5 10 20 30 40 
 287      <BLANKLINE> 
 288      >>> print test.format("fastq") 
 289      @Test Made up! 
 290      NACGTACGTA 
 291      + 
 292      !"#$%&+5?I 
 293      <BLANKLINE> 
 294   
 295  We can check this FASTQ encoding - the first PHRED quality was zero, and this 
 296  mapped to a exclamation mark, while the final score was 40 and this mapped to 
 297  the letter "I": 
 298   
 299      >>> ord('!') - 33 
 300      0 
 301      >>> ord('I') - 33 
 302      40 
 303      >>> [ord(letter)-33 for letter in '!"#$%&+5?I'] 
 304      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 305   
 306  Similarly, we could produce an Illumina 1.3 to 1.7 style FASTQ file using PHRED 
 307  scores with an offset of 64: 
 308   
 309      >>> print test.format("fastq-illumina") 
 310      @Test Made up! 
 311      NACGTACGTA 
 312      + 
 313      @ABCDEJT^h 
 314      <BLANKLINE> 
 315   
 316  And we can check this too - the first PHRED score was zero, and this mapped to 
 317  "@", while the final score was 40 and this mapped to "h": 
 318   
 319      >>> ord("@") - 64 
 320      0 
 321      >>> ord("h") - 64 
 322      40 
 323      >>> [ord(letter)-64 for letter in "@ABCDEJT^h"] 
 324      [0, 1, 2, 3, 4, 5, 10, 20, 30, 40] 
 325   
 326  Notice how different the standard Sanger FASTQ and the Illumina 1.3 to 1.7 style 
 327  FASTQ files look for the same data! Then we have the older Solexa/Illumina 
 328  format to consider which encodes Solexa scores instead of PHRED scores. 
 329   
 330  First let's see what Biopython says if we convert the PHRED scores into Solexa 
 331  scores (rounding to one decimal place): 
 332   
 333      >>> for q in [0,1,2,3,4,5,10,20,30,40]: 
 334      ...     print "PHRED %i maps to Solexa %0.1f" % (q, solexa_quality_from_phred(q)) 
 335      PHRED 0 maps to Solexa -5.0 
 336      PHRED 1 maps to Solexa -5.0 
 337      PHRED 2 maps to Solexa -2.3 
 338      PHRED 3 maps to Solexa -0.0 
 339      PHRED 4 maps to Solexa 1.8 
 340      PHRED 5 maps to Solexa 3.3 
 341      PHRED 10 maps to Solexa 9.5 
 342      PHRED 20 maps to Solexa 20.0 
 343      PHRED 30 maps to Solexa 30.0 
 344      PHRED 40 maps to Solexa 40.0 
 345   
 346  Now here is the record using the old Solexa style FASTQ file: 
 347   
 348      >>> print test.format("fastq-solexa") 
 349      @Test Made up! 
 350      NACGTACGTA 
 351      + 
 352      ;;>@BCJT^h 
 353      <BLANKLINE> 
 354   
 355  Again, this is using an ASCII offset of 64, so we can check the Solexa scores: 
 356   
 357      >>> [ord(letter)-64 for letter in ";;>@BCJT^h"] 
 358      [-5, -5, -2, 0, 2, 3, 10, 20, 30, 40] 
 359   
 360  This explains why the last few letters of this FASTQ output matched that using 
 361  the Illumina 1.3 to 1.7 format - high quality PHRED scores and Solexa scores 
 362  are approximately equal. 
 363   
 364  """ 
 365  __docformat__ = "epytext en" #Don't just use plain text in epydoc API pages! 
 366   
 367  from Bio.Alphabet import single_letter_alphabet 
 368  from Bio.Seq import Seq, UnknownSeq 
 369  from Bio.SeqRecord import SeqRecord 
 370  from Bio.SeqIO.Interfaces import SequentialSequenceWriter 
 371  from math import log 
 372  import warnings 
 373  from Bio import BiopythonWarning 
 374   
 375   
 376  # define score offsets. See discussion for differences between Sanger and 
 377  # Solexa offsets. 
 378  SANGER_SCORE_OFFSET = 33 
 379  SOLEXA_SCORE_OFFSET = 64 
 380   
381 -def solexa_quality_from_phred(phred_quality):
382 """Covert a PHRED quality (range 0 to about 90) to a Solexa quality. 383 384 PHRED and Solexa quality scores are both log transformations of a 385 probality of error (high score = low probability of error). This function 386 takes a PHRED score, transforms it back to a probability of error, and 387 then re-expresses it as a Solexa score. This assumes the error estimates 388 are equivalent. 389 390 How does this work exactly? Well the PHRED quality is minus ten times the 391 base ten logarithm of the probability of error:: 392 393 phred_quality = -10*log(error,10) 394 395 Therefore, turning this round:: 396 397 error = 10 ** (- phred_quality / 10) 398 399 Now, Solexa qualities use a different log transformation:: 400 401 solexa_quality = -10*log(error/(1-error),10) 402 403 After substitution and a little manipulation we get:: 404 405 solexa_quality = 10*log(10**(phred_quality/10.0) - 1, 10) 406 407 However, real Solexa files use a minimum quality of -5. This does have a 408 good reason - a random a random base call would be correct 25% of the time, 409 and thus have a probability of error of 0.75, which gives 1.25 as the PHRED 410 quality, or -4.77 as the Solexa quality. Thus (after rounding), a random 411 nucleotide read would have a PHRED quality of 1, or a Solexa quality of -5. 412 413 Taken literally, this logarithic formula would map a PHRED quality of zero 414 to a Solexa quality of minus infinity. Of course, taken literally, a PHRED 415 score of zero means a probability of error of one (i.e. the base call is 416 definitely wrong), which is worse than random! In practice, a PHRED quality 417 of zero usually means a default value, or perhaps random - and therefore 418 mapping it to the minimum Solexa score of -5 is reasonable. 419 420 In conclusion, we follow EMBOSS, and take this logarithmic formula but also 421 apply a minimum value of -5.0 for the Solexa quality, and also map a PHRED 422 quality of zero to -5.0 as well. 423 424 Note this function will return a floating point number, it is up to you to 425 round this to the nearest integer if appropriate. e.g. 426 427 >>> print "%0.2f" % round(solexa_quality_from_phred(80),2) 428 80.00 429 >>> print "%0.2f" % round(solexa_quality_from_phred(50),2) 430 50.00 431 >>> print "%0.2f" % round(solexa_quality_from_phred(20),2) 432 19.96 433 >>> print "%0.2f" % round(solexa_quality_from_phred(10),2) 434 9.54 435 >>> print "%0.2f" % round(solexa_quality_from_phred(5),2) 436 3.35 437 >>> print "%0.2f" % round(solexa_quality_from_phred(4),2) 438 1.80 439 >>> print "%0.2f" % round(solexa_quality_from_phred(3),2) 440 -0.02 441 >>> print "%0.2f" % round(solexa_quality_from_phred(2),2) 442 -2.33 443 >>> print "%0.2f" % round(solexa_quality_from_phred(1),2) 444 -5.00 445 >>> print "%0.2f" % round(solexa_quality_from_phred(0),2) 446 -5.00 447 448 Notice that for high quality reads PHRED and Solexa scores are numerically 449 equal. The differences are important for poor quality reads, where PHRED 450 has a minimum of zero but Solexa scores can be negative. 451 452 Finally, as a special case where None is used for a "missing value", None 453 is returned: 454 455 >>> print solexa_quality_from_phred(None) 456 None 457 """ 458 if phred_quality is None: 459 #Assume None is used as some kind of NULL or NA value; return None 460 #e.g. Bio.SeqIO gives Ace contig gaps a quality of None. 461 return None 462 elif phred_quality > 0: 463 #Solexa uses a minimum value of -5, which after rounding matches a 464 #random nucleotide base call. 465 return max(-5.0, 10*log(10**(phred_quality/10.0) - 1, 10)) 466 elif phred_quality == 0: 467 #Special case, map to -5 as discussed in the docstring 468 return -5.0 469 else: 470 raise ValueError("PHRED qualities must be positive (or zero), not %s" \ 471 % repr(phred_quality))
472
473 -def phred_quality_from_solexa(solexa_quality):
474 """Convert a Solexa quality (which can be negative) to a PHRED quality. 475 476 PHRED and Solexa quality scores are both log transformations of a 477 probality of error (high score = low probability of error). This function 478 takes a Solexa score, transforms it back to a probability of error, and 479 then re-expresses it as a PHRED score. This assumes the error estimates 480 are equivalent. 481 482 The underlying formulas are given in the documentation for the sister 483 function solexa_quality_from_phred, in this case the operation is:: 484 485 phred_quality = 10*log(10**(solexa_quality/10.0) + 1, 10) 486 487 This will return a floating point number, it is up to you to round this to 488 the nearest integer if appropriate. e.g. 489 490 >>> print "%0.2f" % round(phred_quality_from_solexa(80),2) 491 80.00 492 >>> print "%0.2f" % round(phred_quality_from_solexa(20),2) 493 20.04 494 >>> print "%0.2f" % round(phred_quality_from_solexa(10),2) 495 10.41 496 >>> print "%0.2f" % round(phred_quality_from_solexa(0),2) 497 3.01 498 >>> print "%0.2f" % round(phred_quality_from_solexa(-5),2) 499 1.19 500 501 Note that a solexa_quality less then -5 is not expected, will trigger a 502 warning, but will still be converted as per the logarithmic mapping 503 (giving a number between 0 and 1.19 back). 504 505 As a special case where None is used for a "missing value", None is 506 returned: 507 508 >>> print phred_quality_from_solexa(None) 509 None 510 """ 511 if solexa_quality is None: 512 #Assume None is used as some kind of NULL or NA value; return None 513 return None 514 if solexa_quality < -5: 515 warnings.warn("Solexa quality less than -5 passed, %s" \ 516 % repr(solexa_quality), BiopythonWarning) 517 return 10*log(10**(solexa_quality/10.0) + 1, 10)
518
519 -def _get_phred_quality(record):
520 """Extract PHRED qualities from a SeqRecord's letter_annotations (PRIVATE). 521 522 If there are no PHRED qualities, but there are Solexa qualities, those are 523 used instead after conversion. 524 """ 525 try: 526 return record.letter_annotations["phred_quality"] 527 except KeyError: 528 pass 529 try: 530 return [phred_quality_from_solexa(q) for \ 531 q in record.letter_annotations["solexa_quality"]] 532 except KeyError: 533 raise ValueError("No suitable quality scores found in " 534 "letter_annotations of SeqRecord (id=%s)." \ 535 % record.id)
536 537 #Only map 0 to 93, we need to give a warning on truncating at 93 538 _phred_to_sanger_quality_str = dict((qp, chr(min(126, qp+SANGER_SCORE_OFFSET))) \ 539 for qp in range(0, 93+1)) 540 #Only map -5 to 93, we need to give a warning on truncating at 93 541 _solexa_to_sanger_quality_str = dict( \ 542 (qs, chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET))) \ 543 for qs in range(-5, 93+1))
544 -def _get_sanger_quality_str(record):
545 """Returns a Sanger FASTQ encoded quality string (PRIVATE). 546 547 >>> from Bio.Seq import Seq 548 >>> from Bio.SeqRecord import SeqRecord 549 >>> r = SeqRecord(Seq("ACGTAN"), id="Test", 550 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0]}) 551 >>> _get_sanger_quality_str(r) 552 'SI?5+!' 553 554 If as in the above example (or indeed a SeqRecord parser with Bio.SeqIO), 555 the PHRED qualities are integers, this function is able to use a very fast 556 pre-cached mapping. However, if they are floats which differ slightly, then 557 it has to do the appropriate rounding - which is slower: 558 559 >>> r2 = SeqRecord(Seq("ACGTAN"), id="Test2", 560 ... letter_annotations = {"phred_quality":[50.0,40.05,29.99,20,9.55,0.01]}) 561 >>> _get_sanger_quality_str(r2) 562 'SI?5+!' 563 564 If your scores include a None value, this raises an exception: 565 566 >>> r3 = SeqRecord(Seq("ACGTAN"), id="Test3", 567 ... letter_annotations = {"phred_quality":[50,40,30,20,10,None]}) 568 >>> _get_sanger_quality_str(r3) 569 Traceback (most recent call last): 570 ... 571 TypeError: A quality value of None was found 572 573 If (strangely) your record has both PHRED and Solexa scores, then the PHRED 574 scores are used in preference: 575 576 >>> r4 = SeqRecord(Seq("ACGTAN"), id="Test4", 577 ... letter_annotations = {"phred_quality":[50,40,30,20,10,0], 578 ... "solexa_quality":[-5,-4,0,None,0,40]}) 579 >>> _get_sanger_quality_str(r4) 580 'SI?5+!' 581 582 If there are no PHRED scores, but there are Solexa scores, these are used 583 instead (after the approriate conversion): 584 585 >>> r5 = SeqRecord(Seq("ACGTAN"), id="Test5", 586 ... letter_annotations = {"solexa_quality":[40,30,20,10,0,-5]}) 587 >>> _get_sanger_quality_str(r5) 588 'I?5+$"' 589 590 Again, integer Solexa scores can be looked up in a pre-cached mapping making 591 this very fast. You can still use approximate floating point scores: 592 593 >>> r6 = SeqRecord(Seq("ACGTAN"), id="Test6", 594 ... letter_annotations = {"solexa_quality":[40.1,29.7,20.01,10,0.0,-4.9]}) 595 >>> _get_sanger_quality_str(r6) 596 'I?5+$"' 597 598 Notice that due to the limited range of printable ASCII characters, a 599 PHRED quality of 93 is the maximum that can be held in an Illumina FASTQ 600 file (using ASCII 126, the tilde). This function will issue a warning 601 in this situation. 602 """ 603 #TODO - This functions works and is fast, but it is also ugly 604 #and there is considerable repetition of code for the other 605 #two FASTQ variants. 606 try: 607 #These take priority (in case both Solexa and PHRED scores found) 608 qualities = record.letter_annotations["phred_quality"] 609 except KeyError: 610 #Fall back on solexa scores... 611 pass 612 else: 613 #Try and use the precomputed mapping: 614 try: 615 return "".join([_phred_to_sanger_quality_str[qp] \ 616 for qp in qualities]) 617 except KeyError: 618 #Could be a float, or a None in the list, or a high value. 619 pass 620 if None in qualities: 621 raise TypeError("A quality value of None was found") 622 if max(qualities) >= 93.5: 623 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 624 BiopythonWarning) 625 #This will apply the truncation at 93, giving max ASCII 126 626 return "".join([chr(min(126, int(round(qp))+SANGER_SCORE_OFFSET)) \ 627 for qp in qualities]) 628 #Fall back on the Solexa scores... 629 try: 630 qualities = record.letter_annotations["solexa_quality"] 631 except KeyError: 632 raise ValueError("No suitable quality scores found in " 633 "letter_annotations of SeqRecord (id=%s)." \ 634 % record.id) 635 #Try and use the precomputed mapping: 636 try: 637 return "".join([_solexa_to_sanger_quality_str[qs] \ 638 for qs in qualities]) 639 except KeyError: 640 #Either no PHRED scores, or something odd like a float or None 641 pass 642 if None in qualities: 643 raise TypeError("A quality value of None was found") 644 #Must do this the slow way, first converting the PHRED scores into 645 #Solexa scores: 646 if max(qualities) >= 93.5: 647 warnings.warn("Data loss - max PHRED quality 93 in Sanger FASTQ", 648 BiopythonWarning) 649 #This will apply the truncation at 93, giving max ASCII 126 650 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SANGER_SCORE_OFFSET)) \ 651 for qs in qualities])
652 653 #Only map 0 to 62, we need to give a warning on truncating at 62 654 assert 62+SOLEXA_SCORE_OFFSET == 126 655 _phred_to_illumina_quality_str = dict((qp, chr(qp+SOLEXA_SCORE_OFFSET)) \ 656 for qp in range(0, 62+1)) 657 #Only map -5 to 62, we need to give a warning on truncating at 62 658 _solexa_to_illumina_quality_str = dict( \ 659 (qs, chr(int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \ 660 for qs in range(-5, 62+1))
661 -def _get_illumina_quality_str(record):
662 """Returns an Illumina 1.3 to 1.7 FASTQ encoded quality string (PRIVATE). 663 664 Notice that due to the limited range of printable ASCII characters, a 665 PHRED quality of 62 is the maximum that can be held in an Illumina FASTQ 666 file (using ASCII 126, the tilde). This function will issue a warning 667 in this situation. 668 """ 669 #TODO - This functions works and is fast, but it is also ugly 670 #and there is considerable repetition of code for the other 671 #two FASTQ variants. 672 try: 673 #These take priority (in case both Solexa and PHRED scores found) 674 qualities = record.letter_annotations["phred_quality"] 675 except KeyError: 676 #Fall back on solexa scores... 677 pass 678 else: 679 #Try and use the precomputed mapping: 680 try: 681 return "".join([_phred_to_illumina_quality_str[qp] \ 682 for qp in qualities]) 683 except KeyError: 684 #Could be a float, or a None in the list, or a high value. 685 pass 686 if None in qualities: 687 raise TypeError("A quality value of None was found") 688 if max(qualities) >= 62.5: 689 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 690 BiopythonWarning) 691 #This will apply the truncation at 62, giving max ASCII 126 692 return "".join([chr(min(126, int(round(qp))+SOLEXA_SCORE_OFFSET)) \ 693 for qp in qualities]) 694 #Fall back on the Solexa scores... 695 try: 696 qualities = record.letter_annotations["solexa_quality"] 697 except KeyError: 698 raise ValueError("No suitable quality scores found in " 699 "letter_annotations of SeqRecord (id=%s)." \ 700 % record.id) 701 #Try and use the precomputed mapping: 702 try: 703 return "".join([_solexa_to_illumina_quality_str[qs] \ 704 for qs in qualities]) 705 except KeyError: 706 #Either no PHRED scores, or something odd like a float or None 707 pass 708 if None in qualities: 709 raise TypeError("A quality value of None was found") 710 #Must do this the slow way, first converting the PHRED scores into 711 #Solexa scores: 712 if max(qualities) >= 62.5: 713 warnings.warn("Data loss - max PHRED quality 62 in Illumina FASTQ", 714 BiopythonWarning) 715 #This will apply the truncation at 62, giving max ASCII 126 716 return "".join([chr(min(126, int(round(phred_quality_from_solexa(qs)))+SOLEXA_SCORE_OFFSET)) \ 717 for qs in qualities])
718 719 #Only map 0 to 62, we need to give a warning on truncating at 62 720 assert 62+SOLEXA_SCORE_OFFSET == 126 721 _solexa_to_solexa_quality_str = dict((qs, chr(min(126, qs+SOLEXA_SCORE_OFFSET))) \ 722 for qs in range(-5, 62+1)) 723 #Only map -5 to 62, we need to give a warning on truncating at 62 724 _phred_to_solexa_quality_str = dict(\ 725 (qp, chr(min(126, int(round(solexa_quality_from_phred(qp)))+SOLEXA_SCORE_OFFSET))) \ 726 for qp in range(0, 62+1))
727 -def _get_solexa_quality_str(record):
728 """Returns a Solexa FASTQ encoded quality string (PRIVATE). 729 730 Notice that due to the limited range of printable ASCII characters, a 731 Solexa quality of 62 is the maximum that can be held in a Solexa FASTQ 732 file (using ASCII 126, the tilde). This function will issue a warning 733 in this situation. 734 """ 735 #TODO - This functions works and is fast, but it is also ugly 736 #and there is considerable repetition of code for the other 737 #two FASTQ variants. 738 try: 739 #These take priority (in case both Solexa and PHRED scores found) 740 qualities = record.letter_annotations["solexa_quality"] 741 except KeyError: 742 #Fall back on PHRED scores... 743 pass 744 else: 745 #Try and use the precomputed mapping: 746 try: 747 return "".join([_solexa_to_solexa_quality_str[qs] \ 748 for qs in qualities]) 749 except KeyError: 750 #Could be a float, or a None in the list, or a high value. 751 pass 752 if None in qualities: 753 raise TypeError("A quality value of None was found") 754 if max(qualities) >= 62.5: 755 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 756 BiopythonWarning) 757 #This will apply the truncation at 62, giving max ASCII 126 758 return "".join([chr(min(126, int(round(qs))+SOLEXA_SCORE_OFFSET)) \ 759 for qs in qualities]) 760 #Fall back on the PHRED scores... 761 try: 762 qualities = record.letter_annotations["phred_quality"] 763 except KeyError: 764 raise ValueError("No suitable quality scores found in " 765 "letter_annotations of SeqRecord (id=%s)." \ 766 % record.id) 767 #Try and use the precomputed mapping: 768 try: 769 return "".join([_phred_to_solexa_quality_str[qp] \ 770 for qp in qualities]) 771 except KeyError: 772 #Either no PHRED scores, or something odd like a float or None 773 #or too big to be in the cache 774 pass 775 if None in qualities: 776 raise TypeError("A quality value of None was found") 777 #Must do this the slow way, first converting the PHRED scores into 778 #Solexa scores: 779 if max(qualities) >= 62.5: 780 warnings.warn("Data loss - max Solexa quality 62 in Solexa FASTQ", 781 BiopythonWarning) 782 return "".join([chr(min(126, 783 int(round(solexa_quality_from_phred(qp))) + \ 784 SOLEXA_SCORE_OFFSET)) \ 785 for qp in qualities])
786 787 #TODO - Default to nucleotide or even DNA?
788 -def FastqGeneralIterator(handle):
789 """Iterate over Fastq records as string tuples (not as SeqRecord objects). 790 791 This code does not try to interpret the quality string numerically. It 792 just returns tuples of the title, sequence and quality as strings. For 793 the sequence and quality, any whitespace (such as new lines) is removed. 794 795 Our SeqRecord based FASTQ iterators call this function internally, and then 796 turn the strings into a SeqRecord objects, mapping the quality string into 797 a list of numerical scores. If you want to do a custom quality mapping, 798 then you might consider calling this function directly. 799 800 For parsing FASTQ files, the title string from the "@" line at the start 801 of each record can optionally be omitted on the "+" lines. If it is 802 repeated, it must be identical. 803 804 The sequence string and the quality string can optionally be split over 805 multiple lines, although several sources discourage this. In comparison, 806 for the FASTA file format line breaks between 60 and 80 characters are 807 the norm. 808 809 WARNING - Because the "@" character can appear in the quality string, 810 this can cause problems as this is also the marker for the start of 811 a new sequence. In fact, the "+" sign can also appear as well. Some 812 sources recommended having no line breaks in the quality to avoid this, 813 but even that is not enough, consider this example:: 814 815 @071113_EAS56_0053:1:1:998:236 816 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA 817 +071113_EAS56_0053:1:1:998:236 818 IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 819 @071113_EAS56_0053:1:1:182:712 820 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG 821 + 822 @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 823 @071113_EAS56_0053:1:1:153:10 824 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT 825 + 826 IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 827 @071113_EAS56_0053:1:3:990:501 828 TGGGAGGTTTTATGTGGA 829 AAGCAGCAATGTACAAGA 830 + 831 IIIIIII.IIIIII1@44 832 @-7.%<&+/$/%4(++(% 833 834 This is four PHRED encoded FASTQ entries originally from an NCBI source 835 (given the read length of 36, these are probably Solexa Illumna reads where 836 the quality has been mapped onto the PHRED values). 837 838 This example has been edited to illustrate some of the nasty things allowed 839 in the FASTQ format. Firstly, on the "+" lines most but not all of the 840 (redundant) identifiers are ommited. In real files it is likely that all or 841 none of these extra identifiers will be present. 842 843 Secondly, while the first three sequences have been shown without line 844 breaks, the last has been split over multiple lines. In real files any line 845 breaks are likely to be consistent. 846 847 Thirdly, some of the quality string lines start with an "@" character. For 848 the second record this is unavoidable. However for the fourth sequence this 849 only happens because its quality string is split over two lines. A naive 850 parser could wrongly treat any line starting with an "@" as the beginning of 851 a new sequence! This code copes with this possible ambiguity by keeping 852 track of the length of the sequence which gives the expected length of the 853 quality string. 854 855 Using this tricky example file as input, this short bit of code demonstrates 856 what this parsing function would return: 857 858 >>> handle = open("Quality/tricky.fastq", "rU") 859 >>> for (title, sequence, quality) in FastqGeneralIterator(handle): 860 ... print title 861 ... print sequence, quality 862 071113_EAS56_0053:1:1:998:236 863 TTTCTTGCCCCCATAGACTGAGACCTTCCCTAAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIICII+III 864 071113_EAS56_0053:1:1:182:712 865 ACCCAGCTAATTTTTGTATTTTTGTTAGAGACAGTG @IIIIIIIIIIIIIIICDIIIII<%<6&-*).(*%+ 866 071113_EAS56_0053:1:1:153:10 867 TGTTCTGAAGGAAGGTGTGCGTGCGTGTGTGTGTGT IIIIIIIIIIIICIIGIIIII>IAIIIE65I=II:6 868 071113_EAS56_0053:1:3:990:501 869 TGGGAGGTTTTATGTGGAAAGCAGCAATGTACAAGA IIIIIII.IIIIII1@44@-7.%<&+/$/%4(++(% 870 >>> handle.close() 871 872 Finally we note that some sources state that the quality string should 873 start with "!" (which using the PHRED mapping means the first letter always 874 has a quality score of zero). This rather restrictive rule is not widely 875 observed, so is therefore ignored here. One plus point about this "!" rule 876 is that (provided there are no line breaks in the quality sequence) it 877 would prevent the above problem with the "@" character. 878 """ 879 #We need to call handle.readline() at least four times per record, 880 #so we'll save a property look up each time: 881 handle_readline = handle.readline 882 883 #Skip any text before the first record (e.g. blank lines, comments?) 884 while True: 885 line = handle_readline() 886 if line == "" : return #Premature end of file, or just empty? 887 if line[0] == "@": 888 break 889 890 while True: 891 if line[0] != "@": 892 raise ValueError("Records in Fastq files should start with '@' character") 893 title_line = line[1:].rstrip() 894 #Will now be at least one line of quality data - in most FASTQ files 895 #just one line! We therefore use string concatenation (if needed) 896 #rather using than the "".join(...) trick just in case it is multiline: 897 seq_string = handle_readline().rstrip() 898 #There may now be more sequence lines, or the "+" quality marker line: 899 while True: 900 line = handle_readline() 901 if not line: 902 raise ValueError("End of file without quality information.") 903 if line[0] == "+": 904 #The title here is optional, but if present must match! 905 second_title = line[1:].rstrip() 906 if second_title and second_title != title_line: 907 raise ValueError("Sequence and quality captions differ.") 908 break 909 seq_string += line.rstrip() #removes trailing newlines 910 #This is going to slow things down a little, but assuming 911 #this isn't allowed we should try and catch it here: 912 if " " in seq_string or "\t" in seq_string: 913 raise ValueError("Whitespace is not allowed in the sequence.") 914 seq_len = len(seq_string) 915 916 #Will now be at least one line of quality data... 917 quality_string = handle_readline().rstrip() 918 #There may now be more quality data, or another sequence, or EOF 919 while True: 920 line = handle_readline() 921 if not line : break #end of file 922 if line[0] == "@": 923 #This COULD be the start of a new sequence. However, it MAY just 924 #be a line of quality data which starts with a "@" character. We 925 #should be able to check this by looking at the sequence length 926 #and the amount of quality data found so far. 927 if len(quality_string) >= seq_len: 928 #We expect it to be equal if this is the start of a new record. 929 #If the quality data is longer, we'll raise an error below. 930 break 931 #Continue - its just some (more) quality data. 932 quality_string += line.rstrip() 933 934 if seq_len != len(quality_string): 935 raise ValueError("Lengths of sequence and quality values differs " 936 " for %s (%i and %i)." \ 937 % (title_line, seq_len, len(quality_string))) 938 939 #Return the record and then continue... 940 yield (title_line, seq_string, quality_string) 941 if not line : return #StopIteration at end of file 942 assert False, "Should not reach this line"
943 944 #This is a generator function!
945 -def FastqPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
946 """Generator function to iterate over FASTQ records (as SeqRecord objects). 947 948 - handle - input file 949 - alphabet - optional alphabet 950 - title2ids - A function that, when given the title line from the FASTQ 951 file (without the beginning >), will return the id, name and 952 description (in that order) for the record as a tuple of 953 strings. If this is not given, then the entire title line 954 will be used as the description, and the first word as the 955 id and name. 956 957 Note that use of title2ids matches that of Bio.SeqIO.FastaIO. 958 959 For each sequence in a (Sanger style) FASTQ file there is a matching string 960 encoding the PHRED qualities (integers between 0 and about 90) using ASCII 961 values with an offset of 33. 962 963 For example, consider a file containing three short reads:: 964 965 @EAS54_6_R1_2_1_413_324 966 CCCTTCTTGTCTTCAGCGTTTCTCC 967 + 968 ;;3;;;;;;;;;;;;7;;;;;;;88 969 @EAS54_6_R1_2_1_540_792 970 TTGGCAGGCCAAGGCCGATGGATCA 971 + 972 ;;;;;;;;;;;7;;;;;-;;;3;83 973 @EAS54_6_R1_2_1_443_348 974 GTTGCTTCTGGCGTGGGTGGGGGGG 975 + 976 ;;;;;;;;;;;9;7;;.7;393333 977 978 For each sequence (e.g. "CCCTTCTTGTCTTCAGCGTTTCTCC") there is a matching 979 string encoding the PHRED qualities using a ASCI values with an offset of 980 33 (e.g. ";;3;;;;;;;;;;;;7;;;;;;;88"). 981 982 Using this module directly you might run: 983 984 >>> handle = open("Quality/example.fastq", "rU") 985 >>> for record in FastqPhredIterator(handle): 986 ... print record.id, record.seq 987 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 988 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 989 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 990 >>> handle.close() 991 992 Typically however, you would call this via Bio.SeqIO instead with "fastq" 993 (or "fastq-sanger") as the format: 994 995 >>> from Bio import SeqIO 996 >>> handle = open("Quality/example.fastq", "rU") 997 >>> for record in SeqIO.parse(handle, "fastq"): 998 ... print record.id, record.seq 999 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1000 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1001 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1002 >>> handle.close() 1003 1004 If you want to look at the qualities, they are record in each record's 1005 per-letter-annotation dictionary as a simple list of integers: 1006 1007 >>> print record.letter_annotations["phred_quality"] 1008 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1009 """ 1010 assert SANGER_SCORE_OFFSET == ord("!") 1011 #Originally, I used a list expression for each record: 1012 # 1013 # qualities = [ord(letter)-SANGER_SCORE_OFFSET for letter in quality_string] 1014 # 1015 #Precomputing is faster, perhaps partly by avoiding the subtractions. 1016 q_mapping = dict() 1017 for letter in range(0, 255): 1018 q_mapping[chr(letter)] = letter-SANGER_SCORE_OFFSET 1019 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1020 if title2ids: 1021 id, name, descr = title2ids(title_line) 1022 else: 1023 descr = title_line 1024 id = descr.split()[0] 1025 name = id 1026 record = SeqRecord(Seq(seq_string, alphabet), 1027 id=id, name=name, description=descr) 1028 qualities = [q_mapping[letter] for letter in quality_string] 1029 if qualities and (min(qualities) < 0 or max(qualities) > 93): 1030 raise ValueError("Invalid character in quality string") 1031 #For speed, will now use a dirty trick to speed up assigning the 1032 #qualities. We do this to bypass the length check imposed by the 1033 #per-letter-annotations restricted dict (as this has already been 1034 #checked by FastqGeneralIterator). This is equivalent to: 1035 #record.letter_annotations["phred_quality"] = qualities 1036 dict.__setitem__(record._per_letter_annotations, 1037 "phred_quality", qualities) 1038 yield record
1039 1040 #This is a generator function!
1041 -def FastqSolexaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1042 r"""Parsing old Solexa/Illumina FASTQ like files (which differ in the quality mapping). 1043 1044 The optional arguments are the same as those for the FastqPhredIterator. 1045 1046 For each sequence in Solexa/Illumina FASTQ files there is a matching string 1047 encoding the Solexa integer qualities using ASCII values with an offset 1048 of 64. Solexa scores are scaled differently to PHRED scores, and Biopython 1049 will NOT perform any automatic conversion when loading. 1050 1051 NOTE - This file format is used by the OLD versions of the Solexa/Illumina 1052 pipeline. See also the FastqIlluminaIterator function for the NEW version. 1053 1054 For example, consider a file containing these five records:: 1055 1056 @SLXA-B3_649_FC8437_R1_1_1_610_79 1057 GATGTGCAATACCTTTGTAGAGGAA 1058 +SLXA-B3_649_FC8437_R1_1_1_610_79 1059 YYYYYYYYYYYYYYYYYYWYWYYSU 1060 @SLXA-B3_649_FC8437_R1_1_1_397_389 1061 GGTTTGAGAAAGAGAAATGAGATAA 1062 +SLXA-B3_649_FC8437_R1_1_1_397_389 1063 YYYYYYYYYWYYYYWWYYYWYWYWW 1064 @SLXA-B3_649_FC8437_R1_1_1_850_123 1065 GAGGGTGTTGATCATGATGATGGCG 1066 +SLXA-B3_649_FC8437_R1_1_1_850_123 1067 YYYYYYYYYYYYYWYYWYYSYYYSY 1068 @SLXA-B3_649_FC8437_R1_1_1_362_549 1069 GGAAACAAAGTTTTTCTCAACATAG 1070 +SLXA-B3_649_FC8437_R1_1_1_362_549 1071 YYYYYYYYYYYYYYYYYYWWWWYWY 1072 @SLXA-B3_649_FC8437_R1_1_1_183_714 1073 GTATTATTTAATGGCATACACTCAA 1074 +SLXA-B3_649_FC8437_R1_1_1_183_714 1075 YYYYYYYYYYWYYYYWYWWUWWWQQ 1076 1077 Using this module directly you might run: 1078 1079 >>> handle = open("Quality/solexa_example.fastq", "rU") 1080 >>> for record in FastqSolexaIterator(handle): 1081 ... print record.id, record.seq 1082 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1083 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1084 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1085 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1086 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1087 >>> handle.close() 1088 1089 Typically however, you would call this via Bio.SeqIO instead with 1090 "fastq-solexa" as the format: 1091 1092 >>> from Bio import SeqIO 1093 >>> handle = open("Quality/solexa_example.fastq", "rU") 1094 >>> for record in SeqIO.parse(handle, "fastq-solexa"): 1095 ... print record.id, record.seq 1096 SLXA-B3_649_FC8437_R1_1_1_610_79 GATGTGCAATACCTTTGTAGAGGAA 1097 SLXA-B3_649_FC8437_R1_1_1_397_389 GGTTTGAGAAAGAGAAATGAGATAA 1098 SLXA-B3_649_FC8437_R1_1_1_850_123 GAGGGTGTTGATCATGATGATGGCG 1099 SLXA-B3_649_FC8437_R1_1_1_362_549 GGAAACAAAGTTTTTCTCAACATAG 1100 SLXA-B3_649_FC8437_R1_1_1_183_714 GTATTATTTAATGGCATACACTCAA 1101 >>> handle.close() 1102 1103 If you want to look at the qualities, they are recorded in each record's 1104 per-letter-annotation dictionary as a simple list of integers: 1105 1106 >>> print record.letter_annotations["solexa_quality"] 1107 [25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 23, 25, 25, 25, 25, 23, 25, 23, 23, 21, 23, 23, 23, 17, 17] 1108 1109 These scores aren't very good, but they are high enough that they map 1110 almost exactly onto PHRED scores: 1111 1112 >>> print "%0.2f" % phred_quality_from_solexa(25) 1113 25.01 1114 1115 Let's look at faked example read which is even worse, where there are 1116 more noticeable differences between the Solexa and PHRED scores:: 1117 1118 @slxa_0001_1_0001_01 1119 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1120 +slxa_0001_1_0001_01 1121 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1122 1123 Again, you would typically use Bio.SeqIO to read this file in (rather than 1124 calling the Bio.SeqIO.QualtityIO module directly). Most FASTQ files will 1125 contain thousands of reads, so you would normally use Bio.SeqIO.parse() 1126 as shown above. This example has only as one entry, so instead we can 1127 use the Bio.SeqIO.read() function: 1128 1129 >>> from Bio import SeqIO 1130 >>> handle = open("Quality/solexa_faked.fastq", "rU") 1131 >>> record = SeqIO.read(handle, "fastq-solexa") 1132 >>> handle.close() 1133 >>> print record.id, record.seq 1134 slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1135 >>> print record.letter_annotations["solexa_quality"] 1136 [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] 1137 1138 These quality scores are so low that when converted from the Solexa scheme 1139 into PHRED scores they look quite different: 1140 1141 >>> print "%0.2f" % phred_quality_from_solexa(-1) 1142 2.54 1143 >>> print "%0.2f" % phred_quality_from_solexa(-5) 1144 1.19 1145 1146 Note you can use the Bio.SeqIO.write() function or the SeqRecord's format 1147 method to output the record(s): 1148 1149 >>> print record.format("fastq-solexa") 1150 @slxa_0001_1_0001_01 1151 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1152 + 1153 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@?>=<; 1154 <BLANKLINE> 1155 1156 Note this output is slightly different from the input file as Biopython 1157 has left out the optional repetition of the sequence identifier on the "+" 1158 line. If you want the to use PHRED scores, use "fastq" or "qual" as the 1159 output format instead, and Biopython will do the conversion for you: 1160 1161 >>> print record.format("fastq") 1162 @slxa_0001_1_0001_01 1163 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN 1164 + 1165 IHGFEDCBA@?>=<;:9876543210/.-,++*)('&&%%$$##"" 1166 <BLANKLINE> 1167 1168 >>> print record.format("qual") 1169 >slxa_0001_1_0001_01 1170 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 1171 20 19 18 17 16 15 14 13 12 11 10 10 9 8 7 6 5 5 4 4 3 3 2 2 1172 1 1 1173 <BLANKLINE> 1174 1175 As shown above, the poor quality Solexa reads have been mapped to the 1176 equivalent PHRED score (e.g. -5 to 1 as shown earlier). 1177 """ 1178 q_mapping = dict() 1179 for letter in range(0, 255): 1180 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET 1181 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1182 if title2ids: 1183 id, name, descr = title_line 1184 else: 1185 descr = title_line 1186 id = descr.split()[0] 1187 name = id 1188 record = SeqRecord(Seq(seq_string, alphabet), 1189 id=id, name=name, description=descr) 1190 qualities = [q_mapping[letter] for letter in quality_string] 1191 #DO NOT convert these into PHRED qualities automatically! 1192 if qualities and (min(qualities) < -5 or max(qualities)>62): 1193 raise ValueError("Invalid character in quality string") 1194 #Dirty trick to speed up this line: 1195 #record.letter_annotations["solexa_quality"] = qualities 1196 dict.__setitem__(record._per_letter_annotations, 1197 "solexa_quality", qualities) 1198 yield record
1199 1200 #This is a generator function!
1201 -def FastqIlluminaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1202 """Parse Illumina 1.3 to 1.7 FASTQ like files (which differ in the quality mapping). 1203 1204 The optional arguments are the same as those for the FastqPhredIterator. 1205 1206 For each sequence in Illumina 1.3+ FASTQ files there is a matching string 1207 encoding PHRED integer qualities using ASCII values with an offset of 64. 1208 1209 >>> from Bio import SeqIO 1210 >>> record = SeqIO.read(open("Quality/illumina_faked.fastq"), "fastq-illumina") 1211 >>> print record.id, record.seq 1212 Test ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1213 >>> max(record.letter_annotations["phred_quality"]) 1214 40 1215 >>> min(record.letter_annotations["phred_quality"]) 1216 0 1217 1218 NOTE - Older versions of the Solexa/Illumina pipeline encoded Solexa scores 1219 with an ASCII offset of 64. They are approximately equal but only for high 1220 quality reads. If you have an old Solexa/Illumina file with negative 1221 Solexa scores, and try and read this as an Illumina 1.3+ file it will fail: 1222 1223 >>> record2 = SeqIO.read(open("Quality/solexa_faked.fastq"), "fastq-illumina") 1224 Traceback (most recent call last): 1225 ... 1226 ValueError: Invalid character in quality string 1227 1228 NOTE - True Sanger style FASTQ files use PHRED scores with an offset of 33. 1229 """ 1230 q_mapping = dict() 1231 for letter in range(0, 255): 1232 q_mapping[chr(letter)] = letter-SOLEXA_SCORE_OFFSET 1233 for title_line, seq_string, quality_string in FastqGeneralIterator(handle): 1234 if title2ids: 1235 id, name, descr = title2ids(title_line) 1236 else: 1237 descr = title_line 1238 id = descr.split()[0] 1239 name = id 1240 record = SeqRecord(Seq(seq_string, alphabet), 1241 id=id, name=name, description=descr) 1242 qualities = [q_mapping[letter] for letter in quality_string] 1243 if qualities and (min(qualities) < 0 or max(qualities) > 62): 1244 raise ValueError("Invalid character in quality string") 1245 #Dirty trick to speed up this line: 1246 #record.letter_annotations["phred_quality"] = qualities 1247 dict.__setitem__(record._per_letter_annotations, 1248 "phred_quality", qualities) 1249 yield record
1250
1251 -def QualPhredIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
1252 """For QUAL files which include PHRED quality scores, but no sequence. 1253 1254 For example, consider this short QUAL file:: 1255 1256 >EAS54_6_R1_2_1_413_324 1257 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1258 26 26 26 23 23 1259 >EAS54_6_R1_2_1_540_792 1260 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1261 26 18 26 23 18 1262 >EAS54_6_R1_2_1_443_348 1263 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1264 24 18 18 18 18 1265 1266 Using this module directly you might run: 1267 1268 >>> handle = open("Quality/example.qual", "rU") 1269 >>> for record in QualPhredIterator(handle): 1270 ... print record.id, record.seq 1271 EAS54_6_R1_2_1_413_324 ????????????????????????? 1272 EAS54_6_R1_2_1_540_792 ????????????????????????? 1273 EAS54_6_R1_2_1_443_348 ????????????????????????? 1274 >>> handle.close() 1275 1276 Typically however, you would call this via Bio.SeqIO instead with "qual" 1277 as the format: 1278 1279 >>> from Bio import SeqIO 1280 >>> handle = open("Quality/example.qual", "rU") 1281 >>> for record in SeqIO.parse(handle, "qual"): 1282 ... print record.id, record.seq 1283 EAS54_6_R1_2_1_413_324 ????????????????????????? 1284 EAS54_6_R1_2_1_540_792 ????????????????????????? 1285 EAS54_6_R1_2_1_443_348 ????????????????????????? 1286 >>> handle.close() 1287 1288 Becase QUAL files don't contain the sequence string itself, the seq 1289 property is set to an UnknownSeq object. As no alphabet was given, this 1290 has defaulted to a generic single letter alphabet and the character "?" 1291 used. 1292 1293 By specifying a nucleotide alphabet, "N" is used instead: 1294 1295 >>> from Bio import SeqIO 1296 >>> from Bio.Alphabet import generic_dna 1297 >>> handle = open("Quality/example.qual", "rU") 1298 >>> for record in SeqIO.parse(handle, "qual", alphabet=generic_dna): 1299 ... print record.id, record.seq 1300 EAS54_6_R1_2_1_413_324 NNNNNNNNNNNNNNNNNNNNNNNNN 1301 EAS54_6_R1_2_1_540_792 NNNNNNNNNNNNNNNNNNNNNNNNN 1302 EAS54_6_R1_2_1_443_348 NNNNNNNNNNNNNNNNNNNNNNNNN 1303 >>> handle.close() 1304 1305 However, the quality scores themselves are available as a list of integers 1306 in each record's per-letter-annotation: 1307 1308 >>> print record.letter_annotations["phred_quality"] 1309 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1310 1311 You can still slice one of these SeqRecord objects with an UnknownSeq: 1312 1313 >>> sub_record = record[5:10] 1314 >>> print sub_record.id, sub_record.letter_annotations["phred_quality"] 1315 EAS54_6_R1_2_1_443_348 [26, 26, 26, 26, 26] 1316 """ 1317 #Skip any text before the first record (e.g. blank lines, comments) 1318 while True: 1319 line = handle.readline() 1320 if line == "" : return #Premature end of file, or just empty? 1321 if line[0] == ">": 1322 break 1323 1324 while True: 1325 if line[0] != ">": 1326 raise ValueError("Records in Fasta files should start with '>' character") 1327 if title2ids: 1328 id, name, descr = title2ids(line[1:].rstrip()) 1329 else: 1330 descr = line[1:].rstrip() 1331 id = descr.split()[0] 1332 name = id 1333 1334 qualities = [] 1335 line = handle.readline() 1336 while True: 1337 if not line : break 1338 if line[0] == ">": break 1339 qualities.extend([int(word) for word in line.split()]) 1340 line = handle.readline() 1341 1342 if qualities and min(qualities) < 0: 1343 raise ValueError(("Negative quality score %i found in %s. " + \ 1344 "Are these Solexa scores, not PHRED scores?") \ 1345 % (min(qualities), id)) 1346 1347 #Return the record and then continue... 1348 record = SeqRecord(UnknownSeq(len(qualities), alphabet), 1349 id = id, name = name, description = descr) 1350 #Dirty trick to speed up this line: 1351 #record.letter_annotations["phred_quality"] = qualities 1352 dict.__setitem__(record._per_letter_annotations, 1353 "phred_quality", qualities) 1354 yield record 1355 1356 if not line : return #StopIteration 1357 assert False, "Should not reach this line"
1358
1359 -class FastqPhredWriter(SequentialSequenceWriter):
1360 """Class to write standard FASTQ format files (using PHRED quality scores). 1361 1362 Although you can use this class directly, you are strongly encouraged 1363 to use the Bio.SeqIO.write() function instead via the format name "fastq" 1364 or the alias "fastq-sanger". For example, this code reads in a standard 1365 Sanger style FASTQ file (using PHRED scores) and re-saves it as another 1366 Sanger style FASTQ file: 1367 1368 >>> from Bio import SeqIO 1369 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1370 >>> out_handle = open("Quality/temp.fastq", "w") 1371 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1372 3 1373 >>> out_handle.close() 1374 1375 You might want to do this if the original file included extra line breaks, 1376 which while valid may not be supported by all tools. The output file from 1377 Biopython will have each sequence on a single line, and each quality 1378 string on a single line (which is considered desirable for maximum 1379 compatibility). 1380 1381 In this next example, an old style Solexa/Illumina FASTQ file (using Solexa 1382 quality scores) is converted into a standard Sanger style FASTQ file using 1383 PHRED qualities: 1384 1385 >>> from Bio import SeqIO 1386 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1387 >>> out_handle = open("Quality/temp.fastq", "w") 1388 >>> SeqIO.write(record_iterator, out_handle, "fastq") 1389 5 1390 >>> out_handle.close() 1391 1392 This code is also called if you use the .format("fastq") method of a 1393 SeqRecord, or .format("fastq-sanger") if you prefer that alias. 1394 1395 Note that Sanger FASTQ files have an upper limit of PHRED quality 93, which is 1396 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1397 warning is issued. 1398 1399 P.S. To avoid cluttering up your working directory, you can delete this 1400 temporary file now: 1401 1402 >>> import os 1403 >>> os.remove("Quality/temp.fastq") 1404 """ 1405 assert SANGER_SCORE_OFFSET == ord("!") 1406
1407 - def write_record(self, record):
1408 """Write a single FASTQ record to the file.""" 1409 assert self._header_written 1410 assert not self._footer_written 1411 self._record_written = True 1412 #TODO - Is an empty sequence allowed in FASTQ format? 1413 if record.seq is None: 1414 raise ValueError("No sequence for record %s" % record.id) 1415 seq_str = str(record.seq) 1416 qualities_str = _get_sanger_quality_str(record) 1417 if len(qualities_str) != len(seq_str): 1418 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1419 % (record.id, len(seq_str), len(qualities_str))) 1420 1421 #FASTQ files can include a description, just like FASTA files 1422 #(at least, this is what the NCBI Short Read Archive does) 1423 id = self.clean(record.id) 1424 description = self.clean(record.description) 1425 if description and description.split(None, 1)[0]==id: 1426 #The description includes the id at the start 1427 title = description 1428 elif description: 1429 title = "%s %s" % (id, description) 1430 else: 1431 title = id 1432 1433 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1434
1435 -class QualPhredWriter(SequentialSequenceWriter):
1436 """Class to write QUAL format files (using PHRED quality scores). 1437 1438 Although you can use this class directly, you are strongly encouraged 1439 to use the Bio.SeqIO.write() function instead. For example, this code 1440 reads in a FASTQ file and saves the quality scores into a QUAL file: 1441 1442 >>> from Bio import SeqIO 1443 >>> record_iterator = SeqIO.parse(open("Quality/example.fastq"), "fastq") 1444 >>> out_handle = open("Quality/temp.qual", "w") 1445 >>> SeqIO.write(record_iterator, out_handle, "qual") 1446 3 1447 >>> out_handle.close() 1448 1449 This code is also called if you use the .format("qual") method of a 1450 SeqRecord. 1451 1452 P.S. Don't forget to clean up the temp file if you don't need it anymore: 1453 1454 >>> import os 1455 >>> os.remove("Quality/temp.qual") 1456 """
1457 - def __init__(self, handle, wrap=60, record2title=None):
1458 """Create a QUAL writer. 1459 1460 Arguments: 1461 - handle - Handle to an output file, e.g. as returned 1462 by open(filename, "w") 1463 - wrap - Optional line length used to wrap sequence lines. 1464 Defaults to wrapping the sequence at 60 characters 1465 Use zero (or None) for no wrapping, giving a single 1466 long line for the sequence. 1467 - record2title - Optional function to return the text to be 1468 used for the title line of each record. By default 1469 a combination of the record.id and record.description 1470 is used. If the record.description starts with the 1471 record.id, then just the record.description is used. 1472 1473 The record2title argument is present for consistency with the 1474 Bio.SeqIO.FastaIO writer class. 1475 """ 1476 SequentialSequenceWriter.__init__(self, handle) 1477 #self.handle = handle 1478 self.wrap = None 1479 if wrap: 1480 if wrap < 1: 1481 raise ValueError 1482 self.wrap = wrap 1483 self.record2title = record2title
1484
1485 - def write_record(self, record):
1486 """Write a single QUAL record to the file.""" 1487 assert self._header_written 1488 assert not self._footer_written 1489 self._record_written = True 1490 1491 handle = self.handle 1492 wrap = self.wrap 1493 1494 if self.record2title: 1495 title = self.clean(self.record2title(record)) 1496 else: 1497 id = self.clean(record.id) 1498 description = self.clean(record.description) 1499 if description and description.split(None, 1)[0]==id: 1500 #The description includes the id at the start 1501 title = description 1502 elif description: 1503 title = "%s %s" % (id, description) 1504 else: 1505 title = id 1506 handle.write(">%s\n" % title) 1507 1508 qualities = _get_phred_quality(record) 1509 try: 1510 #This rounds to the nearest integer. 1511 #TODO - can we record a float in a qual file? 1512 qualities_strs = [("%i" % round(q, 0)) for q in qualities] 1513 except TypeError, e: 1514 if None in qualities: 1515 raise TypeError("A quality value of None was found") 1516 else: 1517 raise e 1518 1519 if wrap > 5: 1520 #Fast wrapping 1521 data = " ".join(qualities_strs) 1522 while True: 1523 if len(data) <= wrap: 1524 self.handle.write(data + "\n") 1525 break 1526 else: 1527 #By construction there must be spaces in the first X chars 1528 #(unless we have X digit or higher quality scores!) 1529 i = data.rfind(" ", 0, wrap) 1530 handle.write(data[:i] + "\n") 1531 data = data[i+1:] 1532 elif wrap: 1533 #Safe wrapping 1534 while qualities_strs: 1535 line = qualities_strs.pop(0) 1536 while qualities_strs \ 1537 and len(line) + 1 + len(qualities_strs[0]) < wrap: 1538 line += " " + qualities_strs.pop(0) 1539 handle.write(line + "\n") 1540 else: 1541 #No wrapping 1542 data = " ".join(qualities_strs) 1543 handle.write(data + "\n")
1544
1545 -class FastqSolexaWriter(SequentialSequenceWriter):
1546 r"""Write old style Solexa/Illumina FASTQ format files (with Solexa qualities). 1547 1548 This outputs FASTQ files like those from the early Solexa/Illumina 1549 pipeline, using Solexa scores and an ASCII offset of 64. These are 1550 NOT compatible with the standard Sanger style PHRED FASTQ files. 1551 1552 If your records contain a "solexa_quality" entry under letter_annotations, 1553 this is used, otherwise any "phred_quality" entry will be used after 1554 conversion using the solexa_quality_from_phred function. If neither style 1555 of quality scores are present, an exception is raised. 1556 1557 Although you can use this class directly, you are strongly encouraged 1558 to use the Bio.SeqIO.write() function instead. For example, this code 1559 reads in a FASTQ file and re-saves it as another FASTQ file: 1560 1561 >>> from Bio import SeqIO 1562 >>> record_iterator = SeqIO.parse(open("Quality/solexa_example.fastq"), "fastq-solexa") 1563 >>> out_handle = open("Quality/temp.fastq", "w") 1564 >>> SeqIO.write(record_iterator, out_handle, "fastq-solexa") 1565 5 1566 >>> out_handle.close() 1567 1568 You might want to do this if the original file included extra line breaks, 1569 which (while valid) may not be supported by all tools. The output file 1570 from Biopython will have each sequence on a single line, and each quality 1571 string on a single line (which is considered desirable for maximum 1572 compatibility). 1573 1574 This code is also called if you use the .format("fastq-solexa") method of 1575 a SeqRecord. For example, 1576 1577 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1578 >>> print record.format("fastq-solexa") 1579 @Test PHRED qualities from 40 to 0 inclusive 1580 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1581 + 1582 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJHGFECB@>;; 1583 <BLANKLINE> 1584 1585 Note that Solexa FASTQ files have an upper limit of Solexa quality 62, which is 1586 encoded as ASCII 126, the tilde. If your quality scores must be truncated to fit, 1587 a warning is issued. 1588 1589 P.S. Don't forget to delete the temp file if you don't need it anymore: 1590 1591 >>> import os 1592 >>> os.remove("Quality/temp.fastq") 1593 """
1594 - def write_record(self, record):
1595 """Write a single FASTQ record to the file.""" 1596 assert self._header_written 1597 assert not self._footer_written 1598 self._record_written = True 1599 1600 #TODO - Is an empty sequence allowed in FASTQ format? 1601 if record.seq is None: 1602 raise ValueError("No sequence for record %s" % record.id) 1603 seq_str = str(record.seq) 1604 qualities_str = _get_solexa_quality_str(record) 1605 if len(qualities_str) != len(seq_str): 1606 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1607 % (record.id, len(seq_str), len(qualities_str))) 1608 1609 #FASTQ files can include a description, just like FASTA files 1610 #(at least, this is what the NCBI Short Read Archive does) 1611 id = self.clean(record.id) 1612 description = self.clean(record.description) 1613 if description and description.split(None, 1)[0]==id: 1614 #The description includes the id at the start 1615 title = description 1616 elif description: 1617 title = "%s %s" % (id, description) 1618 else: 1619 title = id 1620 1621 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1622
1623 -class FastqIlluminaWriter(SequentialSequenceWriter):
1624 r"""Write Illumina 1.3+ FASTQ format files (with PHRED quality scores). 1625 1626 This outputs FASTQ files like those from the Solexa/Illumina 1.3+ pipeline, 1627 using PHRED scores and an ASCII offset of 64. Note these files are NOT 1628 compatible with the standard Sanger style PHRED FASTQ files which use an 1629 ASCII offset of 32. 1630 1631 Although you can use this class directly, you are strongly encouraged to 1632 use the Bio.SeqIO.write() function with format name "fastq-illumina" 1633 instead. This code is also called if you use the .format("fastq-illumina") 1634 method of a SeqRecord. For example, 1635 1636 >>> from Bio import SeqIO 1637 >>> record = SeqIO.read(open("Quality/sanger_faked.fastq"), "fastq-sanger") 1638 >>> print record.format("fastq-illumina") 1639 @Test PHRED qualities from 40 to 0 inclusive 1640 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTN 1641 + 1642 hgfedcba`_^]\[ZYXWVUTSRQPONMLKJIHGFEDCBA@ 1643 <BLANKLINE> 1644 1645 Note that Illumina FASTQ files have an upper limit of PHRED quality 62, which is 1646 encoded as ASCII 126, the tilde. If your quality scores are truncated to fit, a 1647 warning is issued. 1648 """
1649 - def write_record(self, record):
1650 """Write a single FASTQ record to the file.""" 1651 assert self._header_written 1652 assert not self._footer_written 1653 self._record_written = True 1654 1655 #TODO - Is an empty sequence allowed in FASTQ format? 1656 if record.seq is None: 1657 raise ValueError("No sequence for record %s" % record.id) 1658 seq_str = str(record.seq) 1659 qualities_str = _get_illumina_quality_str(record) 1660 if len(qualities_str) != len(seq_str): 1661 raise ValueError("Record %s has sequence length %i but %i quality scores" \ 1662 % (record.id, len(seq_str), len(qualities_str))) 1663 1664 #FASTQ files can include a description, just like FASTA files 1665 #(at least, this is what the NCBI Short Read Archive does) 1666 id = self.clean(record.id) 1667 description = self.clean(record.description) 1668 if description and description.split(None, 1)[0]==id: 1669 #The description includes the id at the start 1670 title = description 1671 elif description: 1672 title = "%s %s" % (id, description) 1673 else: 1674 title = id 1675 1676 self.handle.write("@%s\n%s\n+\n%s\n" % (title, seq_str, qualities_str))
1677
1678 -def PairedFastaQualIterator(fasta_handle, qual_handle, alphabet = single_letter_alphabet, title2ids = None):
1679 """Iterate over matched FASTA and QUAL files as SeqRecord objects. 1680 1681 For example, consider this short QUAL file with PHRED quality scores:: 1682 1683 >EAS54_6_R1_2_1_413_324 1684 26 26 18 26 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 1685 26 26 26 23 23 1686 >EAS54_6_R1_2_1_540_792 1687 26 26 26 26 26 26 26 26 26 26 26 22 26 26 26 26 26 12 26 26 1688 26 18 26 23 18 1689 >EAS54_6_R1_2_1_443_348 1690 26 26 26 26 26 26 26 26 26 26 26 24 26 22 26 26 13 22 26 18 1691 24 18 18 18 18 1692 1693 And a matching FASTA file:: 1694 1695 >EAS54_6_R1_2_1_413_324 1696 CCCTTCTTGTCTTCAGCGTTTCTCC 1697 >EAS54_6_R1_2_1_540_792 1698 TTGGCAGGCCAAGGCCGATGGATCA 1699 >EAS54_6_R1_2_1_443_348 1700 GTTGCTTCTGGCGTGGGTGGGGGGG 1701 1702 You can parse these separately using Bio.SeqIO with the "qual" and 1703 "fasta" formats, but then you'll get a group of SeqRecord objects with 1704 no sequence, and a matching group with the sequence but not the 1705 qualities. Because it only deals with one input file handle, Bio.SeqIO 1706 can't be used to read the two files together - but this function can! 1707 For example, 1708 1709 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1710 ... open("Quality/example.qual", "rU")) 1711 >>> for record in rec_iter: 1712 ... print record.id, record.seq 1713 EAS54_6_R1_2_1_413_324 CCCTTCTTGTCTTCAGCGTTTCTCC 1714 EAS54_6_R1_2_1_540_792 TTGGCAGGCCAAGGCCGATGGATCA 1715 EAS54_6_R1_2_1_443_348 GTTGCTTCTGGCGTGGGTGGGGGGG 1716 1717 As with the FASTQ or QUAL parsers, if you want to look at the qualities, 1718 they are in each record's per-letter-annotation dictionary as a simple 1719 list of integers: 1720 1721 >>> print record.letter_annotations["phred_quality"] 1722 [26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 24, 26, 22, 26, 26, 13, 22, 26, 18, 24, 18, 18, 18, 18] 1723 1724 If you have access to data as a FASTQ format file, using that directly 1725 would be simpler and more straight forward. Note that you can easily use 1726 this function to convert paired FASTA and QUAL files into FASTQ files: 1727 1728 >>> from Bio import SeqIO 1729 >>> rec_iter = PairedFastaQualIterator(open("Quality/example.fasta", "rU"), 1730 ... open("Quality/example.qual", "rU")) 1731 >>> out_handle = open("Quality/temp.fastq", "w") 1732 >>> SeqIO.write(rec_iter, out_handle, "fastq") 1733 3 1734 >>> out_handle.close() 1735 1736 And don't forget to clean up the temp file if you don't need it anymore: 1737 1738 >>> import os 1739 >>> os.remove("Quality/temp.fastq") 1740 """ 1741 from Bio.SeqIO.FastaIO import FastaIterator 1742 fasta_iter = FastaIterator(fasta_handle, alphabet=alphabet, \ 1743 title2ids=title2ids) 1744 qual_iter = QualPhredIterator(qual_handle, alphabet=alphabet, \ 1745 title2ids=title2ids) 1746 1747 #Using zip(...) would create a list loading everything into memory! 1748 #It would also not catch any extra records found in only one file. 1749 while True: 1750 try: 1751 f_rec = fasta_iter.next() 1752 except StopIteration: 1753 f_rec = None 1754 try: 1755 q_rec = qual_iter.next() 1756 except StopIteration: 1757 q_rec = None 1758 if f_rec is None and q_rec is None: 1759 #End of both files 1760 break 1761 if f_rec is None: 1762 raise ValueError("FASTA file has more entries than the QUAL file.") 1763 if q_rec is None: 1764 raise ValueError("QUAL file has more entries than the FASTA file.") 1765 if f_rec.id != q_rec.id: 1766 raise ValueError("FASTA and QUAL entries do not match (%s vs %s)." \ 1767 % (f_rec.id, q_rec.id)) 1768 if len(f_rec) != len(q_rec.letter_annotations["phred_quality"]): 1769 raise ValueError("Sequence length and number of quality scores disagree for %s" \ 1770 % f_rec.id) 1771 #Merge the data.... 1772 f_rec.letter_annotations["phred_quality"] = q_rec.letter_annotations["phred_quality"] 1773 yield f_rec
1774 #Done 1775 1776
1777 -def _test():
1778 """Run the Bio.SeqIO module's doctests. 1779 1780 This will try and locate the unit tests directory, and run the doctests 1781 from there in order that the relative paths used in the examples work. 1782 """ 1783 import doctest 1784 import os 1785 if os.path.isdir(os.path.join("..", "..", "Tests")): 1786 print "Runing doctests..." 1787 cur_dir = os.path.abspath(os.curdir) 1788 os.chdir(os.path.join("..", "..", "Tests")) 1789 assert os.path.isfile("Quality/example.fastq") 1790 assert os.path.isfile("Quality/example.fasta") 1791 assert os.path.isfile("Quality/example.qual") 1792 assert os.path.isfile("Quality/tricky.fastq") 1793 assert os.path.isfile("Quality/solexa_faked.fastq") 1794 doctest.testmod(verbose=0) 1795 os.chdir(cur_dir) 1796 del cur_dir 1797 print "Done"
1798 1799 if __name__ == "__main__": 1800 _test() 1801