Package Bio :: Package SeqIO :: Module SffIO
[hide private]
[frames] | no frames]

Module SffIO

source code

Bio.SeqIO support for the binary Standard Flowgram Format (SFF) file format.

SFF was designed by 454 Life Sciences (Roche), the Whitehead Institute for Biomedical Research and the Wellcome Trust Sanger Institute. You are expected to use this module via the Bio.SeqIO functions under the format name "sff" (or "sff-trim" as described below).

For example, to iterate over the records in an SFF file,

>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff"):
...     print record.id, len(record), record.seq[:20]+"..."
E3MFGYR02JWQ7T 265 tcagGGTCTACATGTTGGTT...
E3MFGYR02JA6IL 271 tcagTTTTTTTTGGAAAGGA...
E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...
E3MFGYR02GFKUC 299 tcagCGGCCGGGCCTCTCAT...
E3MFGYR02FTGED 281 tcagTGGTAATGGGGGGAAA...
E3MFGYR02FR9G7 261 tcagCTCCGTAAGAAGGTGC...
E3MFGYR02GAZMS 278 tcagAAAGAAGTAAGGTAAA...
E3MFGYR02HHZ8O 221 tcagACTTTCTTCTTTACCG...
E3MFGYR02GPGB1 269 tcagAAGCAGTGGTATCAAC...
E3MFGYR02F7Z7G 219 tcagAATCATCCACTTTTTA...

Each SeqRecord object will contain all the annotation from the SFF file, including the PHRED quality scores.

>>> print record.id, len(record)
E3MFGYR02F7Z7G 219
>>> print record.seq[:10], "..."
tcagAATCAT ...
>>> print record.letter_annotations["phred_quality"][:10], "..."
[22, 21, 23, 28, 26, 15, 12, 21, 28, 21] ...

Notice that the sequence is given in mixed case, the central upper case region corresponds to the trimmed sequence. This matches the output of the Roche tools (and the 3rd party tool sff_extract) for SFF to FASTA.

>>> print record.annotations["clip_qual_left"]
4
>>> print record.annotations["clip_qual_right"]
134
>>> print record.seq[:4]
tcag
>>> print record.seq[4:20], "...", record.seq[120:134]
AATCATCCACTTTTTA ... CAAAACACAAACAG
>>> print record.seq[134:]
atcttatcaacaaaactcaaagttcctaactgagacacgcaacaggggataagacaaggcacacaggggataggnnnnnnnnnnn

The annotations dictionary also contains any adapter clip positions (usually zero), and information about the flows. e.g.

>>> print record.annotations["flow_key"]
TCAG
>>> print record.annotations["flow_values"][:10], "..."
(83, 1, 128, 7, 4, 84, 6, 106, 3, 172) ...
>>> print len(record.annotations["flow_values"])
400
>>> print record.annotations["flow_index"][:10], "..."
(1, 2, 3, 2, 2, 0, 3, 2, 3, 3) ...
>>> print len(record.annotations["flow_index"])
219

Note that to convert from a raw reading in flow_values to the corresponding homopolymer stretch estimate, the value should be rounded to the nearest 100:

>>> print [int(round(value, -2)) // 100
...        for value in record.annotations["flow_values"][:10]], '...'
[1, 0, 1, 0, 0, 1, 0, 1, 0, 2] ...

As a convenience method, you can read the file with SeqIO format name "sff-trim" instead of "sff" to get just the trimmed sequences (without any annotation except for the PHRED quality scores):

>>> from Bio import SeqIO
>>> for record in SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim"):
...     print record.id, len(record), record.seq[:20]+"..."
E3MFGYR02JWQ7T 260 GGTCTACATGTTGGTTAACC...
E3MFGYR02JA6IL 265 TTTTTTTTGGAAAGGAAAAC...
E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...
E3MFGYR02GFKUC 295 CGGCCGGGCCTCTCATCGGT...
E3MFGYR02FTGED 277 TGGTAATGGGGGGAAATTTA...
E3MFGYR02FR9G7 256 CTCCGTAAGAAGGTGCTGCC...
E3MFGYR02GAZMS 271 AAAGAAGTAAGGTAAATAAC...
E3MFGYR02HHZ8O 150 ACTTTCTTCTTTACCGTAAC...
E3MFGYR02GPGB1 221 AAGCAGTGGTATCAACGCAG...
E3MFGYR02F7Z7G 130 AATCATCCACTTTTTAACGT...

Looking at the final record in more detail, note how this differs to the example above:

>>> print record.id, len(record)
E3MFGYR02F7Z7G 130
>>> print record.seq[:10], "..."
AATCATCCAC ...
>>> print record.letter_annotations["phred_quality"][:10], "..."
[26, 15, 12, 21, 28, 21, 36, 28, 27, 27] ...
>>> print record.annotations
{}

You might use the Bio.SeqIO.convert() function to convert the (trimmed) SFF reads into a FASTQ file (or a FASTA file and a QUAL file), e.g.

>>> from Bio import SeqIO
>>> from StringIO import StringIO
>>> out_handle = StringIO()
>>> count = SeqIO.convert("Roche/E3MFGYR02_random_10_reads.sff", "sff",
...                       out_handle, "fastq")
>>> print "Converted %i records" % count
Converted 10 records

The output FASTQ file would start like this:

>>> print "%s..." % out_handle.getvalue()[:50]
@E3MFGYR02JWQ7T
tcagGGTCTACATGTTGGTTAACCCGTACTGATT...

Bio.SeqIO.index() provides memory efficient random access to the reads in an SFF file by name. SFF files can include an index within the file, which can be read in making this very fast. If the index is missing (or in a format not yet supported in Biopython) the file is indexed by scanning all the reads - which is a little slower. For example,

>>> from Bio import SeqIO
>>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff")
>>> record = reads["E3MFGYR02JHD4H"]
>>> print record.id, len(record), record.seq[:20]+"..."
E3MFGYR02JHD4H 310 tcagAAAGACAAGTGGTATC...

Or, using the trimmed reads:

>>> from Bio import SeqIO
>>> reads = SeqIO.index("Roche/E3MFGYR02_random_10_reads.sff", "sff-trim")
>>> record = reads["E3MFGYR02JHD4H"]
>>> print record.id, len(record), record.seq[:20]+"..."
E3MFGYR02JHD4H 292 AAAGACAAGTGGTATCAACG...

You can also use the Bio.SeqIO.write() function with the "sff" format. Note that this requires all the flow information etc, and thus is probably only useful for SeqRecord objects originally from reading another SFF file (and not the trimmed SeqRecord objects from parsing an SFF file as "sff-trim").

As an example, let's pretend this example SFF file represents some DNA which was pre-amplified with a PCR primers AAAGANNNNN. The following script would produce a sub-file containing all those reads whose post-quality clipping region (i.e. the sequence after trimming) starts with AAAGA exactly (the non- degenerate bit of this pretend primer):

>>> from Bio import SeqIO
>>> records = (record for record in
...            SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff","sff")
...            if record.seq[record.annotations["clip_qual_left"]:].startswith("AAAGA"))
>>> count = SeqIO.write(records, "temp_filtered.sff", "sff")
>>> print "Selected %i records" % count
Selected 2 records

Of course, for an assembly you would probably want to remove these primers. If you want FASTA or FASTQ output, you could just slice the SeqRecord. However, if you want SFF output we have to preserve all the flow information - the trick is just to adjust the left clip position!

>>> from Bio import SeqIO
>>> def filter_and_trim(records, primer):
...     for record in records:
...         if record.seq[record.annotations["clip_qual_left"]:].startswith(primer):
...             record.annotations["clip_qual_left"] += len(primer)
...             yield record
>>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff")
>>> count = SeqIO.write(filter_and_trim(records,"AAAGA"),
...                     "temp_filtered.sff", "sff")
>>> print "Selected %i records" % count
Selected 2 records

We can check the results, note the lower case clipped region now includes the "AAAGA" sequence:

>>> for record in SeqIO.parse("temp_filtered.sff", "sff"):
...     print record.id, len(record), record.seq[:20]+"..."
E3MFGYR02JHD4H 310 tcagaaagaCAAGTGGTATC...
E3MFGYR02GAZMS 278 tcagaaagaAGTAAGGTAAA...
>>> for record in SeqIO.parse("temp_filtered.sff", "sff-trim"):
...     print record.id, len(record), record.seq[:20]+"..."
E3MFGYR02JHD4H 287 CAAGTGGTATCAACGCAGAG...
E3MFGYR02GAZMS 266 AGTAAGGTAAATAACAAACG...
>>> import os
>>> os.remove("temp_filtered.sff")

For a description of the file format, please see the Roche manuals and: http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=formats&m=doc&s=formats

Classes [hide private]
  _AddTellHandle
Wrapper for handles which do not support the tell method (PRIVATE).
  SffWriter
SFF file writer.
Functions [hide private]
 
_sff_file_header(handle)
Read in an SFF file header (PRIVATE).
source code
 
_sff_do_slow_index(handle)
Generates an index by scanning though all the reads in an SFF file (PRIVATE).
source code
 
_sff_find_roche_index(handle)
Locate any existing Roche style XML meta data and read index (PRIVATE).
source code
 
_sff_read_roche_index_xml(handle)
Reads any existing Roche style XML manifest data in the SFF "index" (PRIVATE, DEPRECATED).
source code
 
ReadRocheXmlManifest(handle)
Reads any Roche style XML manifest data in the SFF "index".
source code
 
_sff_read_roche_index(handle)
Reads any existing Roche style read index provided in the SFF file (PRIVATE).
source code
 
_sff_read_seq_record(handle, number_of_flows_per_read, flow_chars, key_sequence, alphabet, trim=False)
Parse the next read in the file, return data as a SeqRecord (PRIVATE).
source code
 
_sff_read_raw_record(handle, number_of_flows_per_read)
Extract the next read in the file as a raw (bytes) string (PRIVATE).
source code
 
SffIterator(handle, alphabet=DNAAlphabet(), trim=False)
Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects).
source code
 
_SffTrimIterator(handle, alphabet=DNAAlphabet())
Iterate over SFF reads (as SeqRecord objects) with trimming (PRIVATE).
source code
Variables [hide private]
  _null = '\x00'
  _sff = '.sff'
  _hsh = '.hsh'
  _srt = '.srt'
  _mft = '.mft'
  _flag = '\xff'
  __package__ = 'Bio.SeqIO'
Function Details [hide private]

_sff_file_header(handle)

source code 

Read in an SFF file header (PRIVATE).

Assumes the handle is at the start of the file, will read forwards though the header and leave the handle pointing at the first record. Returns a tuple of values from the header (header_length, index_offset, index_length, number_of_reads, flows_per_read, flow_chars, key_sequence)

>>> handle = open("Roche/greek.sff", "rb")
>>> values = _sff_file_header(handle)
>>> print values[0]
840
>>> print values[1]
65040
>>> print values[2]
256
>>> print values[3]
24
>>> print values[4]
800
>>> values[-1]
'TCAG'

_sff_do_slow_index(handle)

source code 

Generates an index by scanning though all the reads in an SFF file (PRIVATE).

This is a slow but generic approach if we can't parse the provided index (if present).

Will use the handle seek/tell functions.

_sff_find_roche_index(handle)

source code 

Locate any existing Roche style XML meta data and read index (PRIVATE).

Makes a number of hard coded assumptions based on reverse engineered SFF files from Roche 454 machines.

Returns a tuple of read count, SFF "index" offset and size, XML offset and size, and the actual read index offset and size.

Raises a ValueError for unsupported or non-Roche index blocks.

_sff_read_roche_index_xml(handle)

source code 

Reads any existing Roche style XML manifest data in the SFF "index" (PRIVATE, DEPRECATED).

Will use the handle seek/tell functions. Returns a string.

This has been replaced by ReadRocheXmlManifest. We would normally just delete an old private function without warning, but I believe some people are using this so we'll handle this with a deprecation warning.

ReadRocheXmlManifest(handle)

source code 

Reads any Roche style XML manifest data in the SFF "index".

The SFF file format allows for multiple different index blocks, and Roche took advantage of this to define their own index block wich also embeds an XML manifest string. This is not a publically documented extension to the SFF file format, this was reverse engineered.

The handle should be to an SFF file opened in binary mode. This function will use the handle seek/tell functions and leave the handle in an arbitrary location.

Any XML manifest found is returned as a Python string, which you can then parse as appropriate, or reuse when writing out SFF files with the SffWriter class.

Returns a string, or raises a ValueError if an Roche manifest could not be found.

_sff_read_roche_index(handle)

source code 

Reads any existing Roche style read index provided in the SFF file (PRIVATE).

Will use the handle seek/tell functions.

This works on ".srt1.00" and ".mft1.00" style Roche SFF index blocks.

Roche SFF indices use base 255 not 256, meaning we see bytes in range the range 0 to 254 only. This appears to be so that byte 0xFF (character 255) can be used as a marker character to separate entries (required if the read name lengths vary).

Note that since only four bytes are used for the read offset, this is limited to 255^4 bytes (nearly 4GB). If you try to use the Roche sfffile tool to combine SFF files beyound this limit, they issue a warning and omit the index (and manifest).

SffIterator(handle, alphabet=DNAAlphabet(), trim=False)

source code 
Iterate over Standard Flowgram Format (SFF) reads (as SeqRecord objects).

handle - input file, an SFF file, e.g. from Roche 454 sequencing.
         This must NOT be opened in universal read lines mode!
alphabet - optional alphabet, defaults to generic DNA.
trim - should the sequences be trimmed?

The resulting SeqRecord objects should match those from a paired FASTA
and QUAL file converted from the SFF file using the Roche 454 tool
ssfinfo. i.e. The sequence will be mixed case, with the trim regions
shown in lower case.

This function is used internally via the Bio.SeqIO functions:

>>> from Bio import SeqIO
>>> handle = open("Roche/E3MFGYR02_random_10_reads.sff", "rb")
>>> for record in SeqIO.parse(handle, "sff"):
...     print record.id, len(record)
E3MFGYR02JWQ7T 265
E3MFGYR02JA6IL 271
E3MFGYR02JHD4H 310
E3MFGYR02GFKUC 299
E3MFGYR02FTGED 281
E3MFGYR02FR9G7 261
E3MFGYR02GAZMS 278
E3MFGYR02HHZ8O 221
E3MFGYR02GPGB1 269
E3MFGYR02F7Z7G 219
>>> handle.close()

You can also call it directly:

>>> handle = open("Roche/E3MFGYR02_random_10_reads.sff", "rb")
>>> for record in SffIterator(handle):
...     print record.id, len(record)
E3MFGYR02JWQ7T 265
E3MFGYR02JA6IL 271
E3MFGYR02JHD4H 310
E3MFGYR02GFKUC 299
E3MFGYR02FTGED 281
E3MFGYR02FR9G7 261
E3MFGYR02GAZMS 278
E3MFGYR02HHZ8O 221
E3MFGYR02GPGB1 269
E3MFGYR02F7Z7G 219
>>> handle.close()

Or, with the trim option:

>>> handle = open("Roche/E3MFGYR02_random_10_reads.sff", "rb")
>>> for record in SffIterator(handle, trim=True):
...     print record.id, len(record)
E3MFGYR02JWQ7T 260
E3MFGYR02JA6IL 265
E3MFGYR02JHD4H 292
E3MFGYR02GFKUC 295
E3MFGYR02FTGED 277
E3MFGYR02FR9G7 256
E3MFGYR02GAZMS 271
E3MFGYR02HHZ8O 150
E3MFGYR02GPGB1 221
E3MFGYR02F7Z7G 130
>>> handle.close()