summarizeOverlaps {GenomicRanges} | R Documentation |
Count reads that map to genomic features with options to resolve reads that overlap multiple features
## S4 method for signature 'GRanges,GappedAlignments' summarizeOverlaps( features, reads, mode, ignore.strand = FALSE, ..., param = ScanBamParam())
features |
A GRanges or a GRangesList object of genomic regions of interest. When a GRanges is supplied, each row is considered a different feature. When a GRangesList is supplied, each highest list-level is considered a feature and the multiple elements are considered portions of the same feature. See examples or vignette for details. |
reads |
A GappedAlignments, BamFileList or a BamViews object. |
mode |
Character name of a function that defines the counting method to be used.
Available counting modes include "Union", "IntersectionStrict", or
"IntersectionNotEmpty" and are designed after the counting modes available
in the HTSeq package by Simon Anders (see references). A user provided
count function can be used as the
|
param |
An optional ScanBamParam instance to further influence scanning, counting, or filtering of the BAM file. |
ignore.strand |
A logical value indicating if strand should be considered when matching. |
... |
Additional arguments for other methods. If using multiple cores, you can pass arguments in here to be used by mclapply to indicate the number of cores to use etc. |
In the context of summarizeOverlaps
a "feature" can be any portion of a
genomic region such as a gene, transcript, exon etc. When the features
argument is a GRanges the rows define the features to be
overlapped. When features
is a GRangesList the highest
list-levels define the features.
summarizeOverlaps
offers three mode
functions to handle reads
that overlap multiple features: "Union", "IntersectionStrict", and
"IntersectionNotEmpty". These functions are patterned after the counting
methods in the HTSeq package (see references). Each mode has a set of rules
that dictate how a read is assigned. Reads are counted a maximum of once.
Alternatively, users can provide their own counting function as the
mode
argument and take advantage of the infrastructure in
summarizeOverlaps
to count across multiple files and parse the results
into a SummarizedExperiment object.
Currently reads must be input as either a BAM file or a GappedAlignments object. The information in the CIGAR field is used to determine if gapped reads are present.
NOTE : summarizeOverlaps
does not currently handle paired-end reads.
A SummarizedExperiment object. The assays
slot holds the
counts, rowData
holds the features
, colData
will either be NULL
or hold any metadata that was present in the
reads
.
Valerie Obenchain <vobencha@fhcrc.org>
http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html home page for HTSeq
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html counting with htseq-count
DESeq
, DEXSeq
and edgeR
packages
BamFileList
BamViews
group_id <- c("A", "B", "C", "C", "D", "D", "E", "F", "G", "H", "H") features <- GRanges( seqnames = Rle(c("chr1", "chr2", "chr1", "chr1", "chr2", "chr2", "chr1", "chr1", "chr2", "chr1", "chr1")), strand = strand(rep("+", length(group_id))), ranges = IRanges( start=c(1000, 2000, 3000, 3600, 7000, 7500, 4000, 4000, 3000, 5000, 5400), width=c(500, 900, 500, 300, 600, 300, 500, 900, 500, 500, 500)), DataFrame(group_id) ) reads <- GappedAlignments( names = c("a","b","c","d","e","f","g"), rname = Rle(c(rep(c("chr1", "chr2"), 3), "chr1")), pos = as.integer(c(1400, 2700, 3400, 7100, 4000, 3100, 5200)), cigar = c("500M", "100M", "300M", "500M", "300M", "50M200N50M", "50M150N50M"), strand = strand(rep.int("+", 7L))) ## Results from countOverlaps are included to highlight how the ## modes in summarizeOverlaps count a read a maximum of once. ## When the 'features' argument is a GRanges, each row ## is treated as a different feature. rowsAsFeatures <- data.frame(union = assays(summarizeOverlaps(features, reads))$counts, intStrict = assays(summarizeOverlaps(features, reads, mode="IntersectionStrict"))$counts, intNotEmpty = assays(summarizeOverlaps(features, reads, mode="IntersectionNotEmpty"))$counts, countOverlaps = countOverlaps(features, reads)) ## When the 'features' argument is a GRangesList, each ## highest list-level is a different feature. lst <- split(features, values(features)[["group_id"]]) listAsFeatures <- data.frame(union = assays(summarizeOverlaps(lst, reads))$counts, intStrict = assays(summarizeOverlaps(lst, reads, mode="IntersectionStrict"))$counts, intNotEmpty = assays(summarizeOverlaps(lst, reads, mode="IntersectionNotEmpty"))$counts, countOverlaps = countOverlaps(lst, reads)) ## Read across BAM files and package output for DESeq or edgeR analysis library(Rsamtools) library(DESeq) library(edgeR) fls = list.files(system.file("extdata",package="GenomicRanges"), recursive=TRUE, pattern="*bam$", full=TRUE) bfl <- BamFileList(fls) features <- GRanges( seqnames = Rle(c("chr2L", "chr2R", "chr2L", "chr2R", "chr2L", "chr2R", "chr2L", "chr2R", "chr2R", "chr3L", "chr3L")), strand = strand(rep("+", 11)), ranges = IRanges(start=c(1000, 2000, 3000, 3600, 7000, 7500, 4000, 4000, 3000, 5000, 5400), width=c(500, 900, 500, 300, 600, 300, 500, 900, 500, 500, 500)) ) solap <- summarizeOverlaps(features, bfl) deseq <- newCountDataSet(countData=assays(solap)$counts, conditions=rownames(colData(solap))) edger <- DGEList(counts=assays(solap)$counts, group=rownames(colData(solap)))