matchPDict {Biostrings} | R Documentation |
A set of functions for finding all the occurences (aka "matches" or "hits") of a set of patterns (aka the dictionary) in a reference sequence or set of reference sequences (aka the subject)
The following functions differ in what they return: matchPDict
returns the "where" information i.e. the positions in the subject of all the
occurrences of every pattern; countPDict
returns the "how many
times" information i.e. the number of occurrences for each pattern;
and whichPDict
returns the "who" information i.e. which patterns
in the preprocessed dictionary have at least one match.
vcountPDict
is similar to countPDict
but it works on
a set of reference sequences in a vectorized fashion.
This man page shows how to use these functions for exact matching of a constant width dictionary i.e. a dictionary where all the patterns have the same length (same number of nucleotides).
See ?`matchPDict-inexact`
for how to use these functions
for inexact matching or when the original dictionary has a variable width.
matchPDict(pdict, subject, algorithm="auto", max.mismatch=0, fixed=TRUE, verbose=FALSE) countPDict(pdict, subject, algorithm="auto", max.mismatch=0, fixed=TRUE, verbose=FALSE) whichPDict(pdict, subject, algorithm="auto", max.mismatch=0, fixed=TRUE, verbose=FALSE) vcountPDict(pdict, subject, algorithm="auto", max.mismatch=0, fixed=TRUE, collapse=FALSE, weight=1L, verbose=FALSE)
pdict |
A PDict object containing the preprocessed dictionary. |
subject |
An XString or MaskedXString object containing the
subject sequence for matchPDict , countPDict and whichPDict .
An XStringSet object containing the subject sequences for vcountPDict .
For now, only subjects of base class DNAString are supported. |
algorithm |
Not supported yet. |
max.mismatch |
The maximum number of mismatching letters allowed (see
?isMatching for the details).
This man page focuses on exact matching of a constant width
dictionary so max.mismatch=0 in the examples below.
See ?`matchPDict-inexact` for inexact matching.
|
fixed |
If FALSE then IUPAC extended letters are interpreted as ambiguities
(see ?isMatching for the details).
This man page focuses on exact matching of a constant width
dictionary so fixed=TRUE in the examples below.
See ?`matchPDict-inexact` for inexact matching.
|
verbose |
TRUE or FALSE .
|
collapse,weight |
collapse must be FALSE , 1 , or 2 .
If collapse=FALSE (the default), then weight is ignored
and vcountPDict returns the full matrix of counts (M0 ).
If collapse=1 , then M0 is collapsed "horizontally"
i.e. it is turned into a vector with length equal to
length(pdict) .
If weight=1L (the default), then this vector is defined by
rowSums(M0) .
If collapse=2 , then M0 is collapsed "vertically"
i.e. it is turned into a vector with length equal to
length(subject) .
If weight=1L (the default), then this vector is defined by
colSums(M0) .
If collapse=1 or collapse=2 , then the elements in
subject (collapse=1 ) or in pdict (collapse=2 )
can be weighted thru the weight argument.
In that case, the returned vector is defined by
M0 %*% rep(weight, length.out=length(subject))
and rep(weight, length.out=length(pdict)) %*% M0 ,
respectively.
|
In this man page, we assume that you know how to preprocess a dictionary
of DNA patterns that can then be used with matchPDict
,
countPDict
, whichPDict
or vcountPDict
.
Please see ?PDict
if you don't.
When using matchPDict
, countPDict
, whichPDict
or vcountPDict
for exact matching of a constant width dictionary,
the standard way to preprocess the original dictionary is by calling
the PDict
constructor on it with no extra arguments.
This returns the preprocessed dictionary in a PDict object
that can be used with any of the functions described here.
If M
denotes the number of patterns in the pdict
argument (M <- length(pdict)
), then matchPDict
returns
an MIndex object of length M
,
and countPDict
an integer vector of length M
.
whichPDict
returns an integer vector made of the indices of the
patterns in the pdict
argument that have at least one match.
If N
denotes the number of sequences in the subject
argument (N <- length(subject)
), then vcountPDict
returns an integer matrix with M
rows and N
columns,
unless the collapse
argument is used. In that case, depending
on the type of weight
, an integer or numeric vector is returned
(see above for the details).
H. Pages
Aho, Alfred V.; Margaret J. Corasick (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333-340.
PDict-class,
MIndex-class,
matchPDict-inexact,
isMatching
,
coverage,MIndex-method
,
matchPattern
,
alphabetFrequency
,
DNAString-class,
DNAStringSet-class,
XStringViews-class,
MaskedDNAString-class
## --------------------------------------------------------------------- ## A. A SIMPLE EXAMPLE OF EXACT MATCHING ## --------------------------------------------------------------------- ## Creating the pattern dictionary: library(drosophila2probe) dict0 <- DNAStringSet(drosophila2probe$sequence) dict0 # The original dictionary. length(dict0) # Hundreds of thousands of patterns. pdict0 <- PDict(dict0) # Store the original dictionary in # a PDict object (preprocessing). ## Using the pattern dictionary on chromosome 3R: library(BSgenome.Dmelanogaster.UCSC.dm3) chr3R <- Dmelanogaster$chr3R # Load chromosome 3R chr3R mi0 <- matchPDict(pdict0, chr3R) # Search... ## Looking at the matches: start_index <- startIndex(mi0) # Get the start index. length(start_index) # Same as the original dictionary. start_index[[8220]] # Starts of the 8220th pattern. end_index <- endIndex(mi0) # Get the end index. end_index[[8220]] # Ends of the 8220th pattern. count_index <- countIndex(mi0) # Get the number of matches per pattern. count_index[[8220]] mi0[[8220]] # Get the matches for the 8220th pattern. start(mi0[[8220]]) # Equivalent to startIndex(mi0)[[8220]]. sum(count_index) # Total number of matches. table(count_index) i0 <- which(count_index == max(count_index)) pdict0[[i0]] # The pattern with most occurrences. mi0[[i0]] # Its matches as an IRanges object. Views(chr3R, mi0[[i0]]) # And as an XStringViews object. ## Get the coverage of the original subject: cov3R <- as.integer(coverage(mi0, width=length(chr3R))) max(cov3R) mean(cov3R) sum(cov3R != 0) / length(cov3R) # Only 2.44% of chr3R is covered. if (interactive()) { plotCoverage <- function(cx, start, end) { plot.new() plot.window(c(start, end), c(0, 20)) axis(1) axis(2) axis(4) lines(start:end, cx[start:end], type="l") } plotCoverage(cov3R, 27600000, 27900000) } ## --------------------------------------------------------------------- ## B. NAMING THE PATTERNS ## --------------------------------------------------------------------- ## The names of the original patterns, if any, are propagated to the ## PDict and MIndex objects: names(dict0) <- mkAllStrings(letters, 4)[seq_len(length(dict0))] dict0 dict0[["abcd"]] pdict0n <- PDict(dict0) names(pdict0n)[1:30] pdict0n[["abcd"]] mi0n <- matchPDict(pdict0n, chr3R) names(mi0n)[1:30] mi0n[["abcd"]] ## This is particularly useful when unlisting an MIndex object: unlist(mi0)[1:10] unlist(mi0n)[1:10] # keep track of where the matches are coming from ## --------------------------------------------------------------------- ## C. PERFORMANCE ## --------------------------------------------------------------------- ## If getting the number of matches is what matters only (without ## regarding their positions), then countPDict() will be faster, ## especially when there is a high number of matches: count_index0 <- countPDict(pdict0, chr3R) identical(count_index0, count_index) # TRUE if (interactive()) { ## What's the impact of the dictionary width on performance? ## Below is some code that can be used to figure out (will take a long ## time to run). For different widths of the original dictionary, we ## look at: ## o pptime: preprocessing time (in sec.) i.e. time needed for ## building the PDict object from the truncated input ## sequences; ## o nnodes: nb of nodes in the resulting Aho-Corasick tree; ## o nupatt: nb of unique truncated input sequences; ## o matchtime: time (in sec.) needed to find all the matches; ## o totalcount: total number of matches. getPDictStats <- function(dict, subject) { ans_width <- width(dict[1]) ans_pptime <- system.time(pdict <- PDict(dict))[["elapsed"]] pptb <- pdict@threeparts@pptb ans_nnodes <- length(pptb@nodes) %/% Biostrings:::.ACtree.ints_per_acnode(pptb) ans_nupatt <- sum(!duplicated(pdict)) ans_matchtime <- system.time( mi0 <- matchPDict(pdict, subject) )[["elapsed"]] ans_totalcount <- sum(countIndex(mi0)) list( width=ans_width, pptime=ans_pptime, nnodes=ans_nnodes, nupatt=ans_nupatt, matchtime=ans_matchtime, totalcount=ans_totalcount ) } stats <- lapply(6:25, function(width) getPDictStats(DNAStringSet(dict0, end=width), chr3R)) stats <- data.frame(do.call(rbind, stats)) stats } ## --------------------------------------------------------------------- ## D. vcountPDict() ## --------------------------------------------------------------------- subject <- Dmelanogaster$upstream1000[1:200] subject mat1 <- vcountPDict(pdict0, subject) dim(mat1) # length(pdict0) x length(subject) nhit_per_probe <- rowSums(mat1) table(nhit_per_probe) ## Without vcountPDict(), 'mat1' could have been computed with: mat2 <- sapply(unname(subject), function(x) countPDict(pdict0, x)) identical(mat1, mat2) # TRUE ## but using vcountPDict() is faster (10x or more, depending of the ## average length of the sequences in 'subject'). if (interactive()) { ## This will fail (with message "allocMatrix: too many elements ## specified") because, on most platforms, vectors and matrices in R ## are limited to 2^31 elements: subject <- Dmelanogaster$upstream1000 vcountPDict(pdict0, subject) length(pdict0) * length(Dmelanogaster$upstream1000) 1 * length(pdict0) * length(Dmelanogaster$upstream1000) # > 2^31 ## But this will work: nhit_per_seq <- vcountPDict(pdict0, subject, collapse=2) sum(nhit_per_seq >= 1) # nb of subject sequences with at least 1 hit table(nhit_per_seq) which(nhit_per_seq == 37) # 603 sum(countPDict(pdict0, subject[[603]])) # 37 } ## --------------------------------------------------------------------- ## E. RELATIONSHIP BETWEEN vcountPDict(), countPDict() AND ## vcountPattern() ## --------------------------------------------------------------------- dict3 <- DNAStringSet(mkAllStrings(DNA_BASES, 3)) # all trinucleotides dict3 pdict3 <- PDict(dict3) subject <- Dmelanogaster$upstream1000 subject ## The 3 following calls are equivalent (from faster to slower): mat3a <- vcountPDict(pdict3, subject) mat3b <- sapply(dict3, function(pattern) vcountPattern(pattern, subject)) mat3c <- sapply(unname(subject), function(x) countPDict(pdict3, x)) stopifnot(identical(mat3a, t(mat3b))) stopifnot(identical(mat3a, mat3c)) ## The 2 following calls are equivalent (from faster to slower): nhitpp3a <- vcountPDict(pdict3, subject, collapse=1) # rowSums(mat3a) nhitpp3b <- sapply(dict3, function(pattern) sum(vcountPattern(pattern, subject))) stopifnot(identical(nhitpp3a, nhitpp3b)) ## The 2 following calls are equivalent (from faster to slower): nhitps3a <- vcountPDict(pdict3, subject, collapse=2) # colSums(mat3a) nhitps3b <- sapply(unname(subject), function(x) sum(countPDict(pdict3, x))) stopifnot(identical(nhitps3a, nhitps3b))