letterFrequency {Biostrings}R Documentation

Calculate the frequency of letters in a biological sequence, or the consensus matrix of a set of sequences

Description

Given a biological sequence (or a set of biological sequences), the alphabetFrequency function computes the frequency of each letter in the (base) alphabet.

The consensusMatrix function computes the consensus matrix of a set of sequences, and the consensusString function creates the consensus sequence based on a 50% + 1 vote from the consensus matrix (using the "?" letter to represent the lack of consensus).

In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).

Usage

  alphabetFrequency(x, baseOnly=FALSE, freq=FALSE, ...)
  hasOnlyBaseLetters(x)
  uniqueLetters(x)

  ## S4 method for signature 'character':
  consensusMatrix(x, freq=FALSE)
  ## S4 method for signature 'XStringSet':
  consensusMatrix(x,
      baseOnly=FALSE, freq=FALSE, shift=0L, width=NULL)

  ## S4 method for signature 'matrix':
  consensusString(x)
  ## S4 method for signature 'XStringSet':
  consensusString(x, shift=0L, width=NULL)
  ## S4 method for signature 'ANY':
  consensusString(x)

Arguments

x An XString, XStringSet, XStringViews or MaskedXString object for alphabetFrequency and uniqueLetters.
DNA or RNA input for hasOnlyBaseLetters.
A character vector, or an XStringSet or XStringViews object for consensusMatrix.
A consensus matrix (as returned by consensusMatrix), or an XStringSet or XStringViews object for consensusString.
baseOnly TRUE or FALSE. If TRUE, the returned vector (or matrix) only contains the frequencies of the letters that belong to the "base" alphabet of x i.e. to the alphabet returned by alphabet(x, baseOnly=TRUE). Note that when x is not a DNA or RNA input, then specifying baseOnly has no effect.
freq If TRUE then relative frequencies are reported, otherwise counts (the default).
... Further arguments to be passed to or from other methods. For the XStringViews and XStringSet methods, the collapse argument is accepted.
shift An integer vector (recycled to the length of x) specifying how each sequence in x should be (horizontally) shifted with respect to the first column of the consensus matrix to be returned. By default (shift=0), each sequence in x has its first letter aligned with the first column of the matrix. A positive shift value means that the corresponding sequence must be shifted to the right, and a negative shift value that it must be shifted to the left. For example, a shift of 5 means that it must be shifted 5 positions to the right (i.e. the first letter in the sequence must be aligned with the 6th column of the matrix), and a shift of -3 means that it must be shifted 3 positions to the left (i.e. the 4th letter in the sequence must be aligned with the first column of the matrix).
width The number of columns of the returned matrix for the consensusMatrix method for XStringSet objects. When width=NULL (the default), then this method returns a matrix that has just enough columns to have its last column aligned with the rightmost letter of all the sequences in x after those sequences have been shifted (see the shift argument above). This ensures that any wider consensus matrix would be a "padded with zeros" version of the matrix returned when width=NULL.
The length of the returned sequence for the consensusString method for XStringSet objects.

Details

alphabetFrequency is a generic function defined in the Biostrings package.

Value

alphabetFrequency returns a numeric vector when x is an XString or MaskedXString object. When x is an XStringSet or XStringViews object, then it returns a numeric matrix with length(x) rows where the i-th row contains the frequencies for x[[i]]. If x is a DNA or RNA input, then the returned vector is named with the letters in the alphabet. If the baseOnly argument is TRUE, then the returned vector has only 5 elements: 4 elements corresponding to the 4 nucleotides + the 'other' element.
hasOnlyBaseLetters returns TRUE or FALSE indicating whether or not x contains only base letters (i.e. As, Cs, Gs and Ts for DNA input and As, Cs, Gs and Us for RNA input).
uniqueLetters returns a vector of 1-letter or empty strings. The empty string is used to represent the nul character if x happens to contain any. Note that this can only happen if the base class of x is BString.
An integer matrix with letters as row names for consensusMatrix.
A standard character string for consensusString.

Author(s)

H. Pages and P. Aboyoun

See Also

alphabet, coverage, oligonucleotideFrequency, countPDict, XString-class, XStringSet-class, XStringViews-class, MaskedXString-class, strsplit

Examples

  ## ---------------------------------------------------------------------
  ## A. BASIC alphabetFrequency() EXAMPLES
  ## ---------------------------------------------------------------------
  data(yeastSEQCHR1)
  yeast1 <- DNAString(yeastSEQCHR1)

  alphabetFrequency(yeast1)
  alphabetFrequency(yeast1, baseOnly=TRUE)
  hasOnlyBaseLetters(yeast1)
  uniqueLetters(yeast1)

  ## With input made of multiple sequences:
  library(drosophila2probe)
  probes <- DNAStringSet(drosophila2probe$sequence)
  alphabetFrequency(probes[1:50], baseOnly=TRUE)
  alphabetFrequency(probes, baseOnly=TRUE, collapse=TRUE)

  ## ---------------------------------------------------------------------
  ## B. consensus*() EXAMPLES
  ## ---------------------------------------------------------------------
  ## Read in ORF data:
  file <- system.file("extdata", "someORF.fa", package="Biostrings")
  orf <- read.DNAStringSet(file, "fasta")

  ## To illustrate, the following example assumes the ORF data
  ## to be aligned for the first 10 positions (patently false):
  orf10 <- DNAStringSet(orf, end=10)
  consensusMatrix(orf10, baseOnly=TRUE)

  ## The following example assumes the first 10 positions to be aligned
  ## after some incremental shifting to the right (patently false):
  consensusMatrix(orf10, baseOnly=TRUE, shift=0:6)
  consensusMatrix(orf10, baseOnly=TRUE, shift=0:6, width=10)

  ## For the character matrix containing the "exploded" representation
  ## of the strings, do:
  as.matrix(orf10, use.names=FALSE)

  ## consensusMatrix() can be used to just compute the alphabet frequency
  ## for each position in the input sequences:
  consensusMatrix(probes, baseOnly=TRUE)

  ## After sorting, the first 5 probes might look similar (at least on
  ## their first bases):
  consensusString(sort(probes)[1:5])

  ## ---------------------------------------------------------------------
  ## C. RELATIONSHIP BETWEEN consensusMatrix() AND coverage()
  ## ---------------------------------------------------------------------
  ## Applying colSums() on a consensus matrix gives the coverage that
  ## would be obtained by piling up (after shifting) the input sequences
  ## on top of an (imaginary) reference sequence:
  cm <- consensusMatrix(orf10, shift=0:6, width=10)
  colSums(cm)

  ## Note that this coverage can also be obtained with:
  as.integer(coverage(IRanges(rep(1, length(orf)), width(orf)), shift=0:6, width=10))

[Package Biostrings version 2.12.9 Index]