Title: | Interface for Popular Multiple Sequence Alignment Tools |
---|---|
Description: | Seamlessly interfaces the Multiple Sequence Alignment software packages ClustalW, MAFFT, MUSCLE and Kalign (downloaded separately) and provides support to calcualte distances between sequences. This work was partially supported by grant no. R21HG005912 from the National Human Genome Research Institute. |
Authors: | Michael Hahsler, Anurag Nagar |
Maintainer: | Michael Hahsler <[email protected]> |
License: | GPL-3 |
Version: | 0.99.1 |
Built: | 2024-11-19 04:56:01 UTC |
Source: | https://github.com/mhahsler/rMSA |
Executes boxshade on a multiple sequence alignment.
boxshade(x, file, dev="pdf", param="-thr=0.5 -cons -def", pdfCrop=TRUE) boxshade_help()
boxshade(x, file, dev="pdf", param="-thr=0.5 -cons -def", pdfCrop=TRUE) boxshade_help()
x |
a multiple alignment as an object of class DNAMultipleAlignment, RNAMultipleAlignment or AAMultipleAlignment. |
file |
output file |
dev |
used output device. Available are: ps, eps, hpgl, rtf, crt, ansi, vt, ascii, fig, pict, html and pdf. |
param |
character string with the command line parameters for clustal (see output of |
pdfCrop |
crop the pdf file if it is smaller than a page. Use
|
For installation details see: https://github.com/mhahsler/rMSA/blob/master/INSTALL
Only a file is created.
Michael Hahsler
Boxshade has been written by Kay Hofmann and Michael D. Baron
## Not run: rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna <- narrow(rna, start=1, end=50) al <- clustal(rna) boxshade(al, file="alignment.pdf", dev="pdf") ## End(Not run)
## Not run: rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna <- narrow(rna, start=1, end=50) al <- clustal(rna) boxshade(al, file="alignment.pdf", dev="pdf") ## End(Not run)
Executes Clustal on a set of sequences to obtain a multiple sequence alignment.
clustal(x, param) clustal_help()
clustal(x, param) clustal_help()
x |
an object of class XStringSet (e.g., DNAStringSet) with the sequences to be aligned. |
param |
character string with the command line parameters for clustal (see output of |
For installation details see: https://github.com/mhahsler/rMSA/blob/master/INSTALL
An object of class DNAMultipleAlignment (see BioStrings).
Michael Hahsler
Larkin M., et al. Clustal W and Clustal X version 2.0, Bioinformatics 2007 23(21):2947-29
## Not run: ### DNA dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna al <- clustal(dna) al ### inspect alignment detail(al) ### plot a sequence logo for the first 20 positions plot(al, 1, 20) ### RNA rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna al <- clustal(rna) al ### Proteins aa <- readAAStringSet(system.file("examples/Protein_example.fasta", package="rMSA")) aa al <- clustal(aa) al ## End(Not run)
## Not run: ### DNA dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna al <- clustal(dna) al ### inspect alignment detail(al) ### plot a sequence logo for the first 20 positions plot(al, 1, 20) ### RNA rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna al <- clustal(rna) al ### Proteins aa <- readAAStringSet(system.file("examples/Protein_example.fasta", package="rMSA")) aa al <- clustal(aa) al ## End(Not run)
Implements different methods to calculate distance between sets of sequences based on k-mer distribution, edit distance/alignment or evolutionary distance.
# k-mer-based methods distFFP(x, k=3, method="JSD", normalize=TRUE) distCV(x, k=3) distNSV(x, k=3, method="Manhattan", normalize=FALSE) distKMer(x, k=3) distSimRank(x, k=7) # edit distance/alignment distEdit(x) distAlignment(x, substitutionMatrix=NULL, ...) # evolutionary distance distApe(x, model="K80" ,...)
# k-mer-based methods distFFP(x, k=3, method="JSD", normalize=TRUE) distCV(x, k=3) distNSV(x, k=3, method="Manhattan", normalize=FALSE) distKMer(x, k=3) distSimRank(x, k=7) # edit distance/alignment distEdit(x) distAlignment(x, substitutionMatrix=NULL, ...) # evolutionary distance distApe(x, model="K80" ,...)
x |
an object of class XStringSet containing the sequences. For
|
k |
size of used k-mers. |
method |
metric used to calculate the dissimilarity between two k-mer frequency distributions. |
substitutionMatrix |
matrix with substitution scores (defaults to a matrix with match=1, mismatch=0) |
normalize |
normalize the k-mer frequencies by the total number of k-mers in the sequence. |
model |
evolutionary model used. |
... |
further arguments passed on. |
Feature frequency profile (distFFP
): A FFP is the
normalized (by the number of k-mers in the sequence) count of each possible
k-mer in a sequence. The distance is defined
as the Jensen-Shannon divergence (JSD) between FFPs (Sims and Kim, 2011).
Composition Vector (distCV
): A CV is a vector with the
frequencies of each k-mer in the sequency minus the expected frequency
of random background of neutral mutations obtained from a Markov Model.
The cosine distance is used between CVs. (Qi et al, 2007).
Numerical Summarization Vector (distNSV
): An NSV is
frequency distribution of all possible k-mers in a sequence.
The Manhattan distance is used between NSVs (Nagar and Hahsler, 2013).
Distance between sets of k-mers (distkMer
): Each
sequence is represented as a set of k-mers. The Jaccard (binary) distance is
used between sets (number of unique shared k-mers over the total number of
unique k-mers in both sequences).
Distance based on SimRank (distSimRank
): 1-simRank
(see simRank
).
Edit (Levenshtein) Distance (distEdit
): Edit distance
between sequences.
Distance based on alignment score (distAlignment
):
see stringDist
in Biostrings.
Evolutionary distances (distApe
):
see dist.dna
in ape.
A dist object.
Michael Hahsler
Sims, GE; Kim, SH (2011 May 17). "Whole-genome phylogeny of Escherichia coli/Shigella group by feature frequency profiles (FFPs).". Proceedings of the National Academy of Sciences of the United States of America 108 (20): 8329-34. PMID 21536867.
Gao, L; Qi, J (2007 Mar 15). "Whole genome molecular phylogeny of large dsDNA viruses using composition vector method.". BMC evolutionary biology 7: 41. PMID 17359548.
Qi J, Wang B, Hao B: Whole Proteome Prokaryote Phylogeny without Sequence Alignment: A K-String Composition Approach. Journal of Molecular Evolution 2004, 58:1-11.
Anurag Nagar; Michael Hahsler (2013). "Fast discovery and visualization of conserved regions in DNA sequences using quasi-alignment." BMC Bioinformatics, 14(Suppl. 11), 2013
s <- mutations(random_sequences(100), 100) s ### calculate NSV distance dNSV <- distNSV(s) ### relationship with edit distance dEdit <- distEdit(s) df <- data.frame(dNSV=as.vector(dNSV), dEdit=as.vector(dEdit)) plot(sapply(df, jitter), cex=.1) ### add lower bound (2*k, for Manhattan distance) abline(0,1/(2*3), col="red", lwd=2) ### add regression line abline(lm(dEdit~dNSV, data=df), col="blue", lwd=2) ### check correlation cor(dNSV,dEdit)
s <- mutations(random_sequences(100), 100) s ### calculate NSV distance dNSV <- distNSV(s) ### relationship with edit distance dEdit <- distEdit(s) df <- data.frame(dNSV=as.vector(dNSV), dEdit=as.vector(dEdit)) plot(sapply(df, jitter), cex=.1) ### add lower bound (2*k, for Manhattan distance) abline(0,1/(2*3), col="red", lwd=2) ### add regression line abline(lm(dEdit~dNSV, data=df), col="blue", lwd=2) ### check correlation cor(dNSV,dEdit)
Runs Kalign progressive multiple sequence alignment on a set of sequences.
kalign(x, param=NULL) kalign_help()
kalign(x, param=NULL) kalign_help()
x |
an object of class DNAStringSet with the sequences to be aligned. |
param |
character string with the command line parameters for kalign (see output of |
For installation details see: https://github.com/mhahsler/rMSA/blob/master/INSTALL
An object of class DNAMultipleAlignment (see BioStrings).
Michael Hahsler
Lassmann T., Sonnhammer E. Kalign - an accurate and fast multiple sequence alignment algorithm, BMC Bioinformatics 2005, 6:298
## Not run: dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna ### align the sequences al <- kalign(dna) al ## End(Not run)
## Not run: dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna ### align the sequences al <- kalign(dna) al ## End(Not run)
Executes mafft on a set of sequences to obtain a multiple sequence alignment.
mafft(x, param="--auto") mafft_help()
mafft(x, param="--auto") mafft_help()
x |
an object of class XStringSet (e.g., DNAStringSet) with the sequences to be aligned. |
param |
character string with the command line parameters (see output of |
For installation details see: https://github.com/mhahsler/rMSA/blob/master/INSTALL
An object of class DNAMultipleAlignment (see BioStrings).
Michael Hahsler
Katoh, Standley 2013 (Molecular Biology and Evolution 30:772-780) MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
## Not run: ### DNA dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna al <- mafft(dna) al ### inspect alignment detail(al) ### plot a sequence logo for the first 20 positions plot(al, 1, 20) ### RNA rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna al <- mafft(rna) al ### Proteins aa <- readAAStringSet(system.file("examples/Protein_example.fasta", package="rMSA")) aa al <- mafft(aa) al ## End(Not run)
## Not run: ### DNA dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna al <- mafft(dna) al ### inspect alignment detail(al) ### plot a sequence logo for the first 20 positions plot(al, 1, 20) ### RNA rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna al <- mafft(rna) al ### Proteins aa <- readAAStringSet(system.file("examples/Protein_example.fasta", package="rMSA")) aa al <- mafft(aa) al ## End(Not run)
Executes MUSLCE on a set of sequences to obtain a multiple sequence alignment.
muscle(x, param="") muscle_help()
muscle(x, param="") muscle_help()
x |
an object of class XStringSet (e.g., DNAStringSet) with the sequences to be aligned. |
param |
character string with the command line parameters (see output of |
For installation details see: https://github.com/mhahsler/rMSA/blob/master/INSTALL
An object of class DNAMultipleAlignment (see BioStrings).
Michael Hahsler
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res. 32(5):1792-1797
Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity, BMC Bioinformatics, (5) 113
## Not run: ### DNA dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna al <- muscle(dna) al ### inspect alignment detail(al) ### plot a sequence logo for the first 20 positions plot(al, 1, 20) ### RNA rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna al <- MUSCLE(rna) al ### Proteins aa <- readAAStringSet(system.file("examples/Protein_example.fasta", package="rMSA")) aa al <- MUSCLE(aa) al ## End(Not run)
## Not run: ### DNA dna <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) dna al <- muscle(dna) al ### inspect alignment detail(al) ### plot a sequence logo for the first 20 positions plot(al, 1, 20) ### RNA rna <- readRNAStringSet(system.file("examples/RNA_example.fasta", package="rMSA")) rna al <- MUSCLE(rna) al ### Proteins aa <- readAAStringSet(system.file("examples/Protein_example.fasta", package="rMSA")) aa al <- MUSCLE(aa) al ## End(Not run)
Creates a set of sequences which are random mutations (with base changes, insertions and deletions) for a given DNA, RNA or AA sequence.
mutations(x, number=1, change=0.01, insertion=0.01, deletion=0.01, prob=NULL)
mutations(x, number=1, change=0.01, insertion=0.01, deletion=0.01, prob=NULL)
x |
A XString or an XStringSet of length 1. |
number |
number of sequences to create. |
change , insertion , deletion
|
probability of this operation. |
prob |
a named vector with letter probabilities. 4 for DNA and RNA and 20
for AA (see |
A XStringSet.
Michael Hahsler
### create random sequences s <- random_sequences(100, number=1) s ### create 10 sequences with 1 percent base changes, insertions and deletions m <- mutations(s, 10, change=0.01, insertion=0.01, deletion=0.01) m ### calculate edit distance between the original sequence and the mutated ### sequences stringDist(c(s,m)) ### multiple sequence alignment ## Not run: al <- clustal(c(s,m)) detail(al) ## End(Not run)
### create random sequences s <- random_sequences(100, number=1) s ### create 10 sequences with 1 percent base changes, insertions and deletions m <- mutations(s, 10, change=0.01, insertion=0.01, deletion=0.01) m ### calculate edit distance between the original sequence and the mutated ### sequences stringDist(c(s,m)) ### multiple sequence alignment ## Not run: al <- clustal(c(s,m)) detail(al) ## End(Not run)
Plots genetic sequences (RNA/DNA) using sequence logos.
plot
creates a sequence logo. Parameters are
start
(position to start the logo),
end
(position to end the logo),
ic.scale
(if TRUE then each column are scaled proportional to its information content).
seqLogo
in seqLogo.
Creates a set of random DNA, RNA or AA sequences.
random_sequences(len, number=1, prob=NULL, type=c("DNA", "RNA", "AA"))
random_sequences(len, number=1, prob=NULL, type=c("DNA", "RNA", "AA"))
len |
sequence length |
number |
number of sequences in the set |
prob |
a named vector with letter probabilities or a transition
probability matrix (as produced by
|
type |
sequence type |
A XStringSet.
Michael Hahsler
### create random sequences (using given letter frequencies) seqs <- random_sequences(100, number=10, prob=c(a=.5, c=.3, g=.1, t=.1)) seqs ### check letter frequencies summary(oligonucleotideFrequency(seqs, width=1, as.prob=TRUE)) ### creating random sequences using a random dinocleodite transition matrix prob <- matrix(runif(16), nrow=4, ncol=4, dimnames=list(DNA_BASES, DNA_BASES)) prob <- prob/rowSums(prob) seqs <- random_sequences(100, number=10, prob=prob) seqs ### check dinocleodite transition probabilities prob oligonucleotideTransitions(seqs, as.prob=TRUE)
### create random sequences (using given letter frequencies) seqs <- random_sequences(100, number=10, prob=c(a=.5, c=.3, g=.1, t=.1)) seqs ### check letter frequencies summary(oligonucleotideFrequency(seqs, width=1, as.prob=TRUE)) ### creating random sequences using a random dinocleodite transition matrix prob <- matrix(runif(16), nrow=4, ncol=4, dimnames=list(DNA_BASES, DNA_BASES)) prob <- prob/rowSums(prob) seqs <- random_sequences(100, number=10, prob=prob) seqs ### check dinocleodite transition probabilities prob oligonucleotideTransitions(seqs, as.prob=TRUE)
Computes the SimRank similarity (number of shared unique k-mers over the smallest number of unique k-mers.)
simRank(x, k = 7)
simRank(x, k = 7)
x |
an object of class DNAStringSet containing the sequences. |
k |
size of used k-mers. |
distSimRank()
returns 1-simRank()
.
simRank()
returns a similarity object of class "simil" (see proxy).
distSimRank()
returns a dist object.
Michael Hahsler
Santis et al, Simrank: Rapid and sensitive general-purpose k-mer search tool, BMC Ecology 2011, 11:11
### load sequences sequences <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) sequences ### compute similarity simil <- simRank(sequences) ### use hierarchical clustering hc <- hclust(distSimRank(sequences)) plot(hc)
### load sequences sequences <- readDNAStringSet(system.file("examples/DNA_example.fasta", package="rMSA")) sequences ### compute similarity simil <- simRank(sequences) ### use hierarchical clustering hc <- hclust(distSimRank(sequences)) plot(hc)
These convenience function can be used to convert character strings into vectors of single characters and back.
c2s(x) s2c(x)
c2s(x) s2c(x)
x |
for |
Either a single character string or a vector of single characters.
Michael Hahsler
s <- sample(c("A", "C", "G", "T"), 10, replace = TRUE) s s2 <- c2s(s) s2 s2c(s2)
s <- sample(c("A", "C", "G", "T"), 10, replace = TRUE) s s2 <- c2s(s) s2 s2c(s2)