Using GramAlign

Licensing

This program is freely available for academic use, without any warranty. Commercial distribution of this program, in whole or in part, requires prior agreement with the authors.

Install

GramAlign is written in ANSI-C, and so should build without error on any platform with an ANSI-C compiler. Use the following commands to build the program.

cd src
make clean
make

At this point, the executable GramAlign (on linux/unix/macosx) or GramAlign.exe (on Windows) will reside in the src directory as well. Copy the executable to anywhere in your path.

Usage

To run GramAlign from the command-line window, run the following command:

/path/to/executable/GramAlign [options]

Examples

./GramAlign -i ~/Desktop/input.fasta -o ~/Desktop/output.txt

The example above performs an alignment for the sequences in the file "input.fasta", which is located on the user's desktop. The output is written to a file named "output.txt", also on the desktop.

./GramAlign -f 0 -C -i ~/Desktop/input.fasta -o ~/Desktop/output.txt

The example above creates a distance matrix for the sequences in the file "input.fasta". Supplying the "-C" option tells the program to compute the full distance matrix, and the program's output is written to "output.txt".

Options

General Options

-h

Help! Display the command-line options.

-q

Turn on "quiet mode", which will prevent any text from being displayed by GramAlign. The default is verbose, in it will output various progress during the alignment procedure.

-S [value]

Specify the maximum trace-back matrix size before temporary paging begins. The largest one-time memory requirement for GramAlign occurs during each pairwise alignment process, at which time the trace-back matrix (an N_i x N_j matrix of bytes, where N_k is the length of sequence i and j) is necessary for the backward portion of the dynamic programming procedure. If both sequences being aligned are large (relative to the computer's physical available memory), then this block of memory can become so large that the computer spends a significant amount of processing time paging physical memory to and from the system virtual memory (i.e., hard-drive area). The value specified here determines a maximum size on the trace-back matrix (in bytes) before GramAlign will (much more) efficiently copy pieces of the trace-back matrix to "page files" within the current directory. At the end of the trace-back procedure, these temporary page files will be removed. For systems with larger physical memory, this amount should be increased. If this option is not specified, the default value is 100 (i.e., 100 million bytes). Note, the temporary page files are named "_ga_temp.pagexxxxx", where xxxxx is replaced with the proper page number. These files are safe to delete as long as GramAlign is not running.

File Options

-i [filename]

Specify the input file, which needs to be in FASTA format. If this option is not used, the default file name is "infile". The type of input sequence (nucleotide or amino acid) is determined by the input file format command line option (-F).

-o [filename]

Specify the output file name. If this option is not used, the default file name is "outfile".

-f [value]

Specify the output file format. A value of 0 will output the grammar-based distance matrix. A value of 1 will output the alignment in PHYLIP format. A value of 2 will output the alignment in Aligned FASTA format. A value of 3 will output the alignment in MSF/GCG format. A value of 4 will output the consensus sequence in html including gaps in the alignment. A value of 5 will output the consensus sequence in html ignoring any gap elements in the alignment. If this option is not specified, the default file format is PHYLIP.

-F [value]

Specify the input file format. A value of 0 will cause GramAlign to automatically detect if the input file contains Amino Acid sequences. The auto-detection is based on if a base other than A, C, G, T, U, or X is part of the sequence. Should any other character appear in any of the input sequences, the program will align the sequences as though the input file contains all amino acid sequences. A value of 1 will force the alignment to assume all sequences are either DNA or RNA. A value of 2 will force the alignment to assume all sequences are amino acid sequences. If this option is not specified, the default is to automatically detect the input file type.

Distance Matrix Options

-C

Force GramAlign to generate a complete distance matrix prior to making the spanning tree. The default allows GramAlign to generate a partial distance matrix with a time complexity on the order of NlogN. This option will ensure the most accurate grammar-based spanning tree, but requires a time complexity on the order of N^2. In creating the partial distance matrix, one initial column is completely filled in and divided into two clusters (one with the smallest distances and the other with the largest distances). Then each cluster is recursively calculated (i.e., each cluster is calculated and divided into two clusters). The underlying basis for this to work is the transitivity of grammars (i.e., if the initial sequence has a short grammar distance to two other sequences, then those two sequences should likely have a short grammar distance to each other).

-M

Disable use of the merged amino acid alphabet. As discussed in our paper on GramAlign, we developed a merged alphabet whereby certain amino acid characters were found to have similar row scores within the substitution matrices. We were able to reduce the original 23 characters into a set of 11 characters. This ability is particularly useful for the grammar-based distance calculation. This option will disable using the merged alphabet. This option is ignored for nucleotide sequences.

-T [value]

Specify the relative grammar-based similarity threshold. All sequences that have a relative complexity measure below this threshold will be grouped together. Sequences within each group will be aligned to each other first. Then a consensus sequence for each group will be aligned to the overall alignment ensemble. Lower thresholds will force sequences to be more identical before they will be grouped together. If this option is not specified, the default value is 0.10.

Alignment Heuristic Options

-v [value]

Specify the alignment overhang percentage for similar sequences. The Needleman-Wunsch based dynamic programming method is used to perform pairwise alignment. Normally every residue of one sequence is compared to all other residues of the second sequence. Especially for nearly identical sequences, this is unnecessary. This value represents the percent of residues shifted past both ends of the longer sequence. Smaller values will result in quicker alignments but at a risk of increased mismatches. If this option is not specified, the default value is 0.10.

-V [value]

Specify the alignment overhang percentage for dissimilar sequences. The Needleman-Wunsch based dynamic programming method is used to perform pairwise alignment. Normally every residue of one sequence is compared to all other residues of the second sequence. Especially for nearly identical sequences, this is unnecessary. This value represents the percent of residues shifted past both ends of the longer sequence. Smaller values will result in quicker alignments but at a risk of increased mismatches. If this option is not specified, the default value is 0.25.

Alignment Scoring Options

-g [value]

Specify the gap-open cost. At the core of the alignment algorithm is the pairwise alignment algorithm used to progressively align each sequence not already in the alignment. This pairwise process is the well-known Needleman-Wunsch dynamic programming algorithm modified for affine gap penalties. The value specified in this option represents the cost assigned anytime a gap in a sequence is started. If this option is not specified, the default value is 15.2 for protein sequences or 8.7 for DNA sequences. In the case the Gonnet250 substitution matrix is used, this value is multiplied by 10.

-G [value]

Specify the tail gap-open cost. The value specified in this option represents the cost assigned if a gap is started at either the beginning or the ending of the alignment. If this option is not specified, the default value is 15.2 for protein sequences or 8.7 for DNA sequences. In the case the Gonnet250 substitution matrix is used, this value is multiplied by 10.

-e [value]

Specify the gap-extension cost. The value specified in this option represents the cost assigned each time a gap in a sequence is extended by an additional character. If this option is not specified, the default value is 0.6 for protein sequences or 0.8 for DNA sequences. In the case the Gonnet250 substitution matrix is used, this value is multiplied by 10.

-E [value]

Specify the tail gap-extension cost. The value specified in this option represents the cost assigned each time a gap is extended at either the beginning or the ending of the alignment. If this option is not specified, the default value is 0.3 for protein sequences or 0.4 for DNA sequences. In the case the Gonnet250 substitution matrix is used, this value is multiplied by 10.

-m [value]

Specify the amino acid substitution matrix. Another important piece of the Needleman-Wunsch pairwise alignment procedure is the substitution scoring matrix. A value of 0 will use the GONNET250 matrix. A value of 1 will use the BLOSUM45 matrix. A value of 2 will use the BLOSUM62 matrix. A value of 3 will use the BLOSUM80 matrix. Note, when using the GONNET250 matrix, the values specified by -g, -G, -e and -E are all multiplied by 10 to account for the relative differences between the GONNET and BLOSUM matrices. If this option is not specified, the default substitution matrix is the GONNET250. This option is ignored for nucleotide sequences, which use a simple matrix of positive diagonal entries and negative off-diagonal entries.

Alignment Gap Filter Options

-t [value]

Specify the percentage of gaps in a column before a blind adjustment can occur. At the end of the multiple sequence alignment algorithm, the alignment is scanned for columns containing at least as many gaps as specified via this percentage (e.g., 0 = 0% = zero gaps in the column, 1.0 = 100% = column with all gaps, 0.5 = 50% = at least half of the column entries are gaps). If any column contains at least this many gaps, a surrounding window (specified in the -w option) of columns is checked for possible gaps that may be shifted into the current column. To disable this action, simply set this value to 1.0 (setting the threshold to be columns that contain nothing but gaps, which are non-existent based on the pairwise alignment process). If this option is not specified, the default value is 1.0 (i.e., 100%).

-w [value]

Specify the number of columns in the gap-adjustment window. Regarding the process discussed in the (-t) option, when a column in the initial multiple sequence alignment is found to have at least the necessary number of gaps to be adjusted, this value determines the number of neighboring columns on either side to be scanned for gaps that may be shifted into the current column. Another way to disable this blind shifting is setting this value to 0. If this option is not specified, the default value is 0.

Secondary Structure Options (Experimental)

The next three command-line options are available, but only useful with the secondary-structure output from IVS (a secondary structure grammar inference program not currently released).

-p [filename]

Specify the grammar piece file, which is output by IVS. By default, there is no grammar piece file used. A grammar piece file is meant to help guide an alignment based on secondary structure present in DNA/RNA sequences. This option is ignored for Amino Acid sequences.

-c [value]

Specify the grammar piece mismatch cost. This is the subtractive penalty applied when either a residue position being aligned is not within a structural piece but the ensemble is, or a residue position is within a structural piece but the ensemble is not. This amount is subtracted from the regular pairwise substitution cost. If this option is not specified, the default cost is 0.0.

-s [value]

Specify the grammar piece match score. This is the additive benefit applied when both a residue position being aligned is contained within a structural piece and the ensemble location is also within a structural piece. This amount is added to the regular pairwise substitution score. If this option is not specified, the default score is 0.0.

Return to Main GramAlign Page