Content-type: text/html; charset=UTF-8 Man page of tacg

tacg

Section: User Commands (1)
Updated: tacg (v4.3.x) - a command line tool for DNA and Protein Analysis
Index Return to Main Contents
 

NAME

tacg - finds short patterns and specific combinations of patterns in nucleic acids, translates DNA <-> protein.  

SYNOPSIS

tacg -flag [option] -flag [option] ... <input.file >output.file tacg takes input from a file (--infile) or via stdin (| or <); spits output to screen (default), >file, | next command

[-chHlLsv] [-b #] [-e #] [-C {0-16}] [--clone '#_#,#x#...'] [--cost #] [-D 0-4] [--dam] [--dcm] [--example] [-f {0|1}] [-F {0-3}] [-g #,#] [-G #,{X|Y|L}] [-H (--HTML) 0|1] [-i (--idonly) 0-2] [--infile 'input/data/file'] [-m #] [-M #] [-n {3-8}] [--numstart] [--notics] [-o {0|1|3|5,#}] [-O {1-6,#}] [--orfmap] [-p Name,pattern,Err] [-P NameA,(+|-)(l|g)Dist_Lo(-Dist_Hi),NameB] [--ps] [--pdf] [--logdegens] [-r (--regex) {'Label:RegexPat' | 'FILE:FileOfRegexPats'}] [--rule 'Name,(LabA:m:M&LabB:m:M),Win'] [--rulefile '/path/to/rulefile'] [-R 'alterative Pattern/Matrix file'] [--raw] [-S (1*|2)] [--silent] [--strands {1|2}] [-T {0|1|3|6},{1|3}] [-V {1-3}] [-w {1|#}] [-W (--slidwin) #] [-x NameA(=),NameB..(,C)] [-X (--extract) {b,e,[0|1]}] [-# %] [--rev] [--comp] [--revcomp]

 

DESCRIPTION

tacg takes input from stdin (or a file specified via the --infile option), automagically translates most standard ASCII formats of Nucleic Acid (NA) sequence, then analyses that sequence for restriction enzyme (RE) sites and other NA motifs such as Transcription Factor (TF) binding sites (w/ or w/o mismatch errors), matrix matches, and regular expressions, finally writing analyses to stdout. It also can translate the NA input to protein in any frame, using any of a number of Codon translations tables, and search for Open Reading Frames (ORFs), as well as perform many other analyses. Most of the internals use dynamic memory so there are few limits on sequence input size and pattern number. It's ~ 5-50x faster than the comparable routines in GCG or EMBOSS and as it's writ in ANSI C, portable to all unix variants, and even Microsoft Win32 with the Cygwin and the ming32 toolkits.

tacg searches the sequence read from stdin for matches based on descriptions stored in a database of patterns, either explicit sequences, possibly containing IUPAC degeneracies (default rebase.data, in GCG format or extended format), or matrix descriptions (default matrix.data, in TRANSFAC format), regular expressions (default regex.data, in GCG-like format), or a rules file (default rules.data in a simple format) based on matches and options entered on the command line, sends ALL output to stdout. (Unless requested, it no longer sends errors to stderr (except failure errors) and it no longer emits default output - you have to request all output, except for the simplest case: the '-p' flag will also set the -S flag to generate Sites.)

tacg now automagically translates most ASCII formats (Genbank, FASTA, etc) via Jim Knight's SEQIO library and now handles multiple sequences at one time, internally converting 'u's to 't's. It considers both strands at the same time so you don't have to manually reverse complement the sequence (altho you can - see --rev, --comp, --revcomp, and will by default accept all IUPAC degeneracies (yrmkwsbdhv), performing all possible operations on that sequence. It treats degeneracies in the input sequence in different ways depending on the -D flag (see below). It either strips all letters other than 'a','c','g', or 't' and analyzes the sequence as 'pure' using a fast incremental hashing algorithm or it treats it as degenerate and analyses it via a slower de novo hash. By default, it treats sequence as 'pure' unless it detects an IUPAC degeneracy, in which case it will adaptively switch back and forth between the fast and slow hashing routines.

NB: tacg can produce lots of output, especially in the Linear map mode; while it's possible to pipe direct to lp/lpr, you'll probably regret it.
        

REQUIREMENTS

tacg 4.x requires an external Codon file codon.data but does not absolutely require a pattern/REBASE file, allowing you to enter patterns via the command line with the '-p' flag. However, most users will want to use a REBASE file in GCG format to supply the RE definitions. By default the name of this (supplied) file is: rebase.data, altho other files in the same format can be specified by the -R flag. While you can use the default GCG-formatted file from NEB's REBASE distribution (http://rebase.neb.com), additional information is required to use the --dam, --dcm, or --cost options. This info is included in the distribution of tacg and can be added or modified with a text editor. Searching for Matrices requires the use of a TRANSFAC-formatted file (also supplied in the default name of matrix.data ).

The codons/pattern/matrix data files may exist in any of 3 locations which are searched in the order of: the current directory $PWD, your home directory $HOME, or tacg lib $TACGLIB. Many shells will automatically define the 1st two; the last must be specified either via command line or in your .cshrc file.

ie. 'setenv TACGLIB /usr/local/lib/tacg' [csh/tcsh]

or 'export TACGLIB=/usr/local/lib/tacg' [bash]

 

FLAGS and OPTIONS

{} = required for flag; * = default (doesn't need to be entered); # = an integer value; () = optional

ie. -f {0,1*} means that the flag must be entered -f1 or -f0. The flags -f0 or -f 0 are equally acceptable and flags without variables can be grouped together (-scl). A single flag requiring an option can be appended to the end of a string of simple flags, but not more more than 1. -Ls is OK, and -Lsn6 is OK, but -Lsn6F3 is NOT - it must be entered -Lsn6 -F3. Appending a flag that expects an option value without one will cause odd behavior, usually a cryptic error message and the program halting. NOT entering the flag will cause the default behavior.

-b {#}
select the beginning of a subsequence from a larger sequence file; 1* for 1st base of sequence. In the Linear Map output, the upper label indicates numbering from beginning of subsequence; the lower label indicates numbering from the beginning of the entire sequence (see file 'tacg.main.html' for more detail). The smallest sequence that tacg can handle is 4 bases, 10 for the ladder map (-l). This allows analysis of primers and linkers.

-e {#}
select the end of a subsequence from a larger sequence file; 0* for last base of sequence. The largest sequence that I've sent thru it is ~225MB.

-c
order the output by # of cuts/fragments by each RE (Strider style) and thence alphabetically; otherwise output is by order of appearance in the REBASE file.

-C {0*-16}
Codon Usage table to use for translation:


 0 Standard*       6 Echino_Mito        12 Blepharisma
 1 Vert_Mito       7 Euplotid_Nuclear   13 Chloro_mito
 2 Yeast_Mito      8 Bacterial          14 Trematode_mito
 3 Mold_Mito       9 Alt_Yeast          15 Scenedes_mito
 4 Invert_Mito    10 Ascidian_Mito      16 Thrausto_mito
 5 Ciliate_Mito   11 Alt_Flatworm_mito
             The Codon Usage file used in versions 3 & 4 (codon.data) is a slightly modified NCBI format, which includes info (currently ignored) about multiple initiator codons and references. Please page through it for more info.

--clone '#_#,#x#...'
Clone finds sequence ranges which either MUST NOT be cut (#_#) or that MUST be cut (#x#), up to a maximum of 15 at once. Ranges not specified can be either cut or not cut. The output first lists all REs (if any) which match ALL the rules, then all REs which match SOME rules as long as all NO-CUT rules are respected. The same filters that work in other RE selections (-n, -o, -m, -M, --cost, --dam/dcm) can be applied here to fine-tune the selection.

--cost {#}
Cost controls which REs are chosen, based on the # units/$, where the higher the number, the lower the cost (>100 U/$ is cheap; <10 U/$ is quite expensive, based on the prices quoted in NEB's catalog for their high unit products.

-D {0-4}
Degeneracy flag - controls input and analysis of degenerate sequence input where:

 0  FORCES exclusion of degens in sequence; only 'acgtu' accepted
 1* cut as NONdegen unless degen's found; then cut as '-D3'
 2  degen's OK; ignore in KEY hexamer, but match outside of KEY
 3  degen's OK; expand in KEY hexamer, find only EXACT matches
 4  degen's OK; expand in KEY hexamer, find ALL POSSIBLE matches
The pattern matching is adaptive; given a small window of nondegenerate sequence, the algorithm will match very fast; if degenerate sequence is detected, it will switch to a slower, iterative approach. This results in speed that is proportional to degeneracy for most cases. If you have long sequences of 'n's (inserted as placekeepers, for instance), -D2 may be a better choice. In all cases, as soon as degeneracy of the KEY hexamer exceeds a compiled-in limit (usually 256-fold degeneracy), the KEY is skipped.

--dam
Dam sensitivity simulation of Dam methylation of the DNA. Dam methylase has a palindromic recognition site (GmATC) which can interfere with the binding and cutting of a number of Type II REs. This flag simulates the effect of Dam methylation, but requires extra data to be available in the rebase file. If the RE is completely blocked, it will be noted that it did not cut at all in the summary statement. Otherwise, the effect is noted only by difference in the number of sites listed for the -S and -F flags. The sites are still listed in the Linear Map to indicate where they WOULD be if the DNA was not methylated.

--dcm
Dcm sensitivity similar to '--dam' simulation above but with Dcm methylation of the DNA. Dcm methylase also has a palindromic recognition site (CmCWGG) which can interfere with RE action.

--example {1-10}
example code to show how to add your own flags and functions. Search for 'EXAMPLE' in 'SetFlags.c' and 'tacg.c' for the code.
  
-f {0|1*}
form (or topology) of DNA - 0 (zero) for circular; 1 for linear. This flag also operates on subsequences.
      
-F {0*-3}
print/sort Fragments, based on the user-supplied selection criteria ('-n', '-m', '-M', '-o', etc). See also '-c' above.

 0*-omit; 
 1-unsorted; fragments printed in order of generation.
 2-sorted; fragments sorted by size, smallest to largest.
 3-both. This flag has been left active for the matrix matching, even tho it doesn't make much sense to use it in that way.
                                        
-g {min#(,Max#)}
specify if you want a pseudo-gel map graphic, with a low end cutoff of min# bases and a high end cutoff of Max#. If Max # is omitted, the length of the sequence is assumed, altho you can set Max to be any number so as to constrain the output for comparisons between sequences. These numbers can be any any integer exponent of 10 (10, 100, 1000, etc). See examples below.

-G {binsize,X|Y|L}
Graphic data output, so (mis)named for its original use, where:
binsize = # bases for which hits should be pooled X|Y|L indicates whether the BaseBins should be on the X or Y axis
 X: BaseBins 1000 2000 3000 4000  ..
    NameA      0    4    0    7   ..   
    NameB     22   57   98   29   ..     (#s = matches per bin)
    NameC      1    0    0    3   ..
    .
 Y: BaseBins  NameA   NameB   NameC   ..
      1000      0      22       1     ..
      2000      4      57       0     .. 
      3000      0      98       0     ..
      4000      7      29       3     ..
     .
 L: Basebins  NameA
      1000      0    
      2000      4    
        .      .
    Basebins  NameB
      1000     22
      2000     57
        .      .
  This addresses some missing features - allows the export of match data for the selected Names to allow external analysis of the raw data. Like other output, it is streamed to stdout, so it's not wise to mix -G with other analyses; the lines generated (esp. w/ the X option), can be quite long and are NOT governed by the -w flag).

-h
brief help page (condensed man page).
   
-H (--HTML) {0*|1}
generates complete or partial HTML tags for viewing with a Web browser. 0 - (default) makes standalone HTML page, with Table of Contents (TOC). 1 - no page headers, only TOC, to embed in other HTML pages.

Not useful in a functional sense in the command line version. Always more HTML markup can be done as eye candy.

-i (--idonly) {0*-2}
controls the output for sequences (in a collection) that have no hits for the options selected. 0 - (default) ID line and normal output regardless of hits 1 - BOTH ID line and normal output are printed ONLY IF there are hits. 2 - ONLY ID line is printed if there are hits (to identify sequences of interest in a scan for further analysis).

-infile {input_sequence_file}
provides an alternative method for specifying the input file, useful for some scripting frameworks and web pages. The filename specified is passed to SearchPaths() and so it will be found if it is in the current directory, your home directory, or the TACGLIB directory, in that order. A full pathname will identify only that file, of course.


                  

-l
specify if you want a ladder map of selected enzymes, much like the GCG MAPPLOT output. Also appends a summary of those enzymes that match a few times. The number of matches that is included in the summary is length-sensitive in the distributed source code, but it can be overrriden by changing the value assigned to '#define SUMMARY_CUTS' in 'tacg.h'

-L
specify if you WANT a Linear map. This spews the most output (about 10x the # of input characters) and depending on what other options are specified, can be of moderate to very little use. This option no longer generates the co-translation by default as it did in prior versions. If you want the co-translation, you'll have to specify it via the -T flag below. The Linear map also no longer shows ALL the patterns that match from the pattern file. It now obeys the same filtering rules that the Sites, Fragments, Ladder Map and other analyses do. This behavior was requested by several people, and I have to admit it makes sense. tacg 4 also labels non-palindromic patterns as to orientation if they are reversed relative to the way they were enterered, by appending a ~ character to the end of the pattern label in the linear map.

--strands {1|2*}
in Linear Map, print 1 or 2 strands. Along with '--notics', can be used to compact the output by 2 lines per stanza. 1 - only the top strand is printed. 2 - both top and bottom strands are printed

--notics
in Linear Map, DON'T print the tics - can be used to compact the output by up to 2 lines per stanza.

--numstart {#}
the value given with this flag is the beginning number in the Linear Map (-L) output. This can be used to force a particular numbering scheme on the output or to force upstream (negative) numbering for promoters sequences.

-m/M {#}
select enzyme by minimum (-m) and/or Maximum (-M) # cuts in sequence; 0* for all. Affects the number of enzymes displayed by the sites (-s), fragments (-F), gel (-g), ladder (-l), and linear map (-L) flags.
   
-n {3*-10}
select enzymes by magnitude of recognition site; 3 = all, 5 = 5,6,7,8... n's don't count, other degeneracies are summed ie: tgca=4, tgyrca=5, tgcnnngca=6, tannnnnnnnnnta=4

-o {0|1*|3|5,#}
select enzymes by overhang generated; 5 = 5', 3 = 3', 0 for blunt, 1 for all. If you append an integer between 1 and 6 inclusive, you can additionally filter on the LENGTH of the overhang: ie -o5,4 will produce output only for those REs that leave 5' overhangs that are exactly 4 bases long.
                  
-O {1-6(x),MinSiz}
crude ORF analysis producing either a line or a block (depends on -w) for each ORF including:

 = Frame of the Current ORF
 = Sequence # of the Current ORF
 = Offset from the start in both bases and AAs
 = Size of the ORF in AAs and KDa 
 = ORF itself in 1 letter code
 = if 'x' is appended to frames, extended info is included (# & % of total AAs)
 
NB: If -w is set to 1, the output is written in a 2 line, FASTA-like stanza for each ORF (the header prefixed by '>', and the ORF itself), so that line-oriented pattern-matching tools (grep, egrep, awk) can examine the ORF for matching regular expressions (see the GNU grep man page for an explanation of regular expressions). In this way you can search all 6 frames of >MinSize AAs for whatever pattern interests you. If -w is set to one of the regular widths, the ORF will be wrapped at that length to form a FASTA formatted block for analysis by other apps, more biologically aware tools like FASTA, BLAST, etc.
Examples:
 -O 145,25  frames 1,4,5 with a min ORF size of 25 AAs
 -O 35x,200  frames 3 & 5 with a min ORF size of 200 AAs, with extended info.
 -O 2,66    frame 2 with a min ORF size of 66 AAs

--orfmap
requests a pseudographic ORF map and a MET & STOP map of those Frames requested with the -O flag (see above), and so requires the -O flag to be specified with it. You can expand the scale with the -w flag (see below) to increase accuracy somewhat, but it will still be limited due to the character based mapping. The map does match the one produced by the -l flag, so you can use them together to get a relative sense of where patterns and ORFs map, and then use the -b and -e flags to zoom into the sequence of interest.
 
-p {Name,Pattern[,Err]}
allows entry of search patterns from the command line;

   Name = Pattern name (1-10 chars)
   Pattern = <30 IUPAC characters (ie. gryttcnnngt)
   Err = (optional) max # of errors that are tolerated 
         (<6). If omitted, Err is set to 0
This flag also logs the patterns you've entered into the file tacg.patterns in the correct format for later copying to a REBASE file. Can enter up to 10 of these at a time. Patterns should consist of < 30 IUPAC bases. This uses a brute force approach, so long patterns with high #s of errors (>3) will cause SUBSTANTIAL cpu usage (ie. minutes) in validating the patterns. But actual the search will go very fast.

-P     {NameA,[+-][lg]Dist_Lo[-Dist_Hi],NameB}
Proximity matching. Use this option to search for spacial relationships between
factors, 2 at a time (up to a total of 10).
NameA and NameB must be in a REBASE-formatted file, either the default rebase.data or another specified by the -R flag and are case INsensitive. NameA/B patterns can be composed of any IUPAC bases and ERRORs can be specified in the REBASE entry ie:

 Pit1  5  WWTATNCATW  0  2 ! a Pit1 site with 2 error
 Tataa 4  TATAAWWWW   0  1 ! a Tataa site with 1 error

 +  NameA is DOWNSTREAM of NameB (default is either)
 -  NameA is UPSTREAM of NameB  (ditto)    

 l  NameA is LESS THAN Dist_Lo from NameB (default)
 g  NameA is GREATER THAN Dist_Lo from NameB
 Dist_Hi - if used, implies a RANGE, obviates l or g  
Example I
   -PHindIII,350,bamhi     Match all HindIII sites within 350 bases of BamHI sites
Example II
   -PPit1,-30-2500,Tataa   Match all Pit1 sites that are 30 to 2500 bases UPSTREAM of a Tataa site.

--ps
generates a postscript plasmid map (and multiple pages with the same parameters if fed a multi-sequence file). The output file is named tacg_Map.ps and additional plots will be appended to it if it exists in the same directory. REs to be plotted can be selected with the usual parameters: (-m -M --cost --n -x -p) but you'll usually want to use -M1 or -M2. Degeneracies are plotted along the rim as grayscale arcs (remember tacg can tolerate degeneracies in sequence, so you can compose accurate plasmid maps by connecting known sequences with N's.) ORFs from any and all frames can be plotted internal to the sequence ring by using the -O flag.

--pdf
Invokes --ps above and automatically converts the Postscript putput to Adobe's Portable Document Format, which is considerably more compact.

--logdegens
(off by default) Using this flag forces the logging of every degeneracy in the sequence, trivial if a short sequence (<1Mb), but of concern for chromosome-sized chunks. This info will be used for drawing graphic maps of the sequence and shading degeneracies differently. It is quite memory intensive as it marks the beginning and end of every degeneracy run. No external data is produced, but could be as it's just a simple 2-step array.


   

-R {REBASE|Matrix file}
specifies an alternative database, (RE or Matrix) use. The RE database must be in the same GCG format as rebase.data. There are some example alternative REBASE files shipped with the tacg distribution named '*.RB'.

The latest REBASE files are available via FTP:

ftp://ftp.neb.com/pub/rebase/
  or via WWW:

http://www.neb.com/rebase/rebase.html

and the latest TRANSFAC database is available at:

http://transfac.gbf.de/TRANSFAC/index.html

The file specified with the -R flag is searched for in the same order as the other data files: $PWD , $HOME , $TACGLIB.

--raw
makes tacg consider ALL input as raw, unformatted sequence. This allows it to process unstructured data such as fragments of files and editor buffers. It ignores everything NOT an IUPAC degeneracy, but will consider all possible IUPAC degeneracies, so will produce odd output if fed a regularly formatted sequence file (it will process headers and comments as sequence.) This is the behavior of the version 2 tacg (before SEQIO).

-r (--regex) {'Label:RegexPat'} | {'FILE:FileOfRegexPats'}
searches for regular expressions entered from the commandline using the 1st approach above or searches for the regular expressions read from a file using the 2nd approach. The regular expression syntax can be formal regex patterns or the IUPAC'ed version thereof; the translation from one to the other is handled automatically. Because regex's typically have many characters that shells are happy to misinterpret, the single quotes (') surrounding the option string are almost always required. When trying to specify a file, the term FILE must be in CAPs (so don't go naming a regex pattern 'FILE'). Specific regex patterns from the file can be specified by using the '-x' flag to name them explicitly. Regular expression searches are considerably slower than other types of searches, but searches of 100Kb, with <10 regex patterns of even reasonably high complexity should be tolerable.

--rule {logic}
(see also -P above) --rule allows you to specify arbitrarily complex logical associations of characteristics to detect the patterns that interest you. Admittedly, that phrase is incomprehensible on its own, so let me give an example:

Say you wanted to search for an enhancer that you suspected might be involved in the transcriptional regulation of a pituitary-specific gene. You knew that you were looking for a sequence about 1000 bp long in which there were at least 2 Pit1 sites and 3-5 Estrogen response elements, but NO TATAA boxes. If you had defined these patterns in a file called pit.specific as:


 Pit1  0  WWTATNCATW    0 1 ! Pit1 site w/ 1 error
 ERE   0  GGTCAGCCTGACC 0 1 ! ERE site w/ 1 error
 TATAA 0  tataawwww     0 0 ! TATAA site, no errors allowed


 you could specify this search by:
  tacg --rule '((Pit1:2:7&ERE:3:5)&(TATAA:0:0),1000)' -R pit.specific < input_sequence >output

This query searches a sliding window of 1000 bps (-W 1000) for ((2-7 Pit1 AND 3-5 ERE sites) AND (0 TATAA sites)). These combinations can be as large as your OS allows your command-line to be with arbitraily complex relations represented with logical AND (&), OR (|), and XOR (^) as conjunctions. Parens enforce groupings; otherwise it's evaluated left to right.

--rulefile '/path/to/the/rulefile'
This option allows you to read in a complete file of the kind of complex rules described above and have them all evaluated. The file format is described in the example data file supplied rules.data

-s
prints the summary of site information, describing how many times       each pattern matches the sequence.
Those that match zero times are shown first. In Ver >2, only those that match at least once are shown in the second part (the 0 matchers are not reiterated)

-S (1*|2)
prints the the actual matched Sites in tabular form, much like Strider's output. See also '-c', above.

 1* = sites noted as + offsets; fine for restriction mapping.
 2  = note nonpalindrome patterns on bottom strand with '-' offsets.

--silent
requests that the nucleic sequence submitted be translated starting at the 1st base, in frame 1 (use -b to shift the starting base), according to the Codon Translation table specified with -C, then reverse translated, using the same table, using all the possible degeneracies, then restrict that (quite) degenerate sequence and show all the REs that will match it. You should use the '-L' and '-T' flags to generate the linear map which shows both the REs and the cotranslated sequence to verify that all is as it should be. NB: Depending on Codon Table, some AAs are not reversibly translatable. Using the standard table, Arg (=mgn), Leu (=ytn), and Ser (=wsn) cannot be Forward translated from their Reverse translation.

--tmppath /path/to/tmp/dir
passes the path to tacg to cooperate with CGIs or other programs that need to tell tacg where to place the ps/pdf files.
                  
-T {[0*|1|3|6],[1|3]}
requests frames 1, 1-3, or 1-6 to be cotranslated with the Linear Map using 1 or 3 letter codes. Requires '-L' to have any effect.

Ex: "-T3,3" translates Frames 1,2,3 with 3 letter labels.
    "-T1,1" translates Frame 1 with 1 letter labels.
        
    

-v
asks for program version (there may be multiple versions of the same functional program to track its migration).
                  
-V {1-3}
Verbose output- requests all kinds of ugly diagnostic info to be spat to the screen. May be useful in diagnosing why tacg did not behave as expected..but maybe not. The values 1 - 3 ask for increasing amounts of detail.             

-w {1|#}
output width in bp's (the option number must be exactly 1 or between 60* and 210.

The number (if not 1) is truncated to a # exactly divisible by 15 ('-w 100' will be interpreted as '-w 90') and actual printed output will be about 20 characters wider. Also applies to output of the ladder and gel maps, so if you're trying to get more accuracy and your output device can display small fonts, you may want to use this flag to widen the output. In version 3, the option '-w 1' allows you to put as much information as possible on one line for easier parsing by some external apps.

Ex: "-w 1" prints output in one line
    "-w 150" causes wrapping at about 170 characters (150 bp wide in the Linear map option).

-x {Label(,=),Label..(,C)}
used to restrict the patterns searched for by Name label (either from the 1st field of a REBASE format file or the NA field from a TRANSFAC format file) up to a maximum of 15. Case INsensitive (HindIII = hindiii = HinDiIi), but it HAS to be spelled exactly like the entry in rebase.data with no spaces. (HindIII != Hind III != Hind3).

The '=' tag invokes the Hookey function (named after its requestor, John Hookey), in which the '=' tags the RE to which it is appended. This is useful if you're trying to discern or predict a labelled fragment in a mixture of fragments. The output shows the fragments generated only if they have one or both ends generated by the tagged RE. This option works even if there are a number of REs, but only one can be tagged. Ex: '-x HindIII,=,MseI,HinfI' causes the DNA to be cut by HindIII, MseI, and HinfI, but only fragments that have a HindIII end will be shown. The output is shown both unsorted and sorted by fragment size. If you want to cause the output to simulate a multiple digest with all the REs designated, append a ',C' to the list of RE names. Ex: -xBamHI,EcorI,NruI,C

NB: Don't assign the name 'C' to any patterns or REs.


            

-X (--extract) {b,e,[0|1]}
causes the sequences bounding the match to be spat to stdout in FASTA format. b and e are the beginning and ending offsets respectively for varying the window around the match. NB: both b and e are measured from the start of the match, so e must be corrected for the length of the pattern itself.

-# {#}
calls for matrix matching of either ALL the patterns in the default Matrix file matrix.data or that specified via the '-R' flag, or ONLY THOSE specified via the '-x' flag, regardless of the input file. The number indicates the CUTOFF as the percentage of the maximum score possible (the sum of the highest score at each nucleotide across the matrix - see tacg3.main.html for more info). Example: 'tacg -# 95 -r GCN4 -S <yeastchromo4.genbank' will search all of 'yeastchromo4.genbank' for the Matrix named in matrix.data as GCN4 at a cutoff of 95% (the pattern has to match the matrix at 95% or better).


                  

--rev
causes the sequence(s) to be reversed before analysis: tacg -> gcat. Useful for figuring out sequencing/entry errors.

--comp
causes the sequence(s) to be complemented before analysis: tacg -> atgc. Useful for figuring out sequencing/entry errors.

--revcomp
causes the sequence(s) to be reverse-complemented before analysis: tacg -> cgta. Useful for checking the translation in opposite orientation without having to read translation backwards or convert with another program.


                  
                  

 

RELATED PROGRAMS

In Ver 3, tacg incorporated Jim Knight's (jknight@guarneri.curagen.com) SEQIO library calls to provide automagic format conversion of incoming sequences. This also allows multiple sequences to be run at the same time, allowing tacg to scan databases.

Wu and Manber's agrep is an amazing piece of software for searching for multiple patterns with errors. While not optimzed for molecular biology, it can be used to scan sequences. Jim Knight distributes a variant of it called grepseq with his SEQIO pkg, which IS molbio-aware, but not as generally useful (to me anyway) as tacg, as it only scans one strand and will only search up to 6 matches for some reason. However, I've started to incorporate the grepseq core into tacg. agrep is available via ftp://ftp.cs.arizona.edu/agrep/ or http://manber.com.
 The SEQIO pkg is distributed around the web.

You can also use the excellent paging utility less to move thru your sequence file and use its marking and piping facility to punt the sequence of interest to 'tacg'. In many terminal emulators it will also highlight matched search terms, and so makes an excellent way to scan the output for regions of interest. Many editors also allow piping a selection of text to an external program and inclusion of the result into another window ( nedit, crisp, joe, the indefatiguable emacs/xemacs and others).

Much of the output benefits from wider-than-normal printing. The '-w#' flag allows output up to about 230 characters wide, however to print this without wrapping, you need to use small fonts. A number of unix printing utilities allow you to do this, notably genscript: http://www.hut.fi/%7Emtr/genscript/index.html


   
    

EXAMPLES

Used alone:
tacg -f0 -n5 -T3,1 -sL -F3 -g 100,1000 <NewFile.Genbank >output.file
Translation: read sequence from NewFile.Genbank and analyze it as circular (-f0), with 5+ cutters (-n5), returning both site info and linear map (-sL) as well as sorted and unsorted fragment data (-F3) and do 3 frame translation w/ 1 letter codes (-T3,1) on the linear map, and produce a pseudo gel diagram for those enzymes that pass the filtering, with a low cutoff of 100 bp and a high cutoff of 1000 b(-g100,1000), then write the output to output.file.

Matching matrices:

tacg -R yeast.matrices -# 85 -sSlc -w90 < yst_chr_4.seq >out
Translation: Search the sequence in yst_chr_4.seq for all the matrices described in the file yeast.matrices , applying a uniform cutoff of 85% (-# 85) to the maximum possible score, writing the summary, Sites, ladder map, doubly-sorted (-sSlc) printed 90 characters wide (-w90) to the file out

Specifying patterns on the command-line

tacg -p Pit1,tatwcata,1 -p ap2,tgygcatw,1 -w90 -sSL < rprlPromo.seq > promo.map
Translation: search for the patterns labeled Pit1 and ap2 with 1 error each and search the sequence from the file rprlPromo.seq for them, printing the results (summary (-s), Sites (S), and the Linear Map (L) 90 characters wide (-w90) to the file promo.map

Used to search the entire yeast 500bp Upstream Regulatory sequences (a database of 6226 500 bp sequences) for matches to the MATa1 binding site (from TRANSFAC) :

tacg -R TRANSFAC.data -sScw1 -rMATa1 -#95 < utr5_sc_500.fasta > yeast.summary
Translation: translate each of the FASTA formatted entries in the input file utr5_sc_500.fasta into usable sequence, and after finding the MATa1 (-r MATa1) matrix description from the database -R TRANSFAC.data search the sequences for matches at 95% of the max score that it has in the TRANSFAC database (-# 95), returning the summary (-s), the sites (-S) sorted in Strider order (-c) with results printed on 1 line (w1), directing the output into the file yeast.summary

 

BUGS and ODDITIES

Major

the inclusion of the seqio functions has caused an enormous increase in the compiled size of the executable to ~340kB (up from ~50kb before). If I get a lot of complaints about this, I'll look into stripping out the functions that I use from the SEQIO library, but I'd rather not as it does include a lot of (hidden) functionality that I plan to use later.
      
tacg v2.0 will not currently cut sequence shorter than 5 bases; if you need to analyze sequences shorter than this, perhaps you're using the wrong program.

main() and functions were originally written as single pass code but with the help of Gray Watson's excellent (!) dmalloc malloc debugging library, available at: http://www.dmalloc.com I've recently put some effort into tracking memory leaks, especially since much of the code has to be re-entrant for doing analyses over many sequences. However, it's not completely leak-free yet, so user beware.

The command line handling has been completely re-written, using the getopt() and getopt_long() functions, so the flags are considerably less sensitive to spacing and order.

Translation in 6 frames assumes circular sequence regardless of '-f' flag, so that the last amino acids in frames 5 and 6 in the 1st output block are obviously incorrect if you are assuming linear sequence.

See the manual for other bugs which the author thinks are less problematic.

Harry Mangalam (hjm@tacgi.com)


 

Index

NAME
SYNOPSIS
DESCRIPTION
REQUIREMENTS
FLAGS and OPTIONS
RELATED PROGRAMS
EXAMPLES
BUGS and ODDITIES

This document was created by man2html, using the manual pages.
Time: 15:56:42 GMT, April 19, 2016