
This is the ReadMe file of Teiresias Version 0.9.1 RELEASED 3/13/2014

An online version of this file can be found at https://cm.jefferson.edu/data-tools-downloads/teiresias/teiresias-tutorial-and-version-info/

Contents:
---------
1. What is Teiresias
2. Glossary
3. Algorithm Guarantees
4. Parameters
5. File Formats
6. Execution examples and what to use when
7. How to download and run
8. Changelog
9. References



1. What is Teiresias
--------------------
The Teiresias algorithm is a combinatorial algorithm for the discovery of rigid patterns (motifs) in biological sequences. It is named after the Greek prophet Teiresias and was created in 1997 by Isidore Rigoutsos and Aris Floratos.[1] Teiresias should be used for the discovery of exact or more lenient similarities within a set of character sequences. In other words, Teiresias finds all patterns that exist in the input using each input character as a unit. It was initially developed for the discovery of sequences of amino acids of sequenced data but can be used in any text mining setting.



2. Glossary
-----------
 * Pattern: The Teiresias algorithm uses regular expressions to define patterns. The patterns created by the algorithm are <L,W> patterns that have at least k instances in the input, where L ≤ W and L, W, k positive integers. A pattern is called an <L,W> pattern if and only if any L consecutive literals or bracketed literals span at most W positions (i.e. there can be no more than W-L wild cards).

 * Characters - Literal/Wildcard/Bracket: The algorithm returns patterns that contain a) literals, that is characters, e.g. 'Y', b) wild cards, that is a dot '.' that represents any character of the input and c) brackets, that is literals in a bracket, e.g. [YW] that can be used interchangeably in the pattern's appearances

 * Label line: A line of the input file that starts with '>'. Those lines can be used to include information about each sequence and are ignored by the algorithm. Any line that shouldn't be considered by the algorithm should start with '>'

 * Equivalent Literals: The literals that are on the same line in the equivalence file, those literals will be considered interchangeable and will reside in the same bracket if needed

 * Offset List: A list of pairs of integer numbers. The first number is the sequence number, the second number is the position in the sequence. The list accompanies each pattern and contains the sequences and positions that the pattern occupies in the input. Please remember that the numbering of the sequences and of the positions starts at 0 and not at 1

 * Maximal pattern: Given a set of sequences S, a pattern P that appears k times in S is called maximal if and only if there exists no pattern P' which is more specific than P and also appears exactly k times in S



3. Algorithm Guarantees
-----------------------
The algorithm

 * Is able to produce all patterns that have a user-defined number of copies in the given input
 * Manages to be very efficient by avoiding the enumeration of the entire space
 * The algorithm reports motifs that are maximal in both length and composition



4. Parameters
-------------
 * -l: REQUIRED. This is the minimum number of literals and/or brackets of the output patterns. Every output pattern will have length at least -l and will contain at least -l literals and/or brackets. It should always be followed by a number, e.g. -l4

 * -w: REQUIRED. This parameter controls the minimum literal and/or bracket density of the output patterns. In the worst case an output pattern will have length -w and -l literals and/or brackets. The rest w-l will be wildcards. It should always be at least equal to -l and followed by a number, e.g. -w6

 * -k: REQUIRED. This is the minimum support that any pattern can have. Every reported pattern will have at least -k appearances in the file. It should always be larger than or qual to 2 and followed by a number, e.g. -k5

 * -q: This is the maximum support that any pattern can have. Every reported pattern will have at most -q appearances in the file. By omitting it, it gets automatically set to "infinity". It should always be followed by a number, e.g. -q85

 * -v: This parameter controls how patterns occurrences are counted. If omitted then all occurrences regardless of the sequence they appear in are counted and taken into account for -k. If it appears then each pattern is counted once per sequence. It should be set depending on what the user is looking for. If the number of sequences containing a pattern is sought then it should be set (given as a parameter) if the user is asking how many times a pattern is present on the input file it should be omitted.

 * -i: REQUIRED. The name of the input file. Check below for the correct input file format. It should always be followed by a file name, e.g. -iInput.txt

 * -o: The name of the output file. If omitted the name will be "output.txt". Check below for the correct output file format. It should always be followed by a file name or omitted, e.g. -oOUT.txt

 * -b: The name of the equivalence file. If omitted no equivalences are considered. If added then all characters that are on the same line are considered equivalent and can be interchanged in patterns. Check below for the correct equivalence file format. It should always be followed by a file name, e.g. -bHomologies.txt

 * -n: The number of brackets a pattern can have. If it's not set then there is no bracket limit. Should always be followed by a number, e.g. -n10

 * -p: Prints the offset lists

 * -s: When this parameter is set, only patterns that satisfy the minimum requirements set by the user (elementary patterns) are returned

 * -c: This parameter controls the convolution length. In this version it is set to (-l)-1 always. It can still be used but will have no effect on the output of the algorithm

 * -u,-m,-d,-r: Those parameters were used in previous version and are not included in the current one for the moment. They can still be used but will have no effect on the output of the algorithm



5. File Formats
---------------
Here you will find the formats of the following files:
a) Input File Format
b) Equivalence File Format
c) Output File Format

Please read carefully the following instructions. The correctness of your results depend on the structure of the input and equivalence files as well as on the usage of the parameters.

a) Input File Format
-----------------
The input file must contain all data you want processed. It is assumed that every input file contains sequences and sequence headers. Starting from version 0.9.1 the headers are not required. A header is a line that starts with ">". Please note that any line that starts with ">" will not be considered by the algorithm and every other line will.
Every line that doesn't start with ">" is considered a separate sequence that spans from the last character of the previous header or the beginning of the file until a new header is found or the end of the file. Unlike previous versions, the input file can contain several new line characters within a sequence. Empty lines are skipped but spaces shouldn't appear in the sequence lines. The input must be in ASCII format (.txt files with rich text option turned off). A sample input file with the name "SampleInputTJU.txt", included in the downloadable files, is the following.

>sp|P38398|BRCA1_HUMAN Breast cancer type 1 susceptibility protein OS=Homo sapiens GN=BRCA1 PE=1 SV=2
MDLSALRVEEVQNVINAMQKILECPICLELIKEPVSTKCDHIFCKFCMLKLLNQKKGPSQ
CPLCKNDITKRSLQESTRFSQLVEELLKIICAFQLDTGLEYANSYNFAKKENNSPEHLKD
EVSIIQSMGYRNRAKRLLQSEPENPSLQETSLSVQLSNLGTVRTLRTKQRIQPQKTSVYI
ELGSDSSEDTVNKATYCSVGDQELLQITPQGTRDEISLDSAKKAACEFSETDVTNTEHHQ
PSNNDLNTTEKRAAERHPEKYQGSSVSNLHVEPCGTNTHASSLQHENSSLLLTKDRMNVE
KAEFCNKSKQPGLARSQHNRWAGSKETCNDRRTPSTEKKVDLNADPLCERKEWNKQKLPC
SENPRDTEDVPWITLNSSIQKVNEWFSRSDELLGSDDSHDGESESNAKVADVLDVLNEVD
EYSGSSEKIDLLASDPHEALICKSERVHSKSVESNIEDKIFGKTYRKKASLPNLSHVTEN
LIIGAFVTEPQIIQERPLTNKLKRKRRPTSGLHPEDFIKKADLAVQKTPEMINQGTNQTE
QNGQVMNITNSGHENKTKGDSIQNEKNPNPIESLEKESAFKTKAEPISSSISNMELELNI
HNSKAPKKNRLRRKSSTRHIHALELVVSRNLSPPNCTELQIDSCSSSEEIKKKKYNQMPV
RHSRNLQLMEGKEPATGAKKSNKPNEQTSKRHDSDTFPELKLTNAPGSFTKCSNTSELKE
FVNPSLPREEKEEKLETVKVSNNAEDPKDLMLSGERVLQTERSVESSSISLVPGTDYGTQ
ESISLLEVSTLGKAKTEPNKCVSQCAAFENPKGLIHGCSKDNRNDTEGFKYPLGHEVNHS
RETSIEMEESELDAQYLQNTFKVSKRQSFAPFSNPGNAEEECATFSAHSGSLKKQSPKVT
FECEQKEENQGKNESNIKPVQTVNITAGFPVVGQKDKPVDNAKCSIKGGSRFCLSSQFRG
NETGLITPNKHGLLQNPYRIPPLFPIKSFVKTKCKKNLLEENFEEHSMSPEREMGNENIP
STVSTISRNNIRENVFKEASSSNINEVGSSTNEVGSSINEIGSSDENIQAELGRNRGPKL
NAMLRLGVLQPEVYKQSLPGSNCKHPEIKKQEYEEVVQTVNTDFSPYLISDNLEQPMGSS
HASQVCSETPDDLLDDGEIKEDTSFAENDIKESSAVFSKSVQKGELSRSPSPFTHTHLAQ
GYRRGAKKLESSEENLSSEDEELPCFQHLLFGKVNNIPSQSTRHSTVATECLSKNTEENL
LSLKNSLNDCSNQVILAKASQEHHLSEETKCSASLFSSQCSELEDLTANTNTQDPFLIGS
SKQMRHQSESQGVGLSDKELVSDDEERGTGLEENNQEEQSMDSNLGEAASGCESETSVSE
DCSGLSSQSDILTTQQRDTMQHNLIKLQQEMAELEAVLEQHGSQPSNSYPSIISDSSALE
DLRNPEQSTSEKAVLTSQKSSEYPISQNPEGLSADKFEVSADSSTSKNKEPGVERSSPSK
CPSLDDRWYMHSCSGSLQNRNYPSQEELIKVVDVEEQQLEESGPHDLTETSYLPRQDLEG
TPYLESGISLFSDDPESDPSEDRAPESARVGNIPSSTSALKVPQLKVAESAQSPAAAHTT
DTAGYNAMEESVSREKPELTASTERVNKRMSMVVSGLTPEEFMLVYKFARKHHITLTNLI
TEETTHVVMKTDAEFVCERTLKYFLGIAGGKWVVSYFWVTQSIKERKMLNEHDFEVRGDV
VNGRNHQGPKRARESQDRKIFRGLEICCYGPFTNMPTDQLEWMVQLCGASVVKELSSFTL
GTGVHPIVVVQPDAWTEDNGFHAIGQMCEAPVVTREWVLDSVALYQCQELDTYLIPQIPH
SHY

>sp|P51587|BRCA2_HUMAN Breast cancer type 2 susceptibility protein OS=Homo sapiens GN=BRCA2 PE=1 SV=2
MPIGSKERPTFFEIFKTRCNKADLGPISLNWFEELSSEAPPYNSEPAEESEHKNNNYEPN
LFKTPQRKPSYNQLASTPIIFKEQGLTLPLYQSPVKELDKFKLDLGRNVPNSRHKSLRTV
KTKMDQADDVSCPLLNSCLSESPVVLQCTHVTPQRDKSVVCGSLFHTPKFVKGRQTPKHI
SESLGAEVDPDMSWSSSLATPPTLSSTVLIVRNEEASETVFPHDTTANVKSYFSNHDESL
KKNDRFIASVTDSENTNQREAASHGFGKTSGNSFKVNSCKDHIGKSMPNVLEDEVYETVV
DTSEEDSFSLCFSKCRTKNLQKVRTSKTRKKIFHEANADECEKSKNQVKEKYSFVSEVEP
NDTDPLDSNVAHQKPFESGSDKISKEVVPSLACEWSQLTLSGLNGAQMEKIPLLHISSCD
QNISEKDLLDTENKRKKDFLTSENSLPRISSLPKSEKPLNEETVVNKRDEEQHLESHTDC
ILAVKQAISGTSPVASSFQGIKKSIFRIRESPKETFNASFSGHMTDPNFKKETEASESGL
EIHTVCSQKEDSLCPNLIDNGSWPATTTQNSVALKNAGLISTLKKKTNKFIYAIHDETSY
KGKKIPKDQKSELINCSAQFEANAFEAPLTFANADSGLLHSSVKRSCSQNDSEEPTLSLT
SSFGTILRKCSRNETCSNNTVISQDLDYKEAKCNKEKLQLFITPEADSLSCLQEGQCEND
PKSKKVSDIKEEVLAAACHPVQHSKVEYSDTDFQSQKSLLYDHENASTLILTPTSKDVLS
NLVMISRGKESYKMSDKLKGNNYESDVELTKNIPMEKNQDVCALNENYKNVELLPPEKYM
RVASPSRKVQFNQNTNLRVIQKNQEETTSISKITVNPDSEELFSDNENNFVFQVANERNN
LALGNTKELHETDLTCVNEPIFKNSTMVLYGDTGDKQATQVSIKKDLVYVLAEENKNSVK
QHIKMTLGQDLKSDISLNIDKIPEKNNDYMNKWAGLLGPISNHSFGGSFRTASNKEIKLS
EHNIKKSKMFFKDIEEQYPTSLACVEIVNTLALDNQKKLSKPQSINTVSAHLQSSVVVSD
CKNSHITPQMLFSKQDFNSNHNLTPSQKAEITELSTILEESGSQFEFTQFRKPSYILQKS
TFEVPENQMTILKTTSEECRDADLHVIMNAPSIGQVDSSKQFEGTVEIKRKFAGLLKNDC
NKSASGYLTDENEVGFRGFYSAHGTKLNVSTEALQKAVKLFSDIENISEETSAEVHPISL
SSSKCHDSVVSMFKIENHNDKTVSEKNNKCQLILQNNIEMTTGTFVEEITENYKRNTENE
DNKYTAASRNSHNLEFDGSDSSKNDTVCIHKDETDLLFTDQHNICLKLSGQFMKEGNTQI
KEDLSDLTFLEVAKAQEACHGNTSNKEQLTATKTEQNIKDFETSDTFFQTASGKNISVAK
ESFNKIVNFFDQKPEELHNFSLNSELHSDIRKNKMDILSYEETDIVKHKILKESVPVGTG
NQLVTFQGQPERDEKIKEPTLLGFHTASGKKVKIAKESLDKVKNLFDEKEQGTSEITSFS
HQWAKTLKYREACKDLELACETIEITAAPKCKEMQNSLNNDKNLVSIETVVPPKLLSDNL
CRQTENLKTSKSIFLKVKVHENVEKETAKSPATCYTNQSPYSVIENSALAFYTSCSRKTS
VSQTSLLEAKKWLREGIFDGQPERINTADYVGNYLYENNSNSTIAENDKNHLSEKQDTYL
SNSSMSNSYSYHSDEVYNDSGYLSKNKLDSGIEPVLKNVEDQKNTSFSKVISNVKDANAY
PQTVNEDICVEELVTSSSPCKNKNAAIKLSISNSNNFEVGPPAFRIASGKIVCVSHETIK
KVKDIFTDSFSKVIKENNENKSKICQTKIMAGCYEALDDSEDILHNSLDNDECSTHSHKV
FADIQSEEILQHNQNMSGLEKVSKISPCDVSLETSDICKCSIGKLHKSVSSANTCGIFST
ASGKSVQVSDASLQNARQVFSEIEDSTKQVFSKVLFKSNEHSDQLTREENTAIRTPEHLI
SQKGFSYNVVNSSAFSGFSTASGKQVSILESSLHKVKGVLEEFDLIRTEHSLHYSPTSRQ
NVSKILPRVDKRNPEHCVNSEMEKTCSKEFKLSNNLNVEGGSSENNHSIKVSPYLSQFQQ
DKQQLVLGTKVSLVENIHVLGKEQASPKNVKMEIGKTETFSDVPVKTNIEVCSTYSKDSE
NYFETEAVEIAKAFMEDDELTDSKLPSHATHSLFTCPENEEMVLSNSRIGKRRGEPLILV
GEPSIKRNLLNEFDRIIENQEKSLKASKSTPDGTIKDRRLFMHHVSLEPITCVPFRTTKE
RQEIQNPNFTAPGQEFLSKSHLYEHLTLEKSSSNLAVSGHPFYQVSATRNEKMRHLITTG
RPTKVFVPPFKTKSHFHRVEQCVRNINLEENRQKQNIDGHGSDDSKNKINDNEIHQFNKN
NSNQAAAVTFTKCEEEPLDLITSLQNARDIQDMRIKKKQRQRVFPQPGSLYLAKTSTLPR
ISLKAAVGGQVPSACSHKQLYTYGVSKHCIKINSKNAESFQFHTEDYFGKESLWTGKGIQ
LADGGWLIPSNDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPK
EFANRCLSPERVLLQLKYRYDTEIDRSRRSAIKKIMERDDTAAKTLVLCVSDIISLSANI
SETSSNKTSSADTQKVAIIELTDGWYAVKAQLDPPLLAVLKNGRLTVGQKIILHGAELVG
SPDACTPLEAPESLMLKISANSTRPARWYTKLGFFPDPRPFPLPLSSLFSDGGNVGCVDV
IIQRAYPIQWMEKTSSGLYIFRNEREEEKEAAKYVEAQQKRLEALFTKIQEEFEEHEENT
TKPYLPSRALTRQQVRALQDGAELYEAVKNAADPAYLEGYFSEEQLRALNNHRQMLNDKK
QAQIQLEIRKAMESAEQKEQGLSRDVTTVWKLRIVSYSKKEKDSVILSIWRPSSDLYSLL
TEGKRYRIYHLATSKSKSKSERANIQLAATKKTQYQQLPVSDEILFQIYQPREPLHFSKF
LDPDFQPSCSEVDLIGFVVSVVKKTGLAPFVYLSDECYNLLAIKFWIDLNEDIIKPHMLI
AASNLQWRPESKSGLLTLFAGDFSVFSASPKEGHFQETFNKMKNTVENIDILCNEAENKL
MHILHANDPKWSTPTKDCTSGPYTAQIIPGTGNKLLMSSPNCEIYYQSPLSLCMAKRKSV
STPVSAQMTSKSCKGEKEIDDQKNCKKRRALDFLSRLPLPPPVSPICTFVSPAAQKAFQP
PRSCGTKYETPIKKKELNSPQMTPFKKFNEISLLESNSIADEELALINTQALLSGSTGEK
QFISVSESTRTAPTSSEDYLRLKRRCTTSLIKEQESSQASTEECEKNKQDTITTKKYI


b) Equivalence File Format
--------------------------
The equivalence file has very simple format. Every line is considered by the algorithm and all characters per line are considered equivalent. Each character can appear in multiple lines but once per line. Each line must be unique. The input must be in ASCII format (.txt files with rich text option turned off). A sample equivalence file named "SampleEquivalencesTJU.txt" that is included in the downloadable files is the following.

YW
STY
ILV


c)Output File Format
--------------------
The output is automatically formatted by the algorithm. It consists of a header that spans 4 lines and 3-4 columns. The results are presented in those 3-4 columns and are as follows:

Number of pattern occurrences in the entire file
Number of sequences the pattern appears in
The pattern
The list of the position the patterns appears. The first number denotes the sequence and the second the position (This column appears only if -p is set)
So, for example. If we run the algorithm with the input given above and the following parameters

 ./teiresias -l4 -w6 -k4 -iSampleInputTJU.txt -oMyOutputFile.txt -p

we get the following output.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
5	1	F..ASGK 1 1427 1 1523 1 1843 1 1977 1 2057
5	1	AS.K.V 1 1526 1 1846 1 1980 1 2060 1 2184
5	1	F.TAS.K 1 1008 1 1427 1 1523 1 1977 1 2057
5	2	A.GK.V 0 1707 1 1526 1 1846 1 1980 1 2060
4	1	F..ASGK.V 1 1523 1 1843 1 1977 1 2057
4	1	F.TASGK 1 1427 1 1523 1 1977 1 2057
4	2	E.GSS 0 1045 0 1052 0 1059 1 2138
4	2	LFSD 0 1569 1 881 1 1239 1 2807
4	2	EL.SD 0 180 0 1338 1 880 1 1464
4	2	GSD.S 0 182 0 393 1 1337 1 2440
4	2	SL..AK 0 216 1 1684 1 2508 1 3230
4	2	IKE..S 0 30 0 1158 1 1379 1 3390



6. Execution examples and what to use when
------------------------------------------
Every time you run the algorithm you have to specify values for -l,-w,-k and -i. All the other parameters can be used independently and in any combination. The order of the parameters doesn't matter. When setting your parameters try to think how you would like your results to be and select respectively. Below we provide a list of possible desired results and the recommended parameters. Please keep in mind that even more combinations are possible! When unsure about what to use or how you expect from the data to behave, it might be beneficial to set a parameter very high or very low and then gradually tune it until you get the results you are looking for.
In the text below you will find:
a) How to use -i and -o
b) How to use -l and -w
c) How to use -p
d) How to use -s
e) How to use -k, -q and -v
f) How to use -b and -n


a) How to use -i and -o.
------------------------
Both parameters should be always followed by a file name. The input file parameter, -i, is necessary and the output file parameter -o is optional. In case -o is not given then the name of the output file will be "output.txt". If you intend to run Teiresias multiple times at a time, you should provide a different name each time for the output files, the output.txt will be created once and overwritten every time you run the program. An usage example is the following:
./teiresias -l2 -w6 -k2 -p -iSampleInputTJU.txt -oOutputBRCA1_BRCA2_HUMAN.txt


b) How to use -l and -w.
------------------------
The parameter -l controls the minimum number of literals in any pattern which is also the minimum length. The parameter -w controls the number of literals that are indifferent, i.e. the number of wildcards, among every -l consecutive (but not necessarily contiguous) literals. It entails that -w must be larger or equal to -l and that -l/-w shows the literal density.

 * If you are interested in seeing only exact patterns, in other words -l/-w = 1, you should set -l=-w= the minimum length you wish your smaller pattern to have. For example

./teiresias -l6 -w6 -k4 -iSampleInputTJU.txt

returns patterns that have length at least 6 and all positions have literals.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
2	1	NEVGSS
2	1	FSTASGK
2	1	SFSKVI
2	1	SLQNAR

 * If you are interested in patterns with wildcards, then you should set -w larger than -l. If -w = -l + 2 for example then each produced pattern will have only to wild cards among every -l consecutive (but not necessarily contiguous) literals. For example

./teiresias -l4 -w6 -k4 -iSampleInputTJU.txt

returns all patterns with length at least 4 that can have (-w)-(-l) = 2 wildcards ('.') among every 4 consecutive (but not necessarily contiguous) literals and appear at least 4 times in the input file.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
5	1	F..ASGK
5	1	AS.K.V
5	1	F.TAS.K
5	2	A.GK.V
4	1	F..ASGK.V
4	1	F.TASGK
4	2	E.GSS
4	2	LFSD
4	2	EL.SD
4	2	GSD.S
4	2	SL..AK
4	2	IKE..S


c) How to use -p.
-----------------
This parameter is quite straightforward. If set then the positions that each pattern appears in the file will be added in the output. For example

./teiresias -l6 -w6 -k4 -iSampleInputTJU.txt -p

returns the positions of all appearances after each pattern.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
2	1	NEVGSS 0 1044 0 1051
2	1	FSTASGK 1 1977 1 2057
2	1	SFSKVI 1 1785 1 1868
2	1	SLQNAR 1 1991 1 2482


d) How to use -s.
-----------------
The algorithm is executed in two stages. The scanning phase and the convolution. During the scanning phase all patterns that satify the minimum requirements are found and during the convolution the "elementary" patterns are combined and maximized. When -s is used the algorithm stops after the scanning phase. Although the output patterns are not maximized when this parameter is used, the results are returned much faster and include results that are length specific, as seen below.
If you want to find only the patterns with a specific number of literals you should set -l to that length and use -s. For example:
./teiresias -l5 -w6 -k2 -iSampleInputTJU.txt -s
returns all patterns with exactly 5 literals. If you wanted length 5 also you should set -w=-l=5.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
2	1	SKSKS
2	1	LQNAR
2	1	SLQNA
2	1	STASG
2	1	FSTAS
2	1	FSKVI
2	1	SFSKV
2	1	GQPER
4	1	TASGK
2	1	LPRIS
2	1	KEQGL
2	1	RKPSY
2	1	LGPIS
2	2	SLFSD
2	2	LEESG
2	1	EVGSS
2	1	NEVGS
2	1	VQTVN
2	2	ISLLE
2	1	SSSIS
2	2	GSDDS
2	2	GSDSS
2	1	MEKT.S
2	1	SLQN.R
2	1	STAS.K
2	1	FSTA.G
2	1	VSKI.P
2	1	SFSK.I
2	1	KEQG.S
4	1	ASGK.V
2	1	GTGN.L
2	1	LFSD.E
2	1	ETEA.E
2	1	ECEK.K
2	1	EANA.E
2	1	PTLS.T
2	2	LEEN.Q
2	1	EENL.S
2	2	IKED.S
2	2	QTVN.D
2	1	VGSS.N
2	1	NEVG.S
2	1	LGSD.S
2	1	SLQ.AR
2	1	STA.GK
2	1	FST.SG
2	1	SFS.VI
2	1	SDE.YN
2	1	EKE.AK
2	1	NSL.ND
2	1	GSD.SK
2	1	SLN.DK
2	1	LAL.NT
2	1	FSD.EN
2	1	SDT.FQ
2	1	KKV.DI
2	1	KDF.TS
2	2	EEQ.LE
2	2	KNT.EN
2	1	GSS.NE
2	1	NEV.SS
2	1	INE.GS
2	2	ENV.KE
2	2	SPK.TF
2	1	SGL.PE
2	2	EKI.LL
2	2	KKE.NS
2	1	SL.NAR
2	1	ST.SGK
2	1	FS.ASG
2	1	SF.KVI
2	1	LK.SKS
2	1	SL.KVK
2	1	GF.TAS
2	1	KI.KES
2	1	QL.ATK
2	1	SD.SKN
2	1	SV.VSD
2	1	KE.KLS
2	1	SQ.SLL
2	2	TS.LPR
2	2	EQ.TSE
2	2	DK.LVS
3	1	NE.GSS
2	2	LS.ERV
2	2	SK.HDS
2	1	SD.SED
2	2	KE.NSP
2	1	S.QNAR
2	1	S.ASGK
2	1	S.SKVI
2	1	L.TSKS
2	1	F.GQPE
4	1	F.TASG
2	1	K.AGLL
2	1	K.NNYE
2	2	S.VKEL
2	2	D.RNPE
2	1	S.LEDL
2	1	V.NIPS
2	2	E.LSSE
2	2	L.SDNL
2	1	N.VGSS
2	2	E.SSSN
2	2	E.FEEH
2	2	S.ERVL
2	1	V.SKSV
2	2	L.NQKK
2	2	E.IKEP


e) How to use -k, -q and -v.
----------------------------
The parameter -k controls the minimum number of appearances a pattern can have. The parameter -q controls the maximum. The parameter -v controls how the appearances are counted. If omitted then they are counted per general appearance but if included each pattern is counted once per sequence. There are many interesting combinations among those variables.

 * If you are looking for patterns that appear at least -k times or/and no more than -q, set -k and -q accordingly. For example:

./teiresias -l3 -w3 -k5 -q5 -iSampleInputTJU.txt

returns all exact patterns that span at least 3 positions and appear exactly 5 times.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
5	2	TVN
5	2	SLK
5	2	LPR
5	2	VEE
5	2	NVE
5	2	GLE
5	2	KTK
5	2	SEN
5	2	SNN
5	2	GKE
5	2	KEL
5	2	LSN
5	2	LES
5	2	ENI
5	2	GLL
5	2	GSD
5	2	QSP
5	2	SCS
5	2	TKC
5	2	SET
5	2	ENK
5	2	ESS
5	2	KKK
5	2	SDE
5	2	EKN
5	2	PTS
5	2	VST
5	2	TSL
5	2	STR
5	1	KVK
5	1	ASGK
5	1	RAL
5	1	KLS
5	2	ESG
5	2	SST
5	1	SFS
5	2	SLP
5	1	DSE
5	2	EES
5	2	SLN
5	2	EIK
5	2	SQK
5	2	PIS
5	2	SRN
5	2	SEL
5	2	ETV
5	2	TEN
5	2	DSS
5	2	FSD
5	2	SIS
5	2	KCS
5	2	SVV
5	2	SIK
5	2	LKN
5	2	SLF
5	2	SSA
5	2	EEQ
5	2	AKK

 * If you want to find patterns that exist in a specific number of sequences (or all) then set -v and -k accordingly. For example

./teiresias -l4 -w4 -k2 -v -iSampleInputTJU.txt

returns all patterns with length 4 or more and no wildcards that appear in both the sequences of the sample input file.

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
4	2	LFSD
3	2	NEVG
3	2	QTVN
3	2	SKNK
3	2	SLQN
3	2	SDSS
3	2	LEEN
3	2	SLFS
3	2	SLLE
3	2	IKKK
3	2	SSED
2	2	LSNL
2	2	PEKY
2	2	AEND
2	2	LSKN
2	2	SDNL
2	2	HLSE
2	2	NSLN
2	2	SNSY
2	2	TSVS
2	2	RNPE
2	2	SDTF
2	2	KSVQ
2	2	VFSK
2	2	LESS
2	2	TEQN
2	2	ITPQ
2	2	CNKS
2	2	LEESG
2	2	GSKE
2	2	SLPR
2	2	FGKT
2	2	SSKQ
2	2	TLKY
2	2	DTYL
2	2	SEET
2	2	IKED
2	2	SPAA
2	2	GSSE
2	2	VEEL
2	2	ASTE
2	2	FTKC
2	2	KENN
2	2	VGQK
2	2	FKTK
2	2	FEEH
2	2	ERVL
2	2	PEHL
2	2	SVSE
2	2	GSDDS
2	2	EQKE
2	2	EEKE
2	2	DEEL
2	2	ISLLE
2	2	KSER
2	2	EISL
2	2	ESTR
2	2	LIKE
2	2	NQKK
2	2	KCSI
2	2	SPYL
2	2	KVSK
2	2	TENL
2	2	GSDSS
2	2	APES
2	2	SLFSD
2	2	SISN
2	2	SEEI
2	2	SVVK
2	2	SETS
2	2	SSSN
2	2	NVEK
2	2	IKEP
2	2	KTSV
2	2	NKSK
2	2	ENNS
2	2	SPER
2	2	ELSS
2	2	VKEL
2	2	VKTK
2	2	EETT
2	2	EESE
2	2	DLLD
2	2	SQKS
2	2	SLKK
2	2	VHPI
2	2	KADL
2	2	NQEE
2	2	DIKE
2	2	SVAL
2	2	GLEI
2	2	VNKR
2	2	LSSE
2	2	EAAS
2	2	LGRN
2	2	ETSY


f) How to use -b and -n.
The parameter -n controls the number of bracketed literals, it should always be used with -b. These parameters are used when some amount of freedom of replacement is given to the literals but the replacements can only occur within specific groups. Of course they can also be combined with wildcards. The result will be patterns with some specific literals, some literals that can only be chosen among the bracketed ones and some wildcards i.e. literals that can be anything. Keep in mind that unlike -w that signifies the maximum number of ‘.’ among every -l consecutive (but not necessarily contiguous) literals, -n signifies the number of brackets per pattern. If -n is not set but a file is provided using -b, then -n is set to "infinity". The brackets are trimmed so that only literals that actually appear in the input in those sequences remain.

 * If you don't want to use wildcards but would still like some positions to be non specific, you should use -b. For example

./teiresias -l4 -w4 -k5 -iSampleInputTJU.txt -bSampleEquivalencesTJU.txt

returns

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
11	2	[STY][STY][ILV][ILV]
11	2	[ST][STY][ILV][SY]
10	2	[ST][STY][ILV]S
9	2	[ILV][STY][ST][ILV]
9	2	[ILV][ILV][ST][STY]
8	2	[IV][SY][ILV][ILV]
8	2	[ST][ST][STY][ILV]
8	2	[ILV][STY][ILV][ST]
8	2	[ST]S[ILV][SY]
7	2	[ST]S[ILV]S
7	2	[ILV][ILV][STY][ILV]
7	2	[ILV]K[ILV][ST]
7	2	[ILV][STY][STY][STY]
7	2	[ST][ILV][ST][ST]
7	2	[ST][ILV][ILV]E
7	2	[IV]S[ILV][ILV]
6	2	[ILV][ILV][SY]D
6	2	S[STY][ILV][ILV]
6	1	[ILV][STY][ILV]S
6	2	[IL]F[ST]D
6	2	D[ILV][ILV][STY]
6	2	[ILV][STY]L[ST]
6	2	T[SY][ILV][SY]
6	2	[ILV][STY]E[ST]
6	2	EE[ILV][ILV]
6	2	[STY][ILV][ILV][STY]
6	2	[ST]S[STY][ILV]
6	2	[ILV]S[ST][IL]
6	2	[ILV]K[ILV]S
6	2	[ILV][STY][ST]L
6	2	K[ST][ST][ST]
6	2	[ST][ILV][SY][ILV]
6	2	E[ILV][STY][STY]
6	2	S[ILV][ILV]E
6	2	[ST][LV][ILV][ILV]
6	2	[STY][STY][ILV]L
6	2	[ST][STY]EE
6	2	[ILV]D[ILV][IL]
6	2	[STY][STY]L[IL]
6	2	[STY]S[LV][ILV]
5	2	[ILV][ILV]KE
5	2	V[SY][ILV][ILV]
5	2	[STY][ILV][ST]N
5	2	TS[ILV][SY]
5	2	[ST][ST]V[SY]
5	2	[SY][ILV]E[ST]
5	2	[ST][STY][SY]K
5	2	[SY]G[ILV][STY]
5	2	[LV]E[STY]S
5	2	E[ST]D[ILV]
5	2	LF[ST]D
5	2	[ILV][ST]S[ST]
5	2	[ILV][ILV]T[ST]
5	2	SN[ST][STY]
5	2	[LV][ILV][ILV]Q
5	2	[ST][IL]KK
5	1	[IL]S[ILV][ST]
5	2	[ST][STY][STY][SY]
5	2	[ILV][ILV]SD
5	2	[LV][ST][STY]E
5	1	[ST]ASG
5	1	S[ST][IV]S
5	1	ASGK
5	2	[ST][ST]S[IL]
5	2	[ST]D[ILV][STY]
5	2	[LV][ILV][LV][ST]
5	2	[ST][IL]LE
5	2	[ILV][ST][SY]S
5	2	S[ILV][SY][ILV]
5	2	[IL][STY][STY]Q
5	2	K[ST]S[ST]
5	2	[ST][LV]S[ST]
5	2	[LV][STY][IL]E
5	2	[ILV][ST]T[ILV]
5	2	DE[ILV][STY]
5	2	[ILV][ST]N[ST]
5	2	E[ST][ST][SY]
5	2	[SY][IL]QN
5	2	H[ILV][STY]E
5	2	S[LV]F[ST]
5	2	[ST]G[IL]E
5	2	[ILV][ILV]S[ILV]
5	2	[STY]P[STY][ST]
5	2	[ILV][ST]N[LV]
5	2	T[SY][ILV]S

 * The number of brackets per pattern can be restricted by using -n. For example

./teiresias -l4 -w4 -k5 -n2 -iSampleInputTJU.txt -bSampleEquivalencesTJU.txt

returns

##########################################################
#                                                        #
#                       FINAL RESULTS                    #
#                                                        #
##########################################################
7	2	[ST]S[ILV]S
6	2	[ST][STY]EE
6	2	S[ILV][ILV]E
6	2	[ILV]K[ILV]S
6	2	EE[ILV][ILV]
6	2	[IL]F[ST]D
5	2	[ILV][ILV]KE
5	2	[ST][IL]LE
5	2	K[ST]S[ST]
5	2	DE[ILV][STY]
5	2	[SY][IL]QN
5	2	H[ILV][STY]E
5	2	S[LV]F[ST]
5	2	[ST]G[IL]E
5	2	T[SY][ILV]S
5	2	TS[ILV][SY]
5	2	[LV]E[STY]S
5	2	E[ST]D[ILV]
5	2	LF[ST]D
5	2	SN[ST][STY]
5	2	[ST][IL]KK
5	2	[ILV][ILV]SD
5	1	[ST]ASG
5	1	S[ST][IV]S
5	1	ASGK



7. How to download and run
--------------------------
a) Download the compressed file from here.
b) Un-compress the files and move them to a folder on your desktop.
c) Open a terminal window.
d) Type "cd Desktop/name-of-folder/" where name-of-folder is the name of the folder containing the files. Press enter.
e) Type "make". Press enter.
f) Type "./teiresias -l4 -w6 -k4 -iSampleInputTJU.txt -p" or any other parameters you want and press enter.
g) Open the output file created in the same folder to see the results!



8. Changelog
------------
v0.9.1 RELEASED 3/13/2014

The input file can contain empty lines
The input file's headers can contain anything (must still start with >)
The input file's sequences can have end-of-line characters among their characters



9. References
-------------
[1] Rigoutsos, I, Floratos, A (1998) Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics 14: 55-67
