README for Threshold-seq Code
Rogan Magee, rgm005@jefferson.edu
Current: 20 March, 2017 

The following text file serves as a README for the Threshold-seq R script.

For any questions or issues with this code, please email Rogan Magee at rgm005@jefferson.edu. 

Index:

1. Description
2. Installation
3. R Script Usage
4. R Command Line Script Usage
5. Terms of Use
6. Change Logs

———

1. Description:

Threshold-seq is a computational method for determining the read threshold for use with short RNA sequencing data. The threshold given as output by this method represents the raw read level above which sequences from a short RNA sequencing dataset should be used. Threshold-seq uses statistical sampling methods to determine the read level at which read counts are thought be significant, and therefore important.

——-
———

2. Installation

The Threshold-seq algorithm is distributed as a R script (‘Threshold-seq.r’) and R command line script (‘cmd.Threshold-seq.R’). No installation of the code itself is required. However, you must have the latest version of R or R Studio installed to use the code.

To install R/R Studio, please visit: https://cran.r-project.org/.

Once you have R installed, you are ready to use the Threshold-seq code. To use the Threshold-seq code, please open R. 

——-
———

3. R script usage:

	Example R code:
	
	source(“Threshold-seq.r”);
	data = readLines(“exampleData.txt”);
	myVariable = thresholdSeq(data,”myOutputFile.txt”);

To use Threshold-seq in R:

1) load the script into R using the source() command (line 1 above)
2) gather read count data from a short RNA sequencing dataset into a numeric vector (line 2 above); OPTIONAL: use the makeCountSequencePairs() function with a vector of sequence identifiers, in order to generate this numeric vector. This numeric vector should contain the count for each unique sequence in your dataset, after quality and adaptor trimming have been done. 
3) pass the vector from 2 as a first argument to thresholdSeq(), together with a filename as the second argument (line 3 above)
4) use the numeric threshold reported, by using only sequences from your short RNA sequencing dataset that were read at threshold or higher

Optional arguments to Threshold-seq include:

nperm -- The number N of iterations through the Threshold-seq algorithm to run. The higher the number of iterations, the less likely Threshold-seq is to report a different value every time it is run (these values differ due to the stochastic nature of resampling). To be safe, use values of N = 500 or higher.

CDFmin -- The minimum value of the CDF to begin the scan from. This value represents the point above which to start asking whether or not the CDFs gathered by resampling are beginning to differ. The higher this value, the less likely Threshold-seq is to report a different value every time it is run (these values differ due to the stochastic nature of resampling). To be safe, use values of 0.9 or higher. 

CDFstep -- The value by which to increase the scanned CDF values. This can generally be set to 0.005, with smaller values unlikely to change threshold output, and higher values likely to yield higher thresholds. 

verbose -- Should incremental output be typed to the console?

autoRetry -- If a threshold can't be found, should Threshold-seq automatically reattempt with a CDFstep/2? 

———
———

4. R command line tool usage:

	Example UNIX code:

	Rscript cmd.Threshold-seq.r inputFile.txt outputFile.txt 1000 0.9

INPUT:
usage: Rscript ./cmd.Threshold-seq.r [inputFile] [outputFile]

Arguments:

inputFile:

This is a required argument. It is a file containing a read count for each unique sequence in your dataset on each line. Please observe this small example.


If your dataset contains the following sequence-count pairs, A=1, B=2, C=3, then the inputFile should contain only the following lines:

1
2
3

outputFile:

This is a required argument. It is a file to which the threshold number will be written at the end of the script execution. This can be any path on your file system.

Optional Arguments:

The cmd.Threshold-seq.r command line executable script takes the same optional arguments as the Threshold-seq.r script. They are, in order: 
nperm: DEFAULT = 1000; The number N of iterations through the Threshold-seq algorithm to run. The higher the number of iterations, the less likely Threshold-seq is to report a different value every time it is run (these values differ due to the stochastic nature of resampling). To be safe, use values of N = 500 or higher.

CDFmin: DEFAULT = 0.9; The minimum value of the CDF to begin the scan from. This value represents the point above which to start asking whether or not the CDFs gathered by resampling are beginning to differ. The higher this value, the less likely Threshold-seq is to report a different value every time it is run (these values differ due to the stochastic nature of resampling). To be safe, use values of 0.9 or higher. 

CDFstep: DEFAULT = 0.005; The value by which to increase the scanned CDF values. This can generally be set to 0.005, with smaller values unlikely to change threshold output, and higher values likely to yield higher thresholds. 

verbose: DEFAULT = FALSE; Should incremental output be typed to the console?

***PLEASE NOTE*** the optional arguments must be passed in order and all of the default arguments must be specified before the last argument in the list you wish to specify. FOR EXAMPLE, if you wish to specify the CDFmin, then YOU MUST specify the nperm before it. 

FOR EXAMPLE, the following is an acceptable execution:

Rscript cmd.Threshold-seq.r inputFile.txt outputFile.txt 1000 0.9

The following is not an acceptable execution and will likely result in error:

Rscript cmd.Threshold-seq.r inputFile.txt outputFile.txt 0.9

———
———

5. Terms of Use

This code can be freely used for research, academic and other non-profit activities.
Only one instance of the code may be used at a time, and then for only one concurrent user. You may not
use the code to conduct any type of application service, service bureau or time-sharing operation or to
provide any remote processing, network processing, network telecommunications or similar services to
any person, entity or organization, whether on a fee basis or otherwise. The code can be copied and
compiled on any platform for the use authorized by these terms and conditions. All copies of the code
must be accompanied by this note. The code cannot be modified without the written permission of the
Computational Medicine Center of Thomas Jefferson University https://cm.jefferson.edu

Commercial use is strictly prohibited. If you wish to use these codes commercially please contact the
Computational Medicine Center of Thomas Jefferson University: https://cm.jefferson.edu/contact-us/

THE CODE IS PROVIDED “AS IS” WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESSED
OR IMPLIED. TO THE FULLEST EXTENT PERMISSIBLE PURSUANT TO APPLICABLE LAW. THOMAS JEFFERSON
UNIVERSITY, AND ITS AFFILIATES, DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NON-INFRINGEMENT.

NEITHER THOMAS JEFFERSON UNIVERSITY NOR ITS AFFILIATES MAKE ANY REPRESENTATION AS TO THE RESULTS
TO BE OBTAINED FROM USE OF THE CODE.

———
———

6. Change Logs

Version 1.0 - February 14, 2017: Initial release of Threshold-seq!

Version 1.1 - March 23,    2017: Added two functions to 1) sanitize the input data so as to reject input that contains non-numeric values and 2) account for datasets in which a threshold can not be found given a set of input parameters. 

———
———

