Threshold-seq Written Example
Rogan Magee; March 20, 2017 

This guide will walk you through the use of the Threshold-seq R script distribution, from both the command line and the R console itself. We provide an example file - “exampleData.txt” - to get you started on using Threshold-seq. The threshold for this file should be the integer 4 (four). 

Please note, the example file - “exampleData.txt” - is a list of read counts for unique sequences from a real short RNA sequencing dataset. The dataset has been processed for quality and adaptor trimming, and these numbers represent the read counts for the unique sequences in the dataset. 
 
———
R Console/R Studio:

To start working with Threshold-seq in the R console or R studio, please open up your version of the R console or R studio.

Next, load the example data file - “exampleData.txt” - into R, using the following command:

data = readLines(“exampleData.txt”);

This command will read the example file, which is a list of raw counts corresponding to the read count for unique sequences in a real short RNA sequencing dataset. It will read that example file into a numeric vector named data. The variable - data - now contains a list of integers that correspond to read counts for unique sequences in a sequencing dataset.

Next, load the Threshold-seq source code into R, using the following command:

source(“Threshold-seq.r”);

Please note, the Threshold-seq.r file must be in the same directory (folder) that you are running the R console or R studio in. If it is not, please refer to the R script by its full file path on your file system.

Next, you’re ready to run Threshold-seq. For a list of the optional arguments, please refer to the paper itself for the theoretical discussion of their place in the Threshold-seq algorithm and the README.Threshold-seq.RM.16March2017.txt file for a description of how they are implemented in the R code.

Run Threshold-seq with the following command:

thresholdSeq(data,”myOutputFile.txt”);

Please note that the data argument is required, and is the list of read counts for your sequenced dataset. The second argument is the name of an output file, as a string, that you wish to write the resulting threshold in. Optionally, you can save the output of Threshold-seq to an R variable, by:

myVariable = thresholdSeq(data,”myOutputFile.txt”);

This will save several of the variables that Threshold-seq uses to the R list called “myVariable”, and will allow you to error proof your Threshold-seq run.

That’s it! You should expect to see one number in the output file you chose, namely, the threshold for this dataset.
———

———
Command Line (unix and/or Mac OS X):

To run Threshold-seq from the command line, please use the extended script, called “cmd.Threshold-seq.r”. Please use the following usage syntax:

Rscript cmd.Threshold-seq.r exampleData.txt myOutputFile.txt

Please note that the first word on the line invokes the R console command line tool, which automatically loads the next argument as R source code and executes the entire script. The second word on the line is the name of the Threshold-seq command line tool.

The next two words are the required arguments. The first of the next two words (word three on the line) is your input data file, containing read counts for unique sequences, one per line. The second of the next two words (word four on the line) is the output file to which you would like to save the result of Threshold-seq. 

Please note that you can also add the optional arguments to Threshold-seq on the command line. Please see the below for the exhaustive list. ***IMPORTANT*** all optional arguments must:
1) be given in the order that they’re listed below
2) be given one after the other if you intend to specify a later default argument (see below for explanation)

Optional arguments to Threshold-seq include:

nperm -- The number N of iterations through the Threshold-seq algorithm to run. The higher the number of iterations, the less likely Threshold-seq is to report a different value every time it is run (these values differ due to the stochastic nature of resampling). To be safe, use values of N = 500 or higher.

CDFmin -- The minimum value of the CDF to begin the scan from. This value represents the point above which to start asking whether or not the CDFs gathered by resampling are beginning to differ. The higher this value, the less likely Threshold-seq is to report a different value every time it is run (these values differ due to the stochastic nature of resampling). To be safe, use values of 0.9 or higher. 

CDFstep -- The value by which to increase the scanned CDF values. This can generally be set to 0.005, with smaller values unlikely to change threshold output, and higher values likely to yield higher thresholds. 

verbose -- Should incremental output be typed to the console?

autoRetry -- If a threshold can't be found, should Threshold-seq automatically reattempt with a CDFstep/2? 

For example, the following is an acceptable execution:

Rscript cmd.Threshold-seq.r exampleData.txt myOutputFile.txt 1000 0.9

The following is not an acceptable execution and will likely result in error:

Rscript cmd.Threshold-seq.r exampleData.txt myOutputFile.txt 0.9
———

That concludes the Threshold-seq tutorial! Please email Rogan Magee at rogan.magee@jefferson.edu with any questions or concerns you have regarding the script. 

———

Terms of Use
---------------
This code can be freely used for research, academic and other non-profit activities.
Only one instance of the code may be used at a time, and then for only one concurrent user. You may not
use the code to conduct any type of application service, service bureau or time-sharing operation or to
provide any remote processing, network processing, network telecommunications or similar services to
any person, entity or organization, whether on a fee basis or otherwise. The code can be copied and
compiled on any platform for the use authorized by these terms and conditions. All copies of the code
must be accompanied by this note. The code cannot be modified without the written permission of the
Computational Medicine Center of Thomas Jefferson University https://cm.jefferson.edu

Commercial use is strictly prohibited. If you wish to use these codes commercially please contact the
Computational Medicine Center of Thomas Jefferson University: https://cm.jefferson.edu/contact-us/

THE CODE IS PROVIDED “AS IS” WITH NO REPRESENTATIONS OR WARRANTIES OF ANY KIND, EITHER EXPRESSED
OR IMPLIED. TO THE FULLEST EXTENT PERMISSIBLE PURSUANT TO APPLICABLE LAW. THOMAS JEFFERSON
UNIVERSITY, AND ITS AFFILIATES, DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NON-INFRINGEMENT.

NEITHER THOMAS JEFFERSON UNIVERSITY NOR ITS AFFILIATES MAKE ANY REPRESENTATION AS TO THE RESULTS
TO BE OBTAINED FROM USE OF THE CODE.

———
