AdapterRemoval − Fast short-read adapter trimming and processing
AdapterRemoval [options…] –file1 <filenames> [–file2 <filenames>]
AdapterRemoval removes residual adapter sequences from single−end (SE) or paired−end (PE) FASTQ reads, optionally trimming Ns and low qualities bases and/or collapsing overlapping paired−end mates into one read. Low quality reads are filtered based on the resulting length and the number of ambigious nucleotides (‘N’) present following trimming. These operations may be combined with simultaneous demultiplexing using 5’ barcode sequences. Alternatively, AdapterRemoval may attempt to reconstruct a consensus adapter sequences from paired−end data, in order to allow the identification of the adapter sequences originally used.
If you use this program, please cite the paper:
Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, 12;9(1):88
http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104−016−1900−2
For detailed documentation, please see
http://adapterremoval.readthedocs.io/en/v2.2.3/
−−help |
Display summary of command−line options. |
−−version
Print the version string.
−−file1 filename [filenames...]
Read FASTQ reads from one or more files, either uncompressed, bzip2 compressed, or gzip compressed. This contains either the single−end (SE) reads or, if paired−end, the mate 1 reads. If running in paired−end mode, both −−file1 and −−file2 must be set. See the primary documentation for a list of supported formats.
−−file2 filename [filenames...]
Read one or more FASTQ files containing mate 2 reads for a paired−end run. If specified, −−file1 must also be set.
−−identify−adapters
Attempt to build a consensus adapter sequence from fully overlapping pairs of paired−end reads. The minimum overlap is controlled by −−minalignmentlength. The result will be compared with the values set using −−adapter1 and −−adapter2. No trimming is performed in this mode. Default is off.
−−threads n
Maximum number of threads. Defaults to 1.
FASTQ options
−−qualitybase base
The Phred quality scores encoding used in input reads − either ‘64’ for Phred+64 (Illumina 1.3+ and 1.5+) or ‘33’ for Phred+33 (Illumina 1.8+). In addition, the value ‘solexa’ may be used to specify reads with Solexa encoded scores. Default is 33.
−−qualitybase−output base
The base of the quality score for reads written by AdapterRemoval − either ‘64’ for Phred+64 (i.e., Illumina 1.3+ and 1.5+) or ‘33’ for Phred+33 (Illumina 1.8+). In addition, the value ‘solexa’ may be used to specify reads with Solexa encoded scores. However, note that quality scores are represented using Phred scores internally, and conversion to and from Solexa scores therefore result in a loss of information. The default corresponds to the value given for −−qualitybase.
−−qualitymax base
Specifies the maximum Phred score expected in input files, and used when writing output files. Possible values are 0 to 93 for Phred+33 encoded files, and 0 to 62 for Phred+64 encoded files. Defaults to 41.
−−mate−separator separator
Character separating the mate number (1 or 2) from the read name in FASTQ records. Defaults to ‘/’.
−−interleaved
Enables −−interleaved−input and −−interleaved−output.
−−interleaved−input
If set, input is expected to be a interleaved FASTQ files specified using −−file1, in which pairs of reads are written one after the other (e.g. read1/1, read1/2, read2/1, read2/2, etc.).
−−interleaved−ouput
Write paired−end reads to a single file, interleaving mate 1 and mate 2 reads. By default, this file is named basename.paired.truncated, but this may be changed using the −−output1 option.
−−combined−output
Write all reads into the files specified by −−output1 and −−output2. The sequences of reads discarded due to quality filters or read merging are replaced with a single ‘N’ with Phred score 0. This option can be combined with −−interleaved−output to write PE reads to a single output file specified with −−output1.
Output file options
−−basename filename
Prefix used for the naming output files, unless these names have been overridden using the corresponding command−line option (see below).
−−settings file
Output file containing information on the parameters used in the run as well as overall statistics on the reads after trimming. Default filename is ‘basename.settings’.
−−output1 file
Output file containing trimmed mate1 reads. Default filename is ‘basename.pair1.truncated’ for paired−end reads, ‘basename.truncated’ for single−end reads, and ‘basename.paired.truncated’ for interleaved paired−end reads.
−−output2 file
Output file containing trimmed mate 2 reads when −−interleaved−output is not enabled. Default filename is ‘basename.pair2.truncated’ in paired−end mode.
−−singleton file
Output file to which containing paired reads for which the mate has been discarded. Default filename is ‘basename.singleton.truncated’.
−−outputcollapsed file
If –collapsed is set, contains overlapping mate−pairs which have been merged into a single read (PE mode) or reads for which the adapter was identified by a minimum overlap, indicating that the entire template molecule is present. This does not include which have subsequently been trimmed due to low−quality or ambiguous nucleotides. Default filename is ‘basename.collapsed’
−−outputcollapsedtruncated file
Collapsed reads (see –outputcollapsed) which were trimmed due the presence of low−quality or ambiguous nucleotides. Default filename is ‘basename.collapsed.truncated’.
−−discarded file
Contains reads discarded due to the –minlength, –maxlength or –maxns options. Default filename is ‘basename.discarded’.
Output compression options
−−gzip |
If set, all FASTQ files written by AdapterRemoval will be gzip compressed using the compression level specified using −−gzip−level. The extension “.gz” is added to files for which no filename was given on the command−line. Defaults to off. |
−−gzip−level level
Determines the compression level used when gzip’ing FASTQ files. Must be a value in the range 0 to 9, with 0 disabling compression and 9 being the best compression. Defaults to 6.
−−bzip2
If set, all FASTQ files written by AdapterRemoval will be bzip2 compressed using the compression level specified using −−bzip2−level. The extension “.bz2” is added to files for which no filename was given on the command−line. Defaults to off.
−−bzip2−level level
Determines the compression level used when bzip2’ing FASTQ files. Must be a value in the range 1 to 9, with 9 being the best compression. Defaults to 9.
FASTQ trimming options
−−adapter1 adapter
Adapter sequence expected to be found in mate 1 reads, specified in read direction. For a detailed description of how to provide the appropriate adapter sequences, see the “Adapters” section of the online documentation. Default is AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG.
−−adapter2 adapter
Adapter sequence expected to be found in mate 2 reads, specified in read direction. For a detailed description of how to provide the appropriate adapter sequences, see the “Adapters” section of the online documentation. Default is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT.
−−adapter−list filename
Read one or more adapter sequences from a table. The first two columns (separated by whitespace) of each line in the file are expected to correspond to values passed to –adapter1 and –adapter2. In single−end mode, only column one is required. Lines starting with ‘#’ are ignored. When multiple rows are found in the table, AdapterRemoval will try each adapter (pair), and select the best aligning adapters for each FASTQ read processed.
−−minadapteroverlap length
In single−end mode, reads are only trimmed if the overlap between read and the adapter is at least X bases long, not counting ambiguous nucleotides (N); this is independent of the −−minalignmentlength when using −−collapse, allowing a conservative selection of putative complete inserts in single−end mode, while ensuring that all possible adapter contamination is trimmed. The default is 0.
−−mm mismatchrate
The allowed fraction of mismatches allowed in the aligned region. If the value is less than 1, then the value is used directly. If `−−mismatchrate is greater than 1, the rate is set to 1 / −−mismatchrate. The default setting is 3 when trimming adapters, corresponding to a maximum mismatch rate of 1/3, and 10 when using −−identify−adapters.
−−shift n
To allow for missing bases in the 5’ end of the read, the program can let the alignment slip −−shift bases in the 5’ end. This corresponds to starting the alignment maximum −−shift nucleotides into read2 (for paired−end) or the adapter (for single−end). The default is 2.
−−trim5p n [n]
Trim the 5’ of reads by a fixed amount after removing adapters, but before carrying out quality based trimming. Specify one value to trim mate 1 and mate 2 reads the same amount, or two values separated by a space to trim each mate different amounts. Off by default.
−−trim3p n [n]
Trim the 3’ of reads by a fixed amount. See −−trim5p.
−−trimns
Trim consecutive Ns from the 5’ and 3’ termini. If quality trimming is also enabled (−−trimqualities), then stretches of mixed low−quality bases and/or Ns are trimmed.
−−maxns n
Discard reads containing more than −−max ambiguous bases (‘N’) after trimming. Default is 1000.
−−trimqualities
Trim consecutive stretches of low quality bases (threshold set by −−minquality) from the 5’ and 3’ termini. If trimming of Ns is also enabled (−−trimns), then stretches of mixed low−quality bases and Ns are trimmed.
−−trimwindows window_size
Trim low quality bases using a sliding window based approach inspired by sickle with the given window size. See the “Window based quality trimming” section of the manual page for a description of this algorithm.
−−minquality minimum
Set the threshold for trimming low quality bases using −−trimqualities and −−trimwindows. Default is 2.
−−minlength length
Reads shorter than this length are discarded following trimming. Defaults to 15.
−−maxlength length
Reads longer than this length are discarded following trimming. Defaults to 4294967295.
FASTQ merging options
−−collapse
In paired−end mode, merge overlapping mates into a single and recalculate the quality scores. In single−end mode, attempt to identify templates for which the entire sequence is available. In both cases, complete “collapsed” reads are written with a ‘M_’ name prefix, and “collapsed” reads which are trimmed due to quality settings are written with a ‘MT_’ name prefix. The overlap needs to be at least −−minalignmentlength nucleotides, with a maximum number of mismatches determined by −−mm.
−−minalignmentlength length
The minimum overlap between mate 1 and mate 2 before the reads are collapsed into one, when collapsing paired−end reads, or when attempting to identify complete template sequences in single−end mode. Default is 11.
−−seed seed
When collaping reads at positions where the two reads differ, and the quality of the bases are identical, AdapterRemoval will select a random base. This option specifies the seed used for the random number generator used by AdapterRemoval. This value is also written to the settings file. Note that setting the seed is not reliable in multithreaded mode, since the order of operations is non−deterministic.
−−deterministic
Enable deterministic mode; currently only affects –collapse, different overlapping bases with equal quality are set to N quality 0, instead of being randomly sampled.
FASTQ demultiplexing options
−−barcode−list filename
Perform demultiplxing using table of one or two fixed−length barcodes for SE or PE reads. The table is expected to contain 2 or 3 columns, the first of which represent the name of a given sample, and the second and third of which represent the mate 1 and (optionally) the mate 2 barcode sequence. For a detailed description, see the “Demultiplexing” section of the online documentation.
−−barcode−mm n
Maximum number of mismatches allowed when counting mismatches in both
the mate 1 and the mate 2 barcode for paired reads.
−−barcode−mm−r1 n
Maximum number of mismatches allowed for the mate 1 barcode; if not set, this value is equal to the −−barcode−mm value; cannot be higher than the −−barcode−mm value.
−−barcode−mm−r2 n
Maximum number of mismatches allowed for the mate 2 barcode; if not set, this value is equal to the −−barcode−mm value; cannot be higher than the −−barcode−mm value.
−−demultiplex−only
Only carry out demultiplexing using the list of barcodes supplied with –barcode−list. No other processing is done.
As of v2.2.2, AdapterRemoval implements sliding window based approach to quality based base−trimming inspired by sickle. If window_size is greater than or equal to 1, that number is used as the window size for all reads. If window_size is a number greater than or equal to 0 and less than 1, then that number is multiplied by the length of individual reads to determine the window size. If the window length is zero or is greater than the current read length, then the read length is used instead.
Reads are trimmed as follows for a given window size:
1. |
The new 5’ is determined by locating the first window where both the average quality and the quality of the first base in the window is greater than −−minquality. |
||
2. |
The new 3’ is located by sliding the first window right, until the average quality becomes less than or equal to −−minquality. The new 3’ is placed at the last base in that window where the quality is greater than or equal to −−minquality. |
||
3. |
If no 5’ position could be determined, the read is discarded. |
AdapterRemoval exists with status 0 if the program ran succesfully, and with a non−zero exit code if any errors were encountered. Do not use the output from AdapterRemoval if the program returned a non−zero exit code!
Please report any bugs using the AdapterRemoval issue−tracker:
https://github.com/MikkelSchubert/adapterremoval/issues
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or at your option any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.
Mikkel Schubert; Stinus Lindgreen
2017, Mikkel Schubert; Stinus Lindgreen