Benchmark Proposal #3


Arian Smit and Robert Hubley

Institute for Systems Biology


Original Document




TEBenchPress is an evolving set of TE annotation benchmarks we use to evaluate the performance of the RepeatMasker package.  For this proposal we provide a GARLIC[1] simulated human intergenic sequence containing modeled TEs and simple repeat sequences as a demonstration benchmark for TE annotation software.  In addition to the sequence we provide tools for comparing the known TE insertions with a user provided set of putative TE ranges to calculate false positive, false negative, true positive, true negative, specificity, sensitivity, accuracy and false discovery rate metrics.  At this time we also provide reversed, and shuffled natural sequences as an additional false positive benchmark.





Modified real, and simulated
Modeled intergenic sequences with modeled TE insertions from Repbase. Reversed, and shuffled sequences real genomic sequences.
Primary Uses
To measure both sensitivity and specificity of TE annotation ranges.  No evaluation of repeat family membership or classification is performed.
Currently Homo sapiens
R. Hubley



This package contains a human-like artificial sequence dataset for use as a TE annotation benchmark.  Included is a “makefile” which was used to generate the benchmark dataset, an evaluation of the artificial sequence vs real human sequence using a variety of sequence complexity measures, and utilities to evaluate a set of annotations against the known locations of TEs in the artificial sequence.

The artificial sequence containing inserted simple repeats and TEs is created using the GARLIC algorithm[1].  Using this sequence it is possible to evaluate both false positives, and false negatives for many types of repeat annotation programs.  A simple BED format is used by the comparison program to relate annotation ranges with the known insertion sites.  A script is provided to convert RepeatMasker output into the BED format.

Example run with RepeatMasker and evaluation of TE results:

% RepeatMasker -engine cross_match -s artSeq.fasta
% ./ -noSimple artSeq.fasta.out > artSeq.fasta.noSimple.out.bed
% ./ artSeq.IROnly.inserts.bed artSeq.fasta.noSimple.out.bed artSeq.fasta

The GARLIC modeling approach requires knowledge of the types, abundance, and age of repeats in a genome a priori.  Using this model to create a benchmark sequence for programs used to define the model is circular. The primary limitation of this is that the program providing the initial repeat list sets the level of the bar for difficulty. False positives are still fairly evaluated, arising primarily due to the realistic simple repeat sequences included in the benchmark. As the TE insertions are mutated starting from a consensus library, the actual insertions are independent from a given detection method and provide a good false negative benchmark.



1. Caballero, Juan, et al. “Realistic artificial DNA sequences as negative controls for computational genomics.” Nucleic acids research (2014): gku356.