TEBenchPress

Arian Smit and Robert Hubley

Institute for Systems Biology
Contact: rhubley@systemsbiology.org

 

Original Document

TE-Annotation-Benchmark-Proposal-AS_RH

 

Description

TEBenchPress is an evolving set of TE annotation benchmarks we use to evaluate the performance of the RepeatMasker package.  For this proposal we provide a GARLIC[1] simulated human intergenic sequence containing modeled TEs and simple repeat sequences as a demonstration benchmark for TE annotation software.  In addition to the sequence we provide tools for comparing the known TE insertions with a user provided set of putative TE ranges to calculate false positive, false negative, true positive, true negative, specificity, sensitivity, accuracy and false discovery rate metrics.  At this time we also provide reversed, and shuffled natural sequences as an additional false positive benchmark.


Specification

 

Description

Comments

Type
Modified real, and simulated
Modeled intergenic sequences with modeled TE insertions from Repbase. Reversed, and shuffled sequences real genomic sequences.
Primary Uses
To measure both sensitivity and specificity of TE annotation ranges.  No evaluation of repeat family membership or classification is performed.
Taxa
Currently Homo sapiens
Source
R. Hubley
Documentation
Included
Version
1.0

 

Details

This package contains a human-like artificial sequence dataset for use as a TE annotation benchmark.  Included is a “makefile” which was used to generate the benchmark dataset, an evaluation of the artificial sequence vs real human sequence using a variety of sequence complexity measures, and utilities to evaluate a set of annotations against the known locations of TEs in the artificial sequence.

The artificial sequence containing inserted simple repeats and TEs is created using the GARLIC algorithm[1].  Using this sequence it is possible to evaluate both false positives, and false negatives for many types of repeat annotation programs.  A simple BED format is used by the comparison program to relate annotation ranges with the known insertion sites.  A script is provided to convert RepeatMasker output into the BED format.

Example run with RepeatMasker and evaluation of TE results:

% RepeatMasker -engine cross_match -s artSeq.fasta
% ./outToBed.pl -noSimple artSeq.fasta.out > artSeq.fasta.noSimple.out.bed
% ./compareResults.pl artSeq.IROnly.inserts.bed artSeq.fasta.noSimple.out.bed artSeq.fasta

The GARLIC modeling approach requires knowledge of the types, abundance, and age of repeats in a genome a priori.  Using this model to create a benchmark sequence for programs used to define the model is circular. The primary limitation of this is that the program providing the initial repeat list sets the level of the bar for difficulty. False positives are still fairly evaluated, arising primarily due to the realistic simple repeat sequences included in the benchmark. As the TE insertions are mutated starting from a consensus library, the actual insertions are independent from a given detection method and provide a good false negative benchmark.

 

References

1. Caballero, Juan, et al. “Realistic artificial DNA sequences as negative controls for computational genomics.” Nucleic acids research (2014): gku356.

 

Drosophila melanogaster genome

Emmanuelle Lerat

Lab. Biométrie et Biologie Evolutive, Université Lyon 1, France

 

Original Document

TE Annotation Benchmark Proposal _EL (Word)

TE Annotation Benchmark Proposal _EL (PDF)

 

Description

The Drosophila melanogaster genome has been published in 2000. Since then, it has been the subject of throughout annotation process, in particular to describe the content of transposable elements (TEs). To date, the last version of TE insertion annotation is available on flybase (version 5.57; http://flybase.org/) and an exhaustive list of reference sequences are described in the Repbase database (http://www.girinst.org/).

Specification

 

Description

Comments

Type
Real and modified real
Assembled genomic data
Primary Uses
Sensitivity, specificity
Used to test the performance of various de novo programs
Taxa
Drosophila melanogaster
Source
flybase

 

Details

* data description

The genome of D. melanogaster corresponds to about 140 Mb (Adams et al. 2000; Smith et al. 2007). The content of repeats has been estimated to 7% of the euchromatin (Bergman et al. 2006; Smith et al. 2007), and 77% of the heterochomatin (Smith et al. 2007). TE insertions have been annotated and represent a total of 5,409 insertions (version 5.57; http://flybase.org/). However, some reference TE are also to be taken into account, that are indicated in the Repbase database (http://www.girinst.org/). It is thus possible to still find some unannotated copies in this genome, although the vast majority has been described. Concerning the insertions, it has been estimated that less than a thousand are nested (Bergman et al. 2006; Smith et al. 2007).

* performance of signature based programs

To compare the ability of signature based approach programs developed to detect specifically LTR-retrotransposons,  the X chromosome of D. melanogaster was used as a benchmark (Lerat 2010). Considering the annotations, this chromosome is supposed to contain 225 LTR-retrotransposon insertions among which 96 are full-length elements, which are the type of sequences that these specific programs are able to detect. It was thus possible to compute the sensitivity (TP/(TP+FN)) of each program.

* performance of a read based de novo program

To test the ability for a read based de novo program, simulated read data were obtained using the program ART (Huang et al. 2012), at 10% of coverage and with various read lengths (80, 150 and 250 nts) based on the genome sequence of D. melanogaster, with and without taking into account of the chromosomes U and Uextra. This dataset can be used conjointly with the set of reference sequences described in Repbase.

 

References

Adams MD et al. (2000) The genome sequence of Drosophila melanogaster. Science 287:2185-2195.

Bergman CM, Quesneville H, Anxolabéhère D, Ashburner M (2006) Recurrent insertion and duplication generate networks of transposable element sequences in the Drosophila melanogaster genome. Genome Biol. 7:R112.

Huang W, Li L, Myers JR, Marth GT (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593-594.

Lerat E (2010) Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs. Heredity 104:520-533.

Smith CD, Shu S, Mungall CJ, Karpen GH (2007) The Release 5.1 annotation of Drosophila melanogaster heterochromatin. Science 316:1586-1591.

 

Benchmark for quality assessment of de novo repeat identification and genome annotation

 

Florian Maumus and Hadi Quesneville

URGI-INRA, FRANCE
 

 

Original Document

Benchmark_Proposal_URGI   (revision 22-Aug-14)

 

Description

This dataset aims to help assessing the sensitivity and specificity of de novo repeat detection and annotation tools using the A. thaliana genome. It uses the coverage of known repeats as a proxy for sensitivity and the coverage of a simulated genome as a proxy of specificity.

 

The sixteen researchers who attended the TEAM meeting in Barbados:

Mathieu Blanchette

Guillame Bourque

Thomas Bureau

Josep Casacuberta

Richard Cordaux

Anna-Sophie Fiston-Lavier

Glenn Hickey

Douglas Hoen

Aurélie Hua-Van

Robert Hubley

Aurélie Kapusta

Emmanuelle Lerat

Florian Maumus

David Pollock

Arian Smit

Travis Wheeler

Thanks for coming!

Please visit the Barbados Transposable Element Annotation Meeting page.