General information:
The main database content is original sequence data and cDNA library information from different organisms as well as results from BlastX searches against major protein sequence databases contained in NRPEP. Additionally sequence alignments of stackPACK clustering projects are available.
The analysis application content of CR-EST is growing and changing. Information about recent developments can be found at the version history page.
Database and application:
The CR-EST database runs on an ORACLE 11g database management system. For applications, data access is allowed only through database views. This led to a 3-tier-architecture (see Fig. 1). The system can be easily extended by additional plant organisms and their EST information. Data workflow from cDNA library creation to data import is based on different technologies (see fig. 2).
cDNA libraries:
For information about all cDNA libraries select View cDNA libraries from CR-EST home or select the cDNA library catalog from the CR-EST application menu for a organism-specific list.
Sequences:
Vector sequences and sequence ends were trimmed from the 5'-and 3'-end until a 50 bp window contains less than two ambiguities. The maximum length was set to 700 bp. In a second step CrossMatch (Green 1996) was used to detect remaining vector artefacts. Only sequences longer than 100 bp after this process were included in the dataset.
Sequence identifier:
The EST sequence identifier consists of three parts, e.g. the fictive identifier XYZ99A01p:
XYZ : library code (up to 4 letters)
99A01 : plate address ( 99: plate number (up to 3 digits), A: column on plate, 01: row on plate)
p : primer code (1 letter)
Primer code | Library id | Direction |
r | HA | 5 |
u | HA | 3 |
r | HB | 5 |
T | HB | 5 |
w | HB | 3 |
w | HC | 5 |
y | HC | 3 |
r | HD | 5 |
T | HD | 5 |
u | HD | 3 |
w | HD | 3 |
T | HDP | 3 |
w | HDP | 5 |
u | HE | 3 |
r | HE | 5 |
w | HF | 3 |
r | HF | 5 |
T | HF | 5 |
r | HG | 5 |
u | HG | 3 |
y | HH | 3 |
u | HH | 3 |
w | HH | 5 |
u | HI | 3 |
T | HI | 5 |
r | HI | 5 |
w | HI | 3 |
x | HJ | 5 |
u | HK | 5 |
r | HK | 3 |
w | HL | 5 |
Sequence similarity searches:
Using the BlastX2 program, all sequences were compared to NRPEP, a database containing non-redundant protein sequences from GENBANK translations, PDB, SWISSPROT and PIR. (for details on databases see: http://genome.dkfz-heidelberg.de/).
The first ten hits were included into CR-EST. Until now, similarity search results are available for barley, pea and potato EST sequences generated at the IPK. Sequence similarity searches against databases were conducted using BlastX2 from release 2.0.9 of the Blast2 (NCBI) suite of programs.
These programs use filtering tools by default (SEG). Searches were performed using the default parameters (matrix: blosum62, -EXP=10, -WORD=3, -THRES=11, -EXT=15, -GAP=11, -LEN=1).
StackPACK consensus sequences:
We included Blast results for consensi of a clustering process into the database to provide information about redundancy and potentially more significant similarity scores as compared to single EST sequences to the user.
EST clustering was performed with stackPACK 2.1 and stackPACK 2.2 from Nov. 26, 2004 (http://www.egenetics.com). For information about the clustering process see the ISMB99 EST clustering tutorial (http://www.sanbi.ac.za).
There are different types of consensus sequences available within the CR-EST database. Identifiers were adopted from stackPACK extended by a special code for each clustering project. For detailed information about the clustering projects log into the CR-EST application and select the clustering project statistics.
The structure of consensus sequence identifiers is cl#ct#cn#[a-z]##, whereby '#' is wildcard for a number and [a-z] stands for one lowercase character. The fictive consensus sequence identifier cl100ct110cn120a00 contains following information:
- cl100 stands for cluster number 100
- ct110 stands for contig number 110
- cn120 stands for consensi number 120
- a00 stands for clustering project a00
In all cases primary and alternative consensus sequences are stored. For more information about StackPACK consensus sequences, please have a look at the Egenetics webpage.
NOTE: Clustering data will change periodically, therefore cluster IDs are not useful as reference points!