About CR-EST

ArchitectFig. 1: CR-EST system architecture
WorkflowFig. 2: CR-EST data workflow
General information:
The main database content is original sequence data and cDNA library information from different organisms as well as results from BlastX searches against major protein sequence databases contained in NRPEP. Additionally sequence alignments of stackPACK clustering projects are available. The analysis application content of CR-EST is growing and changing. Information about recent developments can be found at the version history page.

Database and application:
The CR-EST database runs on an ORACLE 11g database management system. For applications, data access is allowed only through database views. This led to a 3-tier-architecture (see Fig. 1). The system can be easily extended by additional plant organisms and their EST information. Data workflow from cDNA library creation to data import is based on different technologies (see fig. 2).

cDNA libraries:
For information about all cDNA libraries select View cDNA libraries from CR-EST home or select the cDNA library catalog from the CR-EST application menu for a organism-specific list.

Sequences:
Vector sequences and sequence ends were trimmed from the 5'-and 3'-end until a 50 bp window contains less than two ambiguities. The maximum length was set to 700 bp. In a second step CrossMatch (Green 1996) was used to detect remaining vector artefacts. Only sequences longer than 100 bp after this process were included in the dataset.

Sequence identifier:
The EST sequence identifier consists of three parts, e.g. the fictive identifier XYZ99A01p:

XYZ : library code (up to 4 letters)
99A01 : plate address ( 99: plate number (up to 3 digits), A: column on plate, 01: row on plate)
p : primer code (1 letter)

Primer codeLibrary idDirection
rHA5
uHA3
rHB5
THB5
wHB3
wHC5
yHC3
rHD5
THD5
uHD3
wHD3
THDP3
wHDP5
uHE3
rHE5
wHF3
rHF5
THF5
rHG5
uHG3
yHH3
uHH3
wHH5
uHI3
THI5
rHI5
wHI3
xHJ5
uHK5
rHK3
wHL5
Primer codeLibrary idDirection
uHM3
THM5
rHM5
wHM3
rHO5
SHO5
wHO3
THP5
wHP3
wHQ5
rHR5
uHR3
uHS3
THS5
rHS5
wHS3
uHT3
THT5
rHT5
wHT3
uHU3
THU5
rHU5
wHU3
rHV5
THV5
uHV3
wHV3
THW5
uHW3
VHW5
Primer codeLibrary idDirection
wBP3
fGAN3
uGAN3
fGBN3
uGBN3
rGBN3
xGCA5
uGCN3
rGCW5
fGCW5
fGD3
rGNW5
rGNW5
uGPN3
fGW5
rGW5
wHX3
uHX3
rHX5
THX5
uHY3
VHY5
THY5
uHZ3
rHZ5
rRUS5
wRUS3
uSDBN3
xSDBT5
uSSBN3
xSSBT5
uSTDB3


Sequence similarity searches:
Using the BlastX2 program, all sequences were compared to NRPEP, a database containing non-redundant protein sequences from GENBANK translations, PDB, SWISSPROT and PIR. (for details on databases see: http://genome.dkfz-heidelberg.de/). The first ten hits were included into CR-EST. Until now, similarity search results are available for barley, pea and potato EST sequences generated at the IPK. Sequence similarity searches against databases were conducted using BlastX2 from release 2.0.9 of the Blast2 (NCBI) suite of programs. These programs use filtering tools by default (SEG). Searches were performed using the default parameters (matrix: blosum62, -EXP=10, -WORD=3, -THRES=11, -EXT=15, -GAP=11, -LEN=1).

StackPACK consensus sequences:
We included Blast results for consensi of a clustering process into the database to provide information about redundancy and potentially more significant similarity scores as compared to single EST sequences to the user. EST clustering was performed with stackPACK 2.1 and stackPACK 2.2 from Nov. 26, 2004 (http://www.egenetics.com). For information about the clustering process see the ISMB99 EST clustering tutorial (http://www.sanbi.ac.za). There are different types of consensus sequences available within the CR-EST database. Identifiers were adopted from stackPACK extended by a special code for each clustering project. For detailed information about the clustering projects log into the CR-EST application and select the clustering project statistics. The structure of consensus sequence identifiers is cl#ct#cn#[a-z]##, whereby '#' is wildcard for a number and [a-z] stands for one lowercase character. The fictive consensus sequence identifier cl100ct110cn120a00 contains following information:
  • cl100 stands for cluster number 100
  • ct110 stands for contig number 110
  • cn120 stands for consensi number 120
  • a00 stands for clustering project a00

In all cases primary and alternative consensus sequences are stored. For more information about StackPACK consensus sequences, please have a look at the Egenetics webpage. NOTE: Clustering data will change periodically, therefore cluster IDs are not useful as reference points!
© Copyright 2003 - 2024 , IPK Gatersleben