Work Packages

Work package 1: Deep genotyping and diversity analysis of the IPK barley collection

WP1a: We will use genotyping-by-sequencing (GBS) as a universal approach for deep genotyping more than 20,000 accessions of IPK's barley collection. This will deliver a diversity catalog with different levels of data resolution suited for different analytical purposes, and as a result we will obtain detailed information on the genetic structure of IPK¿s barley collection. Homogeneity of accessions and diversity between accessions can be determined at high sensitivity thus supporting novel measures of collection management as well as population genetic analyses. The information will be used for the development of customized core collections (discovery collections) to be used for genetic analyses like trait mapping, gene discovery, haplotype analysis and allele mining. Since barley is predominantly self-pollinating, most domesticated barley accessions are expected to be highly homogeneous. However, seed contaminations that occurred during the propagation process cannot be ruled out. In order to allow for the detection of heterogeneity we will sample DNA by standard protocols from excised embryos of 50 germinated grains per accession (500 extraction / week / person = 22,000 samples can be accomplished per year). GBS libraries will be prepared and sequenced by established protocols [8] using Illumina short-read (100 nt) technology (established at IPK). 96 independently barcoded GBS samples will be pooled per lane (8 lanes per flow-cell, 1x100 nt). 26 HiSeq2000 runs are required to complete GBS of 20,000 accessions (total run time = 26 weeks). Read data of one Illumina Hiseq2000 flowcell can be mapped using the established highly automated SNP data analysis pipeline. As a reference sequence for mapping, we will use either the existing WGS assembly of barley cultivar "Morex" [7] or, as soon as available, any improved reference sequence of barley or a superposition of both. Read mapping will be carried out in parallel to sequencing. The estimated cumulative amount of raw read data will not exceed 6 TB (up to 30 flowcells x 0.2 TB). During computation the required temporary disk space may be up to four-times the amount of the raw data (i.e. 24 TB) - the existing computing analysis pipeline will be adapted to handle the required throughput.

WP1b: Primary analysis of GBS data will be centered on the R statistical environment. We estimate the size of the final numeric genotypic matrix to be approximately 80 GB (20,000 samples x 500,000 markers x 8 bytes; 4 bytes for an integer data point in R + 4 bytes overhead). This amount of data can be persistently stored in memory as an R object (data.frame) for the analysis with R packages on a high-memory server. Alternatively, a high-level interface to disk-space storage of variation datasets has been coupled to population genetic methods and is available in the R package SNPrelate [13]. The purity of accessions will be assessed at markers with sufficient sequencing coverage (>20x). The matrix of genotype data will be evaluated for population genetic analysis and can be correlated with phenotypic traits. Basic population genetic summary statistics (number of segregating sites, site frequency spectrum, nucleotide diversity, fixation index First between subpopulations) will be calculated in genomic subintervals with standard population genetic software (hierfstat [14], libsequence [15]). Computationally efficient methods for assessing population structure are available [13,16]. Relatedness and decay of linkage disequilibrium will be estimated by methods implemented in the SNPrelate package [13]. Recombination rate estimation will be done with LDhat [17]. Genome-wide association studies (GWAS) will be performed to find significant associations of genetic markers with phenotypic (from legacy data) or geographic information (obtained by GIS). We will apply the GAPIT R package [18] that has been utilized for GWAS analysis of GBS data of maize and sorghum [19,20]. To increase the power of GWAS we will imputate missing data by utilizing software tested for imputing GBS data in inbreeding crop plants [21,22]. In addition to population genetic methods based on conventional SNP x sample matrices, we will explore recently developed approaches that avoid separate SNP calling/genotyping pipelines, but utilize read mapping files directly [23].

Work package 2: Unlocking the value of historic legacy data

For IPK's barley collection an extensive legacy dataset (C&E) has been accumulated and was digitized (85,000 records related to 22,000 barley accessions). Since this data cannot be queried by the public so far, the objective of this work package will be to transform all barley legacy data into suitable format, perform quality checks (e.g., plausibility, completeness) and cross-checks to hard copy documentation (field books, card files) in case of doubt. Eventually, the datasets will be transferred into the GBIS (and EURISCO) data base structure. Legacy data provides a powerful proxy for educated selection of germplasm from genebanks, and several algorithms and software tools for analysing and aggregating C&E data have been developed at IPK (Keilwagen et al., under revision). These software tools will be implemented into the BRIDGE data warehouse, providing the public with easy-to-use access for selecting genebank accessions based on C&E data. Here the data exchange formats and interfaces between GBIS (and EURISCO) on one side and the BRIDGE data warehouse on the other side will be established.

Work package 3: The link to pre-breeding and future application

WP3a: The concept of a "Genebank of the Future" envisions that potential users (e.g. plant breeders) will enter the BRIDGE data warehouse interface and will be able to query the available germplasm info according to criteria of genotype, geographic origin, climatic adaptation or concerning yield parameters or plant performance. The BRIDGE project cannot cover all areas due to budgetary constraints; however, we see the need for implementing the required architecture of user needs and requests already from the start. Therefore, we will perform multi-environment trials on a set of ~800 barley accessions purified by single-seed descent, for which GBS and exome capture-re-sequencing based genotyping information has already been or will be gathered (see WP3b). We will focus on pollination capability - a trait where currently genetic variation is very limited in the Central European elite barley genepool, with most elite lines being cleistogamous (non-open flowering). Recent efforts to implement hybrid barley breeding made it necessary to enrich the elite pool with alleles promoting allogamy for hybrid production. Pollination capability will serve as a use case to demonstrate the potential of association mapping and genomic selection methods to mine for favorable alleles available in IPK's Genebank. The collected phenotypic data will be entered into the BRIDGE data warehouse in order to design a breeders-interest-based test case query option for the informed selection of Genebank accessions based on genotype and phenotype data.

WP3b: A second future application area of information provided by the BRIDGE data warehouse will be data mining dedicated to alleles of key genes on the basis of re-sequencing information. GBS analysis during WP1 will deliver diversity catalog information based on an average of ~190,000 (1-fold coverage) to ~35,000 (20-fold coverage) SNP loci per accession. These data are not suitable for direct allele mining, but this information will guide us towards definition of a core set of accessions with maximal genetic distance based on diversity-types (¿quasi haplotypes¿) obtained from all ¿homogeneous¿ accessions (<10% heterogeneity at loci >10-fold coverage). This core set will represent the initial target for re-sequencing for allele mining. Exome capture re-sequencing of 700 highly diverse, geo-referenced and Genebank-related barley accessions is already underway and will be available to the BRIDGE project (280 by IPK, own unpublished data; about 500 by EUFP7 project WHEALBI). In WP3b we will substantiate this dataset with exome capture sequencing of additional 100 accessions to ensure availability of deep re-sequencing data for allele mining of the ~800 accessions evaluated under WP3a. Exome capture re-sequencing of further accessions (core-set of most diverse haplotypes) is anticipated but will require additional funding beyond the frame of BRIDGE.

Work package 4: The BRIDGE data warehouse

About seven million accessions of PGR are hosted globally in genebanks. Although some provide access to legacy data and limited genotypic information, none of these genebanks offers comprehensive access to deep genotype/ phenotype information of entire collections. Likewise, no genebank can support informed selection of accessions other than based on passport or legacy data. We will contextualize the data sets accumulated or curated within or linked to the BRIDGE project (e.g. GBIS, EURISCO) with related information from public repositories ¿ to build the integrated platform for facilitated and educated utilization of crop plant biodiversity (Figure 1). We will explore several data warehouse strategies for hosting large-scale variation datasets. Both classical concepts based on relational databases [24-26] and NoSQL solutions such as key-value databases [27,28] will be evaluated during the data accumulation phase. Existing exome capture sequence datasets will support testing during the implementation phase. Apart from performance criteria (speed, ease-of-use and hardware requirements), we will put special emphasis on the possibility of integrating the BRIDGE data warehouse with the IPK Genebank Information System (GBIS) to provide a direct link to passport and legacy data. Integration efforts will take advantage of information retrieval systems developed by IPK [29].

Fig.1: The BRIDGE variation warehouse is collecting primary data from genotyping and field evaluation as well as legacy data of the entire barley collection. This will be contextualized with other public data sets. The platform will then provide interactive user interfaces to support educated selection of material for pre-breeding.

Handling large amounts of genotypic data will require the implementation of a strategy to efficiently store genotype matrices for interactive access and visualization of analysis results. We will explore non-relational database architectures as well as data compression methods based on the Burrows-Wheeler transformation [30] as a backbone for our database. A graphical web-interface will serve as a front-end to the public for the BRIDGE data warehouse. It will feature (i) entrance points to database searches for conventional queries based on accession names or the genomic positions of markers, (ii) innovative interactive maps visualizing geographic origin of accessions including links to passport and legacy data, and (iii) as a guide for pre-breeding efforts, we will explore the combination of haplotype matching and imputation methods to enable users to query BRIDGE datasets of genebank accessions relative to an existing own marker genotype in a given genomic region.

Apart from single queries, convenient bulk download options will be provided. The design of the web interface and of on-the-fly analysis methods will involve consulting with collection curators of IPK Genebank as well as with an external user panel (e.g. scientists and breeders) to define the implementation of use cases2. Prior to development of own software, existing solutions from similar projects, e.g. SeeDs of Discovery (CIMMYT) will be evaluated thoroughly regarding hardware and software requirements and adaptability of existing tools (i.e. Germinate 3) to our needs. We will also monitor progress of the new German Network for Bioinformatics Infrastructure (de.NBI) and seek potential synergies by adapting newly developed technologies to our needs. Preferably, a set of information systems (Figure 1) will be loosely coupled with the BRIDGE Variation Warehouse based on defined interfaces and standardized data exchange formats.

Work package 5: The link to capacity building and education

Diversity informatics opens a new scientific arena, integrating genomic information into the operational context of genebanks [6]. Specialists with novel skill combinations will be in demand, like biologists with high level of genetic, computational, and statistical background, people with expertise in development and management of online databases, annotation services, hosting of large and computationally intensive querying resources. The BRIDGE project offers a unique opportunity to contribute closing this gap and to get involved students of several disciplines such as plant biology, agriculture, computer science and bioinformatics. IPK is collaborating in this field with the ScienceCampus Plant-based Bioeconomy of Martin-Luther University Halle-Wittenberg. We are planning to link the BRIDGE project tightly to the established master programs of "Nutzpflanzenwissenschaften" (module "Pflanzengenetische Ressourcen und Genomforschung") and "Bioinformatics" (modules "Algorithms on sequences II", "Statistical pattern recognition in DNA sequences"). Each year, five students of each discipline (in total 30 students) will receive three months short-term fellowships based on interest and qualification for supporting individual activities and tasks of WPs1-4. Furthermore, an interdisciplinary seminar ("Journal club on classical and current works of bioinformatics") for students of the different disciplines will be reinforced to review and discuss the most recent literature in the area of diversity informatics and utilization of PGR.

2 This will include organizing a user panel workshop at end of year 2 to put to the test functionalities