Date: November 15 - 17, 2007
Location: The Global Learning Conference Center
Welcome to the 6th Georgia Tech-Oak Ridge National Lab International Conference on Bioinformatics, in silico Biology: Gene Discovery and Systems Genomics
10 year anniversary of the first conference in 1997 -- Gene Discovery in silico
Astonishing progress in genome related biological science and engineering has been made in the ten years since the first Georgia Tech bi-annual International Conference on Bioinformatics on Gene Discovery was held in Autumn 1997. In particular, molecular biology tremendously widened its horizons in terms of the number of species well studied on cellular and molecular levels. These advances occurred in large part with the guidance of genome analysis and comparative genomics. At the same time, the power of genome-based analysis made possible numerous discoveries revealing deep and important secrets of life at the cellular and sub-cellular levels.
This year, Georgia Tech continues the tradition of organizing this scienific forum, bringing together leading, world-renowned researchers in genomics and bioinformatics to present recent advances in the field and to discuss open problems. The 6th International Conference on Bioinformatics, to be held in November, will focus on Gene Discovery and Systems Genomics.
Each of the many types of genome-scale network data --- protein interactions, genetic interactions, transcriptional regulation and mRNA response --- provides a limited view of the organization of an entire cellular system. This talk describes principled algorithms for data unification: a graph diffusion kernel that uses renormalized propagators to combine different edge types; a variational mean field approach for identifying modules based on the social equivalent of 'friends' and 'enemies'; and a new graphical model for understanding transcriptional wiring. The talk concludes with a description of a new synthetic biology project to create a yeast cell whose genome has been entirely replaced by synthetic DNA. This new life form will answer long-standing questions about the fitness benefit of introns and other non-protein-coding elements and will permit a wet-lab exploration of minimal genomes. Much of this work was inspired by the 2007 DIMACS - Georgia Tech Workshop on Complex Networks and their Applications.
We have developed an algorithm ("Lever") that systematically maps metazoan DNA regulatory motifs or motif combinations to the sets of genes that they likely regulate. Lever accomplishes this by assessing whether the motifs are enriched within predicted cis regulatory modules (CRMs) in the noncoding sequences surrounding genes in a collection of gene sets. When these gene sets correspond to Gene Ontology (GO) categories, the results of Lever analysis allow the unbiased assignment of functional annotations to the regulatory motifs and also to the candidate CRMs that comprise the genomic motif occurrences. We demonstrate these methods using human myogenic differentiation as a model system, for which we statistically assessed greater than 25,000 pairings of gene sets and motifs / motif combinations. These results allowed us to assign functional annotations to candidate regulatory motifs predicted previously, and to identify gene sets that are likely to be co-regulated via shared regulatory motifs. Lever represents a major genome-wide step in moving beyond the identification of putative regulatory motifs in mammalian genomes, towards understanding their biological roles. This approach is general and can be applied readily to any cell type, gene expression pattern, or organism of interest.
Fast sequence similarity searches, multiple sequence alignments and tree-building have become the three most indispensable bioinformatic tools for biologists in the post-genomic era. These tools are central to the elucidation of the functions and evolutionary patterns of the many genes the sequences of which are now in databases. I will report on the Phylogeny.fr project, the goal of which is to provide the state-of-the-art algorithms in an integrated manner with an interface and logic simple enough to make them accessible to experimental biologists with little or no training in Bioinformatics or Molecular Evolution. A beta-test version is available at URL: phylogeny.fr
Phylogeny.fr, developed jointly by the MAB team (headed by Olivier Gascuel) at LIRMM (Montpellier, France) and the SGI laboratory (Marseille, France), is a special project of the French national genopole network agency (RNG).
Transcribed regions have been long been regarded as a distinguishing characteristic of functional portions of the human genome. As part of the Encyclopedia of DNA Elements (ENCODE) project, the sites of transcription in the non-repeat sequences across a representative 1% of the human genome has been determined in a large number of different cell line/tissue samples using of high throughput transcription interrogation technique. In addition, a detailed annotation of the protein coding content of the ENCODE regions has been obtained through a combination of computational, experimental and manual methods. Overall, at least 90% of the ENCODE regions appear to transcribed as primary nuclear transcripts, and about 15% are present as mature processed polyadenylated transcripts. Interestingly up to 30% of these sites of transcription have not been previously identified.
In addition, using a combination of 5’Rapid Amplification of cDNA Ends (RACEs) and high-density resolution tiling arrays, we have systematically explored the transcriptional diversity of protein coding loci. RACE allows detection of low copy number transcripts/isoforms and a high-resolution analysis of genes individually, while pooling strategies and array hybridization permit to reach high-throughput readout. We identified previously unannotated and often tissue/cell line specific transcribed fragments (RACEfrags), both 5’ distal to the annotated 5’ terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene. 5’ most novel detected exons are significantly associated to independently derived evidence of transcription initiation. Notably, more than 50% of the novel transcripts resulting from inclusion of novel exons have changes in their open reading frames. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results might revise our current understanding of the architecture of protein-coding genes. They have significant implications for our views on locations of regulatory regions in the genome and for the interpretation of sequence polymorphisms mapping to regions hitherto considered to be “non-coding” ultimately relating to the identification of disease-related sequence alterations.
MicroRNAs (miRNAs) are ~22 nucleotide non-coding RNA, which regulate expression of protein-coding genes through translation repression and/or degradation of mRNA. They are known to regulate cell proliferation and death and it has been found that miRNA expression signatures can distinguish cancer subtypes or predict biological and clinical behavior within the same cancer type. The understanding of miRNA function is likely to lead to novel therapeutically treatment of cancer.
Combined computational/experimental approaches have played a significant role during recent years in the identification of novel microRNAs (miRNAs), as well as in the analysis of their function. I will discuss also the role of edited microRNAs, the implication of SNP's within miRNA targets and expression of microRNAs in cancer.
Since the completion of the Human Genome Project, high-throughput experimental projects have been initiated for uncovering genomic information in an extended sense, including trascriptome and proteome, as well as metabolome, glycome, and other genome-encoded information. Together with traditional genome sequencing for an increasing number of organisms from bacteria to higher eukaryotes, we are beginning to understand the genomic space of possible genes and proteins that make up the biological system. In contrast, we have very limited knowledge about the chemical space of possible chemical substances that exists as an interface between the biological world and the natural world. This situation is rapidly changing thanks to the chemical genomics initiatives for systematic screening of biologically active chemical compounds. The KEGG resource (http://www.genome.jp/kegg/) has been widely used as a reference knowledge base for linking genomes to life through the process of PATHWAY mapping, which is to map, for example, a genomic or transcriptomic content of genes to KEGG reference pathways to infer systemic behaviors of the cell or the organism. In addition, KEGG now provides a reference knowledge base for linking genomes to the environment, such as for the analysis of drug-target relationships, through the process of BRITE mapping. KEGG BRITE is an ontology database representing functional hierarchies of various biological objects, including molecules, cells, organisms, diseases, and drugs, as well as relationships among them. I will discuss how KEGG can be used to extract biological information encoded in the small molecular structures, such as for predicting the metabolic fate of xenobiotic chemical compounds.
As more non-coding RNAs are discovered, the importance of methods for RNA analysis increases. Since the structure of ncRNA is intimately tied to the function of the molecule, programs for RNA structure prediction are necessary tools in this growing field of research. Furthermore, it is known that RNA structure is often evolutionarily more conserved than sequence. However, few existing methods are capable of simultaneously making multiple sequence alignment and structure prediction. We present a novel solution to the problem of simultaneous structure prediction and multiple alignment of RNA sequences. The algorithm MASTR (Multiple Alignment of STructural RNAs) iteratively improves both sequence alignment and structure prediction for a set of RNA sequences. This is done by minimizing a combined cost function that considers sequence conservation, covariation and basepairing probabilities using simulated annealing. The results show that the method is very competitive to similar programs available today, both in terms of accuracy and computational efficiency. In the talk, I will focus particularly on covariance: how to use it and how much it helps.
Understanding how genes are regulated in various circumstances (e.g., heatshock, starvation, etc.) is a central problem in molecular biology. The adoption of large-scale biological data generation techniques such as the mRNA microarrays has enabled researchers to tackle the gene regulation problem in a global way. I will survey some computational and statistical strategies developed by our group on how to effectively use the gene upstream sequence information in junction with mRNA expression microarray data to dissect the gene regulatory network.
I term this type of methods "predictive modeling approaches" because they take the expression data (or other gene-level quantitative measurements) as the response variable and attempt to use sequence information to predict the response. Main advantages of these approached are: (a) they are much more sensitive and specific than those sequence-only motif-discovery approaches when the motif signal is weak; (b) many advanced statistical learning tools can be used and various sophisticated dimension reduction and variable selection techniques can be applied under this coherent framework; (c) the discovered motifs or other sequence patterns can be "statistically" confirmed by cross-validations instead of relying purely on previous biological knowledge or further follow-up experiments.
We first demonstrate a re-analysis of the dataset from Beer and Tavazoie (2004), which serves to warn against "over-interpretation" when a pedictive modeling approach is used. Then we describe some successful applications of the methods, such as a study of RacA binding activities in Bacillus Subtilis and statistical analyses of histone modification and nucleosome occupancy data.
The family of miRNAs has recently extended past metazoan and plants with the discovery of virally encoded miRNAs. This expands our view of the cellular processes regulated by miRNAs, with the intriguing prospect for host-pathogen cross-talk mediated by miRNAs. Although much progress has been made in the miRNA field, the identification of miRNA targets remains one of the most challenging tasks. We have developed a predictive algorithm, RepTar, which is independent of evolutionary conservation considerations, and thus is better suited for identifying targets of the less conserved viral miRNAs. We applied our algorithm to all human 3'UTRs in search of targets of the miRNAs of Human Cytomegalovirus (HCMV), Epstein-Barr virus (EBV) and Kaposi's sarcoma-associated herpesvirus (KSHV). A statistically significant number of target genes were common to all three viruses, despite lack of sequence conservation between their miRNAs. This common group of targets was enriched in genes involved in apoptosis and cell signaling, demonstrating a convergent evolution of the viruses to use miRNAs for down-regulation of processes that may threaten their survival within the host cell. We also found immune-response genes among the putative common targets, suggesting that the viruses may use miRNAs to evade the immune system. This conjecture was verified experimentally for the HCMV miRNA UL112 and its top predicted target, MICB, an activating ligand of Natural Killer cells. Experimental validation showed down-regulation of MICB by hcmv-miR-UL112 and reduced killing of the infected cells by Natural Killer cells. Our results demonstrate a novel viral miRNA-based immuno-evasion strategy and have promising therapeutic implications.
The forces shaping genome organization and evolution in bacteria include many factors beyond the protein-coding genes themselves. Much of the noncoding DNA encodes important regulatory signals as well as remnants of past evolutionary events. In this talk I will focus first on the transcriptional structure of bacterial genomes and on bioinformatics methods for identifying this structure. I will describe TransTermHP, a new method for identifying transcription terminators in bacteria, and discuss how we serendipitously discovered new DNA update signal sequences with this system. I will also discuss our OperonDB algorithm, which uses comparative genomics methods to identify genes that are co-transcribed in a wide range of species. Finally, I will discuss our recent examination of the surprisingly high rate of overlapping protein-coding genes in bacteria, and a simple evolutionary model that explains the patterns of overlaps seen in hundreds of genomes.
More and more, when newly sequencing a genome, related, previously sequenced genomes exist. Frequently, these informant genomes have been annotated and cDNA data exists for them. We present a method for generating cross species alignments of alien cDNA sequences to a target genomic sequence. The cDNAs are first aligned to their native genome and this alignment is then mapped to the target genome via the syntenic alignment between the two genomes. This indirect alignment method allows us to align a much larger fraction of the alien cDNA to the target genome than when using established direct alignment methods. We use these mapped alignments as approximate information for gene annotation in the target genome. When mapping ORF-annotated RefSeq alignments from mouse, rat, dog, cow and chicken to human, the images of the alignments in the human genome frequently match fairly closely a human gene structure. We integrate such "hints" about the approximate boundaries of exons and introns into a gene finder. Taking the human genome as a test case, but using only non-human cDNA, our gene predictions are not much less accurate than the best automatic methods which use all available data, including human mRNAs. This annotation methodology is well suited for many new genomes; For example, for annotating new vertebrate genomes which have few native full length mRNAs but which would benefit from the large amount of existing human, mouse and rat mRNA sequences.
Dissecting genetic architecture of complex phenotypes remains a major challenge for human genetics. Availability of large genotyping microarrays propelled Genome-Wide Association Studies (GWAS) as a major tool for genetic analysis of complex diseases. Development of low cost sequencing technology is a widely anticipated next technological breakthrough. The simultaneous progress in collecting genetic material from large clinical populations will enable human geneticists to conduct studies on samples including tens of thousands of individuals. We investigate the potential of these two technological advances to enable fundamentally new ways for identification of genes underlying human complex phenotypes.
Our study was based on a computational population genetic model inferred from the deepest systematic re-sequencing dataset to date. The model incorporates incoming mutations, demographic history of the human population and natural selection. We demonstrate that the model is able to accurately recapitulate experimental dataset of 757 sequenced individuals. We use this model to simulate genetic variation in larger population cohorts. We further simulate sequencing of samples from phenotypic extremes to evaluate power of unbiased discovery of genes underlying the phenotype.
Our results suggest that genome-wide analysis of rare coding variation in individuals at phenotypic extremes will provide a powerful tool for discovery of new gene-phenotype associations. However, these studies would require sequencing of very large population panels exceeding 10,000 individuals.
Recent advances in sequencing technologies produced unprecedented amount of genomic sequence data. Whole genome shotgun sequencing produces increasingly higher coverage of a genome with random sequence reads. It is commonly believed that a 3 to 5X coverage of genomic DNA can yield enough amounts of biologically meaningful data if the appropriate analysis methods can be applied. However, draft genomes pose special problems for the annotation processes. We have come to rely on the output of the annotation pipelines and assume that the data is “correct”. We would like to present some examples which illustrate the danger of these assumptions
NCBI approach for eukaryotic genomes annotation is using a combination of homology searching with ab initio modeling. To assess the quality of genome annotation we have compared several vertebrate genomes with various depth of coverage and levels of assembly. We have performed the analysis of the frameshifted genes in the context of genome assembly and have shown that some annotation problems indicate potential assembly issues.
Whole-genome alignments are invaluable for comparative genomics. Before doing any comparative analysis on a region of interest, one must have confidence in that region's alignment. This talk presents a methodology to measure the accuracy of arbitrary regions of these alignments. We have applied it to the UCSC Genome Browser's 17-vertebrate alignment. We identify 9.7% (21 Mbp) of the human chromosome 1 alignment as suspiciously aligned. We present independent evidence that many of these suspicious regions represent misalignments. This is joint work with Amol Prakash.
Understanding of gene function and regulation on a whole-genome scale is the key challenge for systems biology. Discovery and wide adoption of functional genomics technologies in the past decades promised a rapid means to address this challenge and has fueled development on numerous computational methods to deal with the resulting data. However, functional understanding of the proteome still lags behind experimental data generation. We address this data-knowledge disconnect through close integration of computation and experimentation in an iterative framework. I will present this framework and discuss our successful application of it to discover novel biology of the mitochondria.
In this talk we review a biophysical method for prediction of transcription factor affinity to binding sites on the DNA. The affinity prediction is calibrated to reproduce ChIP-chip values where these are available, while also allowing for prediction solely based on a weight matrix description of a binding site. Recently, we also computed statistics for the significance of the affinity values, which allows comparing predicted binding behavior of different factors. Further, we computed tissue specific transcription factors by analyzing promoters from sets of tissue specific genes. Results confirm established knowledge and provide several new predictions.
The identification of regulatory elements from different cell types is necessary for understanding the mechanisms controlling cell type-specific and housekeeping gene expression. Mapping DNaseI hypersensitive (HS) sites is an accurate method for identifying the location of functional regulatory elements. We have used a high throughput method, called DNase-chip, to identify 3904 DNaseI HS sites from six cell types across 1% of the human genome. A significant number (22%) of DNaseI HS sites from each cell type are ubiquitously present among all cell types studied. Surprisingly, nearly all of these ubiquitous DNaseI HS sites correspond to either promoters or insulator elements: 86% of them are located near annotated transcription start sites (TSS) and 10% are bound by CTCF, a protein with known enhancer blocking insulator activity. We also identified a large number of DNaseI HS sites that are cell type-specific (only present in one cell type); these regions are enriched for enhancer elements and correlate with cell type-specific gene expression as well as cell type-specific histone modifications. Finally, we find that approximately 8% of the genome overlaps a DNaseI HS site in at least one the six cell lines studied, indicating that a significant percentage of the genome is potentially functional.
In this talk, we will focus on the role of DNA methylation on genome evolution, especially on the evolution of vertebrate promoters. The majority of human promoters are associated with high CpG dinucleotides content. These are referred to as HCG (high CpG), or 'strong CpG islands' promoters. The structural distinction between HCG and LCG (low CpG) promoters extends to functional differences: HCG promoters are enriched with broadly expressed housekeeping genes, and LCG promoters often exhibit tissue-specific expression. We will demonstrate that such structural and functional distinction between LCG and HCG promoters has evolved early in vertebrate evolution, by presenting that (i) the distinction of LCG and HCG is conserved in a variety of vertebrate species, while not present in a urochordate genome, and that (iii) along vertebrate evolution the clustering of LCG and HCG has become more defined. We suggest that during early vertebrate evolution, as DNA methylation spread throughout the whole genome, promoter regions of some broadly expressed genes were 'protected' from DNA methylation. Our results emphasize the conserved role of DNA methylation on regulation of gene expression during vertebrate evolution, and provide a scenario on the evolution of CpG islands associated with promoters.