Archives and Documentation Center
Digital Archives

A bayesian approach to the clustering problem with application to gene expression analysis

Show simple item record

dc.contributor Ph.D. Program in Computer Engineering.
dc.contributor.advisor Cemgil, Ali Taylan.
dc.contributor.author Fidaner, Işık Barış.
dc.date.accessioned 2023-03-16T10:13:46Z
dc.date.available 2023-03-16T10:13:46Z
dc.date.issued 2016.
dc.identifier.other CMPE 2016 F53 PhD
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12608
dc.description.abstract This thesis investigates methods for extraction of information from gene expression time series data. These time series provide indirect measurements about the underlying biological mechanisms, hence their analysis heavily depends on statistical modelling techniques. One particularly popular analysis approach is clustering genes by their similarity of expression profiles. However, for scientific data analysis, clustering requires a rigorous methodology and Bayesian nonparametrics provides a promising framework. In this context, two novel models were developed: Infinite Multiway Mixture (IMM) that extends the standard infinite mixture model; and Infinite Mixture of Piecewise Linear Sequences (IMPLS) that assumes a specific structure for its mixture components, tailored towards gene expression time series. In the Bayesian paradigm, the key object for gene analysis is the posterior distribution over partitionings, given the model and observed data. However, a posterior distribution over partitionings is a highly complicated object. Here, we apply Markov Chain Monte Carlo (MCMC) inference to obtain a sample from the posterior distribution of gene partitionings, and cluster genes by a heuristic algorithm. An alternative, novel approach for the analysis of distributions over partitions is also developed, that we named as entropy agglomeration (EA). We demonstrate the use of EA by a clustering experiment on a literary text, Ulysses by James Joyce. In our bioinformatics application CLUSTERnGO (CnG), the relevance of resulting clusters are evaluated by applying standard multiple hypothesis testing to compare them against previous biological knowledge encoded in terms of a Gene Ontology. The complete workflow of CnG consists of a four-phase pipeline (Configuration, Inference, Clustering, Evaluation).
dc.format.extent 30 cm.
dc.publisher Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2016.
dc.subject.lcsh Gene expression -- Analysis.
dc.subject.lcsh Bayesian statistical decision theory.
dc.title A bayesian approach to the clustering problem with application to gene expression analysis
dc.format.pages xiii, 88 leaves ;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account