Archives and Documentation Center
Digital Archives

Predicting intracellular functions of proteins from amino acid sequences using language processing methods

Show simple item record

dc.contributor Graduate Program in Computer Engineering.
dc.contributor.advisor Özgür, Arzucan.
dc.contributor.author Çaldır, Bedirhan.
dc.date.accessioned 2023-10-15T06:48:30Z
dc.date.available 2023-10-15T06:48:30Z
dc.date.issued 2022
dc.identifier.other CMPE 2022 C35
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/19700
dc.description.abstract Rapidly increasing computational power and sequencing technologies, which are at the peak of their development, enable the use of advanced algorithms with high processing volume to predict the intracellular functions of proteins, which is one of the most important problems in computational biology. The functionalities of proteins emerge primarily through their three-dimensional folded structures. When these structures are interpreted as graphs, the application of graph neural networks leads to promising results. However, these approaches are limited as the three-dimensional folded structures are not yet known for most proteins. The fact that the amino acid sequences of proteins have properties similar to natural languages and the large amounts of sequence data suggest that these sequences can be processed using natural language processing (NLP) methods. In this thesis, two different NLP methods are adapted to the problem of protein function prediction, assuming that the protein sequence data contain necessary and sufficient information to predict both three-dimensional folded structure and intracellular function: (i) Bidirectional Transoformer BERT model (ii) Heterogeneous Graph Convolutional Network (GCN) model. The results show that it is more advantageous to treat the proteins as graphs. The GCN model performs better than the BERT model and achieves performance close to the state-of-the-art model that uses three-dimensional folding information. In addition, we find that tokenizing the sequences instead of using the individual amino acids as tokens increases the performance.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcsh Proteins.
dc.subject.lcsh Amino acid sequence.
dc.subject.lcsh Computational biology.
dc.title Predicting intracellular functions of proteins from amino acid sequences using language processing methods
dc.format.pages xiii, 79 leaves


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account