Abstract:
Information regarding the physical interactions among proteins is crucial, since protein-protein interactions (PPIs) are central for many biological processes. The experimental techniques used to verify PPIs are also vital for characterizing and assessing the reliability of the identi ed PPIs. A lot of information about PPIs and the experimental methods are only available in the text of the scienti c publications that report them. In this thesis, we approach the problem of identifying passages with experimental methods for physical interactions between proteins as an information retrieval search task. The baseline system is based on query matching, where the queries are generated by utilizing the names (including synonyms) of the experimental methods in the Proteomics Standard Initiative - Molecular Interactions (PSI-MI) ontology. We propose two methods, where the baseline queries are expanded by including additional relevant terms. The rst method is a supervised approach, where the most salient terms for each experimental method are obtained by using the term frequency-relevance frequency (tf.rf) metric over 13 articles from our manually annotated data set of 30 full text articles, which is made publicly available as an additional contribution of this study. The rst method is evaluated on the test set consisting of the remaining 17 articles and achieves better recall score compared to the baseline. On the other hand, the second method is an unsupervised approach, where the queries for each experimental method are expanded by using the word embeddings of the names of the experimental methods in the PSI-MI ontology. The second method achieves better recall and F-measure scores over the test set compared to the baseline.