Named entity recognition for Turkish microblog texts using semi-supervised learning with word embeddings

Okur, Eda.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Özgür, Arzucan.
dc.contributor.author	Okur, Eda.
dc.date.accessioned	2023-03-16T10:02:04Z
dc.date.available	2023-03-16T10:02:04Z
dc.date.issued	2015.
dc.identifier.other	CMPE 2015 O58
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12291
dc.description.abstract	Recently, due to the increasing popularity of social media and the value of information contained within real data, the necessity for extracting information from informal text types such as microblog texts has gained significant attention, together with the challenges it brings to the Natural Language Processing (NLP) research community. In this study, we focused on the Named Entity Recognition (NER) problem on informal text types such as microblog texts for Turkish, which is a morphologically rich language. For that purpose, we utilized a semi-supervised learning approach composed of an unsupervised stage followed by a supervised stage based on neural networks. We applied a fast unsupervised method for learning continuous representations of Turkish words in vector space. We make use of these obtained word embeddings, together with language independent features that are engineered to work better on informal text types, for generating a Turkish NER system on microblog texts. For examining informal and short texts in Turkish, we focused on the most popular microblogging environment called Twitter and we evaluated our Turkish NER system on short and unstructured Twitter messages called tweets. With our NER system, we achieved better F-score performances than the published results of previously proposed NER systems on Turkish tweets. To be more precise, we outperformed the state-of-the-art F-score by up to 11% on the same Turkish Twitter data. The only language dependent stage of our system is the normalization scheme we applied for Turkish microblog texts as a preprocessing step before the NER application, which improves the performance of our NER system on informal text types. Since we did not employ any language dependent features, other than text normalization, we believe that our method can be easily adapted to microblog texts in other morphologically rich languages.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2015.
dc.subject.lcsh	Microblogs.
dc.subject.lcsh	Microblogs -- Turkey.
dc.subject.lcsh	Twitter.
dc.title	Named entity recognition for Turkish microblog texts using semi-supervised learning with word embeddings
dc.format.pages	xiii, 101 leaves ;