Arşiv ve Dokümantasyon Merkezi
Dijital Arşivi

Text normalization using lexical and contextual features

Basit öğe kaydını göster

dc.contributor Graduate Program in Computer Engineering.
dc.contributor.advisor Özgür, Arzucan.
dc.contributor.author Uluşahin Sönmez, Çağıl.
dc.date.accessioned 2023-03-16T10:01:46Z
dc.date.available 2023-03-16T10:01:46Z
dc.date.issued 2014.
dc.identifier.other CMPE 2014 U68
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12265
dc.description.abstract The informal nature of social media text, renders it very di cult to be automatically processed by natural language processing tools. Text normalization, which corresponds to restoring the noisy words to their canonical forms, provides a solution to this challenge. We introduce an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatical features of social text. The contextual and grammatical features are extracted from a word association graph built by using a large unlabeled social media text corpus. The graph encodes the relative positions of the words with respect to each other, as well as their part-of-speech tags. The lexical features are obtained by using the longest common subsequence ratio and edit distance measures to encode the surface similarity among words, and the double metaphone algorithm to represent the phonetic similarity. Unlike most of the recent approaches that are based on generating normalization dictionaries, the proposed approach performs normalization by considering the context of the noisy words in the input text. Our results show that it achieves state-of-the-art F-score performance on a standard data set. In addition, the system can be tuned to achieve very high precision without sacri cing much from recall.
dc.format.extent 30 cm.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.
dc.subject.lcsh Text processing (Computer science)
dc.title Text normalization using lexical and contextual features
dc.format.pages xi, 39 leaves ;


Bu öğenin dosyaları

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Basit öğe kaydını göster

Dijital Arşivde Ara


Göz at

Hesabım