Abstract:
Word alignment is a crucial first step in learning statistical translation models. In this dissertation, we propose a Bayesian approach to unsupervised learning of word alignments by introducing a sparse prior on the parameters of IBM word alignment models. In the original approach, word translation probabilities are estimated using the expectation-maximization (EM) algorithm. In the proposed approach, they are random variables with a prior and are integrated out during inference, where collapsed Gibbs sampling is used. The inferred word alignments are evaluated in a statistical ma chine translation (SMT) setting, experimenting with several language pairs and sizes of corpora and comparing against the EM and variational Bayes (VB) methods. We show that Bayesian inference outperforms both EM and VB in the majority of test cases, effectively addresses the high-fertility rare word problem in EM and unaligned rare word problem in VB, achieves higher agreement and vocabulary coverage rates than both, and leads to smaller phrase tables. We also propose a method for un supervised learning of the optimal segmentation for SMT. We augment the original Morfessor monolingual segmentation model with a word alignment model so that the new model optimizes the posterior probability of the parallel training corpus according to a generative segmentation-translation model. In order to speed up computation, we propose an incremental method for approximate translation likelihood calculation and a parallelizable search algorithm, which improves the performance of even the mono lingual segmentation. We use the proposed method to segment the Turkish side in a Turkish-to-English SMT system and find that the bilingual model results in more intuitive segmentations but does not yield a further significant increase in BLEU scores.