Abstract:
Sign languages are the main medium of communication for the Deaf. However, insufficient retrieval tools for sign languages restrict the Deaf’s access to information. To address this issue, we tackle the problem of keyword search in sign language. Although keyword search is a well-studied task for domains like speech processing, it has not been extensively studied in the context of sign language. To this end, we introduce improvements to an existing keyword search system for sign language and a new iterative training approach. We adapt Graph Attention Networks (GAT) to the sign language domain and extend its capabilities by employing a learnable mask and a separate temporal attention mechanism. Moreover, we investigate the effectiveness of the Pseudo-Relevance Feedback (PRF) technique in improving retrieval accuracy. Additionally, it is demonstrated that the existing model can also be trained with similarity-based methods using cosine and triplet losses, which can later be fused with other models to boost performance. Finally, we introduce an iterative training method similar to Expectation-Maximization (EM) that gradually improves its predictions. This method employs a cross-modal attention mechanism and a query encoder to discover subtle video-query interactions. The experiments are carried out on the RWTH- Phoenix2014T dataset, where the effectiveness of the proposed methods is verified. The results show that the pose models trained with a GAT-based encoder and in an iterative way significantly improve the retrieval performance.