• Home
  • Word Embedding
    • List of Articles Word Embedding

      • Open Access Article

        1 - Farsi Conceptual Text Summarizer: A New Model in Continuous Vector Space
        Mohammad Ebrahim Khademi Mohammad Fakhredanesh Seyed Mojtaba Hoseini
        Traditional methods of summarization were very costly and time-consuming. This led to the emergence of automatic methods for text summarization. Extractive summarization is an automatic method for generating summary by identifying the most important sentences of a text. More
        Traditional methods of summarization were very costly and time-consuming. This led to the emergence of automatic methods for text summarization. Extractive summarization is an automatic method for generating summary by identifying the most important sentences of a text. In this paper, two innovative approaches are presented for summarizing the Persian texts. In these methods, using a combination of deep learning and statistical methods, we cluster the concepts of the text and, based on the importance of the concepts in each sentence, we derive the sentences that have the most conceptual burden. In the first unsupervised method, without using any hand-crafted features, we achieved state-of-the-art results on the Pasokh single-document corpus as compared to the best supervised Persian methods. In order to have a better understanding of the results, we have evaluated the human summaries generated by the contributing authors of the Pasokh corpus as a measure of the success rate of the proposed methods. In terms of recall, these have achieved favorable results. In the second method, by giving the coefficient of title effect and its increase, the average ROUGE-2 values increased to 0.4% on the Pasokh single-document corpus compared to the first method and the average ROUGE-1 values increased to 3% on the Khabir news corpus. Manuscript profile
      • Open Access Article

        2 - Density Measure in Context Clustering for Distributional Semantics of Word Sense Induction
        Masood Ghayoomi
        Word Sense Induction (WSI) aims at inducing word senses from data without using a prior knowledge. Utilizing no labeled data motivated researchers to use clustering techniques for this task. There exist two types of clustering algorithm: parametric or non-parametric. Al More
        Word Sense Induction (WSI) aims at inducing word senses from data without using a prior knowledge. Utilizing no labeled data motivated researchers to use clustering techniques for this task. There exist two types of clustering algorithm: parametric or non-parametric. Although non-parametric clustering algorithms are more suitable for inducing word senses, their shortcomings make them useless. Meanwhile, parametric clustering algorithms show competitive results, but they suffer from a major problem that is requiring to set a predefined fixed number of clusters in advance. The main contribution of this paper is to show that utilizing the silhouette score normally used as an internal evaluation metric to measure the clusters’ density in a parametric clustering algorithm, such as K-means, in the WSI task captures words’ senses better than the state-of-the-art models. To this end, word embedding approach is utilized to represent words’ contextual information as vectors. To capture the context in the vectors, we propose two modes of experiments: either using the whole sentence, or limited number of surrounding words in the local context of the target word to build the vectors. The experimental results based on V-measure evaluation metric show that the two modes of our proposed model beat the state-of-the-art models by 4.48% and 5.39% improvement. Moreover, the average number of clusters and the maximum number of clusters in the outputs of our proposed models are relatively equal to the gold data Manuscript profile
      • Open Access Article

        3 - Utilizing Gated Recurrent Units to Retain Long Term Dependencies with Recurrent Neural Network in Text Classification
        Nidhi Chandra Laxmi  Ahuja Sunil Kumar Khatri Himanshu Monga
        The classification of text is one of the key areas of research for natural language processing. Most of the organizations get customer reviews and feedbacks for their products for which they want quick reviews to action on them. Manual reviews would take a lot of time a More
        The classification of text is one of the key areas of research for natural language processing. Most of the organizations get customer reviews and feedbacks for their products for which they want quick reviews to action on them. Manual reviews would take a lot of time and effort and may impact their product sales, so to make it quick these organizations have asked their IT to leverage machine learning algorithms to process such text on a real-time basis. Gated recurrent units (GRUs) algorithms which is an extension of the Recurrent Neural Network and referred to as gating mechanism in the network helps provides such mechanism. Recurrent Neural Networks (RNN) has demonstrated to be the main alternative to deal with sequence classification and have demonstrated satisfactory to keep up the information from past outcomes and influence those outcomes for performance adjustment. The GRU model helps in rectifying gradient problems which can help benefit multiple use cases by making this model learn long-term dependencies in text data structures. A few of the use cases that follow are – sentiment analysis for NLP. GRU with RNN is being used as it would need to retain long-term dependencies. This paper presents a text classification technique using a sequential word embedding processed using gated recurrent unit sigmoid function in a Recurrent neural network. This paper focuses on classifying text using the Gated Recurrent Units method that makes use of the framework for embedding fixed size, matrix text. It helps specifically inform the network of long-term dependencies. We leveraged the GRU model on the movie review dataset with a classification accuracy of 87%. Manuscript profile
      • Open Access Article

        4 - Using Sentiment Analysis and Combining Classifiers for Spam Detection in Twitter
        mehdi salkhordeh haghighi Aminolah Kermani
        The welcoming of social networks, especially Twitter, has posed a new challenge to researchers, and it is nothing but spam. Numerous different approaches to deal with spam are presented. In this study, we attempt to enhance the accuracy of spam detection by applying one More
        The welcoming of social networks, especially Twitter, has posed a new challenge to researchers, and it is nothing but spam. Numerous different approaches to deal with spam are presented. In this study, we attempt to enhance the accuracy of spam detection by applying one of the latest spam detection techniques and its combination with sentiment analysis. Using the word embedding technique, we give the tweet text as input to a convolutional neural network (CNN) architecture, and the output will detect spam text or normal text. Simultaneously, by extracting the suitable features in the Twitter network and applying machine learning methods to them, we separately calculate the Tweeter spam detection. Eventually, we enter the output of both approaches into a Meta Classifier so that its output specifies the final spam detection or the normality of the tweet text. In this study, we employ both balanced and unbalanced datasets to examine the impact of the proposed model on two types of data. The results indicate an increase in the accuracy of the proposed method in both datasets. Manuscript profile
      • Open Access Article

        5 - Semantic Word Embedding Using BERT on the Persian Web
        shekoofe bostan Ali-Mohammad Zare-Bidoki mohamad reza pajohan
        Using the context and order of words in sentence can lead to its better understanding and comprehension. Pre-trained language models have recently achieved great success in natural language processing. Among these models, The BERT algorithm has been increasingly popular More
        Using the context and order of words in sentence can lead to its better understanding and comprehension. Pre-trained language models have recently achieved great success in natural language processing. Among these models, The BERT algorithm has been increasingly popular. This problem has not been investigated in Persian language and considered as a challenge in Persian web domain. In this article, the embedding of Persian words forming a sentence was investigated using the BERT algorithm. In the proposed approach, a model was trained based on the Persian web dataset, and the final model was produced with two stages of fine-tuning the model with different architectures. Finally, the features of the model were extracted and evaluated in document ranking. The results obtained from this model are improved compared to results obtained from other investigated models in terms of accuracy compared to the multilingual BERT model by at least one percent. Also, applying the fine-tuning process with our proposed structure on other existing models has resulted in the improvement of the model and embedding accuracy after each fine-tuning process. This process will improve result in around 5% accuracy of the Persian web ranking. Manuscript profile
      • Open Access Article

        6 - Ranking Improvement Using BERT
        shekoofe bostan Ali-Mohammad Zare-Bidoki Mohammad-Reza Pajoohan
        In today's information age, efficient document ranking plays a crucial role in information retrieval systems. This article proposes a new approach to document ranking using embedding models, with a focus on the BERT language model to improve ranking results. The propose More
        In today's information age, efficient document ranking plays a crucial role in information retrieval systems. This article proposes a new approach to document ranking using embedding models, with a focus on the BERT language model to improve ranking results. The proposed approach uses vocabulary embedding methods to represent the semantic representations of user queries and document content. By converting textual data into semantic vectors, the relationships and similarities between queries and documents are evaluated under the proposed ranking relationships with lower cost. The proposed ranking relationships consider various factors to improve accuracy, including vocabulary embedding vectors, keyword location, and the impact of valuable words on ranking based on semantic vectors. Comparative experiments and analyses were conducted to evaluate the effectiveness of the proposed relationships. The empirical results demonstrate the effectiveness of the proposed approach in achieving higher accuracy compared to common ranking methods. These results indicate that the use of embedding models and their combination in proposed ranking relationships significantly improves ranking accuracy up to 0.87 in the best case. This study helps improve document ranking and demonstrates the potential of the BERT embedding model in improving ranking performance. Manuscript profile