Title: Development of Different Word Vectors and Testing Using Text Classification Algorithms for Telugu
Abstract: Word embedding methods are used to represent words in a numerical way. Text data cannot be directly processed by machine learning or deep learning algorithms. It is very efficient to process numerical data, so by using word embedding techniques, we need to transform the text data into numerical form. One hot encoding vectors of real-valued numbers are simple and easy to generate. The researchers are now managing Word2vec for semantic representation words. In the literature review, we found that there are fewer tools and resources available for Indian languages compared to European languages. We want to construct a word embedding (vectors) using one hot encoding and Word2vec strategy. In this paper, we evaluate these vectors using supervised machine learning algorithms for sentiment classification. We pursue a two-step approach in this article. The first step is to generate vocabulary using News Corpus and to create word vectors using various word embedding methods. Validating the vector quality is the second step, using machine learning algorithms. We did preprocessing on the corpus we received. We got 178,210 types and 929,594 tokens after preprocessing. The size of our vocabulary is 178,210 unique words. We used labeled corpus, i.e.; movie review sentences and vocabulary, for developing sentence vector. Using the one hot encoding vector and word2vec vector model, we translated sentences to vectors. Once label sentences were translated into vectors, three machine learning algorithms were trained and evaluated. We finally compared the outcome.
Publication Year: 2022
Publication Date: 2022-01-01
Language: en
Type: book-chapter
Indexed In: ['crossref']
Access and Citation
Cited By Count: 2
AI Researcher Chatbot
Get quick answers to your questions about the article from our AI researcher chatbot