Bengali Word Spelling Correction Using Pre-trained Word2Vec

Source: Deep Learning on Medium

Bengali Word Spelling Correction Using Pre-trained Word2Vec

Correct spelling is very important for any kinds of documentation. Many of automatic spell check is available in the online for different language. It helps us to correct the written wrong word or replace the correct automatically. Also, helps to find out the grammatical mistake and syntax error. If we look at an example such as Grammarly automatic spell checker is the best example for everyone. There is no automatic spell checker are present for our Bengali language. But automatic spell checker looks like Grammarly software is badly need for every Bengali language users.

Here is discussed an approach for making an automatic Bengali correct word replacement for building an automatic spell checker. The whole procedure depends on word2vec. Pre-trained word2vec file is used for this which has a small vocabulary. But effective for this work. Given a short description in below for making a spell checker.

Library Function

gensim library function is used to load the Bengali pre-trained word2vec file from pc.

import gensim

Word2Vec

Word embedding is one of the most significant strategies in common language processing, where words are mapped to vectors of genuine numbers. Word embedding is fit for catching the importance of a word in a report, semantic and syntactic closeness, connection with different words. It additionally has been broadly utilized for recommender frameworks and content arrangement.

‘bnword2vec’ is a pre-trained word2vec file for the Bengali language and ‘.txt’ is the extension of the loaded file.

model = gensim.models.KeyedVectors.load_word2vec_format('bnword2vec.txt')

Words Rank

Words rank-ordering archive significance dependent on the area of a looked through watchword in the sentence. Here it’s discovering the centre word from the Word2Vec document.

words = keeps the word index number from Word2vec file.

w_rank = It is a dictionaries which put all words when the loop is working.

enumerate() = Enumerate is a method adds a counter to an iterable and returns it in a form of enumerate object.

WORDS = This varible carry the value of w_rank dictionaries.

words = model.index2wordw_rank = {}for i,word in enumerate(words):w_rank[word] = iWORDS = w_rank

The len() the function returns the number of items in an object.

len(words)

Function

A function is a square of sorted out, reusable code that is utilized to play out a solitary, related activity. The function gives better seclusion to an application and a high level of code reusing.

P() = This methods returns the value for the given key, if present in the dictionary using the get() method.

Dictionary.get(key, default=None) this is the syntax of the get() method.

def P(word):
return - WORDS.get(word, 0)

max() = This function is used to compute the maximum of the values passed in its argument and lexicographically largest value if strings are passed as arguments.

correction() = It returns the maximum candidates words with a key which is defined by P.

def correction(word):
return max(candidates(word), key=P)

candidates() = The absolute candidate of the wrong word could found from known() methods is the actual work of this function.

def candidates(word):
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

known() =This method is used to find out the set of a word which is present in the dictionary.

set()= A set is an unordered collection of items. Every element is unique (no duplicates) and must be immutable.

def known(words):
return set(w for w in words if w in WORDS)

edits1() = Many parameters such as deletes, transposes, replaces, inserts are used in this method. Those parameters return the correct word of an incorrect word in a sentence. A set() function is used to find out the unordered collection of words.

edits2() = This method is returend the word which is edited by in edits1() functions.

letters = The Bengali script has a total of 9 vowels. Each of which is called a ‘স্বরবর্ণ. Also, have 35 consonants that are known as ‘ব্যঞ্জনবর্ণ .

splits = It working as a list which has both forward and reverses orders of the word sequence.

deletes = Also, a list which checks the left and right of a word in splits list and deletes the incorrect syntax.

transposes = It is used to change word places with each other words using the splits list.

replaces = Is a list it put words back in a previous place or position.

inserts = It helps to place and fit the correct words into the replacing with the incorrect word, especially with care.

def edits1(word):letters = 'ঁংঃঅআইঈউঊঋএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরলশষসহ়ঽািীুূৃৄেৈোৌ্ৎৗড়ঢ়য়'splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]deletes = [L + R[1:] for L, R in splits if R]transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]replaces = [L + c + R[1:] for L, R in splits if R for c in letters]inserts = [L + c + R for L, R in splits for c in letters]return set(deletes + transposes + replaces + inserts)def edits2(word):
return (e2 for e1 in edits1(word) for e2 in edits1(e1))

Now the code is ready to replace the correct word. If a user could put an incorrect word in the variable then the corresponding correct word will be output. This code is built for only single word spelling checking. But need a spell checker which checks the spelling of a whole paragraph or a document continuously. There are given some output demo in below.

a=input()correction(a)