Show menu >>

NLP Foundations

NLP Foundations What's Up site

LexRank is a graph-based unsupervised learning algorithm that uses the modified cosine of the inverse frequency of occurrence of a word as a measure of the similarity of two sentences.
2021-10-04, by ,

#Machine learning || #ML || #Tech ||

Table of contents:



Text Data Vectorization

The process of converting text to numbers is called vectorization. Now after Text Preprocessing, we need to represent the text in numerical form, that is, encode text data in the form of numbers, which can later be used in algorithms.

"Bag of words"

This is one of the simplest text vectorization techniques. In BOW logic, two sentences can be said to be the same if they contain the same set of words.

BOW creates a dictionary of unique d words in a corpus (collection of all tokens in the data). For example, the corpus in the image above consists of all the words of the sentences S1 and S2.

Now we can create a table where the columns correspond to the unique d words included in the corpus, and the rows to sentences (documents). We set the value to 1 if the word is in the sentence, and 0 if it is not there - see doctranslator for reference.