Evaluating Strategies for Calculating Doc Similarity – KDnuggets #Imaginations Hub

Evaluating Strategies for Calculating Doc Similarity – KDnuggets #Imaginations Hub
Image source - Pexels.com

Picture by Editor



Knowledge science is a discipline that has grown tremendously within the final hundred years due to developments made within the discipline of pc science. With pc and cloud storage prices getting cheaper, we are actually capable of retailer copious quantities of knowledge at a really low value in contrast to a couple years in the past. With the rise in computational energy, we are able to run machine studying algorithms on giant units of knowledge and churn it to supply insights. With developments in networking, we are able to generate and transmit information over the web at lightning velocity. Because of all of this, we reside in an period with plentiful information being generated each second. We now have information within the type of e-mail, monetary transactions, social media content material, net pages on the web, buyer information for companies, medical data of sufferers, health information from smartwatches, video content material on Youtube, telemetry from smart-devices and the checklist goes on. This abundance of knowledge each in structured and unstructured format has made us land in a discipline referred to as Knowledge Mining. 

Knowledge Mining is the method of discovering patterns, anomalies, and correlations from giant information units to foretell an end result. Whereas information mining strategies may very well be utilized to any type of information, one such department of Knowledge Mining is Textual content Mining which refers to discovering significant data from unstructured textual information. On this paper, I’ll deal with a typical process in Textual content Mining to search out Doc Similarity.

Doc Similarity helps in environment friendly data retrieval. Functions of doc similarity embrace – detecting plagiarism, answering net search queries successfully, clustering analysis papers by matter, discovering related information articles, clustering related questions in a Q&A website similar to Quora, StackOverflow, Reddit, and grouping product on Amazon based mostly on the outline, and so forth. Doc similarity can be utilized by corporations like DropBox and Google Drive to keep away from storing duplicate copies of the identical doc thereby saving processing time and storage value.  



There are a number of steps to computing doc similarity. Step one is to symbolize the doc in a vector format. We are able to then use pairwise similarity capabilities on these vectors. A similarity operate is a operate that computes the diploma of similarity between a pair of vectors. There are a number of pairwise similarity capabilities similar to  – Euclidean Distance, Cosine Similarity, Jaccard Similarity, Pearson’s correlation, Spearman’s correlation, Kendall’s Tau, and so forth [2]. A pairwise similarity operate will be utilized to 2 paperwork, two search queries, or between a doc and a search question. Whereas pairwise similarity capabilities go well with properly for evaluating a smaller variety of paperwork, there are different extra superior strategies similar to Doc2Vec, BERT which can be based mostly on deep studying strategies and are utilized by search engines like google and yahoo like Google for environment friendly data retrieval based mostly on the search question. On this paper, I’ll deal with Jaccard Similarity, Euclidean Distance, Cosine Similarity, Cosine Similarity with TF-IDF, Doc2Vec, and BERT.




A typical step to computing distance between paperwork or similarities between paperwork is to do some pre-processing on the doc. The pre-processing step contains changing all textual content to lowercase, tokenizing the textual content, eradicating cease phrases, eradicating punctuations and lemmatizing phrases[4].

Tokenization: This step entails breaking down the sentences into smaller items for processing. A token is a smallest lexical atom {that a} sentence will be damaged down into. A sentence will be damaged down into tokens by utilizing area as a delimiter. That is a technique of tokenizing. For instance, a sentence of the shape “tokenization is a extremely cool step” is damaged into tokens of the shape  [‘tokenization’, ‘is’, a, ‘really’, ‘cool’, ‘step’]. These tokens kind the constructing blocks of Textual content Mining and are one of many first steps in modeling textual information.. 

Lowercasing: Whereas preserving instances could be wanted in some particular instances, generally we need to deal with phrases with completely different casing as one. This step is vital with a view to get constant outcomes from a big information set. For instance if a person is trying to find a phrase ‘india’, we need to retrieve related paperwork that include phrases in numerous casing both as “India”, “INDIA” and “india” if they’re related to the search question.

Eradicating Punctuations: Eradicating punctuation marks and whitespaces assist focus the search on vital phrases and tokens.

Eradicating cease phrases: Cease phrases are a set of phrases which can be generally used within the English language and removing of such phrases may help in retrieving paperwork that match extra vital phrases that convey the context of the question. This additionally helps in decreasing the scale of the function vector thereby serving to with processing time. 

Lemmatization: Lemmatization helps in decreasing sparsity by mapping phrases to their root phrase.For instance ‘Performs’, ‘Performed’ and ‘Enjoying’ are all mapped to play. By doing this we additionally cut back the scale of the function set and match all variations of a phrase throughout completely different paperwork to deliver up probably the most related doc.


Evaluating Methods for Calculating Document Similarity



This technique is likely one of the best strategies. It tokenizes the phrases and calculates the sum of the depend of the shared phrases to the sum of the full variety of phrases in each paperwork. If the 2 paperwork are related the rating is one, if the 2 paperwork are completely different the rating is zero [3]. 


Evaluating Methods for Calculating Document Similarity

Evaluating Methods for Calculating Document Similarity
Picture supply: O’Reilly


Abstract: This technique has some drawbacks. As the scale of the doc will increase, the variety of widespread phrases will improve, despite the fact that the 2 paperwork are semantically completely different.



After pre-processing the doc, we convert the doc right into a vector. The load of the vector can both be the time period frequency the place we depend the variety of occasions the time period seems within the doc, or it may be the relative time period frequency the place we compute the ratio of the depend of the time period to the full variety of phrases within the doc [3]. 

Let d1 and d2 be two paperwork represented as vectors of n phrases (representing n dimensions); we are able to then compute the shortest distance between two paperwork utilizing the pythagorean theorem to discover a straight line between two vectors. The larger the space, the decrease the similarity;the decrease the space, the upper the similarity between two paperwork.


Evaluating Methods for Calculating Document Similarity

Evaluating Methods for Calculating Document Similarity
Picture Supply: Medium.com


Abstract: Main downside of this strategy is that when the paperwork are differing in measurement, Euclidean Distance will give a decrease rating despite the fact that the 2 paperwork are related in nature. Smaller paperwork will end in vectors with a smaller magnitude and bigger paperwork will end in vectors with bigger magnitude because the magnitude of the vector is instantly proportional to the variety of phrases within the doc, thereby making the general distance bigger.



Cosine similarity measures the similarity between paperwork by measuring the cosine of the angle between the 2 vectors. Cosine similarity outcomes can take worth between 0 and 1. If the vectors level in the identical route, the similarity is 1, if the vectors level in reverse instructions, the similarity is 0. [6].


Evaluating Methods for Calculating Document Similarity

Evaluating Methods for Calculating Document Similarity
Picture Supply: Medium.com


Abstract: The benefit of cosine similarity is that it computes the orientation between vectors and never the magnitude. Thus it can seize similarity between two paperwork which can be related regardless of being completely different in measurement.

The elemental downside of the above three approaches is that the measurement misses out on discovering related paperwork by semantics. Additionally, all of those strategies can solely be achieved pairwise, thus requiring extra comparisons .



This technique of discovering doc similarity is utilized in default search implementations of ElasticSearch and it has been round since 1972 [4].  tf-idf stands for time period frequency-inverse doc frequency. We first compute the time period frequency utilizing this system 


Evaluating Methods for Calculating Document Similarity


Lastly we compute tf-idf by multiplying TF*IDF. We then use cosine similarity on the vector with tf-idf as the burden of the vector. 

Abstract: Multiplying the time period frequency with the inverse doc frequency helps offset some phrases which seem extra ceaselessly normally throughout paperwork and deal with phrases that are completely different between paperwork.  This method helps find paperwork that match a search question by focussing the search on vital key phrases.



Though utilizing particular person phrases (BOW – Bag of Phrases) from paperwork to transform to vectors could be simpler to implement, it doesn’t give any significance to the order of phrases in a sentence. Doc2Vec is constructed on prime of Word2Vec. Whereas Word2Vec represents the which means of a phrase, Doc2Vec represents the which means of a doc or paragraph [5].

This technique is used for changing a doc into its vector illustration whereas preserving the semantic which means of the doc. This strategy converts variable-length texts similar to sentences or paragraphs or paperwork to vectors [5]. The doc2vec mode is then skilled. The coaching of the fashions is just like coaching different machine studying fashions by choosing coaching units and take a look at set paperwork and adjusting the tuning parameters to attain higher outcomes. 

Abstract: Such a vectorised type of the doc preserves the semantic which means of the doc as paragraphs with related context or which means shall be nearer collectively whereas changing to vector. 



BERT is a transformer based mostly machine studying mannequin utilized in NLP duties, developed by Google.

With the arrival of BERT (Bidirectional Encoder Representations from Transformers), NLP fashions are skilled with enormous, unlabeled textual content corpora which appears to be like at a textual content each from proper to left and left to proper. BERT makes use of a method referred to as “Consideration” to enhance outcomes. Google’s search rating improved by an enormous margin after utilizing BERT [4]. Among the distinctive options of BERT embrace

  • Pre-trained with Wikipedia articles from 104 languages.
  • Appears to be like at textual content each left to proper and proper to left
  • Helps in understanding context

Abstract: Consequently, BERT will be fine-tuned for lots of functions similar to question-answering, sentence paraphrasing, Spam Classifier, Construct language detector with out substantial task-specific structure modifications.



It was nice to find out about how similarity capabilities are utilized in discovering doc similarity. At present it’s as much as to the developer to choose a similarity operate that most accurately fits the situation. For instance tf-idf is at present the state-of-the-art for matching paperwork whereas BERT is the state-of-the-art for question searches. It will be nice to construct a instrument that auto-detects which similarity operate is greatest suited based mostly on the situation and thus choose a similarity operate that’s optimized for reminiscence and processing time. This might tremendously assist in eventualities like auto-matching resumes to job descriptions, clustering paperwork by class, classifying sufferers to completely different classes based mostly on affected person medical data and so forth.  



On this paper, I lined some notable algorithms to calculate doc similarity. It’s no means an exhaustive checklist. There are a number of different strategies for locating doc similarity and the choice to choose the best one is dependent upon the actual situation and use-case. Easy statistical strategies like tf-idf, Jaccard, Euclidien, Cosine similarity are properly fitted to easier use-cases. One can simply get setup with present libraries out there in Python, R and calculate the similarity rating with out requiring heavy machines or processing capabilities. Extra superior algorithms like BERT rely on pre-training neural networks that may take hours however produce environment friendly outcomes for evaluation requiring understanding of the context of the doc.





[1]       Heidarian, A., & Dinneen, M. J. (2016). A Hybrid Geometric Strategy for Measuring Similarity Degree Amongst Paperwork and Doc Clustering. 2016 IEEE Second Worldwide Convention on Huge Knowledge Computing Service and Functions (BigDataService), 1–5. https://doi.org/10.1109/bigdataservice.2016.14

[2]       Kavitha Karun A, Philip, M., & Lubna, Okay. (2013). Comparative evaluation of similarity measures in doc clustering. 2013 Worldwide Convention on Inexperienced Computing, Communication and Conservation of Vitality (ICGCE), 1–4. https://doi.org/10.1109/icgce.2013.6823554

[3]        Lin, Y.-S., Jiang, J.-Y., & Lee, S.-J. (2014). A Similarity Measure for Textual content Classification and Clustering. IEEE Transactions on Information and Knowledge Engineering, 26(7), 1575–1590. https://doi.org/10.1109/tkde.2013.19

[4]       Nishimura, M. (2020, September 9). The Finest Doc Similarity Algorithm in 2020: A Newbie’s Information – In direction of Knowledge Science. Medium.  https://towardsdatascience.com/the-best-document-similarity-algorithm-in-2020-a-beginners-guide-a01b9ef8cf05

[5]        Sharaki, O. (2020, July 10). Detecting Doc Similarity With Doc2vec –   In direction of Knowledge Science. Medium.  https://towardsdatascience.com/detecting-document-similarity-with-doc2vec-f8289a9a7db7

[6]        Lüthe, M. (2019, November 18). Calculate Similarity — probably the most related Metrics in a Nutshell – In direction of Knowledge Science. Medium. https://towardsdatascience.com/calculate-similarity-the-most-relevant-metrics-in-a-nutshell-9a43564f533e

[7]        S. (2019, October 27). Similarity Measures — Scoring Textual Articles – In direction of Knowledge Science. Medium. https://towardsdatascience.com/similarity-measures-e3dbd4e58660

Poornima Muthukumar is a Senior Technical Product Supervisor at Microsoft with over 10 years of expertise in creating and delivering revolutionary options for numerous domains similar to cloud computing, synthetic intelligence, distributed and large information methods. I’ve a Grasp’s Diploma in Knowledge Science from the College of Washington. I maintain 4 Patents at Microsoft specializing in AI/ML and Huge Knowledge Programs and was the winner of the World Hackathon in 2016 within the Synthetic Intelligence Class. I used to be honored to be on the Grace Hopper Convention reviewing panel for the Software program Engineering class this 12 months 2023. It was a rewarding expertise to learn and consider the submissions from proficient girls in these fields and contribute to the development of girls in expertise, in addition to to study from their analysis and insights. I used to be additionally a committee member for the Microsoft Machine Studying AI and Knowledge Science (MLADS) June 2023 convention. I’m additionally an Ambassador on the Girls in Knowledge Science Worldwide Group and Girls Who Code Knowledge Science Group.

Related articles

You may also be interested in