Cosine Similarity

Custom Activity

Custom Activity

Updated: October 19th, 2020

Published: January 14th, 2020

Downloads: 13

Language compatibility: Visual Basic

Community Support

Cosine Similarity

Custom Activity

Summary: 

Cosine similarity is a metric used to determine how similar the documents are irrespective of their size

A commonly used approach to match similar documents is based on counting the maximum number of common words between the documents.
Still, this approach has an inherent flaw: as the size of the document increases, the number of common words tends to grow even if the documents cover different topics.
The cosine similarity helps overcome this fundamental flaw in the ‘count-the-common-words’ or Euclidean distance approach.
Input:
  • TestingDocumentText - string containing the text content to be tested
  • TrainingDocumentText - string containing the text content to be trained
Output:
  • CosineSimilarityValue - decimal value ranging between [0-1]

Details

Benefits

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘cricket’ appeared 50 times in one document and 10 times in another), they could still have a smaller angle between them. Smaller the angle, higher the similarity.

Compatibility

Developed in 2019.4.4

Dependencies

Centivus.EnglishStemmer.dll

Licensing

By clicking download you agree to the following license.