Jaccard Similarity

May 4, 2023 Engineering/ML

When building an AI code assistant like Cody, you often find yourself in need for a quick way to rank the similarity of two blocks of text relative to each other.

This is where Jaccard similarity is helpful. Jaccard similarity is a metric to find similarity between two sets of data, typically used for comparing text data.

The equation for Jaccard distance is:

J(A, B) = |A ∩ B| / |A ∪ B|

Here's a simple example in TypeScript to calculate the Jaccard distance between two sentences:

function jaccardDistance(sentence1: string, sentence2: string): number {
  const set1 = new Set(sentence1.split(' '));
  const set2 = new Set(sentence2.split(' '));
  
  const intersection = new Set([...set1].filter(x => set2.has(x))).size;
  const union = new Set([...set1, ...set2]).size;
  
  return intersection / union;
}

The Jaccard similarity comes in handy for finding similar text, like detecting plagiarism or comparing documents. It might not be the most advanced method for text comparison, but it's simple and fast, making it great for many applications.

You can even do some pre-processing on the text to increase its accuracy. The above example is naively comparing words but you could use stemming to make it easier to compare words with the same meaning.

Why not give it a try with some sample texts and see how it works for you?


Other Notes

June 13, 2024
Engineering/The 50-50 Goal
May 17, 2024
May 2, 2024
April 3, 2024
Engineering/Infrastructure/Deploy Workers Programatically
April 2, 2024
March 7, 2024
Engineering/Feature Flags
February 21, 2024
Engineering/Demo Culture
February 16, 2024
February 1, 2024
Engineering/ML/Embeddings
May 5, 2023
Engineering/ML/Jaccard Similarity
May 4, 2023
May 2, 2023
Engineering/Front-End/Modern Front-End Problems
November 3, 2022
Engineering/Test Matrixes
February 25, 2022
February 25, 2022
Engineering/Front-End/React’s Escape Hatch
February 21, 2022
Other/Notes
January 1, 2022

About the author

Philipp Spiess
Philipp Spiess [ˈʃpiːs]

Prev: Engineer at Sourcegraph and Meta, curator of This Week in React, React DOM team member, and Team Lead at PSPDFKit.