Jaccard Similarity : The Definitive Guide

Jaccard Index - ThatWare

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures the similarity between finite sample sets and is defined as the size of the intersection divided by the size of the union of the sample sets:

Loading Libraries:

jaccard 2

Loading stopwords files:

jaccard 3

Text Data input:

This line will take every text file (.txt) in a list file, present in the default directory.

lapply function is applied for operations on list objects and returns a list object of the same length of the original set.

lapply function is applied for operations on list objects and returns a list object of the same length of the original set.

Creating a Corpus:

jaccard 5

Content Transformation:

jaccard 6

Creating Term Doc Matrix:

Inspect output

jaccard 8

Converting the tdm into a data frame:

jaccard 9

A= accessing and assigning the tdm value of doc1.

B= accessing and assigning the tdm value of doc1.

Converting into a set:

Calculating the similarity using the Set similarity package:



Intersection and Union of the data:

jaccard 13


jaccard 14

Dividing both results by its length:

jaccard 15

According to the equation


jaccard 17


