The goal is to create a program where two site’s contents can be compared and represent to show their similarities in a heatmap. This is used in professional SEO services.
By using Hierarchical clustering and K mean Clustering, a group of terms has been selected according to their TF. Then it is shown and compared side by side in a form of Document heat map form where the colors represent their TF in that particular document.
To download package outside of CRAN archive:
“genefilter” package needed to download in order to plot the heat map.
Loading stopwords files:
Text Data input:
This line will take every text file (.txt) in a list file, present in the default directory.
lapply function is applied for operations on list objects and returns a list object of same length of original set.
Creating a Corpus:
Creating Term Doc Matrix:
Distance Between vectors:
Algorithm used Euclidean distance to measure the distance of each terms in this case.
Number of group to visualize
Creating Clusters, using Ward method.
Hang= -1 is use to level the output.
K means Clustering:
Plotting the K means
a number of pseudo-randomly-generated numbers, it takes an (arbitrary) integer argument. So we can take any argument, say, 1 or 123 or 300 or 12345 to get the reproducible random numbers.
Row Variance Of An Array:
Calculates variances of each row of an array
Get Indicies For Significant Edges
Get the indicies for the significant edges in a network.
Plotting the heatmap:
Heat map plotting
Doc 1 Doc2