Whitepaper On Artificial Intelligence Based SEO

Welcome to THATWARE, we are on a mission to innovate SEO using artificial intelligence, machine learning, and data science.

Here we are publicly sharing some of our whitepaper on how AI can be utilized for the benefit of optimizing a page for better SEO performance. We are using cutting edge technologies from various fields such as NLP, information retrieval, semantic engineering, data science, AI, language R, and etc.

Disclaimer: This whitepaper should not be used without the consent and permission of THATWARE, if used please do mention the source as us! Furthermore, this whitepaper documentation is legally licensed & registered under Copyright Act, 1957 under the name of THATWARE LLP.

1. Using Bag Of Word(Word cloud) to find out the best featured Anchor text:

By ThatWare

Introduction:

Bag of Words, in short BoW is use to extract featured keyword from a text document. These features can be used for training machine learning algorithms. Basically speaking Bag of words is an algorithm that counts how many times a word appears in a document. Word counts allow us to compare documents and their similarities and it is used in many various fields.

Analysis:

By counting word occurrence the numbers of occurrences represent the importance of word. More frequency means more importance.

First we tokenize each and every word in the text document then we use the tokenized words for each observation and find out the frequency of each token. After that we can determine the frequency of a particular word which will later on indicate a relevancy of a keyword to a particular site.

Taken a corpus (Collection of documents) which contains text related to the site, we vectorized all the word in corpus and separated them, each and word has a value (integer) that indicates their occurrences later on we have created a visualization to clear out which word stands out the most. At the end to get a fine anchor text from particular content is we extracted the highest most frequently occurring words from the corpus.

Creating into a corpus the cleaning them, also converting it into the lower case:

Creating a term-document matrix:

Sorting the data and converting into a data frame:

Creating a word cloud:

Setting min and max term freq

Output:

Also creating a barplot:

Output:

Use in SEO:

The Bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a “bag of words”, we can calculate various measures to characterize the text. The most common type of characteristics or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text.

Conclusion:

As far As we know keywords and contents are one of the most valuable things present nowadays most of the search engine’s algorithms are advancing daily as we speak. In every search query hit the search engine gives a relatable site to browse which is derived by many algorithms. So choosing a keyword for site holds the important key to your site’s visibility in SERP.

2. Using Precision and Recall to get the accuracy of a particular model (Topic Model)

By ThatWare

Introduction:

Precision and Recall, both of the terms indicate a distinct meaning. Precision means the percentage of your results which are relevant and recall indicates the percentage of total relevant results correctly classified by a particular algorithm.

Analysis:

Precision:

In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the query.

Recall:

In information retrieval, recall is the fraction of the relevant documents that are successfully retrieved.

Image Source [https://en.wikipedia.org/wiki/Precision_and_recall ]

Initializing values:

Calculating precision and recall:

Output:

Use in SEO:

In various fields like pattern recognition, information retrieval and binary classification to fetch the relevant information, while precision is the fraction of relevant instances among the retrieved instances and recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

Likewise in SEO clusters of semantically similar focus keywords can be identified using precision and recall. Precision has the ability of a classification model to return only relevant instances; Recall has the ability of a classification model to identify all relevant instances.

Conclusion:

Precision and recall are two extremely important model evaluation metrics. We used these metrics to identify relevant keywords from a particular model. Also, we can determine the relevancy of particular content.

Generally in information retrieval, if you want to identify as many relevant documents as you can (that’s recall) and avoid having to sort through junk (that’s precision).

3. Using Naïve Bayes to classify focus keyword According to their rank

By ThatWare

Introduction:

Naive Bayes classifiers are a collection of classification algorithms using probability-based on Bayes’ Theorem. Basically, it is a classification technique with an assumption of independence among predictors. A Naive Bayes classifier assumes that a particular feature in a class is unrelated to any other feature in the class. This assumption is called class conditional independence.

Analysis:

Naive Bayes model is very useful against large scale data sets. Naive Bayes is a technique for constructing classifiers models. In the model, there are classes which are assigned by a particular label represented as vectors of feature values, where the class labels are drawn from some finite set.

Base formula:

P(A|B) is the posterior probability of class (A, target) given predictor (B, attributes).

P(B) is the prior probability of class.

P(B|A) is the likelihood which is the probability of predictor given class.

P(B) is the prior probability of predictor.

There are three types of Naive Bayes model:

1. Gaussian

2. Multinomial

3. Bernoulli

Constructing a sparse document-feature matrix:

Setting factors and keeping the last one as ‘NA’

Creating the naive Bayes classifier:

textmodel_nb: Naive Bayes classifier for texts, a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.

coef: A function which extracts model coefficients from objects returned by modeling functions.

predict: is a generic function for predictions from the results.

Predicting models with different prior distribution:

prior distribution on texts; one of “uniform”, “docfreq”, or”termfreq”.
Prior distributions refer to the prior probabilities assigned to the training classes, and the choice of prior distribution affects the calculation of the fitted probabilities. The default is uniform priors, which sets the unconditional probability of observing the one class be the same as observing any other class.

Getting the result:

Distribution: count model for text features, can be multinomial or Bernoulli.

prior distribution on texts; “docfreq”

Use in SEO:

We used Naive Bayes classifiers, to classify good keywords and bad keywords according to their rank, from this we can analyze why the rankings are dropping for some particular keyword, later on the result will be used in penalty analysis.

Conclusion:

Naive Bayes algorithm will really helps in SEO, classification can really help to identify potential ranking keyword as well as bad keywords. This can help in mapping keyword in a better way that will increase visibility in SERP.

4. Correlating Multiple Sites to get a brief understanding of ranking factors

By ThatWare

Introduction:

A correlation is a numerical measure of two entities, meaning a statistical relationship between two entities. Each entity has a set of observed data which later on used in various types of analysis.

In our case it is website correlation or website matching, this is a process which is used to get a preview of both the main site and the competitor site’s structure, content, and category. Later on helps in identifying critical issues which may to lead to bad ranking or penalty.

Analysis:

Two main types of correlation that we use are:

Pearson:

Pearson correlation coefficient is a measure of the linear correlation between two variables containing data sets where 1 represents positive linear correlation 0 neutral and -1 represents negative linear correlation.

Base formula:

Source: https://wikimedia.org/api/rest_v1/media/math/render/svg/2b9c2079a3ffc1aacd36201ea0a3fb2460dc226f

Where n is the sample size, xi and yi are the individual sample points indexed with i,

Spearsman:

Spearman’s rank correlation coefficient measures statistical dependence between the rankings of two variables. Spearman’s correlation assesses monotonic relationships. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.

Base Formula:

Source: https://wikimedia.org/api/rest_v1/media/math/render/svg/b69578f3203ecf1b85b1a0929772b376ae07a3ce

But we used correlation in a different way that really helps to analyze a site’s structure, we made a program that scrapes sites to get a view on structure and how it affects ranking in SERP by judging factors.

Scraping selected URLS

Keyword input:

Number of urls:

Checking H1:

Checking H2:

Checking H3:

Checking H4:

Checking H5:

Checking H6:

Checking span tag:

Checking i tag:

Checking em tag:

Checking strong tag:

Checking b tag:

Checking li tag:

Checking ol tag:

Checking ul tag:

Checking option tag:

Checking p tag:

Checking body tag:

Checking div tag:

Checking article tag:

Page title length checking:

Checking Page Title Attribute Matches:

Checking if Page Li tag Matches:

Checking Page Title contains Question Words:

Checking if Page Li Tag ExactMatches:

Checking Page Meta Description Length:

Checking Page Meta Description Matches:

Checking Page Meta Description SearchTerm Matches:

Checking Page Meta Keywords Matches:

Checking Page Has WordPress Generator Tag:

Use in SEO:

In every search enigine correlating sites in common, this helps to uncover keywords with similar time-based (frequency options: weekly or monthly)or provided search query.

Google Correlate uses the Pearson correlation to compare normalized query data to surface the highest correlative terms.”

Although we used our program to judge each and every keyword in meta description, title tag, alt text, i tag, p tag etc. against multiple competitors to check what kind of matches(Exact match, phrase match, search term match) are present in the content, this affect the ranking in terms of keyword and search query hit by users. We also check the pearson value and spearman’s ranking value to analyze how the main site is performing compare to its competitor .

Our method of correlating websites gives an insight in variuos crucial elements which afftects a site’s visibilty in SERP.

Conclusion:

Pearson’s Correlation Coefficient and Spearman’s Rank-Order Correlation helps to get a complete overview of the data fetched by correlating the main site and the competitors site.

In SEO using correlation we can determine what the main site is lacking and according to the results we can implement changes which will lead to better ranking in SERP.

5. Topic Modelling Using LDA(Latent Dirichlet Allocation) for keyword optimization in SEO

By ThatWare

Introduction: The Information Retrieval model

Latent Dirichlet Allocation (LDA) is a “generative probabilistic model” of a collection of composites made up of parts. In terms of topic modeling, the composites are documents and the parts are words and/or phrases (phrases n-words in length are referred to as n-grams).

Topic models provide a simple way to analyze large volumes of unlabeled text. Topic modeling is a method for unsupervised classification of documents. Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

Purpose:

Although LDA has many applications in many various fields, we used LDA to find out the keyword’s belonging probability in particular topic from a particular document. Each and every keyword has a belonging probability in particular topic which also indicates their relevance to a particular document.

After analyzing one’s content of a website, using LDA topic modeling we can determine which keyword holds the highest relevancy score based off a query (search term). By using those Keywords which helps in increasing a page’s relevancy for most of the search engine.

Analysis:

Latent Dirichlet allocation is one of the most common algorithms for topic modeling. Without diving into the math behind the model, we can understand it as being guided by two principles.

        1. Every document is a mixture of topics

  1. 2. Every topic is a mixture of words

LDA views documents as bags of words. LDA first assumes that a document was generated, by picking up a set of topics and then for each topic picking a set of words. Then the algorithm reverse engineers the whole process to find out the topic for each set of words.

Load and the files into a character vector:

Create corpus from the vector:

Data Preprocessing:

Remove potential problematic symbols:

Remove punctuation, digits, stop words and white space:

Creating a Document Term Matrix:

Setting parameters required in “Gibbs” sampling method:

     K= number of topics (min=2)

K=2, means the topic distribution per document will be 3.

The main point of using the seed is to be able to reproduce a particular sequence of ‘random’ numbers. 

                                                                                                       Iteration is set to be 2000 time

 

Finally ready to use the LDA() function from the topicmodels package:

                                                 Method=”Gibbs” is

Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is difficult. 

Word-topic probabilities:

Test_lda_td2<-tidy(test_lda2)

Converted the LDA  output in Tidy form

Once a dataset is tidy, it can be used as input into a variety of other functions that may transform, model, or visualize the data.

     

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. ungroup() removes grouping. In this case the tbl is grouped by topic.

top_n(10,beta):In this case it means that top 10 terms along with their beta is being showed.

SBeta:  Beta represents topic-word density with a high beta, topics are made up of most of the words in the corpus, and with a low beta they consist of few words.   

In this case the beta also shows the topic-word probability:

Visualization, of the 2 topics that were extracted from the documents:

Output:

Note: Our analysis method may depending upon the requirements.

Use in SEO:

We use LDA topic modeling to find out the best keyword by their relevancy score. We use the main site’s content to analyze to get the best keyword available from the content to increase the page visibility in SERP.

Algorithm We Used:

  1. Assume there are k topics across all of the documents

  2. Distribute these k topics across document m by assigning each word a topic.

  3. For each word in document m, assume its topic is wrong but every other word is assigned the correct topic.

  4. Probabilistically assign word a topic based on two things:

    – what topics are in document

m-
How many times word has been assigned a particular topic across all of the documents

  1. Repeat this process a number of times for each document.

Conclusion:

Topic modeling is the process of identifying topics in a set of documents. This can be useful for search engines, trending news topics, and any other instance where knowing the topics of documents is important.

In our case we used the algorithm for finding out the money in the content this algorithm is very useful for keyword research optimizing content and other factors, which leads to better ranking of a particular site.

6. Using Fleiss’ Kappa determining the agreement percentage of some particular factors implementation

By ThatWare

Introduction:

Cohen’s kappa is used to measure the percentage of agreement between two raters. But the drawback is that we can’t use more than two raters in cohen’s kappa, to overcome the problem an extension called Fleiss’ kappa is used. Fleiss’ Kappa is a way to measure agreement between three or more raters.
Base Formula for two raters:

1 – Pe = Gives the degree of agreement that is attainable above chance

P – Pe = Gives the degree of the agreement actually achieved above chance.

K= 1 (If the raters are in complete agreement)

K < 0 (If there is no agreement among the raters)

Analysis:

We used kappa statistics to determine the degree of agreement of the nominal or ordinal ratings made by multiple raters when the raters evaluate the same samples. In our case agreement can be defined of as follows, if a fixed number of raters assign numerical ratings to a number of factors then the kappa will give a measure for how consistent the ratings are.
To serve our purpose we used Kappa Statistics shown below so we can get the desired result.

We have taken a supervised data subjects, which are observed from several websites, whether the particular subject or point is present or not.
According to that, I’ve created a test file containing agreement and disagreement, which is represented as 1 and 0, 0 means disagreement and 1 means agreement.

The dataset we used

Subjects (Factors)

Raters (Competitors)

Imported the data from the default directory

Applied Fleiss Kappa:

Output:

Now,

Deriving the percentage of agreement for each subject:

10, because there are 10 competitors

Output:

Result file:

Percentage of the agreement for each subject

The kappa agreement value

Use in SEO:

In Kappa statistics, we measure the percentage of agreement of ratings made by multiple raters, which helps in making a decision for many different tasks.

Likewise in our case we use Kappa Statistics to judge some elements or factors (tags) that are crucial for SEO (heading tag, meta description, nofollow, noindex, iframe, etc.), by comparing it to the main site with multiple competitor’s sites to see the degree of agreements that which factor should we use the most and which should we avoid to get ranked in SERP.

Conclusion:

In SEO, analyzing competitor’s site in one of the most mandatory procedure, this gives a brief review of how your site’s condition is compared to your competitor’s site. This helps understand where your site is lacking. By using Kappa Statistics we can determine which factor holds the major priority through observing the degree of the agreement. Kappa Statistics can really help in competitor analysis.

7. Using Jaccard Index find related tags

By ThatWare

Introduction:

Jaccard Index determines how similar the two sets are, this compares members for two sets to see which members are shared and which are distinct. Basically speaking Intersection over Union, the size of the intersection divided by the size of the union of the sample sets.

Analysis:

Base formula of Jaccard index:

J(A,B) = |A∩B| / |A∪B|

In our case we’ve two vectors which contains terms from a corpus,

Two vectors X and Y, each represented as a set

A = {0,1,2,5,6}

B = {0,2,3,4,5,7,9}

If the Jaccard index between two sets is 1, then the two sets have the same number of elements in the intersection as the union, and we can show that A∩B=A∪B. So every element in A and B is in A or B, so A = B.

Loading the documents and creating it into a corpus:

Creating term document matrix:

Giving a specified tag

Calculating the results:

Output:

Use in SEO: The Jaccard index is used to measure similarity between two sets by Intersection over Union. Jaccard index allows us to find highly related tags which literally have no textual characteristics in common, this is one of the most used method for identifying good tags, bad tags and relatable tags. By using Jaccard Index we can determine whether the tags are valuable or not.

Algorithm: 

  1. Creating corpus with multiple documents

  2. Pre-processing the documents

  3. Creating term document matrix

  4. Separating terms and value

  5. Creating two sets

  6. Calculating the sets with the base formula

  7. Getting the output

Conclusion:

User-generated content is one of the most valuable sources of content this can help us build human driven natural language descriptions. User generated tags can hold the key to increase visibility to SERP, although there are millions of tags created by the user for some particular product, some may lead to duplicity of content, Jaccard index can solve the issue by identifying tags that  relates to some particular product.

8. Using Hierarchical clustering to find out similar tags in the site

By ThatWare

Introduction:

Hierarchical clustering is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

All the data Points in the clusters are very much similar to each other likewise in other clusters they also have very similar data points but every cluster is dissimilar than each other because the data points in a particular cluster have a different values than the data points in the other cluster.

Analysis:

Hierarchical clustering involves creating clusters that have a predetermined ordering from top to bottom.

Hierarchical cluster sees the data as a separate cluster Then, it repeatedly executes the process of identifying the two clusters that are closest together and merging the two most similar clusters as one.

Hierarchical clustering generally falls into two types:

  • Agglomerative: This is a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. In this technique, initially, each data point is considered an individual cluster. At each iteration, the similar clusters merge with other clusters until one cluster or K clusters are formed.

  • Divisive: This is a “top-down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In divisive or top-down clustering method we assign all of the observations to a single cluster and then partition the cluster to two least similar clusters

Algorithms to determine the distance of the data points:

Euclidean distance, Squared Euclidean distance,      Manhattan distance, Maximum distance, Mahalanobis distance.

Calculate the similarity between two clusters:

Calculating the similarity between two clusters is important to merge or divide the clusters. Certain approaches used to calculate the similarity between two clusters:

  • Maximum or complete-linkage clustering

  • Minimum or single-linkage clustering

  • Group Average

  • Distance Between Centroids

  • Ward’s Method

Implementation:

 Taken a specific page to analyze:

Now, scraping the site:

Creating term document matrix:

A given range of terms 1 to 40 with docs 1.

Distance Between vectors:

Algorithm used Euclidean distance to measure the distance of each terms in this case.

Plotting and Clustering:

Single Linkage :

Complete Linkage :

Output:

The Hierarchical clustering Technique can be visualized using a Dendrogram.

A Dendrogram is a tree-like diagram that records the sequences of merges or splits.

Use in SEO:

Clusters of similar data points actually helps in analyzing tags of a particular site and we can determine how much similar they are. Finding out similar tags which are helpful if the tags are relevant with the site this can increase the visibility of site in SERP.

Algorithm of basic hierarchical clustering:

Step- 1: In the initial step, we calculate the proximity of individual points and consider all the data points as individual clusters.

Step- 2: In step two, similar clusters are merged together and formed as a single cluster.

 Step- 3: We again calculate the proximity of new clusters and merge the similar clusters to form new clusters.

Step- 4: Calculate the proximity of the new clusters and form a new cluster.

Step- 5: Finally, all the clusters are merged together and form a single cluster.

Conclusion:

Hierarchical clustering is a powerful technique that allows you to build tree structures from data similarities. Clustering can discover hidden patterns in the data. We can now see and separate how different sub-clusters relate to each other, and how far apart data points are.

But in our case we just need to find out similar tags to use in tag optimization which will help in SEO in a different way that a particular site can rank higher in SERP.

9. Using K-means to identify a group of semantically similar keywords

By ThatWare

Introduction:

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. The algorithm inputs are the number of clusters K and the data set. The data set is a collection of features for each data point.

Analysis:

We have taken a data set which includes items with certain features, and values for these features (like a vector). The task is to categorize those items into groups (clusters). To get the desire output, we used k-Means algorithm; an unsupervised learning algorithm.

Fetching the files and clearing them:

Creating term document matrix:

Performing the cluster:

Output:

Use in SEO:

In general we used k-Means clustering to find a group of semantically similar keywords. By this algorithm we can determine the words present in a particular group of cluster is similar to each other and dissimilar to other words in some different cluster.

To optimize a particular keyword we use k-Means clustering to build and find more meaning into a particular word that is being used in a site’s content. Basically using semantic keyword, a particular keyword can indicate the true intent of content that might satisfy the user’s query which may lead to have more traffic in a website and also the site’s rank may get higher in SERP.

Algorithm:

  1. Clusters the data into k groups where k is predefined.

  2. Select k points at random as cluster centers.

  3. Assign objects to their closest cluster center according to the Euclidean distance function.

  4. Calculate the centroid or mean of all objects in each cluster.

  5. Repeat steps 2, 3 and 4 until the same points are assigned to each cluster in consecutive rounds.

Conclusion:

Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset.

In our case we used k-Means to optimize keywords, by doing so it gives more depth to particular content and provides more value in the eyes of Google. This will create the opportunity and more chances to obtain a variety of keyword rankings also an opportunity to rank for a longer period of time. By the implementing the keyword we chose from a particular cluster we can deliver more relevant content.

10. Rank Brain Schema Implements

By ThatWare

Introduction:  

Schema markup is a form of microdata added to a particular webpage that creates an enhanced description which also known as the rich snippet that appears in the search results. Top search engines – including Google, Yahoo, Bing, and Yandex – first started collaborating to create Schema.org, back in 2011.

Purpose:

Schema is used to improve the way search engines read and represent your page in SERPs.

 In the above screenshot you can see star rating and rates. Both of these can be added using Schema. Search results with more extensive rich snippets (like above-created using Schema) will have a better click-through rate. Schema markup is important for RankBrain Algorithm. We have developed a schema that satisfies the RankBrain Algorithm which will help interpret the context of a query.

Analysis: 

Schema markup was invented for users, so they can see in the SERPs what a website is all about, the place, the time schedule, cost, rating, etc. This is a user-centric improvement. Search engines help users to find the information they need.

Many types of research indicate that schema holds one of the important keys to SERP, they say that websites with schema markup will rank better in the SERPs than websites without markup. Some studies also determined that websites with markup rank an average of four positions higher in the SERPs than those without schema markup.

In short, there are millions of websites missing out on a huge source of SEO potential. By implementing schema you will hold a majority stand against your competitors.

Creating schema according to RankBrain algorithm:

RankBrain is one of the Core Algorithm for determining search results according to their relevance. Schema is one of the important parts of this algorithm. RankBrain has been cited as part of the overall Google Hummingbird algorithm.

In 2015, Google stated that RankBrain was being used to process 15% of queries the system had never encountered before. By 2016, Google was applying RankBrain to all queries.

So the old concept of One-keyword-one-page won’t work anymore.

By implementing schema in such a way that some of the required points for relevancy of a particular query can be found, this can also increase the chance of being in SERP at a good position. An example of our schema which has been shown below (The code may change depending to the site’s content):

<script type=”application/ld+json”>

{

“@context”: “http://schema.org”,

“@type”: “WebPage”,

“breadcrumb”: “Home > Blog > High Performance Sport Sunglasses Guide”,   //Breadcrumb

  “mainEntity”:{

          “@type”: “Article”,

          “name”: “High Performance Sport Sunglasses Guide”,

           “headline”: “High Performance Sport Sunglasses Guide”,

            “author”: “Heavyglare Eyewear”,

            “image”: “https://shop.heavyglare.com/blog/wp-content/uploads/2017/10/High-Performance-Sport-Sunglasses-Guide.jpg”,

          “inLanguage”: “English”,

           “publisher”: {

    “@type”: “Organization”,

    “name”: “High Performance Sport Sunglasses Guide”,

    “logo”: {

      “@type”: “ImageObject”,

      “name”:”Heavyglare Eyewear Logo”,

      “height”:”50″,

      “width”:”200″,

      “url”: “https://shop.heavyglare.com/media/wysiwyg/logo.png”

    }

    },

    “datePublished”: “2018-5-12”,

     “dateModified”:”2018-5-14″,

     “mainEntityOfPage”: {

    “@type”: “WebPage”,

    “@id”: “https://shop.heavyglare.com/blog”

  }

  }

}

</script>

<script type=”application/ld+json”>

{

  “@context”: “http://schema.org”,

 “@type”: “BlogPosting “,

 “publishingPrinciples”:{

 “@type”:”Table”,       //Contents

 “name”:”Contents”,

 “sameAs”:[“1.Functionality of your lenses is imperative”, “2.Quality lens materials make a difference”,”3.High performance sport sunglasses need to fit properly”]

 },

 “headline”: “High Performance Sport Sunglasses Guide”,

 “author”: {

    “@type”: “Person”,

    “name”: “Heavyglare Eyewear”

  },

  “image”: [

   “https://shop.heavyglare.com/blog/wp-content/uploads/2017/10/High-Performance-Sport-Sunglasses-Guide.jpg”

   ],

   “publisher”: {

    “@type”: “Organization”,

    “name”: “Heavyglare Eyewear”,

    “logo”: {

      “@type”: “ImageObject”,

      “name”:”Heavyglare Eyewear Logo”,

      “height”:”50″,

      “width”:”200″,

      “url”: “https://shop.heavyglare.com/media/wysiwyg/logo.png”

    }

  },

  “datePublished”:”2018-5-12″,

  “dateModified”:”2018-5-14″,

  “mainEntityOfPage”: {

    “@type”: “WebPage”,

    “@id”: “https://shop.heavyglare.com/blog”

  },

     “keywords”:”lens,sunglass lens characteristics,sunglass lenses,sunglass lens issues”,

  “wordCount”:”1800″, //wordcount

   “mainEntity”:{        //Anchortext

    “@type”:”brand”,

    “name”:[“shop.heavyglare.com”,

   “heavyglare eyewear”,

   “heavyglare”,

   “sports eyewear”,

   “prescription sunglasses”,

   “heavyglare.com”

    ]

    },

  “workExample”:{          //Quora questions

  “@type”: “Question”,

    “text”: “What is the colour of sky?”,

    “dateCreated”: “2018-5-12”,

    “author”: {

        “@type”: “Person”,

        “name”: “A”

    },

    “acceptedAnswer”: {

        “@type”: “Answer”,

        “text”: “blue”,

        “dateCreated”: “2018-5-14”,

        “author”: {

            “@type”: “Person”,

            “name”: “A”

        }

    },

     “mainEntity”:{     //loop

        “@type”:”HowToStep”,

        “name”:”suggested answers”,

        “itemListOrder”: “http://schema.org/ItemListOrderAscending”,

        “itemListElement”:[

        {

        “@type”: “Answer”,

        “position”: “1”,

        “text”: “black”,

        “dateCreated”: “2018-5-14”,

        “author”: {

            “@type”: “Person”,

            “name”: “B”

        }

    },

       {

        “@type”: “Answer”,

        “position”: “2”,

        “text”: “red”,

        “dateCreated”: “2018-5-14”,

        “author”: {

            “@type”: “Person”,

            “name”: “B”

        }

    }

    ]

    }

    },

   

    “potentialAction”:{        //questions with loop

  “@type”: “AskAction”,

  “agent”: {

    “@type”: “Person”,

    “name”: “John”

  },

  “recipient”: {

    “@type”: “Person”,

    “name”: “Steve”

  },

   “object”:{

        “@type”:”HowToStep”,

        “name”:”Queries”,

        “itemListOrder”: “http://schema.org/ItemListOrderAscending”,

        “itemListElement”:[

        {

    “@type”: “Question”,

    “position”: “1”,

    “text”: “What’s 2 + 2?”

  },

   {

    “@type”: “Question”,

    “position”: “2”,

    “text”: “What’s 21 + 22?”

  }

]

}

},

   

    “exampleOfWork”:{          //how to

    “@type”:”HowTo”,

    “name”:”How to choose a good sunglass for better performance?”,

    “url”:”https://shop.heavyglare.com/blog/high-performance-sport-sunglasses-guide/”,

               “description”:”Read the article to know the process of choosing a good sunglass”,

               “image”:”https://shop.heavyglare.com/blog/wp-content/uploads/2017/10/High-Performance-Sport-Sunglasses-Guide.jpg”,

               “inLanguage”:”en-US”,

               “keywords”:”lens,sunglass lens characteristics,sunglass lenses,sunglass lens issues”,

        “steps”:{

        “@type”:”HowToStep”,

        “name”:”Queries”,

        “itemListOrder”: “http://schema.org/ItemListOrderAscending”,

        “itemListElement”:[

        {

       

        “@type”: “HowToDirection”,

                                      “position”: “1”,

                                      “name”:”Aim to protect your eyes first and foremost.”,

                                      “description”:”Excessive exposure to UV radiation can cause a variety of problems for your eyes such as cataracts, burns, and cancer.”

},

{

        “@type”: “HowToDirection”,

                                      “position”: “2”,

                                      “name”:”If you want your sunglasses to protect you from these risks, look for pairs that block at least 99% of UVB rays and at least 95% of UVA rays.”,

                                      “description”:”Also look for the amount of cover the sunglasses provide. Look at how much you can see around the frames––will the sunglasses let in sun from the top or sides?”

},

{

        “@type”: “HowToDirection”,

                                      “position”: “3”,

                                      “name”:”Don’t buy sunglasses if they’re labeled as ‘cosmetic’ or don’t provide any information on UV protection.”,

                                      “description”:”Look for scratch resistance, many lenses have very fragile coatings. If you are spending much money, you want them to last. Fortunately damaged lenses can be replaced for most models.”

},

{

        “@type”: “HowToDirection”,

                                      “position”: “4”,

                                      “name”:”Choose scratch-resistant lenses.”,

                                      “description”:”Scratched up sunglasses are useless sunglasses. Lenses made from NXT polyurethane are impact-resistant, flexible, lightweight, and have great optical clarity, but they’re expensive.”

},

{

        “@type”: “HowToDirection”,

                                      “position”: “5”,

                                      “name”:”Check for distortion.”,

                                      “description”:”Hold the lenses up to a fluorescent lamp. As you move the sunglasses up and down, check that wave distortion doesn’t happen. If it doesn’t happen, this is a good sign.”

}

       ]

    }

    },

   

  “video”: {                //video

    “@type”: “VideoObject”,

    “description”: “Get a visual idea about the high performance sport sunglasses.”,

    “uploadDate”:”2018-05-14″,

    “duration”: “T1M33S”,

    “name”: “High Performance Sport Sunglasses Guide”,

    “thumbnail”: “High-Performance-Sport-Sunglasses-Guide.jpg”,

    “thumbnailUrl”:”https://shop.heavyglare.com/blog/wp-content/uploads/2017/10/High-Performance-Sport-Sunglasses-Guide.jpg”

  },

  “encoding”:{                 //infographic

  “@type”: “ImageObject”,

  “name”: “Infographic on High Performance Sport Sunglasses Guide”,

  “author”: “Heavyglare Eyewear”,

  “contentUrl”: “https://shop.heavyglare.com/blog/wp-content/uploads/2017/10/High-Performance-Sport-Sunglasses-Guide.jpg”,

  “datePublished”: “20018-05-12”,

  “description”: “Get a visual idea about the high performance sport sunglasses.”

  }

}

}

  </script>

The main part in the entire above schema is shown below “How to section”:

“object”:{

        “@type”:”HowToStep”,

        “name”:”Queries”,

        “itemListOrder”: “http://schema.org/ItemListOrderAscending”,

        “itemListElement”:[

        {

    “@type”: “Question”,

    “position”: “1”,

    “text”: “What’s 2 + 2?”

  },

   {

    “@type”: “Question”,

    “position”: “2”,

    “text”: “What’s 21 + 22?”

  }

]

}

},

   

    “exampleOfWork”:{          //how to

    “@type”:”HowTo”,

    “name”:”How to choose a good sunglass for better performance?”,

    “url”:”https://shop.heavyglare.com/blog/high-performance-sport-sunglasses-guide/”,

               “description”:”Read the article to know the process of choosing a good sunglass”,

               “image”:”https://shop.heavyglare.com/blog/wp-content/uploads/2017/10/High-Performance-Sport-Sunglasses-Guide.jpg”,

               “inLanguage”:”en-US”,

               “keywords”:”lens,sunglass lens characteristics,sunglass lenses,sunglass lens issues”,

        “steps”:{

        “@type”:”HowToStep”,

        “name”:”Queries”,

        “itemListOrder”: “http://schema.org/ItemListOrderAscending”,

        “itemListElement”:[

        {

       

        “@type”: “HowToDirection”,

                                      “position”: “1”,

                                      “name”:”Aim to protect your eyes first and foremost.”,

                                      “description”:”Excessive exposure to UV radiation can cause a variety of problems for your eyes such as cataracts, burns, and cancer.”

},

{

        “@type”: “HowToDirection”,

                                      “position”: “2”,

                              “name”:”If you want your sunglasses to protect you from these risks, look for pairs that block at least 99% of UVB rays and at least 95% of UVA rays.”,

“How to section” Actually increases the land page’s ranking probability. The “How to section” is directly proportional to the page’s intent which also increases the relevancy score for some query hit.

Schema implementation:

  1. First go to the Google Tag Manager.

  2. Then active the preview mode.

  3. Next create one tag.

  4. Select custom HTML and paste the above code there and save the Tag.

  5. Then create one trigger and select ‘Page View’.

  6. If you want to implement the code for all pages then select ‘All Page View’ and if you want to implement the code for some particular pages then select ‘Some Page View’ and select the page path and put the page path in it.

  7. Add the trigger in the Tag.

  8. Then at last submit it.

Conclusion:

Structured data markup is an important aspect of any comprehensive SEO solution. It is recommended that one should implement the schema types that are most applicable to the business. This will likely provide a competitive edge in the SERPs, and it is a cost-effective way to boost the organic search results.

11. Cosine Similarity Implementation In SEO

By ThatWare

Introduction:

Cosine Similarity is a measure of similarity between two non-zero vectors that estimates the cosine angle between them. If the cosine angle orientations between two vectors are the same then they have a cosine similarity of 1 and also with different orientation the cosine similarity will be 0 or in between 0-1. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1].

Purpose:

In information retrieval, Cosine Similarity is a commonly used algorithm in various fields; in our case, we used Cosine Similarity to check Content similarity between two websites. There are many applications of Cosine Similarity that we created which is totally SEO focused, for instance:

  1. Anchor Text Similarity

  2. Document to Document Similarity

  3. Keyword Similarity between your site and the competitor’s site

Analysis:

Through all the applications we created the base formula is the same; we modified our application according to the requirement. We used R language to compile our analysis.

Base Formula:

Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as:

Where Ai and Bi are components of vector A and B respectively.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity of two documents will range from 0 to 1, since the term frequencies cannot be negative. The angle between two-term frequency vectors cannot be greater than 90°. We are using cosine similarity to measure the cosine angle of two website’s content whether the site’s content is similar or not if it is then how much. The range we prefer is [ 0.3-0.5 ].

Use in SEO:

3 Applications that we use the most in competitor analysis:

             1. Anchor Text Similarity:

The anchor element is used to create hyperlinks between a source anchor and a destination anchor.

Taken two specific websites, one is the main site and another is the competitor of the main site. This method is used for finding a site’s anchor text similarity with the site’s competitor, in other words, the common anchor texts. Similarity is measured by a cosine angle.

We need to scrape the sites for all anchor texts from a given link:

Cleared and vectorized all the content then stored into a data frame

Same method follows for the competitor’s site

Creating a Corpus which contains both of the site’s anchor text

Creating a term document matrix to find out there frequency

Applying the base formula and intersecting to sort out the common anchor text

  1. 2. Document to Document Similarity :

In this method what we do is, we scrape all the content from the main site and from the competitor site and store it into a variable later on we convert it into a data frame for further analysis

Creating Term Doc Matrix:

       3. Keyword Similarity between your site and the competitor’s site :

We take a particular keyword from the main site as the main term to run the cosine similarity to determine how much it is similar.

After the test run if the degree of the output is very similar this will help the main site to rank better in SERP.

Note: The similarity of the content can cause plagiarism. The only thing we need to compare is the keywords.

Cosine Similarity base algorithm:

The Cosine Similarity procedure computes similarity between all pairs of items. It is a symmetrical algorithm, which means that the result from computing the similarity of Item A to Item B is the same as computing the similarity of Item B to Item A.

  1. A and B are two vectors contains integer values (Cosine Similarity function computes the similarity of two lists of numbers)

  2. The dot product of the two vectors divided by the product of the two vectors’ lengths (or magnitudes).

  3. The cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1].

Conclusion:

The measurement of similarity using Cosine similarity algorithm is way more efficient and simple to work with. Gives an output which does not complicate the analysis and also for the competitor research Cosine Similarity is one of the most useful algorithm we’ve seen.

12. Using Sentiment analysis to analyze the emotion (positive or negative) associated with a particular content of a site:

By ThatWare

Introduction:

Sentiment Analysis is the process of determining whether the content is positive, negative or neutral. By analyzing the content you can identify what kind of emotion the content reflects. A sentiment analysis system for text analysis combines natural language processing (NLP) and machine learning techniques to assign weighted sentiment scores to the entities, topics, themes, and categories within a sentence or phrase.
Analysis: Basically Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer’s attitude towards a particular topic, product, etc. is positive, negative, or neutral.
There are three different sentiment dictionaries, afinn, nrc and Bing we used where we get the word stock that particularly expresses many different kinds of emotion which makes our work easier those words are predefined what a particular word reflects.

Sentiment anaysis can be applied at different levels of scope:

Document-level sentiment analysis obtains the sentiment of a complete document or paragraph.

Sentence level sentiment analysis obtains the sentiment of a single sentence.

Sub-sentence level sentiment analysis obtains the sentiment of sub-expressions within a sentence.

Types of Sentiment Analysis:

Fine-grained Sentiment Analysis

This is usually referred to as fine-grained sentiment analysis. This could be, for example, mapped onto a 5-star rating in a review, e.g.: Very Positive = 5 stars and Very Negative = 1 star.

Emotion detection

Emotion detection aims at detecting emotions like happiness, frustration, anger, sadness, and the like.

Intent analysis

The intent analysis basically detects what people want to do with a text rather than what people say with that text.

Reading the text file:

Tibble to Doc:

This gives us a tibble with a single variable equal to the name of the file. We now need to unnest that tibble so that we have the document, line numbers and words as columns.

SENTIMENT ANALYSIS OF THE TEXT:

Ouput:

Use in SEO:

Google had already filed a patent for an algorithm that would ensure “each snippet comprises a plurality of sentiments about the entity,” which, in theory, would keep the emotional content of the snippets relatively balanced. Said by “searchenginepeople”.

In our case we used sentiment analysis in various reports which indicates several factors to judge upon. It totally depends on what task is require, we use in review analysis, content analysis, Use of sentiment analysis algorithms across product reviews lets online retailers know what consumers think of their products and respond accordingly. This can specially be used to gain insight on customer sentiment of their products and services. As for SEO creating content according to the positive sentiment, this could really help in ranking in SERP.

Conclusion:

As sentiment analysis tools become increasingly available the SEO industry cannot help but be affected by them. So we started analyzing contents which indicates various forms of sentiment that really helps in product analysis and review analysis.

13. Using TF-IDF Determining a keyword’s value against the document:

By ThatWare


Introduction:

In information retrieval TF-IDF stands for Term Frequency-Inverse Document frequency. The tf-idf weight in often used to indicate a keyword’s relevance of a particular document. Variations of the tf-idf weighting scheme are often used by search engines in scoring and ranking a document’s relevance given a query. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.


Analysis:

The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents.

Term frequency:

The number of times a term occurs in a document is called its term     frequency. In the case of     the term frequency tf(t,d), use the raw count of a term in a document.

t= Terms

d= Document

Augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the raw frequency of the most occurring term in the document:

Inverse document frequency:

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

N= Total number of documents in the corpus

Test Query:

Creating Corpus and clean it:

Creating term document matrix:

Creating a function for calculating tf-idf:

Calling the tf-idf function:

Getting the tf-idf value of query and docs separately:

Getting each and every Doc score:

Converting the results into a dataFrame:

Printing the result according to the score in decreasing order:

Output:

Printing only the 1st and 2nd row:

Output:

Results are indicating a relevance feedback for each and every particular doc.

Use in SEO:

TF-IDF is very useful method to find a keyword and check how much relevance it is to a particular document. By implementing keyword that are more relevant to the actual document can increase visibility in SERP and it can give a good ranking opportunity for any site.

Conclusion:

TF-IDF is intended to reflect how relevant a term is in a given document. In SEO, terms or keywords are very crucial, one of the important factors for ranking high in SERP. TF-IDF actually helps to observe how much a query is relevant to the document.


References:

14. Using AdaBoost to solve clustering problems for focus keywords of a particular content:

          By ThatWare

Introduction:

AdaBoost, another way to say “Adaptive Boosting”, is the principal functional boosting calculation; it centers on arrangement issues and means to change over a lot of frail classifiers into a solid one.

Generally this method creates a strong classifier from a number of weak classifiers. This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model.

Analysis:

AdaBoost is a popular boosting technique which helps you combine multiple “weak classifiers” into a single “strong classifier”. AdaBoost can be applied to any classification algorithm, so it’s really a technique that builds on top of other classifiers as opposed to being a classifier itself.

  1. Retrains the algorithm iteratively by choosing the training set based on the accuracy of previous training.

  2. The weight-age of each trained classifier at any iteration depends on the accuracy achieved.

where f_m stands for the m_th weak classifier and theta_m is the corresponding weight. It is exactly the weighted combination of M weak classifiers.

Fetching the file and clearing them:

Using in SEO:

Basically this algorithm helps on classification problems and aims to convert a set of weak classifiers into a strong one.

Well there are layers of work to do to get a proper use of this algorithm mainly if there are lots of data (text, product, categories) present in the site which needs a clear-cut classification we mostly use Hierarchical, K-means algorithm to sort it out and make it into a cluster of information then we use LDA to categorize each and every topic according to their tf-idf weight, In here sometime the weak clusters have problems getting into a proper cluster that’s why need to use Ada-boost to make a strong cluster.

Conclusion:

Adaboost really helps in clustering, so later on it can help in getting a proper view in focus keywords. In this context, AdaBoost actually has two roles. Each layer of the cascade is a strong classifier built out of a combination of weaker classifiers, as discussed here. However, the principles of AdaBoost are also used to find the best features to use in each layer of the cascade.

This algorithm tells you what the best “features” are and how to combine them to a classifier.

  •  
  •  
  •  
  •