Gephi Report: The Definitive Guide

Gephi Report: The Definitive Guide


Gephi is a free open source software that is used to represent a large computer network, social media network as a form of Graph. Gephi also allows the user to analyze and manipulate the network according to user satisfaction.

This has wide varieties of application which can be used depending on the user’s requirement like, link analysis, exploratory data analysis, social network analysis, biological network analysis, and poster creation. Its main features include dynamic filtering, real-time visualization, input/output, create cartography, layout, and data table & editions.

2.Benefits of Gephi:

After crawling and extracting all the necessary links of your site using a screaming frog, the data has been collected and imported into Gephi then we were able to request any nodes of your website and display links between them(Basically page nodes, collected through site crawl data).

Gephi helps to a particular website’s data turn into visualization. There are a wide variety of layouts available for network visualizations. The layout we used is Force Atlas 2 for graphical representation. Although we are doing link analysis we calculate each and every node’s (target page) based on real-time data.

3.SEO Advantage:

Generally, a very important factor of a strong SEO strategy is to understand how a website is structured, how the pages of a website are connected, how the page rank flows through it. This is where Gephi steps in, through a visual representation of the internal structure of a particular site, we’ll be able to diagnose and detect SEO issues regarding any page rank deficiency, Link equity and also how Google might crawl your website, then we built a network using Gephi which allowed us to detect the issues which might affect the ranking of your page later on.

Note: If there are improper page rank distributions in the network/structure also if the main target page has low page rank, then the main target page won’t get rank higher. Regarding these types of issues, Gephi is used.

4.Getting Started:

Started by crawling the website and collecting the data. (Necessary Internal links):
The tool we used to perform a crawl of your site is Screaming Frog. Once Screaming frog has finished crawling your site we get lots of links which includes an image, js files, CSS files, and many unnecessary things. Mainly we were interested in pages, not the other files, so we needed to exclude those things from the crawl data.

Exporting the files in a spreadsheet and excluding unnecessary files and links:

After we gathered the crawled data, we exported the file from Screaming Frog:

“Bulk Export”>” All links” to an excel spreadsheet.

Later on, we filtered and cleaned the spreadsheet by removing the first column “Type”, and also all the other columns beside “Source” and “Target.” But before that, we renamed the “Destination” column to “Target”.

After that, we also filtered images, CSS and JavaScript files. And also removed links like tags, category, duplicate links, paginations from the source.

Bulk Export >> all links

Exported the crawled link into an excel sheet.

Filtered and Renamed “Destination” to “Target”.

At this point, we have the filtered excel spreadsheet with Source and target information in it.
In Gephi we imported the excel spreadsheet from the “Data Laboratory” section. Once all the data is in the “Data Laboratory,” then we can see a graphical representation of the data, in the “Overview” section. This may not be a proper layout to work with, need tuning in a proper layout algorithm.

After importing the excel sheet-


An overview of the present layout-


3.1.2 Tuning the Graph:

Generally, all the main layouts run using a “forced-based” algorithm. The layout algorithm we used to clarify the nodes (are the webpage links) and edges (shows the links between webpages) are “Force-Atlas 2”.By adjusting Scaling and Gravity the graph came to a viewable position.

Adjusting the view:


Complete Overview:


After adjusting the view, we calculated the Pagerank and Modularity.

These options are available in the “Statistics” tab, and then we ran using the default settings for PageRank, but for Modularity, we un-ticked “Use weights.” This will append data about your pages in new columns that will be used for the visualization and plotting later on.

Ran Modularity-


Ran PageRank-


PageRank identification tuning:

We choose nodes than its size, then the ranking tab gave min size, gave a max size.

This highlights with high page rank in large circles and low page rank with small circles according to the given max-min size.


Modularity identification tuning:

Here we choose nodes then Color, then the ranking tab, then choose Modularity class.

This will highlight the nodes which have high modularity are being present as a deep-colored circle, and the nodes which has low Modularity, those nodes are being present as a faded colored circle.


Arranging Data in the Spreadsheets: (According to their PageRank and Modularity)

Basically a table, then 4 columns have been created to note down the measurements according to their:



Now in the first column, we added those links which have high modularity.

Then in the second column, we added those links which have low modularity.

Then in the third column, we added those links which have a high PageRank.

Then in the fourth column, we added those links which have a low PageRank.


Note: The topmost ranked page which has the highest PageRank according to Gephi will be count as the root page/ the main page/the origin page.

Calculations of the relative distances between nodes:

This basically measures the distance from the top most Ranked page to every other page, which are present in the data table created or spreadsheet previously (According to their PageRank and Modularity)

The distances are shown in percentage:

To calculate percentage we used a little c program which takes the max page rank value and all the other page rank values and then gives a result which indicates the relative distance between the origin node and the entire nodes in the spreadsheet.


Output of the relative distance in a[i] ( i= 1,2,3,4,5 ):



Plotting the Internal structure:

There are two general theories of page rank distribution to maintain link equity that we followed is:

When the relative distance from the origin page to any other page’s percentage is high, we need to pass the page rank.

When the relative distance from the origin page to any other page’s percentage is low, we need to leave that node as it is (when these kinds of nodes have low modularity then only we need to pass modularity from another page with high modularity count).

For example:


A is the origin page with high page rank, has low modularity.

B is any one of the pages on the website with a low page rank.

C is also any one of the pages in the website with high modularity


X- Both side arrow connectivity between A and C means, both nodes will pass PR and MD to each other to maintain link equity.

Y- This basically means Only from A will pass PR and MD to B to maintain link equity.

Z- Only from C, PR and MD will pass to B to maintain link equity.

Site’s linking structure, Benefits the homepage:

4. Conclusion:

This analysis covers all the basic understanding of the internal linking structure also gives a general overview of the network visualization. This is yet an important analysis help to design a better internal linking strategy in the near future.
After gathering the data, analyzing the data and calculating the data is finished, and then comes the linking strategy internally, which will pass PR MD in such a way that every page will have an equal PR and MD at last.

Through Gephi we have a basic strategy to maintain the link equity of the whole website.