Visualization Of Data In Python Part 1

“Every second of every day, our senses bring in way too much data than we can possibly process in our brains.” – Peter Diamandis, Chairman/CEO, X-Prize Foundation.

 

However farfetched and over-exaggerating the quote may seem, it is nothing but the truth. The amount of data that surrounds us is immense. These data provide us with information which in turn provide us with insights. But, while it may seem that collecting a lot of data is very useful, the massive quantity of data is a disadvantage in itself. Simply by observing the numerous aspects of data, it is not humanely possible to reach any satisfactory conclusion. This is where Data Visualization comes in handy. This blog will be covering how data can be visualized using a very common language, Python.

 

Why Python?

Python, being an object-oriented, open source, flexible, and easy-to-use programming language, is widely used by Data Analysts and Data Scientists. Also, the wide varieties of libraries make it easier to analyze data.

How to visualize data using Python?

Python already has two exclusive libraries for Data Visualization. They are:

  • Matplotlib: Matplotlib is a python based plotting library. It mainly provides 2D visualizations of data while also supporting limited 3D graphic visualizations.
  • Seaborn: Seaborn is based on Matplotlib. It provides multiple features like numerous color palettes, themes, tools to visualize data, statistical time series and many more.

 

How to visualize data using Python libraries? 

Histogram

Histograms are used to visualize the distribution of continuous data. By visualizing data using a histogram, we can approximate statistical values like the mean, median, mode of data, and the distribution of the variable.

Creating a Histogram in Python:

The data set used here is the iris data that is part of SciKit-learn package. 

import seaborn as sns

df = sns.load_dataset(“iris”)

df.sepal_length.hist()

#Creating the histogram plot selecting the required continuous variable from the dataset

plt.title(“Sepal Length Distribution”) #setting plot title

plt.xlabel(“Sepal Length”) #setting x-axis label

plt.ylabel(“Count of iris plants”) #setting y-axis label

plt.show()

e446b067d2111a8aed012c14d4d5c8ed image

Box Plot

Box Plots show the variation of data ranging from the minimum to the maximum value. Box Plots are used extensively to detect outliers from a given dataset. Any data outside the upper and lower limit is usually considered as an outlier.

 

Upper Limit: Q3  +  1.5 * (Q3 – Q1), and,

Lower Limit: Q1  –  1.5 * (Q3 – Q1),

 

Where Q3 and Q1 are the third and first quartiles respectively, and (Q3 – Q1) is the Inter-Quartile Range. We can also see box plots between two continuous, or one continuous and one categorical variable.

 

Creating a Box Plot in Python:

Here also the dataset of iris has been used.

 

import seaborn as sns

df = sns.load_dataset(“iris”)

plt.boxplot(df[‘sepal_length’]) #Creating a boxplot using matplotlib

sns.boxplot(df[‘sepal_length’]) #Creating a boxplot using seaborn

5a9ba49eaf2c4709be99e439e6a4f875 image

Box Plot of sepal length using matplotlib

d89ea14e67b1e7749bd52d561c41a89f image

Fig: Box Plot of sepal length using seaborn

 

Bar Charts and Stacked Bar Charts

Bar Charts are mostly used to compare values of different categorical variables. Stacked Bar charts are usually used to compare multiple metrics across different categories.

Creating Bar Chart and Stacked Bar Chart in Python:

Here again, the iris dataset has been used.

import seaborn as sns

df = sns.load_dataset(“iris”)

 

df.groupby(‘species’).mean().plot(kind = ‘bar’)   #selecting “species” as the independent variable and plotting the mean dimensions of the different species of iris flowers

df.groupby(‘species’).mean().plot(kind = ‘bar’, stacked = True, color = [‘red’, ‘blue’, ‘green’, ‘yellow’], grid = False)  #creating stacked bar chart

18a4070073ffded2dc74d4925db1a164 image

Bar Chart comparing mean dimensions of different iris flower species

43cd91d28e94d232d694d49b7ded3605 image

Stacked Bar Chart comparing mean dimensions between different iris flower species

Line Chart

Line Charts are usually plotted to study time series data. They are mostly used to detect trends over time.

Creating a Line Chart in Python

Here the flights dataset from the seaborn package has been used

import seaborn as sns

df1 = sns.load_dataset(“flights”)

 

df1.plot(x = ‘month’, y = ‘passengers’,kind = ‘line’)

 

This creates a trend line of the number of passengers availing flights in over the months in 3 years period

 

plt.ylabel(“#passengers”)

plt.xlabel(“months”)

plt.show()  

a6e84161d5425e542009d4d208e27f4f image

Time Chart showing trend of the number of passengers availing flights over the months

Facebook
Twitter
Pinterest
Email