“Every second of every day, our senses bring in way too much data than we can possibly process in our brains.” – Peter Diamandis, Chairman/CEO, X-Prize Foundation.
However farfetched and over-exaggerating the quote may seem, it is nothing but the truth. The amount of data that surrounds us is immense. These data provide us with information which in turn provide us with insights. But, while it may seem that collecting a lot of data is very useful, the massive quantity of data is a disadvantage in itself. Simply by observing the numerous aspects of data, it is not humanely possible to reach any satisfactory conclusion. This is where Data Visualization comes in handy. This blog will be covering how data can be visualized using a very common language, Python.
Why Python?
Python, being an object-oriented, open source, flexible, and easy-to-use programming language, is widely used by Data Analysts and Data Scientists. Also, the wide varieties of libraries make it easier to analyze data.
How to visualize data using Python?
Python already has two exclusive libraries for Data Visualization. They are:
How to visualize data using Python libraries?
Histogram
Histograms are used to visualize the distribution of continuous data. By visualizing data using a histogram, we can approximate statistical values like the mean, median, mode of data, and the distribution of the variable.
Creating a Histogram in Python:
The data set used here is the iris data that is part of SciKit-learn package.
import seaborn as sns
df = sns.load_dataset(“iris”)
df.sepal_length.hist()
#Creating the histogram plot selecting the required continuous variable from the dataset
plt.title(“Sepal Length Distribution”) #setting plot title
plt.xlabel(“Sepal Length”) #setting x-axis label
plt.ylabel(“Count of iris plants”) #setting y-axis label
plt.show()
Box Plot
Box Plots show the variation of data ranging from the minimum to the maximum value. Box Plots are used extensively to detect outliers from a given dataset. Any data outside the upper and lower limit is usually considered as an outlier.
Upper Limit: Q3 + 1.5 * (Q3 – Q1), and,
Lower Limit: Q1 – 1.5 * (Q3 – Q1),
Where Q3 and Q1 are the third and first quartiles respectively, and (Q3 – Q1) is the Inter-Quartile Range. We can also see box plots between two continuous, or one continuous and one categorical variable.
Creating a Box Plot in Python:
Here also the dataset of iris has been used.
import seaborn as sns
df = sns.load_dataset(“iris”)
plt.boxplot(df[‘sepal_length’]) #Creating a boxplot using matplotlib
sns.boxplot(df[‘sepal_length’]) #Creating a boxplot using seaborn
Box Plot of sepal length using matplotlib
Fig: Box Plot of sepal length using seaborn
Bar Charts and Stacked Bar Charts
Bar Charts are mostly used to compare values of different categorical variables. Stacked Bar charts are usually used to compare multiple metrics across different categories.
Creating Bar Chart and Stacked Bar Chart in Python:
Here again, the iris dataset has been used.
import seaborn as sns
df = sns.load_dataset(“iris”)
df.groupby(‘species’).mean().plot(kind = ‘bar’) #selecting “species” as the independent variable and plotting the mean dimensions of the different species of iris flowers
df.groupby(‘species’).mean().plot(kind = ‘bar’, stacked = True, color = [‘red’, ‘blue’, ‘green’, ‘yellow’], grid = False) #creating stacked bar chart
Bar Chart comparing mean dimensions of different iris flower species
Stacked Bar Chart comparing mean dimensions between different iris flower species
Line Chart
Line Charts are usually plotted to study time series data. They are mostly used to detect trends over time.
Creating a Line Chart in Python
Here the flights dataset from the seaborn package has been used
import seaborn as sns
df1 = sns.load_dataset(“flights”)
df1.plot(x = ‘month’, y = ‘passengers’,kind = ‘line’)
This creates a trend line of the number of passengers availing flights in over the months in 3 years period
plt.ylabel(“#passengers”)
plt.xlabel(“months”)
plt.show()
Time Chart showing trend of the number of passengers availing flights over the months