EDA Visualization

4 min readSep 14, 2020

EDA (Exploratory Data Analysis) Visualization could help you understand your data and you may find out some insights or problems you could deal with further.

A good article about EDA visualization: https://medium.com/@tarammullin/python-data-visualization-for-exploratory-data-analysis-eda-fafcf6ecdadc

1.0 Import Packages

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

2.0 Import Data

The data is from Kaggle, Restaurant Data with Consumer Ratings(Provided by UCI). It includes restaurant, customers, and rating data. You could build up a recommendation system based on this dataset.
For this article, I only use userprofile.csv and rating_final.csv.

Source: https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings?select=userprofile.csv

data_profile = pd.read_csv(‘userprofile.csv’)
data_rating = pd.read_csv(‘rating_final.csv’)

3.0 Glance Data

It is a good hobby to have a glance at your data before building models becase it helps you assess is there anything you should clean further.
You could refer to my last article to realize the purpose of the following steps: https://medium.com/@p1234567834/eda-and-data-preprocessing-bbebfac54488

data_profile.head()

data_profile.shape

Fig 2

data_profile.dtypes

data_profile.describe()

data_rating.head()

4.0 Combine Data

Because I want to use numeric variables to create more interesting charts like scatter plot and so on, I need to combine the userprofile.csv and rating_final.csv based on the key, userID.

data = data_rating.merge(data_profile, on=’userID’, how = ‘left’)

5.0 Histogram

Histogram is a good tool to help you realize the distribution of categorical variables. For example, we could see how many customers rated the restaurant as 1.

sns.catplot(x =’rating’, data = data, kind = ‘count’)

5.0 Boxplot

Boxplot is composed of one categorical variable and one numerical variable.
Boxplot is a good way to see the qualities like the figure below.
Although the function describe() can also show the same information, you could tell the same information easily from in a visual way, boxplot.

sns.catplot(x = ‘dress_preference’, y = ‘rating’, data = data, kind = ‘box’)

6.0 Heatmap

It is always good to check the correlation of independent variables before implementing regression.
If there is a high correlation between two independent variables, then you should better fix it to avoid the bad results of the regression.
Heatmap is a good way to see the correlation between numeric variables.

data_heatmap = data[[‘rating’, ‘food_rating’, ‘service_rating’]]data_heatmap.corr()

sns.heatmap(data_heatmap.corr(), annot=True)

7.0 Scatter Plot

Sometimes, we would like to see that the two numerical variables are positive or negative linear relationships, and scatter could help you to achieve the goal.

sns.relplot(x = ‘weight’, y = ‘height’, kind = ‘scatter’, data = data, alpha = 0.4)

If you think this article helps you understand the basic EDA visualization, please clap this post. Thanks!!