EDA (Exploratory Data Analysis) Visualization could help you understand your data and you may find out some insights or problems you could deal with further.
A good article about EDA visualization: https://medium.com/@tarammullin/python-data-visualization-for-exploratory-data-analysis-eda-fafcf6ecdadc
1.0 Import Packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
2.0 Import Data
The data is from Kaggle, Restaurant Data with Consumer Ratings(Provided by UCI). It includes restaurant, customers, and rating data. You could build up a recommendation system based on this dataset.
For this article, I only use userprofile.csv and rating_final.csv.
Source: https://www.kaggle.com/uciml/restaurant-data-with-consumer-ratings?select=userprofile.csv
data_profile = pd.read_csv(‘userprofile.csv’)
data_rating = pd.read_csv(‘rating_final.csv’)
3.0 Glance Data
It is a good hobby to have a glance at your data before building models becase it helps you assess is there anything you should clean further.
You could refer to my last article to realize the purpose of the following steps: https://medium.com/@p1234567834/eda-and-data-preprocessing-bbebfac54488
data_profile.head()
data_profile.shape
data_profile.dtypes
data_profile.describe()
data_rating.head()
4.0 Combine Data
Because I want to use numeric variables to create more interesting charts like scatter plot and so on, I need to combine the userprofile.csv and rating_final.csv based on the key, userID.
data = data_rating.merge(data_profile, on=’userID’, how = ‘left’)
5.0 Histogram
Histogram is a good tool to help you realize the distribution of categorical variables. For example, we could see how many customers rated the restaurant as 1.
sns.catplot(x =’rating’, data = data, kind = ‘count’)
5.0 Boxplot
Boxplot is composed of one categorical variable and one numerical variable.
Boxplot is a good way to see the qualities like the figure below.
Although the function describe() can also show the same information, you could tell the same information easily from in a visual way, boxplot.
sns.catplot(x = ‘dress_preference’, y = ‘rating’, data = data, kind = ‘box’)
6.0 Heatmap
It is always good to check the correlation of independent variables before implementing regression.
If there is a high correlation between two independent variables, then you should better fix it to avoid the bad results of the regression.
Heatmap is a good way to see the correlation between numeric variables.
data_heatmap = data[[‘rating’, ‘food_rating’, ‘service_rating’]]data_heatmap.corr()
sns.heatmap(data_heatmap.corr(), annot=True)
7.0 Scatter Plot
Sometimes, we would like to see that the two numerical variables are positive or negative linear relationships, and scatter could help you to achieve the goal.
sns.relplot(x = ‘weight’, y = ‘height’, kind = ‘scatter’, data = data, alpha = 0.4)
If you think this article helps you understand the basic EDA visualization, please clap this post. Thanks!!