Purpose of EDA and Data Preprocessing
Before building models, it is helpful and necessary to implement EDA and Data Preprocessing because EDA will help you find some patterns of the data and Data Preprocessing will clean the mess and improve the accuracy of the models.
What is the difference between EDA and Data Preprocessing?
EDA (Exploratory Data Analysis) is the first step you do when you see the data. In this step, you can see if there are null values in the data, the attributes of the data (i.e. data type, etc.), or what’s the distribution of the data. You can do summaries and visualization using the data to have some basic ideas.
Data preprocessing is the second step we do. We need to convert variables with a non-numeric data type to numeric ones because we need numbers to calculate in our model. We can also find a way to fill the NA values, and this step is called data cleansing. We can convert categorical variables into dummy variables.
Content
- Import Packages
- Import Data
- Glance Data
- Missing Values
- Data Range
- Duplicated Values
1.0 Import Packages
import numpy as np # Conventionally rename numpy as np
import pandas as pd # Conventionally rename pandas as pd
The reason why we import NumPy is that it helps us deal with arrays in an efficient way such as array calculation.
On the other hand, Pandas is a popular Python package for data science since it provides powerful data structures that users could make data manipulation and analysis easy.
2.0 Import Data
In the following sections, I will use a dataset from Kaggle, Women’s E-Commerce Clothing Reviews.
Source: https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews
This dataset includes 23,486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:
Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Positive Integer variable of the reviewer's age.
Title: String variable for the title of the review.
Review Text: String variable for the review body.
Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high-level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.
data = pd.read_csv(‘Womens Clothing E-Commerce Reviews.csv’)
3.0 Glance Data
data.head()
head() shows the first 5 rows of the data. It helps you to glance at what information included in each column.
data.shape
shape shows the rows and columns of the data. For example, the above figure indicates that the data has 23,486 rows (observations) and 11 columns ( features).
data.columns
columns will show all the column names of your data. Sometimes, when the data has lots of columns, it is useful to use this method to find the column name.
data.dtypes
dtypes shows the data type for each column. It is important to make sure each column has the right data type before further analysis. For instance, the data type of Age should be a numeric value instead of a string. (Fig 4 shows Age is int64 and it means Age is an integer value which is the right data type)
You could look up at the following table to see the meaning of data types.
data.describe()
describe() demonstrates the summary statistic of numeric columns like count, mean, std, min, max, and so on. It is useful to see if there is any abnormal value from min and max. For example, the Age column should not have min less than 0.
# Drop ‘Unnamed: 0’ because it is useless
data = data.drop(‘Unnamed: 0’, axis = ‘columns’)
Because the column ‘Unnamed: 0’ is useless, I drop this column first.
4.0 Missing Value
Some models or packages could not tolerate any missing value. In this case, it is good to deal with missing value first.
data.isnull().sum()
isnull().sum() indicates how many missing value for each column. For instance, column Titile has 3,810 missing values.
data = data.dropna()
dropna() could drop all the missing values of the data. Remeber to save the result back to data because if you just type data.dropna(), it wont save the result to data. Hence, you should type data = data.dropna().
data.isnull().sum()
After dropna(), it is good to check whether the missing values are deleted successfully. From Fig 8, it indicates that there is no missing value after dropna().
5.0 Data Range
In the above sections, I mention that describe() is a good way to check the min and max of the columns.
Another thing to remember, it is nice to check the date column whether the date is reasonable or not. For instance, some rows may have data from the “future” which is not reasonable. (the previous data does not include the date column.)
data[‘Division Name’].value_counts()
Sometimes, we also need to check the categorical columns whether the column has the right segments. For instance, value_counts() indicates the number of each segment. From Fig 9, we clearly know that there are three segments. If ‘Initmates’ should not be included in the column ‘Division Name’, then we could find out and drop this column directly.
6.0 Duplicate Values
In some cases, there are duplicated rows. For instance, two rows have the same values in all of the columns.
data_dropped.duplicated().sum()
duplicated().sum() demonstrates whether there are duplicated rows or now. For this example, there is not duplicated value.
For this article, I just list some basic problems we need to deal with before building a model. However, there are other topics worthy to study further such as EDA Visualization and Time Data Preprocessing. I will try to explain these topics in the future.
If you think this article helps you understand the basic EDA and Data Preprocessing, please clap this post. Thanks!!