Data Cleaning

Tanwir Khan
11 min readNov 7, 2019
Photo by The Creative Exchange on Unsplash

Cleaning the data is the most important step in model creation. If our data is not clean then we will not get accurate prediction.

In this post we will discuss about few approaches that can help us make the data cleaner so that we could ingest it to the model for better result. Before diving into the implementation let’s understand the dataset.

Dataset — survey_results_public.csv

The dataset that we have is taken from the stackoverflow. The dataset is basically about the Annual developer survey for the year 2019. This dataset contains around 88K survey results which has been taken from over 170 countries and development territories. This survey dataset aims at capturing the developer experience from different standpoints like career satisfaction, job search, education and opinions about open source etc.

There are total of 85 columns that we have in the dataset.

So now let’s dive in and load the dataset and start analysing it to see what needs to be cleaned to make it more ingestible to a machine learning model.

  1. import the pandas package and load up the data

2. Checking the shape of the dataset.

3. Check the columns that we have in the dataframe. We’ll use the df.columns command for this.

4. Let’s check the total number of rows

Now that we have seen the columns in the dataset and the total rows present in the dataset let’s start checking the dataset in a more detailed way and clean it using different data cleaning approaches.

Data Cleaning : Approach — I

1. Removing missing data

The most important step for data preprocessing is checking if the dataset has any missing values. If we are creating any kind of machine learning model then our model wouldn’t perform well with missing values/data.
One of the approaches to mitigate this approach is to remove missing data from the dataset.

The way we do it is delete the row if the missing value corresponds to the places in the row or delete the column if it is having 70–75% of missing data.

This is not really the threshold value and it mostly depends on how much we wish to fix it. The main disadvantage of this approach is that we end up losing losing important information, because we are deleting a whole feature based on a few missing values.

5. Before doing this let’s see what are the different data types that we have for the columns:

6. Checking the total “NaN” values across all the columns:

7.Now that we have seen the columns that has missing values, we can remove them using the dropna() function:

8. Check if there are any null values present now

This was one of the approaches where we directly remove all the “NaN” , but as discussed above this is not really the most efficient way to do it as we end up losing losing important information, because we are deleting a whole feature based on a few missing values which will eventually compromise the accuracy of our machine learning model.

Data Cleaning : Approach — II

Mean/Median/Mode Imputation for handling missing data

In this approach we will calculate the mean/median for numerical data and use the result to replace the missing values.

For missing values in case of categorical data we compute the mode and replace the missing data with the mode.

The benefit of this approach is it prevents data loss, however the disadvantage of this approach is you are not sure how accurate the mean, median or mode is going to be in a given use case.

9. To fill in the missing values of the numerical columns, we can calculate the median of that numerical column and fill in the median result in the missing values.

Let’s see how we could do it:

Impute the numerical data of CompTotal column with its median. To do so, first find the median of the CompTotal column using the median() function of pandas, and then print it:

10.Now that we have computed the median for the numerical columns containing missing values, we can fill those missing values of columns with the computed numerical values.

11. Check if Age and CompTotal columns contain any “NaN” values

12. To fill in the missing values of the categorical columns, we can calculate the mode of that categorical column and fill in the mode result in the missing values.

Let’s see how we could do it:

13.Now that we have computed the mode for the individual columns containing missing values, we can replace those missing data with the mode values for respective columns.

14.Now if you check the “MainBranch” and “OpenSource” columns it has no missing values anymore

Data Transformation in case of Numerical/Categorical data

15. Now that we have done the data cleaning, there is still few steps that we need to carry out before ingesting it to the model.

In the above section we did some adjustments to the data by either removing the missing values or by replacing the missing values with mean/median/mode of that particular column.

The main goal here is to transform our data into a machine-learning digestible format. As all the machine learning algorithms are mathematical computation, there is need to transform all the columns into numerical format.

Let’s first understand the broader classification of data and try breaking down the broader categories into subcategories.

1. Numerical : Numerical data is one that is quantifiable
2. Categorical : These data are non-numeric, generally string which are qualitative.

We can further breakdown numerical data into following sub-categories:

1. Discrete: If you are able to count something then it’s discrete. For example, the number of passengers in a bus, the number of people attending a particular meeting, ex., 1,2,3,4.

2. Continuous: The numerical form of data which could be measured is continuous. Example, the weight of a person, time taken to travel from one location to another.

Similarly categorical data are also broken down into below sub categories:

1. Ordered : Ordered data are those in which the data is bucketed into certain categories. Example, the Survey for a particular show has can be — excellent, good, bad, worst.

2. Nominal: These are categorical data which doesn’t have any order. For example, country.

Challenges with Categorical Data

There are challenges while dealing with categorical data and most of the machine learning algorithm don’t work well with categorical data. Decision trees will work well with categorical data but if we are dealing with some other machine learning algorithms then we need to convert these categorical data to numerical form. If the desired output needs to be categorical then we can convert the numerical data back to categorical format.

Let’s see what are the challenges that we might face while dealing with categorical data:

  1. Data with high cardinality: We might have few columns in out data set which will have a very high cardinality which means that they will have a lot of unique values. For example: the ID column in the data set will have all the unique values in it.
  2. Variables with rare occurrences: We might have some data columns as well with very rare occurring variables.
  3. Frequent occurring variables: We might also have some data columns as well which occur many times with low variance.
  4. We might also encounter some data columns which won’t fit the model at all if we don’t process it.

To overcome all these above mentioned challenges we use the following methods:

  1. Encoding : In this method we encode the categorical data to a numerical value. There are 3 types of encoding that we basically follow

i. Label encoding

ii.One hot encoding

iii. Dummy encoding

2. Replacing: In this method we simply replace the categorical data with a number. This does not involve any logical processing.

That’s a lot of theory and definitions, let’s implement all these as see how our data looks after that.

Dealing with Categorical data — Approach I

1. Replace categorical data with number

1.Find the categorical data in the current dataframe and then create a new dataframe having only the categorical data. To do so use the select_dtypes() function from pandas

As you could see out of 85 columns there are 79 columns that are categorical columns

2. Lets take one categorical column “OpenSourcer” and see the unique categorical values in it, so that we could replace it with numerical value

There are 4 categorical values in the “OpenSourcer” column

3. Find the frequency distribution of each categorical column. To do so, use the value_counts() function on each column. This function returns the counts of unique values in an object.

4. Replace the entries in the “OpenSourcer” column with numerical values as below

Dealing with Categorical data — Approach II

2. Label Encoding

Label encoding is a technique in which we basically replace each value in the categorical column with numbers from 0 to n-1.

Let’s say we have a list of names in a column. After label encoding the data in that column, each name will be assigned a numerical label.

Disadvantages of this approach

This approach will not be very efficient in every case because the model might make a mistake of considering the numerical values as the weight assigned to the data.

This approach is best suitable for ordinal data where the categorical data is labeled based on order For example, the attitude towards something (i.e. strongly agree, agree, disagree, strongly disagree) or clothing sizes (i.e. small, medium, large, extra large).The scikit-learn library provides labelEncoder() which helps in label encoding

  1. Remove all the missing data before doing the label-encoding step.

2. select the categorical data from the “label_encoding” dataframe and create a new dataframe with only the categorical data

3. Iterate through this category column and convert it to numeric data using LabelEncoder(). To do so, import the sklearn.preprocessing package and use the LabelEncoder() class to transform the data:

We use fit_transform() here to apply the label encoder. Let’s understand the advantage of using fit_transform().

To center the data (make it have zero mean and unit standard error), you subtract the mean and then divide the result by the standard deviation.

x′=x−μ/σ

You do that on the training set of data. But then you have to apply the same transformation to your testing set (e.g. in cross-validation), or to newly obtained examples before forecast. But you have to use the same two parameters μ and σ (values) that you used for centering the training set.

Hence, every sklearn’s transform’s fit() just calculates the parameters (e.g. μ and σ in case of StandardScaler) and saves them as an internal objects state. Afterwards, you can call its transform() method to apply the transformation to a particular set of examples.

fit_transform() joins these two steps and is used for the initial fitting of parameters on the training set x, but it also returns a transformed x′. Internally, it just calls first fit() and then transform() on the same data.

Dealing with Categorical data — Approach III

3. One-hot Encoding

In the previous approach we used label encoding for categorical data to convert it to numerical values. The values were assigned labels in the from 1,2,3. However in case of predictive modeling, the machine learning algorithm might make a mistake of considering these labels as some kind of order/weight. To avoid this confusion we use one-hot encoding.

How one-hot encoding works is the label-encoded data is further broken down into n columns, where n, denotes the total number of unique labels generated while performing lebel encoding.

For example, say a column has 3 unique labels, after performing one-hot encoding the column will further be divided into column_1, column_2, column_3 different columns.

For every level or category, a new column is created. In order to prefix the category name with the column name you can use this way to create one-hot encoding. In order to prefix the category name with the column name, write the following code:

Below is the github gist for the whole code.

https://gist.github.com/tkhan0/601237430879471c3ad89a1bc1711001

--

--