These days we hear a lot about data and data forms the basis of the digital world. The human race is producing data every second with a button click and a web search. By the end of 2020, we have generated 44 zettabytes of data! One zettabyte is equal to a billion terabytes. All this data has revolutionized science and technology. This data also forms the core of advanced machine learning models. However, all of this data in raw form is not of much use to us and we cannot infer much from it unless it is in a specific format.
The process of converting raw data into a clean format to deduce useful information is called data wrangling. It is also called the data preprocessing stage. In this stage, the garbage and missing values in the data are handled. Hence, it is also called the data cleaning phase. In this article, we will discuss in detail data wrangling. We will see how we can prepare a dataset for actual processing and analysis in Python language.
Data Problem
Let’s start with an example. Suppose we want to sell a car. We have to find the right price for it. The price should be neither too low nor too high, given the specifications of the car. Now, a data analyst will think, are there other cars similar to my car? How much do they sell for? What are their prices given their specifications? The first thing a data analyst looks for while solving this problem is data. To continue with this example problem, we will use an open-source dataset of used cars by Jeffrey C. Schlimmer. We can download this dataset from here. There are 26 columns in total. Our target column is 26th, the price. We want to know the price of a car given the other columns as predictors.
Data Wrangling Methods
Data understanding:
The first step in the analysis of data is its acquisition and loading. We will use the Pandas library in Python to load and read our dataset file. Pandas is a widely used library for data manipulation. It uses a dataframe data structure to store data. The dataset can be saved locally or is available online, as in our case. Read_csv method takes the path. It considers the first row in the dataset as the header of the file. If this is not the case, as in ours, we can set it as none. The df is our dataframe.
pandas as pd DatasetPath = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv" df = pd.read_csv(DatasetPath, header = None)
We can print the top and bottom five rows using the head and tail functions.
print(df.head(5)) print(df.tail(5))
Now, we can set header names using the documentation provided here for easy analysis of our data. Otherwise, Pandas assigns numbers to each column.
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style","drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"] df.columns = headers
The describe method gives a summary of the whole dataset.
print(df.describe())
It returns some useful statistical insights of data. The mean, standard deviation, minimum and maximum values of columns, etc. We can also get the datatypes of each column through,
print(df.dtypes)
It gives an overall idea of the nature of data.
Handling Missing Values
Each row in our dataset is a data point or a record. A record consists of column values. A record can have missing column values. These missing values show as a blank, question mark, 0, or a NaN. It is important to handle such values in the initial stages to ensure smooth processing. Different methods help to handle such missing values.
The first approach is to retrieve the unknown or missing values from the source and fill them in the data set. The second approach is to skip records with the missing values. Dropping data is dependent on the amount of data and missing values in it. If there are not many missing values, we can consider dropping the tables. The third approach is entering values in blank places considering the other data values. This approach includes a chance of introducing inaccurate values, since we are guessing in place of an original data value. Lastly, we can use the method of average values introduction. We take the mean value in the case of quantitative data and add it in place of a missing value. If the column is categorical or qualitative, we can use the most appeared value in place of the missing column value and this value is called a mode of the column.
Let’s apply these methods to our data.
df.dropna(subset=["peak-rpm"], axis=0, inplace=True)
dropna deletes the rows with the missing values in the specified column. Axis=0 indicates to delete rows, and axis=1 is to delete the whole column. The True Inplace argument makes changes to the dataframe directly.
We will replace the missing values in the city-mpg column with the average value of the city-mpg. Use replace() function.
avg_city_mpg = df["city-mpg"].astype(float).mean(axis=0) df["city-mpg"].replace(np.nan,avg_city_mpg,inplace=True)
Data Formatting
If the data under study is collected from multiple sources, there are chances of data inconsistency. Consistent data helps with accurate data analysis. Changing different formats into one consistent format for meaningful data processing is data formatting.
We can use dtype() to check the data type of every column. Then we can change the data type to a consistent one wherever needed. Astype() function changes datatype from one into another.
Data Normalization
Data normalization is an important concept for data analysis. Normalization ensures that all of our variables have the same effect on the model irrespective of data values. For example, our predictor variables are price and area of the house. The price has a lower range of values whereas the area variable has large values. Area values, if used as is, can influence our model when making predictions, although, they are not more important than the price variable. However, if we normalize price and area to a range between 0 and 1, they will have the same effect.
Some ways to normalize the data values are simple feature scaling, minimum-maximum method, z-score, etc.
Dividing each value in the column with the maximum column value is feature scaling.
\( xn= \frac {x}{xmax} \)
The Min-Max method keeps all values in the range of 0 and 1. It takes a difference of each value from the minimum column value and then divides it with the column range.
\( xn= \frac {x - xmin}{xmax - xmin} \)
Standard score or z-score subtracts the mean value of the column from each column value. Then divides with the standard deviation of the column. This method keeps the values between 3 and -3.
\( xn=\frac {x-µ} {σ} \)
Here, µ is the mean and σ is the standard deviation of the feature column.
We can apply these normalization methods in Python as,
df["width"] = df["width"]/df["width"].max() df["width"] = (df["width"]-df["width"].min())/(df["width"].max()-df["width"].min()) df["width"] = (df["width"]-df["width"].mean())/df["width"].std()
Data Binning
Binning refers to the grouping of data values. For example, we can group the age values as 0 to 18, 19 to 29, etc. The difficulty of a task can be grouped as easy, medium, and hard. Such grouping of column values makes data analysis easier. Let’s bin price range in our dataset.
The Linspace() function will give us equally spaced four numbers in our price column. The cut () function will divide the price column into the given number. First, we handle the missing and null values in the price column.
df.replace("?", np.nan, inplace = True) df.dropna(subset=["price"], axis=0, inplace=True) df["price"] = df["price"].astype("int") df.reset_index(drop=True, inplace=True) bins = np.linspace(min(df["price"]),max(df["price"]),4) bin_names = ["Low","Medium","High"] df["price.binned"]=pd.cut(df["price"], bins, labels=bin_names, include_lowest = True)
Here, np is the numpy library.
We can also visualize this price range with a histogram.
import matplotlib as plt from matplotlib import pyplot plt.pyplot.hist(df["price"]) plt.pyplot.xlabel("price") plt.pyplot.ylabel("cars count") plt.pyplot.title("price groups") plt.pyplot.show()
Conversion of Categorical Data
Lastly, we will see how to convert qualitative or categorical data into quantitative data. In data analysis, we use statistical methods, and these methods usually work with quantitative data. Dummies() function in Pandas can help us here.
For example, we want to change the engine type column into straight, inline, and flat engine types. 0 and 1 values will indicate the car’s respective engine type.
dummy_variable = pd.get_dummies(df["engine-type"]) dummy_variable.rename(columns={'engine-type-available':'straight', 'engine-type-available':'inline', 'engine-type-available':'flat'}, inplace=True)
Then we merge the dataframe and the dummy variables and drop the original column that is now quantized.
df = pd.concat([df, dummy_variable], axis=1) df.drop("engine-type", axis = 1, inplace=True)
So that’s it for data wrangling in Python and we are now familiar with an important prerequisite of data analysis and data processing.