More than 3 years have passed since last update.

Interpolation – Fill Missing Values

Posted at 2021-09-14

Introduction

When I was studying as a research student at the University of Tsukuba, we had to work on a field survey. So I selected to build a radiation map around university dormitory premises. Then with the Geiger counter, I went to measure the radiation on pre-selected spots. I spent 4,5 hours and was only able to collect half of the pre-selected areas. And also it does not cover the entire premises. I went to the professor and said I need more time. Then the professor said, your area is too broad. It would help if you had few weeks. OH Noooooooooooo.
The solution given by the professor is to select some spots that cover the entire university and then interpolate data and create your radiation map.

What is interpolation?

Interpolation is a statistical procedure used to estimate unknown data points between two or more known data points. Interpolation is used mainly in data science to fill missing values in the dataset.

The most common way to fill in missing values is to fill them using the average value. But there are many scenarios that we cannot use the average value. The typical case is time-series data.

Demo

First, let's create the pandas series. Then we will try some interpolation methods.

int.py

import pandas as pd
import numpy as np
values = pd.Series([1, 3, np.nan, 7, 9, np.nan, 13, 15])
print(values)

As I mentioned earlier, we use the average of the values to fill the missing data in most cases. If we used the same method, we are making huge mistakes. By looking at the missing values, the calculated value by average seems not fit.

Linear Interpolation.
Linear Interpolation expects to estimate any absent value by connecting dots in a straight line. In brief, It calculates the unknown value in the exact increasing order from previous values. The default method for Interpolation in pandas is Linear so that we can call the function.

Polynomial Interpolation.
We require to set an order in Polynomial Interpolation, which infers that polynomial interpolation fills lacking values with the lowest possible level that carries through available data points.

we can see that result is the same as the previous one.

Now let's use some interpolation methods to fill nan values in the panda's data frame. First, let's create a data frame.

int.py

df = pd.DataFrame({"COL A":[11, 3, None, 5, 3],
                   "COL B":[33, None, 37, None, 1],
                   "COL C":[19, None, None, 5, None],
                   "COL D":[None,1,  7, None, 5]})

print(df)

Linear Interpolation (Forward)
If we specify the limit direction to forward, filling a missing value needs to have some previous value. That's the reason the first row NaN value is not filled.

int.py

df.interpolate(method ='linear', limit_direction ='forward')

Linear Interpolation (Backward)
This is the other way of the above. When we specify backward, the algorithm used the values after the missing positions to fill the missing data.

int.py

df.interpolate(method ='linear', limit_direction ='backward')

There are some advanced ways to use the interpolate function. I may be writing about this later.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up