More than 5 years have passed since last update.

データサイエンス Advent Calendar 2020

Most wanted pandas functions (Part 01)

Posted at 2020-12-20

Hello again.

Among the data scientists who used python, pandas is the most popular data analysis and manipulation library. It contains various functions and methods to perform data preprocessing and analysis.
Here, I'm trying to cover many functions that we can utilize in many projects.For demonstrations, I will use a famous titanic dataset.

First let's install the pandas

pandas_funcs.ipynb

!pip install pandas

1. Import CSV file.

We can import a csv file by using read_csv function. There are many options for advance usage. Please refer the documents for advance usage.

pandas_funcs.ipynb

df = pd.read_csv("titanic.csv")

2. View first N rows.

pandas_funcs.ipynb

df.head(n=7)

3. View last N rows.

pandas_funcs.ipynb

df.tail(n=7)

4. View shape of the DataFrame.

pandas_funcs.ipynb

df.shape

# Output >>
(891, 12)

5. Get column names.

pandas_funcs.ipynb

df.columns

# Output >>
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

6. Drop columns.

We can use the drop function to drop the columns or rows. By passing list of column/rows names to drop we can delete those.

pandas_funcs.ipynb

df.drop(['Name','SibSp','Pclass'], axis=1, inplace=True)

df.columns
# Output >>
Index(['PassengerId', 'Survived', 'Sex', 'Age', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')

Please note that the axis is set to 1 here, which means the function is for columns. If you need to delete rows please use 0.

7. Sampling.

We can get a sample very easily by number of rows or by ratio.

pandas_funcs.ipynb

df.shape
# Output >>
(891, 9)

df_sample_1 = df.sample(n=100)
df_sample_1.shape
# Output >>
(100, 9)

df_sample_2 = df.sample(frac=0.75)
df_sample_2.shape
# Output >>
(668, 9)

8. Find missing values.

It is important to deal with the missing values of the data set. Pandas allow us to check the missing values easily by:

pandas_funcs.ipynb

df.isna().sum()
# Output >>
PassengerId      0
Survived         0
Sex              0
Age            177
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

9. Fill missing values.

Pandas provide a fillna function to fill missing values. It is common to use mean, avg values for this. Note that using method parameters we can fill missing values based on previous or next values. It is useful when we deal with time series data.
For the example : Here there are 177 values missing from the age column. I will use the average age of others to fill the missing values.

pandas_funcs.ipynb

avg = df['Age'].mean()
df['Age'].fillna(value=avg, inplace=True)

10. Drop missing values.

For some projects, sometimes we have to delete entire rows which contain missing values. In that case we can use the dropna function.
In this data set there are 2 values missing from the “Embarked” column. We can remove relevant rows by;

pandas_funcs.ipynb

print(df.shape)
# Output >>
(891, 9)

df.dropna(subset=['Embarked'], axis=0, how='any', inplace=True)
df.shape
# Output >>
(889, 9)

Please note that the axis is set to 1 here, which means the function is for columns. If you need to delete rows please use 0.

11. Conditional Selections.

We need to select rows based on certain conditions. In example below codes show selections of passengers whose gender is male. And age is more than 30 years old.

pandas_funcs.ipynb

male_30_plus = df[(df.Sex == 'male') & (df.Age> 30)]
male_30_plus

12. Conditional Selections with isin

If we need to check several values, we can use isin function, it will return a dataframe which only contains specific values from the column.

Suppose that we need to get passengers ages is 12 and 15,

pandas_funcs.ipynb

df[df['Age'].isin([12,15])]

13. Insert new column.

We can easily add a new column to a dataframe like this.

pandas_funcs.ipynb

group = np.random.randint(10, size=len(df))
df['new_group_name'] = group
df

If you need to add columns at a specific position , insert function can be used.

pandas_funcs.ipynb

df.insert(2, 'new_group_name', group)

14. Ranking

We can rank the values of columns based on their values. Lets rank the passengers according to the ticket price.

pandas_funcs.ipynb

df['rank'] = df['Fare'].rank(method='first', ascending=False).astype('int')
df

15. Groupby

This is a mostly used function in pandas. It is easy to get an overview of the data and relationships among the variables.

pandas_funcs.ipynb

df.groupby(['Embarked','Sex'])['Survived'].count()
# Output >>
Embarked  Sex   
C         female     73
          male       95
Q         female     36
          male       41
S         female    203
          male      441
Name: Survived, dtype: int64

In this example we can get the count of survived people vs, who boarded from 3 ports and gender.

Since this article is becoming long, I will continue from part 2 in later date.

*本記事は @qualitia_cdevの中の一人、@nuwanさんが書いてくれました。
*This article is written by @nuwan a member of @qualitia_cdev.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up