More than 1 year has passed since last update.

Pandas: データフレームについて--11: 重複データの処理

Posted at 2022-06-22

重複データの処理

列単位でユニークな値，行単位でユニークな行についての処理を行う。

import pandas as pd
df = pd.DataFrame({
    'a': [4, 4, 1, 5, 1, 3, 4, 3, 5, 3, 3, 4, 3, 3, 5],
    'b': [4, 1, 4, 3, 1, 2, 4, 4, 2, 1, 1, 1, 1, 3, 3]
})
df

	a	b
0	4	4
1	4	1
2	1	4
3	5	3
4	1	1
5	3	2
6	4	4
7	3	4
8	5	2
9	3	1
10	3	1
11	4	1
12	3	1
13	3	3
14	5	3

1.　ユニーク値の個数　`nunique`

行，列で，ユニークな値の個数を求める。

Usage: nunique(axis: 'Axis' = 0, dropna: 'bool' = True)

df.nunique()

a    4
b    4
dtype: int64

2.　ユニーク値とその頻度（度数分布） `value_counts`

データ中に出現するユニークな値の個数だけが知りたいということはあまりないと思われる。

value_counts は，ユニークな値とその出現頻度を集計する。統計学でいえば度数分布を求めるということである。

Usage: value_counts(subset: 'Sequence[Hashable] | None' = None,
                    normalize: 'bool' = False, sort: 'bool' = True,
                    ascending: 'bool' = False, dropna: 'bool' = True)

df['a'].value_counts()

3    6
4    4
5    3
1    2
Name: a, dtype: int64

df['a'].value_counts(ascending=True)

1    2
5    3
4    4
3    6
Name: a, dtype: int64

クロス集計的なものもできる。

df.value_counts(['a', 'b'])

a  b
3  1    3
4  1    2
   4    2
5  3    2
1  1    1
   4    1
3  2    1
   3    1
   4    1
5  2    1
dtype: int64

しかし，望まれるのは多分次項の pd.crosstab() であろう。

3.　クロス集計（二次元度数分布）

統計学でいう，クロス集計表を求める。

 Usage: crosstab(index, columns, values=None, rownames=None,
                 colnames=None, aggfunc=None, margins: 'bool' = False,
                 margins_name: 'str' = 'All', dropna: 'bool' = True,
                 normalize=False)

pd.crosstab(df['a'], df['b'])

b	1	2	3	4
a
1	1	0	0	1
3	3	1	1	1
4	2	0	0	2
5	0	1	2	0

4.　重複行の除去

行単位で同じ内容を持つ行を削除する。

Usage: drop_duplicates(subset: 'Hashable | Sequence[Hashable] | None' = None,
                       keep: "Literal['first'] | Literal['last'] | Literal[False]" = 'first',
                       inplace: 'bool' = False, ignore_index: 'bool' = False)

df.drop_duplicates()

	a	b
0	4	4
1	4	1
2	1	4
3	5	3
4	1	1
5	3	2
7	3	4
8	5	2
9	3	1
13	3	3

5.　もっと知りたい人

help(df.nunique)         # Help on method value_counts in module pandas.core.frame
help(df.value_counts)    # Help on method value_counts in module pandas.core.frame
help(pd.crosstab)        # Help on function crosstab in module pandas.core.reshape.pivot
help(df.drop_duplicates) # Help on method drop_duplicates in module pandas.core.frame

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Pandas: データフレームについて--11: 重複データの処理

重複データの処理

1. ユニーク値の個数 nunique

2. ユニーク値とその頻度（度数分布） value_counts

3. クロス集計（二次元度数分布）

4. 重複行の除去

5. もっと知りたい人

1.　ユニーク値の個数　`nunique`

2.　ユニーク値とその頻度（度数分布） `value_counts`

3.　クロス集計（二次元度数分布）

4.　重複行の除去

5.　もっと知りたい人