More than 1 year has passed since last update.

RPKMとTPM

Last updated at 2023-01-06Posted at 2023-01-06

RPKM (reads per kilobase per million mapped reads) とTPM (transcripts per million) はいずれも次世代シーケンサー (Next Generation Sequencer, NGS) で測定したカウントデータの正規化法です。

参考ページ:

縦方向に転写産物、横方向にサンプルが並んでいるとします。列毎に正規化してから行毎に正規化するのがRPKM、行毎に正規化してから列毎に正規化するのがTPMです。

列毎の正規化は、サンプル毎のリード数の違いを揃えるために行います。行毎の正規化は、転写産物毎の遺伝子長の違いを揃えるために行います。

Pythonのコードを以下に示します。カウントデータ (2次元配列、データフレーム) と遺伝子長データ (1次元配列) を用意します。

# INPUT:
#   count_df: count data (row: transcripts, columns: samples)
#   len_sr: length data (1-dim sequence (pd.Series, np.Array, etc.))
# OUTPUT:
#   rpkm_df: RPKM data
#   tpm_df: TPM data

# count > RPKM
rpkm_df = count_df / count_df.sum()
rpkm_df = rpkm_df.divide(len_sr, axis=0)
rpkm_df *= 10**9

# count > TPM
tpm_df = count_df.divide(len_sr, axis=0)
tpm_df = tpm_df / tpm_df.sum()
tpm_df *= 10**6

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up