More than 3 years have passed since last update.

R で graphical lasso を用いた異常検知　その１：基礎分析

Last updated at 2021-06-07Posted at 2021-05-24

はじめに

graphical lasso を用いた異常検知を行ってみた。参考資料は末尾に記載した。今回は時系列、多変量、かつ異常と正常のラベルデータが付随しているデータセットであるSECOM Data Setを使用した。この記事のシリーズでは、

R で graphical lasso を用いた異常検知　その１：基礎分析　←　今ここ！
R で graphical lasso を用いた異常検知　その２：前処理
R で graphical lasso を用いた異常検知　その３：モデル構築
R で graphical lasso で用いた異常検知　その４：異常度の算出

の順番で書いていく。本記事ではまずはデータセットの概要を理解するために基礎分析を行っていく。

データセットの読み込み

スペース区切りのデータセットのため、読み込みにちょっと工夫が必要だった。まずは基本統計量を確認する。データ項目が多いので、可視化の方法は考えなくてはならない。

コード

library(dplyr)
library(ggplot2)

df <- read.table("./secom.data", sep = " ", na.strings="NaN")
df_label <- read.table("./secom_labels.data", sep = " ")

str(df)

出力

'data.frame':	1567 obs. of  590 variables:
 $ V1  : num  3031 3096 2933 2989 3032 ...
 $ V2  : num  2564 2465 2560 2480 2503 ...
 $ V3  : num  2188 2230 2186 2199 2233 ...
 $ V4  : num  1411 1464 1698 910 1327 ...
 $ V5  : num  1.36 0.829 1.51 1.32 1.533 ...
 $ V6  : int  100 100 100 100 100 100 100 100 100 100 ...
 $ V7  : num  97.6 102.3 95.5 104.2 100.4 ...
--[省略]--

ラベルの時刻化、時間範囲の確認、データセットのラベル確認

時刻と異常正常のラベルが含まれたデータセットの確認を行う。

まず時刻データのデータ型を変更し、その範囲を確認する。

コード

library(lubridate)
df_label$V2 <- df_label$V2 %>% dmy_hms()
df_label$V2 %>% range()

出力

[1] "2008-07-19 11:55:00 UTC" "2008-10-17 06:07:00 UTC"

約3か月間のデータであることが分かる。

次にラベルを確認する。データ項目の詳細にある通り、

The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where –1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point.

より、合格が -1 のラベル、不合格が 1 のラベルになっている。

コード

df_label %>% filter(V1 == -1) %>% dim
df_label %>% filter(V1 == 1) %>% dim

出力

[1] 1463    2
[1] 104   2

以上より、正常データ（合格）が1463 個、異常データ（不合格）が104個含まれていることが分かる。

基本統計量の確認

これまで行ってきた各データセットの基礎分析と基本的には同じ。データ型、平均、分散、標準偏差、四分位数、欠損の数を確認する。

コード

classes <- df %>% summarise_all(class) %>% t()
means <- df %>% select(!where(is.character)) %>% summarise_all(mean, na.rm=TRUE) %>% t()
vars <- df %>% select(!where(is.character)) %>% summarise_all(var, na.rm=TRUE) %>% t()
sds <- df %>% select(!where(is.character)) %>% summarise_all(sd, na.rm=TRUE) %>% t()
qs <- df %>% select(!where(is.character)) %>% summarise_all(quantile, na.rm=TRUE) %>% t()
colnames(qs) <- c("min", "q14", "median", "q34", "max")
is.nas <- df %>% is.na %>% as.data.frame %>% summarise_all(sum) %>% t()

stat.1 <- data.frame(
  rownames=rownames(classes),
  classes=classes,
  is.nas=is.nas,
  is.not.nas=nrow(df)-is.nas
)

stat.2 <- data.frame(
  rownames=rownames(means),
  means=means,
  vars=vars,
  sds=sds,
  qs
)

stat <- full_join(stat.1, stat.2, by="rownames")
write.csv(x=stat, file="summary.csv")

summary.csvの一部(V1~V7)を抜き出す。

	rownames	classes	is.nas	is.not.nas	means	vars	sds	min	q14	median	q34	max
1	V1	numeric	6	1561	3014.452896	5420.167484	73.62178675	2743.24	2966.26	3011.49	3056.65	3356.35
2	V2	numeric	7	1560	2495.850231	6465.39902	80.40770498	2158.75	2452.2475	2499.405	2538.8225	2846.44
3	V3	numeric	14	1553	2200.547318	871.0261444	29.51315206	2060.66	2181.0444	2201.0667	2218.0555	2315.2667
4	V4	numeric	14	1553	1396.376627	195091.5052	441.6916404	0	1081.8758	1285.2144	1591.2235	3715.0417
5	V5	numeric	14	1553	4.197013136	3175.946899	56.35554009	0.6815	1.0177	1.3168	1.5257	1114.5366
6	V6	integer	14	1553	100	0	0	100	100	100	100	100
7	V7	numeric	14	1553	101.1129082	38.90283585	6.237213789	82.1311	97.92	101.5122	104.5867	129.2522

全てのデータが数値データであることが確認できる。また、いくつか分散、標準偏差が0の項目（上記表だとV6等）があることが分かった。これは時系列で変化せず、分析には使用しにくいデータ項目であるので、のちの前処理でこういった項目は落としていく必要がある。

時系列プロットの作成

時系列データなので、全項目についての時系列プロットを作成した。この時に、ggplot2 の aes_ を使用して変数でもプロットの項目を選択できるようにした。詳しくは「文字列として変数に格納されている列名をggplotのaes()に入れたい」を参考に。

コード

# 可視化の為にbind
colnames(df_label) <- c("label", "datetime")
df.tmp <- cbind(df_label, df)
cl <- as.name("label")
df.tmp$label <- as.factor(df.tmp$label)

library(ggplot2)

dir <- "./plot/"
for (i in 3:ncol(df.tmp)) {
  xn <- as.name(colnames(df.tmp)[2])
  yn <- as.name(colnames(df.tmp)[i])
  fname <- paste(dir, colnames(df.tmp)[i], ".png", sep = "")
  
  g <- ggplot(data = df.tmp, mapping = aes_(x = xn, y = yn, color = cl))
  g <- g + geom_point()
  
  ggsave(file = fname, plot = g, width = 9.44, height = 4, dpi = 300)
}

出力されたプロットを二つほど以下に示す。

分散、標準偏差が0のデータ（V6）は一定値のプロットを示していることが分かる。

おわりに

まずはデータの事を知るのが必要なので、データセットの基礎分析を行った。ここで得られた結果（欠損が多い、一定値を示すデータがある等々…）をもとに、次回の前処理で分析に必要な形にデータを加工していく。

参考文献（graphical lasso を用いた異常検知について）

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

R で graphical lasso を用いた異常検知 その１：基礎分析