LoginSignup
0
1

More than 5 years have passed since last update.

Simple analysis using (Kaggle: Eating & Health Module Dataset)

Posted at

Background

I analyze Eating & Health Module Dataset of Kaggle dataset.
The dataset is here.

Although there are three dataset, I focus on only ehresp_2014.csv dataset.
This dataset contains information about general health and body mass index for each respondent.
Furthermore, this dataset records the income for each respondent.

In this article, I analyze the relationship between the income and other variables.

The code can be found here.

My approach is based on traditional k-means clustering.
The difference from the traditional k-means clustering is that I incorporate the impact of income.

In this dataset, the information of income is denoted by "erincome".
The range of this variable is between 1 and 5.
The description is as follows:
-1 Income > 185% of poverty threshold
-2 Income < = 185% of poverty threshold
-3 130% of poverty threshold < Income<185% of poverty threshold
-4 Income > 130% of poverty threshold
-5 Income <= 130% of poverty threshold.

Approach

There are 32 variables as I remove the following variables("eeincome1", "euincome2", "exincome1", "erincome").
The reason why I remove these variables is that they are all related income and I only use "erincome" to obtain the information of income.
As a preprocessing, I normalize each variable.

Let $K$ be the number of clusters and $x_{v,n,k}$ be the value of
$v$th variable of $n$th respondent within $k$th cluster.
I denote the the value of "erincome" of $l$ th respondent of $k$th cluster
as $S_{n,k}$.

The objective function to be minimized is as follows:

\min_{m_{k,v}} \sum_{k=1}^{K} \sum_{n=1}^{N_{k}}
\sum_{v=1}^{V}
(x_{v,n,k}-S_{n,k}m_{k,v})^2,

where $m_{k,v}$ is the centroid of the $v$th variable within $k$ th cluster.

Experimental setting

I set the number of clustering $K$ to be $7$.
I initialize the clustering allocation randomly.

Results

At first, I show the convergence of this method.
converge.png

This method can achieve local minimum by updating each variables fixing other variables as well as traditional K-means clustering.

Next, I show the clustering allocation.

erincome=1 2 3 4 5
Cluster 0 96 2 7 2 0
Cluster 1 649 193 307 10 705
Cluster 2 292 0 0 0 0
Cluster 3 1926 9 0 0 0
Cluster 4 1946 53 3 0 0
Cluster 5 1155 43 0 0 0
Cluster 6 926 233 659 24 1692

According to this table, respondents whole "erincome" is $5$ are allocated only to Cluster 1 and Cluster 6.

Next, I introduce the features of each cluster by focusing the coefficient ($m_{k,v}$).
coe.png

According to this figure,
- Clustering 0: "ertsart" is large.
- Clustering 2: "erbmi", "etwgt", "euhgt" and "enwgt" are small.
- Clustering 3: "eumeant" and "eumilk" are large.
- Clustering 5: "eudrink" and "eusoda" are large.

I think this might be the secret to take the large value of "erincome"
although it is difficult to distinguish Cluster 1 and 6 from cluster 4.

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1