More than 5 years have passed since last update.

Simple analysis using (Kaggle: Eating & Health Module Dataset)

Posted at 2017-02-18

Background

I analyze Eating & Health Module Dataset of Kaggle dataset.
The dataset is here.

Although there are three dataset, I focus on only ehresp_2014.csv dataset.
This dataset contains information about general health and body mass index for each respondent.
Furthermore, this dataset records the income for each respondent.

In this article, I analyze the relationship between the income and other variables.

The code can be found here.

My approach is based on traditional k-means clustering.
The difference from the traditional k-means clustering is that I incorporate the impact of income.

In this dataset, the information of income is denoted by "erincome".
The range of this variable is between 1 and 5.
The description is as follows:
-1 Income > 185% of poverty threshold
-2 Income < = 185% of poverty threshold
-3 130% of poverty threshold < Income<185% of poverty threshold
-4 Income > 130% of poverty threshold
-5 Income <= 130% of poverty threshold.

Approach

There are 32 variables as I remove the following variables("eeincome1", "euincome2", "exincome1", "erincome").
The reason why I remove these variables is that they are all related income and I only use "erincome" to obtain the information of income.
As a preprocessing, I normalize each variable.

Let $K$ be the number of clusters and $x_{v,n,k}$ be the value of
$v$th variable of $n$th respondent within $k$th cluster.
I denote the the value of "erincome" of $l$ th respondent of $k$th cluster
as $S_{n,k}$.

The objective function to be minimized is as follows:

\min_{m_{k,v}} \sum_{k=1}^{K} \sum_{n=1}^{N_{k}}
\sum_{v=1}^{V}
(x_{v,n,k}-S_{n,k}m_{k,v})^2,

where $m_{k,v}$ is the centroid of the $v$th variable within $k$ th cluster.

Experimental setting

I set the number of clustering $K$ to be $7$.
I initialize the clustering allocation randomly.

Results

At first, I show the convergence of this method.

This method can achieve local minimum by updating each variables fixing other variables as well as traditional K-means clustering.

Next, I show the clustering allocation.

	erincome=1	2	3	4	5
Cluster 0	96	2	7	2	0
Cluster 1	649	193	307	10	705
Cluster 2	292	0	0	0	0
Cluster 3	1926	9	0	0	0
Cluster 4	1946	53	3	0	0
Cluster 5	1155	43	0	0	0
Cluster 6	926	233	659	24	1692

According to this table, respondents whole "erincome" is $5$ are allocated only to Cluster 1 and Cluster 6.

Next, I introduce the features of each cluster by focusing the coefficient ($m_{k,v}$).

According to this figure,

Clustering 0: "ertsart" is large.
Clustering 2: "erbmi", "etwgt", "euhgt" and "enwgt" are small.
Clustering 3: "eumeant" and "eumilk" are large.
Clustering 5: "eudrink" and "eusoda" are large.

I think this might be the secret to take the large value of "erincome"
although it is difficult to distinguish Cluster 1 and 6 from cluster 4.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up