Task 1

This task is to build a predictive model which is able to distinguish 0 or 1 based on an input value. "training-data-small.txt.bz2" file contains training dataset that consists of output/input pairs (tab-separated). By using the constructed model, generate predictions for each line in "test-data-small.txt.bz2" file containing test dataset. Submit your predictions and a report explaining the process for making the model.

Technique and learning process in task 1

In order to generate a learning model, I use support vector machine (SVM) ¹ which is one of the pattern recognition techniques in supervised learning. For binary classification problem of this task, K-nearest neighbor method, neural network, and deep learning (Restricted boltzmann machine or Deep Belief Networks) techniques are able to applied. In this task, SVM is used from the following reason:

Learning to result in a convex quadratic programming problems by the lagrange multipliers method, obtained an uniquely global optimal solution without optimizing local solution. Therefore, it is possible to stably and also used in the business.
Feature vector is mapped to the high-dimensional space by the kernel trick that is able to separate non-linear.
Soft margin SVM is able to deal with training data which contains the noise.

Procedure of learning model generated by SVM

Preprocessing
Design of SVM
Optimization and accuracy evaluation of the learning model parameters

Preprocessing

The training data consists words. It is necessary to binarize words to form of SVM input. A each element of feature vector is represented by boolean whether or not containing the word. Since the data consists 125 words, a dimensional of feature space is 125. In addition, because set of feature vector is sparse, by applying feature hashing method, the dimensional of feature space is compressed. Also, test data is binarized in the same way.

Design of SVM

In order to generate a learning model, I apply non-linear SVM with RBF (Radial Basis Function) kernel to training data. The learning model includes a penalty term to prevent over-fitting. For given set of the $d$ dimensional feature vector $\mathbf{x}_l$ and label $y_l$ pair, define

(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \cdots, (\mathbf{x}_l, y_l) \\
\forall i, \mathbf{x}_i \in \mathbf{R}_d,y_i \in \{−1,1\},

where $l$ is the number of set. Linear inseparable SVM constructs maximum-margin hyperplane to maximize the margin is the distance between the hyperplane and the feature vector. The hyperplane is represented by

\forall i, g(\mathbf{x}) = \mathbf{w}^t\mathbf{x}_i+b \left\{
\begin{array}{ll}
\geq 1 - \xi_i & \mathbf{x}_i \in X_1 \\
\leq -1 + \xi_i & \mathbf{x}_i \in X_2
\end{array}
\right.

In case of the inseparable non-linear separatable, above formula $\xi_i$ is $0$ satisfying the formula of $\mathbf{w}$ does not exist, positive variable $\xi_i(i=1,\ldots,l)$ introduced to loosen the conditions. To achieve the maximum margin is the same value to minimize the norm $|w|$. The problem is

min. \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=l}^{n} \xi_i \\
s.t. \forall i, y_i \cdot (\mathbf{w}^t\mathbf{x}_i+b)−(1-\xi_i) \geq 0, xi_i \geq 0

The first term is for a large margin, the second term is a penalty term for feature vector protruding from the margin, where $C \in \mathbf{R}$ is a constant which determines the balance of the first term and the second term. The smaller $g(\mathbf{x})=\pm1$ distance between the growing. The solution is obtained by solving by using the method of lagrange multipliers the dual problem ¹. Also it obtained a uniquely global optimal solution without falling into local optimal solution to result in a convex quadratic programming problem.
SVM is limited because of a linear classifier. So combining the kernel method ². Inner product in minimization problem is replaced by a kernel function kernel method. In this challenge RBF kernel $exp(−\frac{1}{2\sigma^2}|x_i − x_j|^2)=exp(−\gamma|x_i − x_j|^2)$ was used. By the kernel function, feature vectors are mapped to higher-dimensional space is linearly separable in the space. The SVM allowing nonlinear separated by this mapping.

Optimization and accuracy evaluation of the learning model parameters

Calculate parameters of the constant $C$ and RBF kernel of the penalty term $\gamma\in\mathbf{R}$. The optimization of parameters using cross-validation (K-fold method), an output classification accuracy to employ a set of highest parameter. In this task, 3-fold.
The results are evaluated using the correct rate (accuracy). The accuracy is a measure which is represented by recall and precision.

Result of task 1

I experimented on the performance of the learning model measured by accuracy to verify the possibility of applying SVM to training data. The hyperplane penalty term constant $C$ was $2$, and the RBF kernel parameter $\gamma$ was $0.5$. Accuracy was 67.5%. The results of cross-validation is shown in following table.

C	γ	Accuracy
1	0.5	0.675
1	0.707	0.668
1	1.0	0.658
1	1.414	0.655
1	2.0	0.651
1	2.828	0.648
2	0.5	0.675
2	0.707	0.666
2	1	0.661
2	1.414	0.656
2	2	0.653
2	2.828	0.648
4	0.5	0.671
4	0.707	0.666
4	1	0.660
4	1.414	0.657
4	2	0.653
4	2.828	0.648

Result of predictions

The results of applying the learning model to the test data are shown in result-for-small-test-data.txt which are stored in Dropbox. In 100,000 test data, 81,775 data classified to 0, and 18,225 data classified to 1. Processing time of learning and reasoning was 326 [sec].

Task 2

The task is similar with the exception of the dataset (note: the dataset is significantly larger than the earlier one). Training dataset is in "training-data-large.txt.bz2", and test dataset is in "test-data-large.txt.bz2".

Preprocessing for task 2

The learning of SVM have to solve a quadratic programming problem, requires a huge amount of calculations, there is a problem that not suitable for learning of the large scale data. Platt has proposed SMO (Sequential Minimal Optimization) performing weight update repeating selection by limiting the training data to be updated in two. In addition, for the unbalanced training data, the under-sampling ³ often is applied such as One-Sided Selection. In Task 2, by sampling, deleting redundant data, and compressing the dimension of feature space by feature hashing, learn efficiently.

Result of task 2

The hyperplane penalty term constant $C$ was $1$, and the RBF kernel parameter $\gamma$ was $2.8284$. Accuracy was 71.1%. The results of cross-validation is shown in following table.

C	γ	Accuracy
1	0.5	0.672
1	0.707	0.686
1	1.0	0.697
1	1.414	0.706
1	2.0	0.711
1	2.828	0.711
2	0.5	0.691
2	0.707	0.703
2	1	0.709
2	1.414	0.707
2	2	0.706
2	2.828	0.711
4	0.5	0.706
4	0.707	0.710
4	1	0.709
4	1.414	0.707
4	2	0.706
4	2.828	0.711

Result of predictions

The results for test data are shown in result-for-large-test-data.txt which are stored in Dropbox. In 100,000 test data, 81,775 data classified to 0, and 5,481 data classified to 1. Processing time of learning and reasoning was 2720 [sec].

Experiment environment

The experiment was conducted using Intel Core i5 2.5GHz CPU with 8GB memory size running OS X, with the algorithm implemented in Python language. scikit-learn⁴ was used as library of SVM.

work.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(__doc__)

import numpy as np
from sklearn import svm
from sklearn import cross_validation
from sklearn import grid_search
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
import argparse
import logging
import time
import csv


def init_logger():
    global logger
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    log_fmt = '%(asctime)s/%(name)s[%(levelname)s]: %(message)s'
    logging.basicConfig(format=log_fmt)

def __pad_feature_vector(x):
    x = x.rstrip('\n').split(',')
    x = {i: x.count(i) for i in set(x)}
    return x

def load_train_data(file):
    X = []
    y = []
    f = open(file, 'r')
    for l in f.readlines():
        train = l.split()
        y.append(int(train[0]))
        X.append(__pad_feature_vector(train[1]))
    f.close()
    return X, y

def load_test_data(file):
    X = []
    f = open(file, 'r')
    for l in f.readlines():
        X.append(__pad_feature_vector(l))
    f.close()
    return X

def get_cross_validation_score(clf, train_features, train_labels, test_size):
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(train_features, train_labels, test_size=test_size, random_state=0)
    clf.fit(X_train, y_train)
    print clf.score(X_test, y_test)

def get_cross_validation_accuracy(clf, train_features, train_labels):
    scores = cross_validation.cross_val_score(clf, train_features, train_labels, cv=10)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

def get_best_estimator_by_grid_search(train_X, train_y):
    u"""grid_search
    Grid search to estimate kernel params
    """
    params = [
        {'C': [2**i for i in range(0, 3, 1)], 'gamma': [2**i for i in np.arange(-1, 2, 0.5)], 'kernel': ['rbf']},
    ]
    method = svm.SVC()
    gscv = grid_search.GridSearchCV(method, params, scoring='accuracy', n_jobs=9)
    gscv.fit(train_X, train_y)
    for params, mean_score, all_scores in gscv.grid_scores_:
        logger.info('{:.3f} (+/- {:.3f}) for {}'.format(mean_score, all_scores.std() / 2, params))
    logger.info('params:{params}'.format(params=gscv.best_params_))
    logger.info('score:{params}'.format(params=gscv.best_score_))
    # pred = gscv.best_estimator_.predict_proba(test_X)

def work2(X, y, test_X):
    from unbalanced_dataset import UnderSampler, OneSidedSelection
    sampling_interval = 50
    X = X[::sampling_interval]
    y = y[::sampling_interval]
    # dvec = DictVectorizer(sparse=True)
    # X = dvec.fit_transform(X)
    # test_X = dvec.transform(test_X)
    # X = np.array(X.toarray())
    # y = np.array(y)
    # verbose = False
    # US = UnderSampler(verbose=verbose)
    # X, y = US.fit_transform(X, y)
    # X = dvec.inverse_transform(X)
    # y = y.tolist()
    return X, y, test_X

def main(args):
    # Load training data and format training vectors X and class labels y
    X, y = load_train_data(args.train_file)
    # Load test data
    test_X = load_test_data(args.test_file)

    # For work. Under sampling
    # X, y, __test_X = work2(X, y, test_X)

    # Feature hashing
    hasher = FeatureHasher(n_features=16)
    X = hasher.transform(X)
    test_X = hasher.transform(test_X)
    logger.debug(u'不均衡データ 0: ' + str(y.count(0)) + ' 1: ' + str(y.count(1)))

    # SVM by RBF (Gaussian) kernel with optimized C and gamma.
    # Use grid search to estimate params
    get_best_estimator_by_grid_search(X, y) 
    clf = svm.SVC(kernel='rbf', C=2, gamma=0.5)
    clf.fit(X, y)

    # Predict class labels of test data
    result = clf.predict(test_X)
    logger.debug(result)
    logger.debug('0: ' + str(list(result).count(0)) + ' 1: ' + str(list(result).count(1)))
    return result


if __name__ == '__main__':
    u"""__main__
     Usage: python work.py training-data-small.txt test-data-small.txt -o result.txt
    """
    init_logger()
    parser = argparse.ArgumentParser()
    parser.add_argument('train_file')
    parser.add_argument('test_file')
    parser.add_argument('-o', '--output', required=True)
    args = parser.parse_args()

    start_time = time.time()

    result = main(args)

    elapsed_time = time.time() - start_time
    logger.info(("{0}".format(elapsed_time)) + '[sec]')

    test_file = open(args.test_file, 'r')
    test_data = test_file.readlines()
    test_file.close()
    output_file = open(args.output, 'w')
    for i, class_label in enumerate(result):
        output_file.write(str(class_label) + '\t' + test_data[i])
    output_file.close()

Conclusions

Accuracy was as low as around 70% in the learning of the task. Because the identity of data was unknown, the preprocessing was only dimension compression by hash trick. By cleaning such as deletion of unnecessary words and weighting of the word in accordance with the data characteristics, it is able to improve performance. Further, Kamei proposed sampling technique ⁵, considered it can be learned without compromising performance. In the learning algorithm, recent active research such as Transfer learning (technique which can solve the problem of deep learning that is falling to a local optimal solution) may improve performance.

References

前田英作, ``痛快！サポートベクトルマシン -古くて新しいパターン認識手法-'', 情報処理, 42(7), 676--683, 2001. ↩
H. Takamura, ``Constructive Induction and Text Categorization with SVMs'', 44, J.IPS Japan, 2003. ↩
UnbalancedDataset ↩
scikit-learn ↩
Y. Kamei, ``Applying Sampling Methods to Fault-prone Module Detection,'' J.IPS Japan, Vol.48, No.8, pp.2651-2662, 2007. ↩

Analyze nantoka data by SVM

Task 1

Technique and learning process in task 1

Procedure of learning model generated by SVM

Preprocessing

Design of SVM

Optimization and accuracy evaluation of the learning model parameters

Result of task 1

Result of predictions

Task 2

Preprocessing for task 2

Result of task 2

Result of predictions

Experiment environment

Conclusions

References