More than 1 year has passed since last update.

Swiftで書かれたAffineTransformのコードがあったので姿勢推定のHeatmapの処理を実装してみた

Last updated at 2023-11-03Posted at 2023-10-08

以前ViTPoseをiOS用に実装した時ですが、Heatmapの前処理と後処理をC++とOpenCVに依存する形で実装しました。
その時にOpenCV依存を無くしてSwiftだけで実装するにはAffineTransformを自前で用意する必要があることを知りました。

AffineTransformについては、特に前知識もなく、手をつける為のとっかかりがなかったこともあり放置していたのですが、Swiftで書かれたコードを見つけたこともあり、試しにSwiftで実装しました。

Perspective transform from quadrilateral to quadrilateral in Swift

C++で書かれたHeatmap処理をSwiftに書き直す

元のC++のコード(Heatmap処理)はPaddlePaddleのコードを流用します。
https://github.com/PaddlePaddle/PaddleDetection/blob/develop/deploy/lite/src/keypoint_postprocess.cc

作成したコード
https://github.com/otmb/TopDownPoseEstimation/blob/main/TopDownPoseEstimation/KeypointProcess.swift

リポジトリはこちら

cv::getAffineTransformをSwiftの関数に置き換える

難解な計算式もコードに落とすとこのようになる。
参照先: Perspective transform from quadrilateral to quadrilateral in Swift using SIMD for matrix operations

func affineTransform(from: double3x3, to: double3x3) -> double3x3? {
  let invA = from.inverse
  if invA.determinant.isNaN {
    return nil
  }
  let M = to * invA
  return M
}

func cgAffineTransform(from: double3x3, to: double3x3) -> CGAffineTransform? {
  guard let M = affineTransform(from: from, to: to) else {
    return nil
  }
  let (m1, m2, m3) = M.columns
  return CGAffineTransform(a: m1.x, b: m1.y, c: m2.x, d: m2.y, tx: m3.x, ty: m3.y)
}

他に何か特筆すべき点を挙げてみる

今回のコードの実装で何か特筆すべき点を挙げるとしますと、PaddlePaddleのコードで一部ポインタを用いて実装しているコードがあったのですが、SwiftではAccelerateを用いたコードで処理してあげる方が高速でしたので、そのように実装しました。

PaddlePaddleのコード(C++)

void get_max_preds(std::vector<float>& heatmap,
                   std::vector<int>& dim,
                   std::vector<float>& preds,
                   std::vector<float>& maxvals,
                   int batchid,
                   int joint_idx) {
  int num_joints = dim[1];
  int width = dim[3];
  std::vector<int> idx;
  idx.resize(num_joints * 2);

  for (int j = 0; j < dim[1]; j++) {
    float* index = &(
        heatmap[batchid * num_joints * dim[2] * dim[3] + j * dim[2] * dim[3]]);
    float* end = index + dim[2] * dim[3];
    float* max_dis = std::max_element(index, end);
    auto max_id = std::distance(index, max_dis);
    maxvals[j] = *max_dis;
    if (*max_dis > 0) {
      preds[j * 2] = static_cast<float>(max_id % width);
      preds[j * 2 + 1] = static_cast<float>(max_id / width);
    }
  }
}

Swiftで書き直したコード

func getMaxCoords(heatmap: [Double]) -> [MaxCoord] {
  let width = Double(heatmapWidth)
  
  return (0..<keypointsNumber).map { j in
    let idx = j * heatmapHeight * heatmapWidth
    let end = idx + heatmapHeight * heatmapWidth
    let (maxIdx, maxValue) = vDSP.indexOfMaximum(heatmap[idx..<end])
    let coord = CGPoint(
      x: Double(maxIdx).truncatingRemainder(dividingBy: width),
      y: Double(maxIdx) / width)
    return MaxCoord(coord: coord, maxval: maxValue)
  }
}

その他: 描画処理

こちらもOpenCVでのpolygonを使った描画からCoreGraphicsの描画に切り替えました。
好みはありますがCoreGraphicsを使った方がSwiftだけで完結できるのでアプリサイズや導入のお作法的には良さそうです。

Before

After

描画のコードは下記を参考にしました。

Detecting human body poses in an image

追記

MS COCO val2017

今回作成したコードが正常に動いているかval2017で精度評価。
モバイルでこの精度が出るのは中々凄いのではないでしょうか。Vision Transformerのモデルはすごいですね。

Models	AP
yolov7-tiny_fp16 + vitpose-b256x192_fp16.mlmodel	0.589
yolov7-tiny_fp16 + vitpose_s256x192_wholebody_fp16.mlmodel	0.579
yolov7-tiny_fp16 + vitpose_b256x192_wholebody_fp16.mlmodel	0.600

参照: SwiftでViTPoseのランタイムを書いたので精度を確認する

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up