More than 3 years have passed since last update.

iOS でのGPUプログラムの並列処理の単位

Last updated at 2021-08-13Posted at 2021-07-22

　iOS+Metal+コンピュートシェーダーで配列を処理する際のGPU上の処理の実行単位を調べてました。GPUでは並列で処理を行うため、どういった単位で処理を進めるのか、を本記事では説明します。

１. 並列処理の単位

　GPUは並列でのデータ処理が得意です。
　得意ではあるものの、例えば写真の明るさやコントラストの変換処理を行う場合、１枚の写真を構成する数十万を超える画素すべてについて、同時並行に処理をするような能力はありません。
　では、どの単位で並列に処理をするかというと、画像であればその画素の配列を細分化して、部分部分で並列処理をすすめます。イメージとしては下図の一番右側 SIMD Group という(緑色、橙色の)単位です。

Grid ※写真１枚に相当	Thread Groups	SIMD Groups

処理対象の配列全体が Grid で、それを分割したのがThread Groups で、それをさらに分割したのがSIMD Groups です。
それぞれについてもう少し詳しく説明します。

Grid
- 処理する対象の配列の要素全てを表す。処理対象が画像であれば画像全体の画素を含む。
- 2次元配列のほかに、1次元配列、3次元配列での処理も可。
- Grid は Thread Group の集まり。「Thread Groupの集まり」なので、ピッタリと画像のサイズと一致しないケースがありうる（後述）
Thread Groups
- GPUの計算ユニット(compute unit)に処理させる単位。GPUが複数の計算ユニットを持っていれば¹、複数のスレッドグループを複数の計算ユニットで同時に実行可能²。
- スレッドグループ内でデータ共有ができる。GPU内の高速なメモリに中間結果を格納・共有できるので、システムメモリに対するロード/ストアよりも高速に処理を進められる。
- スレッドグループ内のスレッドは同期をとって処理を進めることができる。各スレッドの中間処理結果がスレッドグループのメモリに格納されていれば、そのデータを使った処理が可能。
- 1つのThread Groupに含められるThread数には上限があり、例えばiPhone12Proであれば 1,024個まででプログラマが指定する必要がある（後述するように、GPUの仕様として1,024個までだが、実際に指定できるスレッド数は実行状態に依存）
- 2次元配列のほかに、1次元配列、3次元配列での処理も可。
- Thread Group は SIMD Group の集まり。
SIMD Groups
- 同時に実行される Thread の集まり。
- SIMD Group のスレッドの数は手元にあるiPhone6s、iPhone8plus、iPhoneXR、iPhone12Proで確認したところ、いずれも32個。言い換えると32スレッドで1つのSIMD Groupとなる³。
- SIMD Groupは NVIDIAの「ワープ」、AMDの「ウェーブフロント」と同等の位置付け⁴。ググると、ワープは32スレッド、ウェーブフロントは16〜64スレッドを単位としているようなので、スレッド数についてはAppleも似た感じといえそう。
Thread
- 並列処理の最小単位

２. Thread Groupのサイズ

2-1) Thread Groupのスレッド数の上限

Thread Groupのスレッド数はプログラマが指定する必要があります。
Thread Groupのスレッド数は、1次元、2次元、3次元で指定できますが、1つのThread Groupに設定できるスレッド数には上限があります。

Appleのドキュメント「Metal Feature Set Tables」⁵をみると下表のように、GPUの種類毎に上限が決まっています。ただし、実行時のメモリ等、GPUの状態により上限が変わることがあるため、下表の値をそのまま使うことはできません。

GPU	Maximum threads per threadgroup
A7〜A10(iPhone5s〜iPhone7)	512
A11(iPhone8〜)	1,024

では、どこからThread Groupのスレッド数の上限を取得できるかというとコンピュートシェーダーの MTLComputePipelineState の maxTotalThreadsPerThreadgroup から取得します。

下表は後述するかなりシンプルなプログラムの実行時に取得できた maxTotalThreadsPerThreadgroup の値です。

GPU	OS	maxTotalThreadsPerThreadgroup
A9(iPhone6s)	iOS14.3	512
A11(iPhone8plus)	iOS13.7	1,024
A12(iPhoneXR)	iOS14.6	1,024
A14(iPhone12Pro)	iOS14.6	1,024

シンプル・軽量な処理なので、GPUの仕様通りのスレッド数になったと思われます。

2-2) Thread Groupのサイズ

　1つのThread Groupのサイズは1〜3次元で指定できます。ここでは画像を処理するケースを想定し2次元の場合を考えます。
サイズを決めるのに必要な情報は次の２つです。

①SIMD Groupのサイズ（例. 32)
②Thread Groupのスレッド数の上限（例. 512、1,024)

同時並行に実行できるスレッド数が①なので、これを2次元配列の横方向のサイズとします。
とすると、縦方向のサイズは「② ÷ ①」となります。Appleのドキュメントではこのサイズの決定方法が最適とされています⁶。

ここでSIMD Groupのサイズは maxTotalThreadsPerThreadgroupと同様に、MTLComputePipelineState の threadExecutionWidth プロパティから取得します。

2-3) Grid内のThread Groupの数

　実際にMetalでのThread Groupのサイズの設定は、MTLComputeCommandEncoder の次のどちらかのメソッドを使います。

#	MTLComputeCommandEncoder のメソッド	A11未満	A11以降 (non-uniform対応)
１	dispatchThreads( _ threadsPerGrid: MTLSize, threadsPerThreadgroup: MTLSize)	利用不可	利用可
２	dispatchThreadgroups( _ threadgroupsPerGrid: MTLSize, threadsPerThreadgroup: MTLSize)	利用可	利用可

　１はGPUがNon-uniform threadgroup size ⁵の機能を備えている必要があるため少し古い機種だと利用できないです。
　１、２ともにThread Group内のスレッドのサイズthreadsPerThreadgroup を指定するのは共通ですが、１はGridの中の配列の要素数threadsPerGrid を指定するのに対し、２はGridの中のThread Groupの数threadgroupsPerGrid を指定します。
したがって、２を利用する場合には次のように threadgroupsPerGrid を計算する必要があります。

let w = pipelineState.threadExecutionWidth
let h = pipelineState.maxTotalThreadsPerThreadgroup / w
// textureは処理対象の画像が格納されている前提
let threadgroupsPerGrid = MTLSize(width: (texture.width + w - 1) / w,
                                  height: (texture.height + h - 1) / h,
                                  depth: 1)

　２の場合はThread Groupの数を指定しており、GridのサイズがThread Groupの数で割り切れない場合、余りの部分のスレッド（GPUリソース）が無駄になります。これに対し１の場合はMetalによって、余りが生じないようにThread Groupが調整されるためパフォーマンスが向上します⁷。

　なお、Thread Group内のスレッドが常に threadExecutionWidth の倍数であることが保証できる場合、MTLComputePipelineDescriptor の threadGroupSizeIsMultipleOfThreadExecutionWidth にtrueを設定することでパフォーマンスが向上する場合があります⁸。

３. 動作確認

次の画像(80x70px)をインプットとして単純な色変換処理を行い、実際に処理単位を確認してみます。

Thread Groupの処理範囲

Grid内のThread Groupの位置を市松模様のように１つ飛ばしで、画素の白黒を反転させた場合。

iPhone12Pro	iPhone6s

Thread Group Size 32x32(= 1,024)	Thread Group Size 32x16(= 512)

想定した通りmaxTotalThreadsPerThreadgroupの違いによるThread Groupのサイズの差異が確認できます。

SIMD Groupの処理範囲

SIMD Groupの位置（インデックス）が奇数の場合に白黒を反転させた場合。

SIMD Groupは縦方向にインデックスが増えるので（横幅は１つしかないので）、縦方向に交互に変換できていることが確認できます。
【追記】横65〜80pxの16pxの部分は2行毎に変換されています。１つのSIMD Groupは32スレッドなので、スレッドが遊ばないように調整されているようです。

Threadの処理範囲

SIMD Group内のThreadの位置（インデックス）が８で割り切れない場合に白黒を反転させた場合。

想定した通りの結果になりました。

　ちなみに、MSL側でThread Group内のSIMD Groupの位置を取得するのに[[simdgroup_index_in_threadgroup]]を、SIMD Group内のThreadの位置を取得するのに、[[thread_index_in_simdgroup]]を利用したのですが、iPhone6s/iPhone8plusでは動作しませんでした（「MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 3 try」エラーになる）。iPhoneXR で取得可能であることは確認しており、iPhone8plusとiPhoneXR（A11とA12）でなにかしらの違いがあると思うのですが、Metal Feature Set Tables⁵ をみても該当しそうな箇所は見つかりませんでした。

４. 全体プログラム

ViewController

class ViewController: UIViewController {

    @IBOutlet weak var mtkView: MTKView!

    var renderer: Renderer!

    override func viewDidLoad() {
        super.viewDidLoad()
        
        mtkView.device = MTLCreateSystemDefaultDevice()
        renderer = Renderer(mtkView: mtkView)
        renderer.mtkView(mtkView, drawableSizeWillChange: mtkView.drawableSize)
        
        mtkView.delegate = renderer
    }
}

Renderer.swift

import MetalKit
import Metal

fileprivate struct Const {
    static let maxBuffersInFlight: Int = 3
    // 処理するテクスチャのサイズ
    static let textureWidth = 80
    static let textureHeight = 70
    static let textureName = "tex80x70"
    // Uniformバッファは256バイトの倍数にする（今回のサンプルでは無用。。。）
    static let alignedUniformsSize = (MemoryLayout<Uniforms>.size & ~0xFF) + 0x100
    // Metal 画像頂点座標(x,y), uv座標(u,v)
    static let kImagePlaneVertexData: [Float] = [
        -1.0, -1.0, 0.0, 1.0,
         1.0, -1.0, 1.0, 1.0,
        -1.0,  1.0, 0.0, 0.0,
         1.0,  1.0, 1.0, 0.0
    ]
}

class Renderer: NSObject {
    private let device: MTLDevice!
    private weak var mtkView: MTKView!
    private var commandQueue: MTLCommandQueue!
    // バッファ
    private var dynamicUniformBuffer: MTLBuffer!
    private var vertexBuffer: MTLBuffer!
    private var inputTexture: MTLTexture!
    private var outputTexture: MTLTexture!
    // 同時コマンド送信制御
    private let inFlightSemaphore = DispatchSemaphore(value: Const.maxBuffersInFlight)
    // １つめのパス
    private var textureRenderPipeline: MTLComputePipelineState!
    // ２つ目のパス
    private var drawableRenderPipeline: MTLRenderPipelineState!
    // 画面描画用
    private let renderPassDescriptor = MTLRenderPassDescriptor()
    // Uniform使用状態
    private var uniformBufferOffset = 0
    private var uniformBufferIndex = 0
    private var uniforms: UnsafeMutablePointer<Uniforms>!
    // シェーダーに渡す情報
    private var aspectRatio: Float = 0.0

    init(mtkView: MTKView) {
        self.mtkView = mtkView
        self.device = mtkView.device
        
        super.init()
        
        setupViews()
        makeBuffers()
        makePipelines()
        makeRenderPassDescriptor()
    }
}

// MARK: - private

private extension Renderer {
    func setupViews() {
        mtkView.backgroundColor = UIColor.clear
        mtkView.preferredFramesPerSecond = 60
        mtkView.clearColor = .init(red: 1, green: 1, blue: 1, alpha: 1)
    }
    
    func makeBuffers() {
        // 頂点座標バッファ確保＆頂点情報流し込み
        let vSize = Const.kImagePlaneVertexData.count * MemoryLayout<Float>.size
        vertexBuffer = device.makeBuffer(bytes: Const.kImagePlaneVertexData, length: vSize, options: [])
        
        // テクスチャを読み込む
        // see: https://developer.apple.com/forums/thread/121269
        let textureLoader = MTKTextureLoader(device: device)
        let textureUsage = MTLTextureUsage.shaderRead.rawValue | MTLTextureUsage.shaderWrite.rawValue
        let textureLoaderOptions = [
            MTKTextureLoader.Option.textureUsage: NSNumber(value:textureUsage),
            MTKTextureLoader.Option.textureStorageMode: NSNumber(value: MTLStorageMode.private.rawValue)
        ]

        guard let imageFilePath = Bundle.main.url(forResource: Const.textureName, withExtension: "png") else { fatalError("image file not found.") }
        do {
            inputTexture = try textureLoader.newTexture(URL: imageFilePath, options: textureLoaderOptions)
        } catch {
            let err = error as NSError
            fatalError("texture load error. [\(err.code)] [\(err.domain)]  [\(err.userInfo)] [\(err.localizedDescription)]")
        }

        // 画像変換後のテクスチャのバッファを確保
        let colorDesc = MTLTextureDescriptor.texture2DDescriptor(pixelFormat: .rgba8Unorm,
                                                                 width: Const.textureWidth,
                                                                 height: Const.textureHeight,
                                                                 mipmapped: false)
        colorDesc.usage = [.shaderRead, .shaderWrite]
        outputTexture = device.makeTexture(descriptor: colorDesc)
        
        // Uniformバッファを確保
        let uniformBufferSize = Const.alignedUniformsSize * Const.maxBuffersInFlight
        guard let buffer = device.makeBuffer(length: uniformBufferSize, options: [MTLResourceOptions.storageModeShared]) else { fatalError() }
        dynamicUniformBuffer = buffer
        uniforms = UnsafeMutableRawPointer(dynamicUniformBuffer.contents()).bindMemory(to: Uniforms.self, capacity: 1)
    }
    
    func makePipelines() {
        commandQueue = device.makeCommandQueue()
        guard let library = device.makeDefaultLibrary() else { fatalError() }
        
        // コンピュートシェーダー
        let shadarName = "convert"
        guard let computeShader = library.makeFunction(name: shadarName) else {
            fatalError("shader not found.")
        }
        do {
            textureRenderPipeline = try self.device.makeComputePipelineState(function: computeShader)
            print("makeComputePipelineState: maxTotalThreadsPerThreadgroup [\(textureRenderPipeline.maxTotalThreadsPerThreadgroup)] threadExecutionWidth [\(textureRenderPipeline.threadExecutionWidth)]")
        } catch {
            fatalError("makeComputePipelineState() failed. error[\(error.localizedDescription)]")
        }
        
        // 画面描画シェーダー
        let descriptor = MTLRenderPipelineDescriptor()
        descriptor.vertexFunction = library.makeFunction(name: "vertexShader")
        descriptor.fragmentFunction = library.makeFunction(name: "fragmentShader")
        descriptor.colorAttachments[0].pixelFormat = mtkView.colorPixelFormat
        descriptor.vertexBuffers[0].mutability = .immutable
        do {
            drawableRenderPipeline = try self.device.makeRenderPipelineState(descriptor: descriptor)
        } catch {
            fatalError("makeRenderPipelineState() failed. error[\(error.localizedDescription)]")
        }
    }
    
    func makeRenderPassDescriptor() {
        renderPassDescriptor.colorAttachments[0].loadAction = .clear
        renderPassDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.0, 0.0, 0.0, 1.0)
    }
}

// MARK: - MTKViewDelegate

extension Renderer: MTKViewDelegate {
    func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {
        aspectRatio = Float(size.height) / Float(size.width)
    }
    
    func draw(in view: MTKView) {
        // GPUへの並行コマンド送信制御（今回のサンプルでは無用。。。）
        _ = inFlightSemaphore.wait(timeout: DispatchTime.distantFuture)
        
        let commandBuffer = commandQueue.makeCommandBuffer()!
        commandBuffer.addCompletedHandler { [weak self] commandBuffer in
            if let self = self {
                self.inFlightSemaphore.signal()
            }
        }
        
        // スレッドが指すUniformバッファの場所（オフセット）を決める（今回のサンプルでは大袈裟。。。）
        uniformBufferIndex = (uniformBufferIndex + 1) % Const.maxBuffersInFlight
        uniformBufferOffset = Const.alignedUniformsSize * uniformBufferIndex
        // Uniformバッファに画面のアスペクトレシオを設定
        uniforms = UnsafeMutableRawPointer(dynamicUniformBuffer.contents() + uniformBufferOffset).bindMemory(to: Uniforms.self, capacity: 1)
        uniforms[0].aspectRatio = aspectRatio
        
        // コンピュートコマンド生成
        encodeComputeCommand(commandBuffer)
        // 画面描画コマンド生成
        encodeRenderCommand(commandBuffer)
        
        // 実行
        guard let drawable = mtkView.currentDrawable else { return }
        commandBuffer.present(drawable)
        commandBuffer.commit()
    }
    
    func encodeComputeCommand(_ commandBuffer: MTLCommandBuffer) {
        // コマンド送信
        let computeEncoder = commandBuffer.makeComputeCommandEncoder()!
        computeEncoder.setComputePipelineState(textureRenderPipeline)
        computeEncoder.setTexture(inputTexture, index: 0)
        computeEncoder.setTexture(outputTexture, index: 1)
        
        let threadWidth = textureRenderPipeline.threadExecutionWidth
        let threadHeight = textureRenderPipeline.maxTotalThreadsPerThreadgroup / threadWidth
        let threadsPerThreadgroup = MTLSizeMake(threadWidth, threadHeight, 1)
        let threadsPerGrid = MTLSizeMake(Const.textureWidth, Const.textureHeight, 1)
        
        if device.supportsFeatureSet(.iOS_GPUFamily4_v2) {
            // A11以降で　non-uniform　に対応している
            computeEncoder.dispatchThreads(threadsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
        } else {
            let threadgroupsPerGrid = MTLSizeMake((Const.textureWidth + threadWidth - 1) / threadWidth,
                                                  (Const.textureHeight + threadHeight - 1) / threadHeight,
                                                  1)
            computeEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
        }
        computeEncoder.endEncoding()
    }
    
    func encodeRenderCommand(_ commandBuffer: MTLCommandBuffer) {
        
            renderPassDescriptor.colorAttachments[0].texture = mtkView.currentDrawable?.texture
        guard let renderEncoder = commandBuffer.makeRenderCommandEncoder(descriptor: renderPassDescriptor) else { return }
        
        renderEncoder.setRenderPipelineState(drawableRenderPipeline)
        renderEncoder.setVertexBuffer(vertexBuffer, offset: 0, index: 0)
        renderEncoder.setVertexBuffer(dynamicUniformBuffer, offset: uniformBufferOffset, index: 1)
        if let texture = outputTexture {
            renderEncoder.setFragmentTexture(texture, index: 0)
        }
        renderEncoder.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: 4)
        renderEncoder.endEncoding()
    }
}

Shaders.metal

# include <metal_stdlib>
using namespace metal;

# include "ShaderTypes.h"

typedef struct {
    float2 position;
    float2 texCoord;
} Vertex;

typedef struct {
    float4 position [[position]];
    float2 texCoord;
} ColorInOut;

constexpr sampler s = sampler(coord::normalized,
                              address::clamp_to_zero,
                              filter::nearest);

kernel void convert(texture2d<half, access::read> input [[ texture(0) ]],
                     texture2d<half, access::write> output [[ texture(1) ]],
                     uint2 position [[thread_position_in_grid]],
                     uint2 group_pos [[threadgroup_position_in_grid]],
                     uint simd_group_index [[simdgroup_index_in_threadgroup]],  // iPhone6s, iPhone8plusでは取得不可
                     uint thread_index [[thread_index_in_simdgroup]]) // iPhone6s, iPhone8plusでは取得不可
                     {
    
    // non-uniform未対応のGPUの場合、処理対象外の領域についてもスレッドが呼び出されてしまうので、その場合には処理しないようにする。
    // see: https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes#2922042
    if (position.x >= input.get_width() || position.y >= input.get_height()) {
        return;
    }

    half4 input_color = input.read(position);
    half4 output_color;
    
    if (fmod(float(group_pos.x + group_pos.y), 2) == 0) { // Thread Group毎に色変換
//    if (fmod(float(simd_group_index), 2) == 0) {    // SIMD-Group毎に色変換
//    if (fmod(float(thread_index), 8) == 0) {    // SIMD-Group内のThread毎に色変換
        output_color = input_color;
    } else {
        // 白黒を反転
        output_color = input_color - 1.0;
        output_color = abs(output_color);
        output_color.a = 1.0;
    }
    
    output.write(output_color, position);
}

vertex ColorInOut vertexShader(const device Vertex *vertices [[ buffer(0) ]],
                               constant Uniforms &uniforms [[ buffer(1) ]],
                              unsigned int vid [[ vertex_id ]]) {
    ColorInOut out;
    const device Vertex& vert = vertices[vid];
    
    out.position = vector_float4(0.0, 0.0, 0.0, 1.0);

    if (uniforms.aspectRatio > 1.0) {
        out.position.x = vert.position.x;
        out.position.y = vert.position.y / uniforms.aspectRatio;
    } else  {
        out.position.x = vert.position.x * uniforms.aspectRatio;
        out.position.y = vert.position.y;
    }
    
    out.texCoord = vert.texCoord;
    return out;
}

fragment float4 fragmentShader(ColorInOut in [[ stage_in ]],
                              texture2d<float> texture [[texture(0)]]) {
    float4 colorSample = texture.sample(s, in.texCoord);
    colorSample.w = 1.0;
    return colorSample;
}

iPhoneがいくつの計算ユニットをもっているのか調べてもわからなかった。A12以降はGPUコアは4つあるしそれ以前のチップもGPUは複数ある。1コアにも複数の計算ユニットがあると思う。Apple M1であればGPUは『8 cores 128 execution units』と発表されている。この execution units が MSLのマニュアルにあるcompute unit に相当なのか、1コアあたり16のexecution unitsなのか、iPhoneも似た感じなのか？ ↩
Metal Shading Language Specification 4.4 threadgroup Address Space ↩
A14を前提としたAppleの記事 Tech Talks: Discover Metal enhancements for A14 Bionic では「Threadgroups are further organized into groups of 32 threads that are called SIMD groups. 」と言及している。 ↩
Appleの記事 Creating Threads and Threadgroups 「The threads in a threadgroup are further organized into single-instruction, multiple-data (SIMD) groups, also known as warps or wavefronts, that execute concurrently. 」 ↩
Metal Feature Set Tables ↩ ↩² ↩³
Calculating Threadgroup and Grid Sizes ↩
WWDC2017 Video Introducing Metal 2 ↩
API: threadGroupSizeIsMultipleOfThreadExecutionWidth ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up