More than 1 year has passed since last update.

Juliaでのスレッド並列：FLoops.jlの紹介

Last updated at 2023-02-23Posted at 2023-02-23

Juliaでは並列計算がサポートされています。以前、Juliaでのスレッド並列を調べてみた
の記事でスレッド並列を調べていましたが、実用的な問題の場合にどうすればいいのか、よくわかっていませんでした。ここでは、FLoops.jlを使うと簡単にforループをスレッド並列化できることを示します。なお、Threads.@threadsによる並列化は、ループが短く何度も呼ぶ場合には、1スレッドの時に並列化しない場合より遅くなることがありますが、@floopではそうなりませんでした。そのような意味でも@floopがおすすめです。

環境

versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 8 virtual cores
Environment:
  JULIA_SSL_NO_VERIFY_HOSTS = github.com
  LD_LIBRARY_PATH = :/usr/local/lib/
  DYLD_LIBRARY_PATH = /opt/intel_org/oneapi/mkl/2021.3.0/lib

コード例

以下のようなコードを考えます。

using BenchmarkTools
using FLoops
using Random

function partial_tr(A::Array{T,3},NC,NV) where {T <: Number} 
    s = 0.0
    for i=1:NV
        @inbounds for k=1:NC
            s += real(A[k,k,i])
        end
    end
    return s/(NV*NC^2)
end

function update!(C,s,t,B,NV,NC)
    for i=1:NV
        for k2=1:NC
            @inbounds for k1=1:NC            
                C[k1,k2,i] += exp(im*B[k1,k2,i]*t*s)
            end
        end
    end
end

function timedev(A,B,itemax)
    NC,_,NV = size(A)
    st = 0.0im
    C = zeros(ComplexF64,NC,NC,NV)
    C.= A
    ts = range(0,1,length=itemax)
    for j=1:itemax
        s = partial_tr(C,NC,NV)/j
        t = ts[j]
        update!(C,s,t,B,NV,NC)
        st += s
    end
    return st
end

function test()
    NC = 3
    NV = 12^4
    Random.seed!(123)
    A = rand(Float64,NC,NC,NV)
    B = abs.(rand(ComplexF64,NC,NC,NV))
    A .= A.*B
    itemax = 300
    
    println("serial ")
    st = timedev(A,B,itemax)
    println("result: ",st/itemax)
    @btime st = timedev($A,$B,$itemax)

end
test()

これは、なんらかの時間発展を想定しています。timedev内のjループはj=1から順番に処理しなければならず、この部分は並列化できないとします。なお、こちら（英語）にも似たような問題が話されています。
コードを実行すると、

serial 
result: 0.32769669929382095 + 0.0im
  540.230 ms (2 allocations: 2.85 MiB)

という結果が出ます。

Threadsの使用

まず初めに、JuliaのドキュメントにあるThreads.@threadsを試してみます。
コードは

function partial_tr_threads(A::Array{T,3},NC,NV) where {T <: Number} 
    acc = Threads.Atomic{Float64}(0.0);
    Threads.@threads for i=1:NV
        @inbounds for k=1:NC
            Threads.atomic_add!(acc, real(A[k,k,i]))
            #s += A[k,k,i]
        end
    end
    s = acc[]
    return  s/(NV*NC^2)
end

function update_threads!(C,s,t,B,NV,NC)
    Threads.@threads for i=1:NV
        for k2=1:NC
            @inbounds for k1=1:NC
                C[k1,k2,i] += exp(im*B[k1,k2,i]*t*s) 
            end
        end
    end
end

function timedev_threads(A,B,itemax)
    NC,_,NV = size(A)
    C = zeros(ComplexF64,NC,NC,NV)
    C.= A
    st = 0.0im
    ts = range(0,1,length=itemax)
    for j=1:itemax
        s = partial_tr_threads(C,NC,NV) /j
        #st += exp(im*s)
        t = ts[j]
        update_threads!(C,s,t,B,NV,NC)
        st += s
    end
    return st
end

function test()
    NC = 3
    #NV = 12^4
    NV = 12^4
    Random.seed!(123)
    A = rand(Float64,NC,NC,NV)
    B = abs.(rand(ComplexF64,NC,NC,NV))
    A .= A.*B
    itemax = 300
    
    println("serial ")
    st = timedev(A,B,itemax)
    println("result: ",st/itemax)
    @btime st = timedev($A,$B,$itemax)
    
    println("with @threads ")
    println("num. of threads: ",Threads.nthreads())
    st = timedev_threads(A,B,itemax)
    println("result with @threads: ",st/itemax)
    @btime st = timedev_threads($A,$B,$itemax)


end
test()

です。二つの関数のforループにThreads.@threadsをつけました。和をとる方の関数では、s += という形でやるとそれぞれのスレッドのアクセスが競合するため、Threads.atomic_add!(acc, real(A[k,k,i]))というものをしていますが、これはs+= real(A[k,k,i])のスレッド並列版だと思ってください。

結果は、

julia -t 1 thread.jl
serial 
result: 0.32769669929382095 + 0.0im
  540.230 ms (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 1
result with @threads: 0.32769669929382095 + 0.0im
  739.145 ms (3902 allocations: 3.19 MiB)

となります。並列化していない1スレッドの時に、普通にやるより遅くなっています。
2並列でやると

julia -t 2 thread.jl
serial 
result: 0.32769669929382095 + 0.0im
  528.284 ms (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 2
result with @threads: 0.32769669929382117 + 0.0im
  941.464 ms (8336 allocations: 3.53 MiB)

と、1スレッドの時より遅くなりました。何が起きているのかよくわかりません。
4並列でやると

serial 
result: 0.32769669929382095 + 0.0im
  537.703 ms (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 4
result with @threads: 0.3276966992938214 + 0.0im
  1.233 s (15873 allocations: 4.18 MiB)

となりもっと遅くなりました。何が起きているのか、本当によくわかりません。スレッド並列の方法を間違えているのでしょうか？
Atomicを使ったのが遅い可能性もあるので、

function partial_tr_threads(A::Array{T,3},NC,NV,temp) where {T <: Number} 
    Threads.@threads for i=1:NV
        @inbounds for k=1:NC
            temp[(i-1)*NC+k] = real(A[k,k,i])
            #Threads.atomic_add!(acc, real(A[k,k,i]))
            #s += A[k,k,i]
        end
    end
    s = sum(temp)
    return  s/(NV*NC^2)
end

function timedev_threads(A,B,itemax)
    NC,_,NV = size(A)
    C = zeros(ComplexF64,NC,NC,NV)
    C.= A
    st = 0.0im
    ts = range(0,1,length=itemax)
    temp = zeros(NC*NV)
    for j=1:itemax
        s = partial_tr_threads(C,NC,NV,temp) /j
        #st += exp(im*s)
        t = ts[j]
        update_threads!(C,s,t,B,NV,NC)
        st += s
    end
    return st
end

と変更してみました。1スレッドの時は

julia -t 1 thread.jl
serial 
result: 0.32769669929382095 + 0.0im
  540.105 ms (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 1
result with @threads: 0.32769669929382117 + 0.0im
  563.366 ms (3604 allocations: 3.66 MiB)

となり、いい感じになりました。これならよさそうです。
4並列だと

julia -t 4 thread.jl
serial 
result: 0.32769669929382095 + 0.0im
  526.584 ms (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 4
result with @threads: 0.32769669929382117 + 0.0im
  166.316 ms (15543 allocations: 4.65 MiB)

でちゃんと速くなっています。オーバーヘッドがどのくらいあるのか確かめるために、itemax=1000と変更してみると、

julia -t 4 thread.jl
serial 
result: 0.33044274593955386 + 0.0im
  1.795 s (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 4
result with @threads: 0.3304427459395537 + 0.0im
  562.210 ms (51852 allocations: 7.74 MiB)

となっています。悪くはなさそうです。最後に、行列サイズが小さい場合にはどうなるでしょうか？NV=4^4、itemax = 10000としてみます。1スレッドだと

julia -t 1 thread.jl
serial 
result: 0.331733478989352 + 0.0im
  220.750 ms (2 allocations: 36.06 KiB)
with @threads 
num. of threads: 1
result with @threads: 0.3317334789893521 + 0.0im
  459.525 ms (120003 allocations: 11.18 MiB)

となり、1スレッドの時に並列化しない場合よりも遅くなってしまっています。2スレッドだと

julia -t 2 thread.jl
serial 
result: 0.331733478989352 + 0.0im
  220.617 ms (2 allocations: 36.06 KiB)
with @threads 
num. of threads: 2
result with @threads: 0.3317334789893521 + 0.0im
  352.282 ms (260180 allocations: 22.32 MiB)

となります。少し速くなりましたが、まだ並列化しないほうが速いです。行列が小さすぎる時はオーバーヘッドがあるのは当然のような気もします。中間サイズ、NV = 8^4、itemax=3000の場合には、
1スレッド

julia -t 1 thread.jl
serial 
result: 0.33138648932804204 + 0.0im
  1.066 s (2 allocations: 576.06 KiB)
with @threads 
num. of threads: 1
result with @threads: 0.331386489328042 + 0.0im
  1.173 s (36004 allocations: 4.00 MiB)

2スレッド

julia -t 2 thread.jl
serial 
result: 0.33138648932804204 + 0.0im
  1.061 s (2 allocations: 576.06 KiB)
with @threads 
num. of threads: 2
result with @threads: 0.331386489328042 + 0.0im
  702.595 ms (79677 allocations: 7.39 MiB)

4スレッド

julia -t 4 thread.jl
serial 
result: 0.33138648932804204 + 0.0im
  1.065 s (2 allocations: 576.06 KiB)
with @threads 
num. of threads: 4
result with @threads: 0.331386489328042 + 0.0im
  353.716 ms (154818 allocations: 13.90 MiB)

のような感じになりました。

結論

Threads.@threadsは小さなループを並列化して大量に呼び出す場合には、1スレッドで遅くなる場合がある
そもそも小さなループは並列化に向いていないのでそのような場合はスレッドを増やさない方がいいが、1スレッドの場合に遅くなるThreads.@threadsはやはり向いていない。

FLoops.jlの使用

FLoops.jlを使ってみます。コードは

using BenchmarkTools
using FLoops
using Random

function partial_tr_floop(A::Array{T,3},NC,NV) where {T <: Number} 
    @floop for i=1:NV
        @inbounds for k=1:NC
            @reduce s += real(A[k,k,i])
        end
    end
    return  s/(NV*NC^2)
end

function update_floop!(C,s,t,B,NV,NC)
    @floop for i=1:NV
        for k2=1:NC
            @inbounds for k1=1:NC
                C[k1,k2,i] += exp(im*B[k1,k2,i]*t*s)
            end
        end
    end
end

function timedev_floop(A,B,itemax)
    NC,_,NV = size(A)
    C = zeros(ComplexF64,NC,NC,NV)
    C.= A
    st = 0.0im
    ts = range(0,1,length=itemax)
    for j=1:itemax
        s = partial_tr_floop(C,NC,NV) /j
        t = ts[j]
        update_floop!(C,s,t,B,NV,NC)
        st += s
    end
    return st
end

function test()
    NC = 3
    NV = 12^4
    Random.seed!(123)
    A = rand(Float64,NC,NC,NV)
    B = abs.(rand(ComplexF64,NC,NC,NV))
    A .= A.*B
    itemax = 1000
    
    println("serial ")
    st = timedev(A,B,itemax)
    println("result: ",st/itemax)
    @btime st = timedev($A,$B,$itemax)
    
    println("with @threads ")
    println("num. of threads: ",Threads.nthreads())
    st = timedev_threads(A,B,itemax)
    println("result with @threads: ",st/itemax)
    @btime st = timedev_threads($A,$B,$itemax)

    println("num. of threads: ",Threads.nthreads())
    st = timedev_floop(A,B,itemax)
    println("result with @floop: ",st/itemax)
    @btime st = timedev_floop($A,$B,$itemax)

end
test()

です。@floopをfor文につけるだけでスレッド並列化してくれます。和をとる場合には、@reduceとつけるだけです。reduceが使えるのでOpenMPっぽい感じに使えそうです。1スレッドの場合、

julia -t 1 thread.jl
serial 
result: 0.33044274593955386 + 0.0im
  1.799 s (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 1
result with @threads: 0.3304427459395537 + 0.0im
  1.865 s (12004 allocations: 4.44 MiB)
num. of threads: 1
result with @floop: 0.33044274593955386 + 0.0im
  1.891 s (15491 allocations: 3.18 MiB)

ん？若干遅いですね。Threads.@threadsと同じようにtempを使いますと、

julia -t 1 thread.jl
serial 
result: 0.33044274593955386 + 0.0im
  1.789 s (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 1
result with @threads: 0.3304427459395537 + 0.0im
  1.870 s (12004 allocations: 4.44 MiB)
num. of threads: 1
result with @floop: 0.3304427459395537 + 0.0im
  1.884 s (5493 allocations: 3.42 MiB)

となりますが、これはほとんど影響を与えていなそうです。ラップトップ上の計測ですので、厳密なベンチマークではありませんが、Threads.@threadsとだいたい同じ感じに見えます。
4スレッドの場合には

julia -t 4 thread.jl
serial 
result: 0.33044274593955386 + 0.0im
  1.798 s (2 allocations: 2.85 MiB)
with @threads 
num. of threads: 4
result with @threads: 0.3304427459395537 + 0.0im
  561.394 ms (51852 allocations: 7.74 MiB)
num. of threads: 4
result with @floop: 0.3304427459395537 + 0.0im
  598.754 ms (50799 allocations: 6.91 MiB)

となって、ちゃんと並列化できています。

さて、Threadsの時に遅かったケース、NV = 4^4,itemax=10000の場合はどうでしょうか。

julia -t 1 thread.jl
serial 
result: 0.331733478989352 + 0.0im
  216.411 ms (2 allocations: 36.06 KiB)
with @threads 
num. of threads: 1
result with @threads: 0.3317334789893521 + 0.0im
  455.882 ms (120003 allocations: 11.18 MiB)
num. of threads: 1
result with @floop: 0.331733478989352 + 0.0im
  237.793 ms (149491 allocations: 3.23 MiB)

となり、並列化しないものとほぼ同じになっています。これは良い性質ですね。
2スレッドと4スレッドを見てみると

julia -t 2 thread.jl
serial 
result: 0.331733478989352 + 0.0im
  221.937 ms (2 allocations: 36.06 KiB)
with @threads 
num. of threads: 2
result with @threads: 0.3317334789893521 + 0.0im
  375.643 ms (260291 allocations: 22.33 MiB)
num. of threads: 2
result with @floop: 0.33173347898935207 + 0.0im
  581.538 ms (634297 allocations: 22.60 MiB)

julia -t 4 thread.jl
serial 
result: 0.331733478989352 + 0.0im
  220.311 ms (2 allocations: 36.06 KiB)
with @threads 
num. of threads: 4
result with @threads: 0.3317334789893521 + 0.0im
  308.189 ms (512785 allocations: 44.07 MiB)
num. of threads: 4
result with @floop: 0.3317334789893521 + 0.0im
  473.959 ms (1618115 allocations: 61.78 MiB)

となり、遅くなってしまいました。しかし、これは、サイズが並列化に向いていない、ということを意味していますので、1スレッドで走らせた時に極端に遅くなるよりは良いように思います。tempを使った場合も確かめてみます。

julia -t 4 thread.jl
serial 
result: 0.331733478989352 + 0.0im
  216.524 ms (2 allocations: 36.06 KiB)
with @threads 
num. of threads: 4
result with @threads: 0.3317334789893521 + 0.0im
  297.626 ms (512950 allocations: 44.08 MiB)
num. of threads: 4
result with @floop: 0.3317334789893521 + 0.0im
  527.237 ms (498740 allocations: 35.72 MiB)

悪化しました。ということで@floopの時はtempを使わない方がよさそうです。

結論

@floopは1スレッドでも遅くならない
小さなサイズの場合、そもそもスレッド並列が向いていないので、@floopもThreads.@threadsもよくない。
スレッド並列数を増やした時に遅くなるような問題の時はサイズが小さすぎるという示唆を与えてくれるという意味では、@floopの方が使いやすい
ThreadsX.jlというパッケージもあるらしいが、まだ調べていない

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up