Parallel programming with Julia

Julia offers different levels of parallel programming

  • distributed processing, where jobs are split among different Julia processes
  • multi-threadding, where jobs are split among multiple threads within the same processes
  • SIMD instructions
  • Task switching.

In this lecture, we will focus mainly on the first two, since SIMD instructions are mainly used for low-level optimization (such as writing your own very performant BLAS library), and task switching is not a true paralelism, but allows to run a different task when one task is waiting for example for IO.

The most important lesson is that before you jump into the parallelism, be certain you have made your sequential code as fast as possible.

Process-level paralelism

Process-level paralelism means we run several instances of Julia (in different processes) and they communicate between each other using inter-process communication (IPC). The implementation of IPC differs if parallel julia instances share the same machine, or they are on different machines spread over the network. By default, different processes do not share any libraries or any variables. They are loaded clean and it is up to the user to set-up all needed code and data.

Julia's default modus operandi is a single main instance controlling several workers. This main instance has myid() == 1, worker processes receive higher numbers. Julia can be started with multiple workers from the very beggining, using -p switch as

julia -p n

where n is the number of workers, or you can add workers after Julia has been started by

using Distributed
addprocs(n)

You can also remove workers using rmprocs. When Julia is started with -p, Distributed library is loaded by default on main worker. Workers can be on the same physical machines, or on different machines. Julia offer integration via ClusterManagers.jl with most schedulling systems.

If you want to evaluate piece of code on all workers including main process, a convenience macro @everywhere is offered.

@everywhere @show myid()

As we have mentioned, workers are loaded without libraries. We can see that by running

@everywhere InteractiveUtils.varinfo()

which fails, but after loading InteractiveUtils everywhere

using Statistics
@everywhere begin 
	using InteractiveUtils
	println(InteractiveUtils.varinfo(;imported = true))
end

we see that Statistics was loaded only on the main process. Thus, there is not magical sharing of data and code. With @everywhere macro we can define function and variables, and import libraries on workers as

@everywhere begin 
	foo(x, y) = x * y + sin(y)
	foo(x) = foo(x, myid())
	x = rand()
end
@everywhere @show foo(1.0)
@everywhere @show x

The fact that x has different values on different workers and master demonstrates again the independency of processes. While we can set up everything using @everywhere macro, we can also put all the code for workers into a separate file, e.g. worker.jl and load it on all workers using -L worker.jl.

Julia's multi-processing model is based on message-passing paradigm, but the abstraction is more akin to procedure calls. This means that users are saved from prepending messages with headers and implementing logic deciding which function should be called for thich header. Instead, we can schedulle an execution of a function on a remote worker and return the control immeadiately to continue in our job. A low-level function providing this functionality is remotecall(fun, worker_id, args...). For example

@everywhere begin 
	function delayed_foo(x, y, n )
		sleep(n)
		println("woked up")
		foo(x, y)
	end
end
r = remotecall(delayed_foo, 2, 1, 1, 60)

returns immediately, even though the function will take at least 60 seconds. r does not contain result of foo(1, 1), but a struct Future, which is a remote reference in Julia's terminology. It points to data located on some machine, indicates, if they are available and allows to fetch them from the remote worker. fetch is blocking, which means that the execution is blocked until data are available (if they are never available, the process can wait forever.) The presence of data can be checked using isready, which in case of Future returned from remote_call indicate that the computation has finished.

isready(r)
fetch(r) == foo(1, 1)

An advantage of the remote reference is that it can be freely shared around processes and the result can be retrieved on different node then the one which issued the call.s

r = remotecall(delayed_foo, 2, 1, 1, 60)
remotecall(r -> println("value: ",fetch(r), " retrieved on ", myid()) , 3, r)

An interesting feature of fetch is that it re-throw an exception raised on a different process.

@everywhere begin 
	function exfoo()
		throw("Exception from $(myid())")
	end
end
r = @spawnat 2 exfoo()

where we have used @spawnat instead of remote_call. It is higher level alternative executing a closure around the expression (in this case exfoo()) on a specified worker, in this case 2. Coming back to the example, when we fetch the result r, the exception is throwed on the main process, not on the worker

fetch(r)

@spawnat can be executed with :any to signal that the user does not care, where the function will be executed and it will be left up to Julia.

r = @spawnat :any foo(1,1)
fetch(r)

Finally, if you would for some reason need to wait for the computed value, you can use

remotecall_fetch(foo, 2, 1, 1)

Running example: Julia sets

Our example for explaining mechanisms of distributed computing will be Julia set fractals, as they can be easily paralelized. The example is adapted from Eric Aubanel. Some fractals (Julia set, Mandelbrot) are determined by properties of some complex-valued functions. Julia set counts, how many iteration is required for $f(z) = z^2+c$ to be bigger than two in absolute value, $|f(z)| >=2$. The number of iterations can then be mapped to the pixel's color, which creates a nice visualization we know.

function juliaset_pixel(z₀, c)
    z = z₀
    for i in 1:255
        abs2(z)> 4.0 && return (i - 1)%UInt8
        z = z*z + c
    end
    return UInt8(255)
end

A nice property of fractals like Julia set is that the computation can be easily paralelized, since the value of each pixel is independent from the remaining. In our experiments, the level of granurality will be one column, since calculation of single pixel is so fast, that thread or process switching will have much higher overhead.

function juliaset_column!(img, c, n, j)
    x = -2.0 + (j-1)*4.0/(n-1)
    for i in 1:n
        y = -2.0 + (i-1)*4.0/(n-1)
        @inbounds img[i,j] = juliaset_pixel(x+im*y, c)
    end
    nothing
end

To calculate full image

function juliaset(x, y, n=1000)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    for j in 1:n
        juliaset_column!(img, c, n, j)
    end
    return img
end

and run it and view it

using Plots
frac = juliaset(-0.79, 0.15)
plot(heatmap(1:size(frac,1),1:size(frac,2), frac, color=:Spectral))

or with GLMakie

using GLMakie
frac = juliaset(-0.79, 0.15)
heatmap(frac)

To observe the execution length, we will use BenchmarkTools.jl

using BenchmarkTools
julia> @btime juliaset(-0.79, 0.15);
  39.822 ms (2 allocations: 976.70 KiB)

Let's now try to speed-up the computation using more processes. We first make functions available to workers

@everywhere begin 
	function juliaset_pixel(z₀, c)
	    z = z₀
	    for i in 1:255
	        abs2(z)> 4.0 && return (i - 1)%UInt8
	        z = z*z + c
	    end
	    return UInt8(255)
	end

	function juliaset_column!(img, c, n, colj, j)
	    x = -2.0 + (j-1)*4.0/(n-1)
	    for i in 1:n
	        y = -2.0 + (i-1)*4.0/(n-1)
	        @inbounds img[i,colj] = juliaset_pixel(x+im*y, c)
	    end
	    nothing
	end
end

For the actual parallelisation, we split the computation of the whole image into bands, such that each worker computes a smaller portion.

@everywhere begin 
	function juliaset_columns(c, n, columns)
	    img = Array{UInt8,2}(undef, n, length(columns))
	    for (colj, j) in enumerate(columns)
	        juliaset_column!(img, c, n, colj, j)
	    end
	    img
	end
end

function juliaset_spawn(x, y, n = 1000)
    c = x + y*im
    columns = Iterators.partition(1:n, div(n, nworkers()))
    r_bands = [@spawnat w juliaset_columns(c, n, cols) for (w, cols) in enumerate(columns)]
    slices = map(fetch, r_bands)
    reduce(hcat, slices)
end

we observe some speed-up over the serial version, but not linear in terms of number of workers

julia> @btime juliaset(-0.79, 0.15);
  38.699 ms (2 allocations: 976.70 KiB)

julia> @btime juliaset_spawn(-0.79, 0.15);
  21.521 ms (480 allocations: 1.93 MiB)

In the above example, we spawn one function on each worker and collect the results. In essence, we are performing map over bands. Julia offers for this usecase a parallel version of map pmap. With that, our example can look like

function juliaset_pmap(x, y, n = 1000, np = nworkers())
    c = x + y*im
    columns = Iterators.partition(1:n, div(n, np))
    slices = pmap(cols -> juliaset_columns(c, n, cols), columns)
    reduce(hcat, slices)
end

julia> @btime juliaset_pmap(-0.79, 0.15);
  17.597 ms (451 allocations: 1.93 MiB)

which has slightly better timing then the version based on @spawnat and fetch (as explained below in section about Threads, the parallel computation of Julia set suffers from each pixel taking different time to compute, which can be relieved by dividing the work into more parts:

julia> @btime juliaset_pmap(-0.79, 0.15, 1000, 16);
  12.686 ms (1439 allocations: 1.96 MiB)

Shared memory

When main and all workers are located on the same process, and the OS supports sharing memory between processes (by sharing memory pages), we can use SharedArrays to avoid sending the matrix with results.

@everywhere begin
	using SharedArrays
	function juliaset_shared(x, y, n=1000)
	    c = x + y*im
	    img = SharedArray(Array{UInt8,2}(undef,n,n))
	    @sync @distributed for j in 1:n
	        juliaset_column!(img, c, n, j, j)
	    end
	    return img
	end 
end

julia> @btime juliaset_shared(-0.79, 0.15);
  19.088 ms (963 allocations: 1017.92 KiB)

The allocation of the Shared Array mich be costly, let's try to put the allocation outside of the loop

img = SharedArray(Array{UInt8,2}(undef,1000,1000))
function juliaset_shared!(img, x, y, n=1000)
    c = x + y*im
    @sync @distributed for j in 1:n
        juliaset_column!(img, c, n, j, j)
    end
    return img
end 

julia> @btime juliaset_shared!(img, -0.79, 0.15);
  17.399 ms (614 allocations: 27.61 KiB)

but both versions are not akin. It seems like the alocation of SharedArray costs approximately 2ms.

@distributed for (Distributed.pfor) does not allows to supply, as it splits the for cycle to nworkers() processes. Above we have seen that more splits is better

@everywhere begin 
	function juliaset_columns!(img, c, n, columns)
	    for (colj, j) in enumerate(columns)
	        juliaset_column!(img, c, n, colj, j)
	    end
	end
end

img = SharedArray(Array{UInt8,2}(undef,1000,1000))
function juliaset_shared!(img, x, y, n=1000, np = nworkers())
    c = x + y*im
    columns = Iterators.partition(1:n, div(n, np))
    pmap(cols -> juliaset_columns!(img, c, n, cols), columns)
    return img
end 

julia> @btime juliaset_shared!(img, -0.79, 0.15, 1000, 16);
  11.760 ms (1710 allocations: 85.98 KiB)

Which is almost 1ms faster than without used of pre-allocated SharedArray. Notice the speedup is now 38.699 / 11.76 = 3.29×

Synchronization / Communication primitives

The orchestration of a complicated computation might be difficult with relatively low-level remote calls. A producer / consumer paradigm is a synchronization paradigm that uses queues. Consumer fetches work intructions from the queue and pushes results to different queue. Julia supports this paradigm with Channel and RemoteChannel primitives. Importantly, putting to and taking from queue is an atomic operation, hence we do not have take care of race conditions. The code for the worker might look like

@everywhere begin 
	function juliaset_channel_worker(instructions, results)
		while isready(instructions)
			c, n, cols = take!(instructions)
			put!(results, (cols, juliaset_columns(c, n, cols)))
		end
	end
end

The code for the main will look like

function juliaset_channels(x, y, n = 1000, np = nworkers())
	c = x + y*im
	columns = Iterators.partition(1:n, div(n, np))
	instructions = RemoteChannel(() -> Channel(np))
	foreach(cols -> put!(instructions, (c, n, cols)), columns)
	results = RemoteChannel(()->Channel(np))
	rfuns = [@spawnat i juliaset_channel_worker(instructions, results) for i in workers()]

	img = Array{UInt8,2}(undef, n, n)
	for i in 1:np
		cols, impart = take!(results)
		img[:,cols] .= impart;
	end
	img
end

julia> @btime juliaset_channels(-0.79, 0.15);

The execution time is much higher then what we have observed in the previous cases and changing the number of workers does not help much. What went wrong? The reason is that setting up the infrastructure around remote channels is a costly process. Consider the following alternative, where (i) we let workers to run endlessly and (ii) the channel infrastructure is set-up once and wrapped into an anonymous function

@everywhere begin 
	function juliaset_channel_worker(instructions, results)
		# println("worker $(myid()) has started")
		while true
		  # println("worker $(myid()) is waiting for instructions")
		  c, n, cols = take!(instructions)
		  # println("worker $(myid()) is computing results")
		  put!(results, (cols, juliaset_columns(c, n, cols)))
		  # println("worker $(myid()) has computed results")
		end
	  # println("worker $(myid()) has computed results")
	end
end

function juliaset_init(x, y, n = 1000, np = nworkers())
  c = x + y*im
  columns = Iterators.partition(1:n, div(n, np))
  T = Tuple{ComplexF64,Int64,UnitRange{Int64}}
  instructions = RemoteChannel(() -> Channel{T}(np))
  T = Tuple{UnitRange{Int64},Array{UInt8,2}}
  results = RemoteChannel(()->Channel{T}(np))
  foreach(p -> remote_do(juliaset_channel_worker, p, instructions, results), workers())
  function compute()
    img = Array{UInt8,2}(undef, n, n)
    foreach(cols -> put!(instructions, (c, n, cols)), columns)
    for i in 1:np
      cols, impart = take!(results)
      img[:,cols] .= impart;
    end
    img
  end 
end

t = juliaset_init(-0.79, 0.15)
julia> @btime t();
  17.697 ms (776 allocations: 1.94 MiB)

with which we obtain the comparable speed to the pmap approach.

Info

remote_do vs remote_call

Instead of @spawnat (remote_call) we can also use remote_do as foreach(p -> remote_do(juliaset_channel_worker, p, instructions, results), workers), which executes the function juliaset_channel_worker at worker p with parameters instructions and results but does not return Future handle to receive the future results.

Info

Channel and RemoteChannel

AbstractChannel has to implement the interface put!, take!, fetch, isready and wait, i.e. it should behave like a queue. Channel is an implementation if an AbstractChannel that facilitates a communication within a single process (for the purpose of multi-threadding and task switching). Channel can be easily created by Channel{T}(capacity), which can be infinite. The storage of a channel can be seen in data field, but a direct access will of course break all guarantees like atomicity of take! and put!. For communication between proccesses, the <:AbstractChannel has to be wrapped in RemoteChannel. The constructor for RemoteChannel(f::Function, pid::Integer=myid()) has a first argument a function (without arguments) which constructs the Channel (or something like that) on the remote machine identified by pid and returns the RemoteChannel. The storage thus resides on the machine specified by pid and the handle provided by the RemoteChannel can be freely passed to any process. (For curious, ProcessGroup Distributed.PGRP contains an information about channels on machines.)

In the above example, juliaset_channel_worker defined as

function juliaset_channel_worker(instructions, results)
	while true
	  c, n, cols = take!(instructions)
	  put!(results, (cols, juliaset_columns(c, n, cols)))
	end
end

runs forever due to the while true loop.

Julia does not provide by default any facility to kill the remote execution except sending ctrl-c to the remote worker as interrupt(pids::Integer...). To stop the computation, we usually extend the type accepted by the instructions channel to accept some stopping token (e.g. :stop) and stop.

@everywhere begin 
	function juliaset_channel_worker(instructions, results)
		while true
			i = take!(instructions)
			i === :stop && break
			c, n, cols = i
			put!(results, (cols, juliaset_columns(c, n, cols)))
		end
		println("worker $(myid()) stopped")
		put!(results, :stop)
	end
end

function juliaset_init(x, y, n = 1000, np = nworkers())
  c = x + y*im
  columns = Iterators.partition(1:n, div(n, np))
  instructions = RemoteChannel(() -> Channel(np))
  results = RemoteChannel(()->Channel(np))
  foreach(p -> remote_do(juliaset_channel_worker, p, instructions, results), workers())
  function compute()
    img = Array{UInt8,2}(undef, n, n)
    foreach(cols -> put!(instructions, (c, n, cols)), columns)
    for i in 1:np
      cols, impart = take!(results)
      img[:,cols] .= impart;
    end
    img
  end 
end

t = juliaset_init(-0.79, 0.15)
t()
foreach(i -> put!(t.instructions, :stop), workers())

In the above example we paid the price of introducing type instability into the channels, which now contain types Any instead of carefully constructed tuples. But the impact on the overall running time is negligible

t = juliaset_init(-0.79, 0.15)
julia> @btime t()
  17.551 ms (774 allocations: 1.94 MiB)
foreach(i -> put!(t.instructions, :stop), workers())

In some use-cases, the alternative can be to put all jobs to the RemoteChannel before workers are started, and then stop the workers when the remote channel is empty as

@everywhere begin 
	function juliaset_channel_worker(instructions, results)
		while !isready(instructions)
		  c, n, cols = take!(instructions)
		  put!(results, (cols, juliaset_columns(c, n, cols)))
		end
	end
end

Sending data

Sending parameters of functions and receiving results from a remotely called functions migh incur a significant cost.

  1. Try to minimize the data movement as much as possible. A prototypical example is
A = rand(1000,1000);
Bref = @spawnat :any A^2;

and

Bref = @spawnat :any rand(1000,1000)^2;
  1. It is not only volume of data (in terms of the number of bytes), but also a complexity of objects that are being sent. Serialization can be very time consuming, an efficient converstion to something simple might be worth
using BenchmarkTools
@everywhere begin 
	using Random
	v = [randstring(rand(1:20)) for i in 1:1000];
	p = [i => v[i] for i in 1:1000]
	d = Dict(p)

	send_vec() = v
	send_dict() = d
	send_pairs() = p
	custom_serialization() = (length.(v), join(v, ""))
end

@btime remotecall_fetch(send_vec, 2);
@btime remotecall_fetch(send_dict, 2);
@btime remotecall_fetch(send_pairs, 2);
@btime remotecall_fetch(custom_serialization, 2);
  1. Some type of objects cannot be properly serialized and deserialized
a = IdDict(
	:a => rand(1,1),
	)
b = remotecall_fetch(identity, 2, a)
a[:a] === a[:a]
a[:a] === b[:a]
  1. If you need to send the data to worker, i.e. you want to define (overwrite) a global variable there
@everywhere begin 
	g = rand()
	show_secret() = println("secret of ", myid(), " is ", g)
end
@everywhere show_secret()

for i in workers()
	remotecall_fetch(g -> eval(:(g = $(g))), i, g)
end
@everywhere show_secret()

which is implemented in the ParallelDataTransfer.jl with other variants, but in general, this construct should be avoided.

Alternatively, you can overwrite a global variable

@everywhere begin 
	g = rand()
	show_secret() = println("secret of ", myid(), " is ", g)
	function set_g(x) 
		global g
		g = x
		nothing
	end
end

@everywhere show_secret()
remote_do(set_g, 2, 2)
@everywhere show_secret()

Practical advices

Recall that (i) workers are started as clean processes and (ii) they might not share the same environment with the main process. The latter is due to the possibility of remote machines to have a different directory structure.

@everywhere begin 
	using Pkg
	println(Pkg.project().path)
end

Our advices earned by practice are:

  • to have shared directory (shared home) with code and to share the location of packages
  • to place all code for workers to one file, let's call it worker.jl (author of this includes the code for master as well).
  • put to the beggining of worker.jl code activating specified environment as (or specify environmnet for all workers in environment variable as export JULIA_PROJECT="$PWD")
using Pkg
Pkg.activate(@__DIR__)

and optionally

Pkg.resolve()
Pkg.instantiate()
  • run julia as
julia -p ?? -L worker.jl main.jl

where main.jl is the script to be executed on the main node. Or

julia -p ?? -L worker.jl -e "main()"

where main() is the function defined in worker.jl to be executed on the main node.

A complete example can be seen in juliaset_p.jl.

Multi-threadding

So far, we have been able to decrese the computation from 39ms to something like 13ms. Can we improve? Let's now turn our attention to multi-threadding, where we will not pay the penalty for IPC. Moreover, the computation of Julia set is multi-thread friendly, as all the memory can be pre-allocatted. We slightly modify our code to accept different methods distributing the work among slices in the pre-allocated matrix. To start Julia with support of multi-threadding, run it with julia -t n, where n is the number of threads. It is reccomended to set n to number of physical cores, since in hyper-threadding two threads shares arithmetic units of a single core, and in applications for which Julia was built, they are usually saturated.

using BenchmarkTools
function juliaset_pixel(z₀, c)
    z = z₀
    for i in 1:255
        abs2(z)> 4.0 && return (i - 1)%UInt8
        z = z*z + c
    end
    return UInt8(255)
end

function juliaset_column!(img, c, n, j)
    x = -2.0 + (j-1)*4.0/(n-1)
    for i in 1:n
        y = -2.0 + (i-1)*4.0/(n-1)
        @inbounds img[i,j] = juliaset_pixel(x+im*y, c)
    end
    nothing
end

function juliaset(x, y, n=1000)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    for j in 1:n
        juliaset_column!(img, c, n, j)
    end
    return img
end

julia> @btime juliaset(-0.79, 0.15, 1000);
   38.932 ms (2 allocations: 976.67 KiB)

Let's now try to speed-up the calculation using multi-threadding. Julia v0.5 has introduced multi-threadding with static-scheduller with a simple syntax: just prepend the for-loop with a Threads.@threads macro. With that, the first multi-threaded version will looks like

function juliaset_static(x, y, n=1000)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    Threads.@threads :static for j in 1:n
        juliaset_column!(img, c, n, j)
    end
    return img
end

with benchmark

julia> 	@btime juliaset_static(-0.79, 0.15, 1000);
  15.751 ms (27 allocations: 978.75 KiB)

Although we have used four-threads, and the communication overhead should be next to zero, the speed improvement is $2.4$. Why is that?

To understand bettern what is going on, we have improved the profiler we have been developing last week. The logging profiler logs time of entering and exitting every function call of every thread, which is useful to understand, what is going on. The api is not yet polished, but it will do its job. Importantly, to prevent excessive logging, we ask to log only some functions.

using LoggingProfiler
function juliaset_static(x, y, n=1000)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    Threads.@threads :dynamic for j in 1:n
        LoggingProfiler.@recordfun juliaset_column!(img, c, n, j)
    end
    return img
end

LoggingProfiler.initbuffer!(1000)
juliaset_static(-0.79, 0.15, 1000);
LoggingProfiler.recorded()
LoggingProfiler.adjustbuffer!()
juliaset_static(-0.79, 0.15, 1000)
LoggingProfiler.export2svg("/tmp/profile.svg")
LoggingProfiler.export2luxor("profile.png")

profile.png From the visualization of the profiler we can see not all threads were working the same time. Thread 1 and 4 were working less that Thread 2 and 3. The reason is that the static scheduller partition the total number of columns (1000) into equal parts, where the total number of parts is equal to the number of threads, and assign each to a single thread. In our case, we will have four parts each of size 250. Since execution time of computing value of each pixel is not the same, threads with a lot zero iterations will finish considerably faster. This is the incarnation of one of the biggest problems in multi-threadding / schedulling. A contemprary approach is to switch to dynamic schedulling, which divides the problem into smaller parts, and when a thread is finished with one part, it assigned new not-yet computed part.

From 1.5, one can specify the scheduller for Threads.@thread [scheduller] for construct to be either :static and / or :dynamic. The :dynamic is compatible with the partr dynamic scheduller. From 1.8, :dynamic is default, but the range is dividided into nthreads() parts, which is the reason why we do not see an improvement.

Dynamic scheduller is also supported using by Threads.@spawn macro. The prototypical approach used for invocation is the fork-join model, where one recursivelly partitions the problems and wait in each thread for the other

function juliaset_recspawn!(img, c, n, lo=1, hi=n, ntasks=128)
    if hi - lo > n/ntasks-1
        mid = (lo+hi)>>>1
        finish = Threads.@spawn juliaset_recspawn!(img, c, n, lo, mid, ntasks)
        juliaset_recspawn!(img, c, n, mid+1, hi, ntasks)
        wait(finish)
        return
    end
    for j in lo:hi
        juliaset_column!(img, c, n, j)
    end
    nothing
end

Measuring the time we observe four-times speedup, which corresponds to the number of threads.

function juliaset_forkjoin(x, y, n=1000, ntasks = 16)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    juliaset_recspawn!(img, c, n, 1, n, ntasks)
    return img
end

julia> @btime juliaset_forkjoin(-0.79, 0.15);
  10.326 ms (142 allocations: 986.83 KiB)

This is so far our fastest construction with speedup 38.932 / 10.326 = 3.77×.

Unfortunatelly, the LoggingProfiler does not handle task migration at the moment, which means that we cannot visualize the results. Due to task switching overhead, increasing the granularity might not pay off.

4 tasks: 16.262 ms (21 allocations: 978.05 KiB)
8 tasks: 10.660 ms (45 allocations: 979.80 KiB)
16 tasks: 10.326 ms (142 allocations: 986.83 KiB)
32 tasks: 10.786 ms (238 allocations: 993.83 KiB)
64 tasks: 10.211 ms (624 allocations: 1021.89 KiB)
128 tasks: 10.224 ms (1391 allocations: 1.05 MiB)
256 tasks: 10.617 ms (2927 allocations: 1.16 MiB)
512 tasks: 11.012 ms (5999 allocations: 1.38 MiB)
using FLoops, FoldsThreads
function juliaset_folds(x, y, n=1000, basesize = 2)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    @floop ThreadedEx(basesize = basesize) for j in 1:n
        juliaset_column!(img, c, n, j)
    end
    return img
end

julia> @btime juliaset_folds(-0.79, 0.15, 1000);
  10.253 ms (3960 allocations: 1.24 MiB)

where basesize is the size of the smallest part allocated to a single thread, in this case 2 columns.

julia> @btime juliaset_folds(-0.79, 0.15, 1000);
  10.575 ms (52 allocations: 980.12 KiB)
function juliaset_folds(x, y, n=1000, basesize = 2)
    c = x + y*im
    img = Array{UInt8,2}(undef,n,n)
    @floop DepthFirstEx(basesize = basesize) for j in 1:n
        juliaset_column!(img, c, n, j)
    end
    return img
end
julia> @btime juliaset_folds(-0.79, 0.15, 1000);
  10.421 ms (3582 allocations: 1.20 MiB)

We can identify the best smallest size of the work basesize and measure its influence on the time

map(2 .^ (0:7)) do bs 
	t = @belapsed juliaset_folds(-0.79, 0.15, 1000, $(bs));
	(;basesize = bs, time = t)
end |> DataFrame
 Row │ basesize  time
     │ Int64     Float64
─────┼─────────────────────
   1 │        1  0.0106803
   2 │        2  0.010267
   3 │        4  0.0103081
   4 │        8  0.0101652
   5 │       16  0.0100204
   6 │       32  0.0100097
   7 │       64  0.0103293
   8 │      128  0.0105411

We observe that the minimum is for basesize = 32, for which we got 3.8932× speedup.

Garbage collector is single-threadded

Keep reminded that while threads are very easy very convenient to use, there are use-cases where you might be better off with proccess, even though there will be some communication overhead. One such case happens when you need to allocate and free a lot of memory. This is because Julia's garbage collector is single-threadded (in 1.10 it is now partially multi-threaded). Imagine a task of making histogram of bytes in a directory. For a fair comparison, we will use Transducers, since they offer thread and process based paralelism

using Transducers
@everywhere begin 
	function histfile(filename)
		h = Dict{UInt8,Int}()
		foreach(open(read, filename, "r")) do b 
			h[b] = get(h, b, 0) + 1
		end
		h
	end
end

files = filter(isfile, readdir("/Users/tomas.pevny/Downloads/", join = true))
@elapsed foldxd(mergewith(+), files |> Map(histfile))
150.863183701

and using the multi-threaded version of map

@elapsed foldxt(mergewith(+), files |> Map(histfile))
205.309952618

we see that the threadding is actually worse than process based paralelism despite us paying the price for serialization and deserialization of Dict. Needless to say that changing Dict to Vector as

using Transducers
@everywhere begin 
	function histfile(filename)
		h = Dict{UInt8,Int}()
		foreach(open(read, filename, "r")) do b 
			h[b] = get(h, b, 0) + 1
		end
		h
	end
end
files = filter(isfile, readdir("/Users/tomas.pevny/Downloads/", join = true))
@elapsed foldxd(mergewith(+), files |> Map(histfile))
86.44577969
@elapsed foldxt(mergewith(+), files |> Map(histfile))
105.32969331

is much better.

Locks / lock-free multi-threadding

Avoid locks.

Take away message

When deciding, what kind of paralelism to employ, consider following

  • for tightly coupled computation over shared data, multi-threadding is more suitable due to non-existing sharing of data between processes
  • but if the computation requires frequent allocation and freeing of memery, or IO, separate processes are multi-suitable, since garbage collectors are independent between processes
  • Making all cores busy while achieving an ideally linear speedup is difficult and needs a lot of experience and knowledge. Tooling and profilers supporting debugging of parallel processes is not much developped.
  • Transducers thrives for (almost) the same code to support thread- and process-based paralelism.

Materials