A little study on multithreading with framelib

Hi there! :slight_smile:
So I started using Framelib a bit more seriously, and I think now I can grasp the basics. I am really into sampling, and working with large corpora of sounds with granular synths. I have always used poly~ for these things, and enjoyed the @parallel 1 mode with multithreading that to me seems to always give way more headroom than what’s normally possible in a single patch.
So when I started diving into Framelib, and realized I can easily launch hundreds, or thousands of grains per sec, reliably and accurately, and without too much boilerplate, I caught the scent of blood.
But I quickly realized that - on Windows at least - the multithread method of fl.contextcontrol~ was not as rewarding as on a Mac system, where it seems to almost always help. (On Windows it very often just adds a 1-5% CPU bump with scary spikes every now and then.) Without the additional headroom I could get from multithreading, I am still ultimately better off with poly~, but the scent of blood already clouded my judgement, so I wanted to see if I can somehow squeeze more multithreading out of Framelib on Windows.
I then discovered that the multithreading starts to really work its magic when I am using multistream networks. I was still not fully convinced to abandon my poly~-workflow though.
So I set out to make some simple tests (not really benchmarks, but something like that) to see what gives me the best “yield”, and this post is sharing the results with patches.

The task

What I am ultimately after is to be able to play a large corpora of sounds that reside in a polybuffer~. Since @a.harker added the wonderful fl.makestring~, this became possible in Frameland. So the test will be to load a folder of 4895 sound files into a polybuffer~ and see in what configuration can I squeeze out the highest number of concurrent grains without clogging my CPU (or reaching its ceiling). My screen-recording will distort readings a bit, but I’ll in each case run the patch for a while before recording so you see the real CPU load (before recording).

Environment

As I mentioned I am on Windows 10, “rocking” a 6-core/12-thread Intel Core i7-10750H CPU @2.60GHz.

Attempt #1 - single stream, single patch, no tricks

Here is the first “baseline” attempt:

…and the patch: fl_multicore_single.maxpat (22.7 KB)

It basically gets a random buffer at each trigger frame and plays it through its full length. The sound files range from around 500ms to 3-4s, and rarely 7-8s in length.

I can push it to around 500 Hz, where I start to lick my CPU ceiling from below (the screen recording makes it a bit worse, but just observe the graphs before the huge spike in the right side). I get regular 100% spikes, and the 1-second mean CPU is hovering around 59-60 (pre-screencast).

fl_multicore_single

I forgot to record that, but in this scenario (and again, on Windows) multithreading makes no perceivable difference.

Attempt #2 - multithreading with poly~

So now I am curious if I can improve this by simply wrapping it into a poly~ driving it with a set of phasor~s phased at equal “distance” from each other. Since my CPU has 12 threads, I give the poly~ 12 voices, and create a 12-channel mc.phasor~ to provide the ticks.

Inside the poly~ patch:

main patch: fl_multicore_w_poly.maxpat (26.7 KB)
poly~ patch: p.fl_polybuf_player.maxpat (9.1 KB)

This time, with the same 500Hz, I seem to get lower CPU measures (mean is hovering around 42-43%, pre-screencast), and almost no 100% spikes ever.
fl_multicore_poly

That’s a significant improvement. Again, enabling multithreading in Frameland does not seem to change much, but I would guess the idea anyway does not make much sense at this point(?).

Unfortunately, it does not scale too well, and I can’t really get over 700Hz safely. I also noticed that the load on the different threads is not evenly distributed, or maybe hyperthreading is ignored my poly~-s threading, no clue.

Attempt #3 - multithreading with multistream networks

This attempt follows my hunch that multithreading à la framelib - on Windows at least - favors multistream networks. So what if I would create a 12-stream network, and distribute the ticks with fl.chain~? Something like this:

as patch: fl_multicore_w_streams.maxpat (35.5 KB)

The idea here is that I have an fl.interval~ with 1/12th of the interval I want at a given time, and subdivide that interval into 12 parts. Each part goes into its own stream, and at the sink we separate them.
image

In this attempt we let Framelib do the multithreading for us.

The results are impressive. My (pre-screencast) mean CPU is around 30-31% with 500 Hz grain density. And look at the super-equal nice distribution over all cores.
fl_multicore_multistream

I can verify that the multithreading does magic here: if I switch it off, the CPU (in the patch) just runs to 100% permanently, and audio glitches and sweats.

With this version I can even go safely to around 1300-1400 Hz without hitting the ceiling with the occasional spikes (mean CPU is still around 60% there). This frequency is far from being possible with the previous two attempts.

Wrapping it into an abstraction

So now I can verify, that at least on Windows, multistreaming is the key to squeeze out the most from a framelib network (of this kind, I guess there are many scenarios I don’t consider here). So why not make it into an abstraction? I call it fl.multiinterval~ and at the moment it doesn’t have too many options, it always expects intervals in Hz, it ignores any /option, and only considers one positional argument which is the number of streams to generate:


patch: fl.multiinterval~.maxpat (17.0 KB)

Sadly, Max does not recognize the #1 after the = sign, hence the “LOL” subpatch:

Nevertheless, it seems to work just fine:


patch: fl_multicore_w_streams_w_abs.maxpat (26.6 KB)

everything in a zip: fl_multicore.zip (18.0 KB)

Happy framing! :slight_smile:

2 Likes

Wow - thanks for all of this. Lots of detailed investigation.

Your basic premise is more or less accurate - multi streams are a form of parallelism and the multithreading in framelib can (mostly) only exploit parallelism in the network. There are some caveats/nuances:

  • grains can only process in parallel when they occur within the same signal vector
  • in the case of two grains in the same signal vector even a serial network might be able to parallelise, but probably not as efficiently as in the case of a parallel network.
  • one issue for FrameLib in comparison to poly~ is that each grain is computed in a single signal vector, which is different to poly~ which will spread that cost over time even for just one grain, so long grains can create CPU spikes.

I’d also expect to see these multistream patches to run best on Mac and there is still some comparison to do there in terms of gains.

Additionally, if you don’t care about latency then running a higher signal vector size is likely to have a noticeable effect.

1 Like

Thanks! Yes, I totally forgot to mention this, but I noticed that I get more headroom with a signal vector of 1024 than let’s say 256. So all of this was tested on 1024.

I also noticed that larger grains will raise CPU levels, but thanks, now I understand why. So in case I need long grains but want to avoid the CPU spikes, I just need to divide them into small grains and then “launch” them in a sequence - right? Can’t see clearly how I would build this, but I guess it should be possible.

OK. I made an abstraction to sequence reading long frames in small “subframes”. I got so good results that I am not sure it’s true. The spikes are gone from the CPU reading, and I can easily go up to 6000 Hz (or beyond) without issues.

Can that be true?? :open_mouth: :face_with_raised_eyebrow:

Here is the abstraction (I call it fl.read.insubframes~):

fl.read.insubframes~.maxpat (16.8 KB)

It has one positional argument that is the maximum frame length in ms. The inlet takes the buffer reference, which also acts as a trigger frame. (No other options yet.) If the buffer is longer than the set length, than the ramping/reading will be scheduled in subframes of the length specified (or shorter in the last bit). I attempt to spread the computational load this way to avoid CPU spikes.

Patch: fl_multicore_w_streams_chopping_long_grains.maxpat (34.3 KB)

I’ll need to read this patch carefully. However, if you go back one step there are a couple of other optimisations to try:

1 - when using fl.window~ for granular things you should set the /size parameter (something like 4096 should be fine). Then it won’t recalculate the window if the input changes she and it’ll linear interpolate instead.

2 - if you don’t need subsample offsets or anything other than normal speed you can set the /interp parameter on fl.read~ to none.

1 Like

Thanks! I will remember that, I just realized that my soundfiles are already faded, so I removed the window altogether.

OK, I’ll try this now!

A hann window over a long file is also going to sound pretty different to just playback!

1 Like

I think just letting everything through the fl.read.insubframes~ without the routing should work the same way too. Thanks to 0-length frames being possible in framelib.

fl_multicore_w_streams_chopping_long_grains_simplified.maxpat (25.3 KB)

OK, so first of all, I realized fl.ticks~ needs to get a /limit, otherwise it’s up to 10 subframes. This is fixed here:
fl.read.insubframes~.maxpat (17.7 KB)

But more importantly, I found why this does not work the way I’d expect. And it comes back to the tradeoff of either calculating large frames at once freeing up a stream for something else - but also getting that CPU spike, OR spread out the load on several smaller frames, but then I need voice management, and one stream gets occupied until it finishes playing all the subframes of an incoming long frame. So we avoided big CPU spikes, but now we have the limitation of the number of streams being the number of possible concurrent voices (like in poly~). I guess this tradeoff is inescapable(?).
So you either make a lot of streams (similar how you make a lot of poly~-voices) and balance the load not to “steal” voices, or you don’t subdivide large frames, and balance the load so the CPU spikes for large frames don’t hit the ceiling.

I am very happy that this xmas i finally got the time to sit down and read this legendary thread. thank you.