x264 threads & real-time encoding. We gotta talk

x264 threads do not do what you think they do

x264 threads & real-time encoding. We gotta talk

I've seen a lot of misinformation regarding the threads=x x264 parameter, so I wanted to do a quick write up of the why, the what and the how. This can have a lot of issues, especially when it comes to real-time encoding.

Why?

Why do people want to manually set a value? I believe this has become more pronounced with the ryze (heh) of reasonably priced higher core count chips. People became aware that the threads parameter had some impact on quality. Usually its references to stuff like this: https://streaminglearningcenter.com/blogs/ffmpeg-command-threads-how-it-affects-quality-and-performance.html

^ beware, there is some misinformation in that article, I'm just using it as an example.

Of course, people want the highest quality possible, so they would like to use the fewest amounts of threads possible. The graphs make it seem like there is a steep decline, when in reality, I personally think that its incredibly unlikely that people would manage to spot the difference of 2 VMAF "points".

No mention of resolution, which is very important. Scale makes the diff appear huge

I've had first hand experience with multiple people who have set both threads=1 and more commonly threads=8 or something close to this. Some people also seem quite intent that 18 is the magic number.

Again, I dont think manually messing with this in most circumstances is productive, and the headache is in my opinion really not worth, what I view as imperceivable quality difference.

My primary concern is use-cases where we need to achieve real-time encoding (like livestreaming). If its not real-time, and you dont care about the speed/efficiency of the process at all, then this isn't really of any concern.

What?

Let me preface this by stating that I'm exclusively talking about frame-based threading, not slice-based. It (slice) sucks, and we shall erase the tech from our minds going forward.

But what is this thread parameter? What does it do? How does it work?
Usually when I talk to people, the answer I get is that it limits the amount of logical threads (cpu threads/cores) to be used. This is unfortunately incorrect. It can impact cpu utilization, but it is far from an accurate statement.

When the thread parameter is set, either by the user or automatically, what its telling x264 is how many "workers" are to be spawned to work on the job (ill be referring to them as that going forth). This does not mean that all the workers will be actively doing work all the time though (threadpool). This also has some knock on effects, like impacting the lookahead threads.

Its best to view these workers as dumb and independent. They cannot share information with other workers and are not particularly good at working together (which is where some of the quality impact come from). More workers cause the process to be a bit more complicated, but overall it works quite well.

The workers do their job as quickly as they can, assuming there is work to be done. Many workers can work on the same frame, although too many workers on a single frame has some downsides, which is why there is an algorithm that decides how many workers should be allowed to be used.

How?

So with the default, meaning you haven't specified the parameter, x264 will automatically use an algorithm to decide on the threads=x value. The formula can be found here, and ill do my best to write them up in an easy to understand way.

threads = logical threads * 1.5

Again showing us, x264 threads do not equal logical threads. Threads is the workers (x264 threads), logical threads are the cpu cores/threads (MT/HT capable chips), and we multiply that by 1.5. So, lets take a 8 core chip, with multi/hyper threading enabled, that would give us 16 logical threads * 1.5 = 24 workers (x264 threads).

This gives us the ideal number x264 would like to spawn. We can also do a 10 core HT chip, giving us 30 x264 threads. 24 should be according to the graph, losing us 1-2 vmaf point. 30 would lose a bit more than that. Is this reasonable?

Well, It depends on the video we are encoding, which is why x264 also has a value they call max_threads. This is the value it would prefer not to exceed, even if we got a higher value from the previous algorithm. Algorithm is here.

max_threads = video-height + 15 / 16 / 2

Height is the vertical resolution (480, 720, 1080, 1440 etc). The 15 is just there as an offset for the lower resolutions. How they arrived at the divided by 16 divided by 2, I'm not sure, but it works quite well to keep quality at a reasonable level, and not cause other issues most of the time. So, what is our max_threads for the different resolutions?

1440 + 15 / 16 / 2 = 45
1080 + 15 / 16 / 2 = 34
720 + 15 / 16 / 2 = 22
480 + 15 / 16 / 2 = 15

As we can see, the developers felt that it was perfectly reasonable to up to 22 threads working on even a 720p file. Resolution (lines) is critical to determining useful worker numbers, as having too many workers and too few lines has negative outcome. The algorithm controls for this.

Threadpool

I need to make mention of the threadpool. It might sound like a nice and comfy pool for all of your workers to relax in, and it kind of is. There reason we need it is because you dont always need to use all of your workers, so some of them are sometimes free. When they are free, they are returned to the pool, so that they can be asked to go do stuff again when (and if) the time comes, and quickly at that.

This is also why want more x264-threads than we have logical threads, because its efficient to do so, especially in HT/MT chips, which is pretty much a given these days. We can always have workers ready in the pool that can start other process once there is work to be done. We dont have to sit around and wait for someone to be done, losing efficiency.

This is very broad strokes on a slightly complex subject that I am in no way an expert on, so take this interpretation with a grain of salt.

Rant over?

Is there truth to the notion that changing the threads value can impact quality? Absolutely! Its just flat out true. Although the difference is so small, that I dont personally think its a good idea to start manually setting the value. I trust the algorithm, the developers still do, and I trust the experts.

If I manually set the value, I forget about it, and then the content changes, resolution increases/decreases, or we change the framerate (especially for real-time), and we end up with choppy content. I would take a tiny imperceptible difference every day over an increased risk of choppy footage (catastrophic in my opinion).

If you restrict the threads value, you also lower the bar to where disaster hits (encoder lag, resulting in dropped/duplicate frames, or just no data arriving). X threads might work for a 3min test where you are feeding it x content, but remember, this has to work 100% the time. Content will drastically change the compute power required. You can't test every scenario, every scene change or rapid panning shot. Lowering the threads increases the risk that x264 wont have the power it needs to get the job done in time.

If you are strapped for cpu, or you want x264 to use less resources, I would recommend starting elsewhere like faster preset, lowering refs etc.

Limiting threads can be wasteful, and potentially catastrophic in real-time scenarios. I'm not saying that there is never a place for the parameter, and if you still want to be in charge of this stuff, by all means go ahead. My goal is just to shine a light on some of the misinformation I've seen about this parameter.

Lets take a scenario where I set a 6 core chip to threads=3. This will  limit the utilization to somewhere around 40-60% of the chip. This does not mean that it will do the same job at the same speed just with less cpu usage, than if I ran it at threads=18. It just means that I've set a cap on the amount of workers. This is less efficient and slower. For real-time encoding our number one priority must always be to complete the work in time. We're not looking to have the frames arrive at some point, we need them to all be there on time.

Making the choice to lag earlier than necessary does not make much sense to me. That's more or less what bothers me about some of the misinformation regarding the topic.

Proof?!

I've made a lot of claims, and haven't really shown a lot of proof yet. Figured I would add some short tests to give perspective.

Example of threads=3, on a 6 core (12 logical thread) chip, going full speed (nothing is holding the workers back, except for threads). This would be threads=18 on auto for 1080p.

that's a lot of lot of activity spaning every thread, despite the low worker value

I want to show a couple of other perhaps interesting graphs.

This is real-time 1080p30 on what I would consider difficult content (tarkov gameplay). Our clip is 32.766 seconds. I'm going to guesstimate we need to be at max +2sec on this time. 34.766s is most likely going to be too slow, resulting in lag.

Threads=auto (18). x264 veryfast 1080p30 6mbit CBR (real-time). Time: 33.607s
Treads=6. x264 veryfast 1080p30 6mbit CBR (real-time). Time: 33.849s

It looks very close in cpu usage. I dont have better tools atm to show usage of a single process. 250ms behind our threads=auto. Probably fine. Still far from 34.766s

Treads=auto (18). x264 slow (modified), 1080p30 6mbit CBR (real-time).Time: 33.877s

Ok, we maxed out some logical threads during certain parts of the encoding process (more difficult parts). still made it under 34.766s, scary territory though, we could easily skip frames here.

Treads=6. x264 slow (modified), 1080p30 6mbit CBR (real-time). Time: 36.653s

We did use less CPU, but look at our time. We are well above 34.766s. We certainly would skip frames here, as we did not reach real-time consistently.

If we try to replicate this the video input, and have OBS encode it with x264 (same settings), we actually get a skipped frames value (I dont know how to tell x264 to drop frames to stay real-time).

Threads 6: 431 frames skipped, out of 983 (43,8%)
Threads auto: 1 frame skipped, out of 983 (0,1%)

Ok, so lets say 6 is just too few, we have hyperthreading, so 6core = 12 logical threads. Lets set it to threads=12 instead. That way we can get our juicy vmaf points. Should be good right?

In order to test this, I need to increase the load a bit to make sure we're actually close to dropping frames in order to see the impact, instead of 0 skipped for both. I removed my modification (trellis=1), and let it run unmodified "slow" preset. Everything else stayed the same.

Threads 12: 562 dropped, out of 983 (57,1%)
Threads auto: 117 dropped, out of 983 (11,9%)

So yeah, there is still a cost here to the efficiency, and we hit the bar way earlier by restricting the threads. I know I'm repeating myself, but I would much rather risk 2 vmaf points than risk choppy footage. The workload stays the same, regardless of threads. Threads do not cause x264 to use less cpu, it only lowers the bar to misery.

Treads 12:

Really struggling here. TCR show HH:MM:SS:Frame(30)

Auto threads (18):

Pretty major difference.

Reminder, if you happen to spot any errors, or think any of the explanations were improper/subpar, feel free to contact me, and hopefully I can improve :)
https://blog.otterbro.com/contact/