Sample frequency and bit depth

So, perhaps you’ve seen export settings for wave-files, or upload specifications on various distribution sites, the file needs to be 44100 Hz (or 44,1 kHz) and 16-bit.

What’s that all about?

Let me first state two things before you continue reading.

This is absolutely nothing that you need to understand, it is just surplus knowledge if you are interested.
As usual, I tend to over simplify, almost to the extent that it is on the limit on actually being true. In this case it is the concept I aim to explain, not the exact technical specification. I am sure you can find more exact specifications by using a little google-fu!

Let’s start with bit depth.

As you probably already know, sound is not actually a “wave” passing through the air, it is rather an energy movement through the air that pushes the atoms against each other in a chain reaction. This leads to parts in the air that are more dense followed by parts that are more thin.

In order to transform this movement through the air, we have two things at our disposal. One being our ears. The second being a microphone. The mechanics of the two are similar (but not identical). Both have a membrane to catch the movement in the air and transform it. Let’s focus on the mic.

The membrane is designed to physically move with the air, or rather, by the air. As the membrane is unaffected, it is still in its middle position (which we call the 0-position), and it can be moved (pushed) in and it can be moved (pulled) out. In the oversimplification spirit, let’s use the number I started with 16 (from 16-bit). This would translate into something like 16 possible positions to capture the current position of the membrane. Imagine 16 cameras. 8 mounted on each side of the membrane with the sole purpose of capturing an image of where the membrane physically are, in each given moment. Now, the cameras are not overlapping, which means that as the 16 cameras are triggered to capture an image, the membrane will only be seen in one of them, leaving the other 15 cameras with blank pictures.

The number 16 in the above example, is the depth of the ability to capture the movement of the membrane. Of course it is not “only” 16 “frames”, but the mechanics behind it. Should you decrease it to 8-bit, it would be half the numbers of cameras (fictive) with 4 cameras on each side of the membrane, while increasing it to 32 would (fictively) give 16 cameras on each side. Naturally, the larger the number, the more precision in capturing the exact position you will get. But. (Yes, there always seems to be a but!) You will also get more data generating larger files. And there is a sweet spot where the common man can’t hear the difference when increasing the bit depth. (However, it is debated if it is 16-bit like the “CD-standard” or if it is 32-bit. I’ll leave it up to you to decide your favourite.)

The other number above is 44100 Hz. This is literally 44100 movements during one seconds, where Hz stands for Hertz and is the unit used to measure cycles (or events occurring) per second. Let’s stick with events occurring per second. In our above example, this means that the cameras are triggered to capture images 44100 times each second. Another often used sampling frequency is 48 kHz or 48000 Hz. Which would translate to triggering the cameras to take pictures 48000 times each second. Again, it is debated which is “best” the “CD-standard” of 44100 Hz or the other commonly used 48000 Hz. Again I leave it up to you. But let’s ask the question, why 44100 or 48000? Wouldn’t 96000 times be better? Or could it be enough with 10000? Again, as for the higher number, the more times we “snap images” with all the cameras, the more data is gathered and the larger the file gets. And it is a sweet spot somewhere around 44100 or 48000 where the common man can no longer hear any difference to the captured sound.

But of course there is a reason for being above 40000. The human hearing range. The human ear can (at best) hear 20 000 Hz. That would be the equivalent of a very high pitched tone. To be able to capture 20000 Hz by using our cameras in above example, we need to take pictures at least 20000 times per second. But if we would (in theory) capture a “wave” passing the membrane 20000 times in one second, and we take pictures 20000 times during that second, we would see the membrane at the exact same position on each picture-stack. Which would register as no-movement at all. Doubling it to 40000 pictures per second would also (in theory) give a series of images showing the membrane in two positions only. Thus, by increasing the frequency of the image capturing, we can build a more precise “wave” from the captured pictures where we can actually see the movement as it is.

Rendering the “sound wave” (the graphical representation we can see on our screen) we take the number of cameras we have decided (the depth) and put a “dot” in the dedicated frame for the camera that captured the membrane, leaving all the other frames empty, and we repeat that for as many times (the sample frequency) we have decided, getting a graphical representation of where the membrane were at each given timeframe, and stacked after each other. If our sound is 1 second we would use 44100 sets of pictures from our 16 cameras. If our sound is one minute it is 60 times more information.

As I started with, this is both surplus knowledge, and oversimplified. But the basic principal is about as stated above. I hope it gives you something. If no, do like Elsa in Frost and “Let is go!”.