What do you hear when you play this sound?
Scroll down for an explanation...
Some people will hear this recording as saying "Lauryl"; others will hear "Yanny". What's going on?
First off, let's clean this recording up. Let's filter out the computer noise in the background:
At first, when I heard this sound, I could hear both words quite easily. However, the more I listen to it, the more sure I am that I hear "Lauryl".
Human perception of speech is a complex topic which science is only just beginning to understand. We may not think of it as a complex process because our brains do all the work for us. But it is a marvel of pattern matching and inference.
For example, when understanding the speech of another person, we use visual cues to distinguish between ambiguous sounds. This is called the McGurk Effect: you can change what you hear by focussing on a video of a person speaking.
Haskins Laboratories at Yale wrote an excellent piece on speech perception. Their core idea is that there is no individual element which is totally necessary to understand speech; that we take a number of cues separately and intelligently combine them to understand speech depending on the circumstances. They have some really cool proof of this.
One of the means by which we recognize speech is formants. These are frequencies that are always the same across all speakers of a vowel sound (which is why we can understand people with deep or high voices, and why speeding up or slowing down speech sounds different to that).
In the audio example above, there is no video to go with it. So perhaps instead we can try splitting it into different frequencies:
That kinda works. Let's try shifting it up/down instead.
So that's pretty clear. The shifted-up version sounds like "Lauryl", and the shifted-down version sounds like "Yanny". But! The high frequencies sound like "Yanny", and the low ones like "Lauryl". What's going on?
It turns out that this is the same phenomenon demonstrated by this famous image from the 90s:
Einstein is visible in the high frequency details, while Madonna is visible in the low frequency details. There's no single factor that determines which you perceive. If you're closer to your screen, you're likely to focus more on the smaller Einstein details, while if you're further away, you're more likely to see Madonna. But you can also squint your eyes, which will blur out the higher frequency details, or you can adjust the contrast on your monitor, which will also favour one or the other.
It's certainly true that if you listen to that recording on a phone speaker, you're going to hear more of the higher frequencies. But that's not the only factor that determines which you hear. Once you've heard one sound or the other, your brain tends to reinforce your existing perceptions -- your hearing is biased towards what you expect to hear.
It's likely that because the difference is more audible in the shifted audio as opposed to the filtered audio, there are ambiguous formants -- frequencies that should be fixed at one pitch for a given vowel sound but in this recording end up somewhere in between. If the formants are too high for "Yanny", then emphasizing the higher frequencies would accentuate that intepretation. If the formants are too low for "Lauryl", the reverse is true.