Very interesting question! But jumping from an apple to a chicken to a planet.
From my experience with OpenCV/audio signal/spectrum, what I will tell you is similar to berak's answer. I've put it on a sheet for you and other coders.
The path to follow is this one:
Ensure you that, whatever language you are using, you have a good access to the audio buffer.
To achieve it, either find a good library that can handle any wav file and convert it to a specific format. i.e. 44100HZ, mono or stereo depending on your needs, or read about the specifications for the format you're willing to analyse.
Then now make sure you can access to your audio data easily.
You'll have to decide what is the "time window" that each spectrum analysis image will cover. i.e. 20ms, or 100ms, or 1s, it's up to you. But what you can do is to take something like 10 capture for each second, and having these capture be only 10 or 20ms, so you don't process all the complete sound and save on performance. In the other hand, if you're not doing live conversion, maybe you can think about doing a full coverage of the sound data.
That being said, for each of these buffer containing a short sound sample, pass it through an FFT algorithm. From my experience, you'll have some doubt about whether or not the snippets/codes you'll find to calculate FFT works, but it generally work if well implemented.
Generally, an FFT method receives an amount of data (aka one of your sound sample) as the input, and spits an analysis of the amplitude for each frequency. It looks a bit like this: 20HZ=2, 40HZ=2, 60HZ=10, 80HZ=20, 100HZ=30, 120HZ=30, etc....... So you can easily sketch a picture of the spectrum by drawing some vertical lines.
I am very limited in my mathematics and physics knowledge, but that's the path I've always followed, either to draw spectrum in media player I've designed, or to analyse sound.
I'm pretty sure that using some imagination, you can match it with some OpenCV algorithms and do something really hot.