Ask Your Question

Various image features from a spectrogram

asked 2013-05-23 05:16:31 -0600

alsaxx gravatar image

Hi, I am trying to see if I can approach the speech recognition (especially phoneme segmentation) problem using computer vision techniques on the spectrogram. I have some open questions, and I also invite everybody who is interested to get in touch with me for a possible collaboration.

A spectrogram looks like this. The X axis represents time, the Y axis is the spectrum of the underlying signal. I would like to extract some features using a pure image processing approach. At first, the parameters of the feature extraction like the thresholds that are dependent on the SNR will be set by hand through sliders in a prototype GUI, then I will look for a method to extract also these parameters out of the context.

So, the questions, please keep in mind that I am a complete beginner in image processing:

  1. The human eye can clearly identify five main vertical shapes. Shall I look into edge detection in order to find their start and end points?That could be an algorithm that extracts just lines that are more or less straight, giving me the start and end point, and then I could check for the slope, and the intensity and length of the lines. This is equivalent to onset detection in audio signals.

  2. I feel sufficiently comfortable using an audio algorithm to find the silent (dark) vertical regions. Given a successful recognition in the previous step I would like to operate on smaller slices of the image that represent one area of interest. One task would be to identify the slope of the oblique stripes with bigger intensity that are visible in the image (those are the formants of vowels). How to achieve that? Again, this could edge detection, but the output of Canny seems to be a series of points without information on how they are interconnected.

I don't need a whole recipe, but it would be nice to have suggestion on where to look in order to avoid a complex postprocessing of the wrong algorithms' output.

edit retag flag offensive close merge delete


What does the output you'd like to have look like? I don't think that Canny is suitable for your purposes...

Ben gravatar imageBen ( 2013-05-23 08:50:18 -0600 )edit

I imagined some sort of parametric line

alsaxx gravatar imagealsaxx ( 2013-05-23 08:58:55 -0600 )edit

Do you have images with more detail or that is what you have to work with?

Rui Marques gravatar imageRui Marques ( 2013-09-30 04:54:39 -0600 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2013-05-23 09:02:34 -0600

Ben gravatar image

For 1.) you might want to experiment with a Sobel filter, which gives you e.g. the image derivative in x direction. Then setting a threshold to get a binary image with pixels indicating strong changes in x direction. And finally counting pixels for every column to find out the positions of clear phoneme starts/ends.

For 2.) you could have something like a histogram of gradients for a given region.

edit flag offensive delete link more


is there any example/tutorial for the second answer?

alsaxx gravatar imagealsaxx ( 2013-05-23 09:12:44 -0600 )edit

hmm... there is a HOG based object detection algorithm implemented in OpenCV, but I don't know if and how you can use the HOG computation that must happen somewhere inside. I guess you'd have to look at the source code. Or you implement it yourself.

Ben gravatar imageBen ( 2013-05-23 10:05:26 -0600 )edit

Question Tools


Asked: 2013-05-23 05:16:31 -0600

Seen: 1,202 times

Last updated: May 23 '13