Voice recognition with spectrogram

asked 2017-01-07 15:13:38 -0600

NavinM gravatar image

updated 2017-01-08 00:50:50 -0600

image descriptionCan we find out total number of speakers and their duration by looking at/analysing spectrogram.! [image description] (https://drive.google.com/drive/folder...)

By just looking at the image, I can see some pattern, but I am looking for right solution in terms of opencv code(python)

edit retag flag offensive close merge delete

Comments

1

I think you're pretty lost here... in any case, the image is not accessible

LorenaGdL gravatar imageLorenaGdL ( 2017-01-07 15:17:07 -0600 )edit
1

@LorenaGdL may be lost but may be not

LBerger gravatar imageLBerger ( 2017-01-07 16:06:06 -0600 )edit
1

@LBerger processing audio is definitely working in the frequency domain. But that doesn't mean that just looking at the spectum image (speaking of pure computer vision) you can retrieve audio and perform voice recognition, that's a whole different story. Doable? Maybe. Doable by the OP? I don't think so to be honest...

LorenaGdL gravatar imageLorenaGdL ( 2017-01-07 16:35:34 -0600 )edit

Yeah, all the analysis in @LBerger's link is being done on the actual data. The spectrogram is just to display it. Maybe you could adapt something from the illumination/reflectance separation work, but it would take a lot of alteration.

Tetragramm gravatar imageTetragramm ( 2017-01-07 17:56:33 -0600 )edit

i share the image now. You are right that illumination/reflectance separation seems to work. What kind of altercations should we do ?

NavinM gravatar imageNavinM ( 2017-01-08 00:52:40 -0600 )edit

@NavinM this is going nowhere, mainly because you show no effort at all. Have you researched about the topic? Do you have any idea to actually tackle the problem? If not, we're not here to do the work for you. Once you specific OpenCV problems, come back

LorenaGdL gravatar imageLorenaGdL ( 2017-01-08 01:52:48 -0600 )edit

I am working on code to measure the separation but it is not something very good. I also want to consider the peak points( in region of illuminance) for identification of speech. Is there subtle way of doing it ?

NavinM gravatar imageNavinM ( 2017-01-08 01:58:52 -0600 )edit

@Tetragramm Sorry I think my link is not good for this problem because here it seems there is only one sensor .

@NavinM " just looking at the image, I can see some pattern" I think it is likeRorschach figure. I'm agree with @LorenaGdL come back with a method and a source file and it will be easyer to help you

LBerger gravatar imageLBerger ( 2017-01-08 02:21:26 -0600 )edit
1

you would'not do computer-vision, based on an acoustical description of an image, - or would you ?

(imho, you're simply on the wrong rail here. go and process audio data, not images.)

berak gravatar imageberak ( 2017-01-08 02:44:20 -0600 )edit

as my purpose is to identify speakers, not what they say.. Audio data has two aspect: Identification of speakers(speaker1 for first 2 min, then speaker2 for next 1 min, then speaker 1 for next 1 min... sort of) can be done in frequency domain only that is captured in spectrogram. And secod aspect: what they say( I am not interested in it at present) . That is why spectrogram analysis seems to be crucial for my work ..

NavinM gravatar imageNavinM ( 2017-01-08 05:39:05 -0600 )edit