Short Term Time Domain Processing of Speech signal

1.1 Short Term Time Domain Processing of Speech Signals

T.Venk An engineering solution proposed for processing speech was to make use of existing signal processing tools in a modified fashion. To be more specific, the tools can still assume the signal under processing to be stationary. Speech signal may be stationary when it is viewed in blocks of 10-30 msec. Hence to process speech by different signal processing tools, it is viewed in terms of 10-30 msec. Such a processing is termed as Short Term Processing (STP).

Short Term Processing of speech can be performed either in time domain or in frequency domain. The particular domain of processing depends on the information from the speech that we are interested in. For instance, parameters like short term energy, short term zero crossing rate and short term autocorrelation can be computed from the time domain processing of speech.

Click Here To Download Complete Project With Source Code
please wait 2-3 seconds for the download to start

1.2 Objectives:

Ø Load, display and manipulation of speech signals.

Ø Modify them for the case of (a) Hamming window, (b) Hanning. Illustrate your observations.

Ø Understand need for short term processing of speech.

Ø Find short term energy and study its significance.

Ø Perform short term zero crossing rate and study its significance.

Ø Compute short term autocorrelation and study its significance.

Ø Estimate pitch of speech using short term autocorrelation. Develop the pitch estimation program in Matlab using frame sizes of 10, 50 & 100 msec, each with a shift of 10 msec.

Ø Perform voiced/unvoiced/silence classification of speech using short term time domain parameters.

1.3 Need for Short Term Processing of Speech
Speech is produced from a time varying vocal tract system with time varying excitation. As a result the speech signal is non-stationary in nature. Most of the signal processing tools studied in signals and systems and signal processing assume time invariant system and time invariant excitation, i.e. stationary signal. Hence these tools are not directly applicable for speech processing. This is because, use of such tools directly on speech violates their underlying assumption. However, even if you use them blindly and compute the output from the tool, then such an output is of little practical significance. For instance, the tool for total energy computation is a fundamental relation in signal processing.That is

This relation is useful for the case of stationary signal having finite energy. Suppose, if you use this tool for computing total energy of a speech signal. No doubt, this gives total energy present in the speech signal. However, the total energy is of no use. This is because, from the nature of its production, we know that speech has time varying amplitude and energy. Therefore what is important in case of speech production is a tool that gives information about time varying energy. Thus a need for different way of processing speech

1.4 Short Term Energy Parameter
The energy associated with speech is time varying in nature. Hence the interest for any automatic processing of speech is to know how the energy is varying with time and to be more specific, energy associated with short term region of speech. By the nature of production, the speech signal consist of voiced, unvoiced and silence regions. Further the energy associated with voiced region is large compared to unvoiced region and silence region will not have least or negligible energy. Thus short term energy can be used for voiced, unvoiced and silence classification of speech.

The relation for finding the short term energy can be derived from the total energy relation defined in signal processing.The total Energy of an Energy Signal is given by

In case of short term energy computation we consider speech in terms of 10-30 msec . Let the samples in a frame of speech are given by "n=0 to n=N-1", where " N " is the length of frame (samples), then for energy computation the speech will be zero outside the frame length. Then for energy computation amplitude of the speech samples will be zero outside the frame.

1.5 Short Term Zero Crossing Rate (ZCR)
Zero Crossing Rate gives information about the number of zero-crossings present in a given signal. Intuitively, if the number of zero crossings are more in a given signal, then the signal is changing rapidly and accordingly the signal may contain high frequency information. On the similar lines, if the number of zero crossing are less, hence the signal is changing slowly and accordingly the signal may contain low frequency information. Thus ZCR gives an indirect information about the frequency content of the signal.

In case of speech the nature of signal changes with time over few msec. For instance, from initial voiced to unvoiced and back to voiced and so on. To have some useful information, ZCR needs to be computed using typical frame size of 10-30 msec with half the frame size as shift. A speech signal for the message " she had your suit in your greasy wash water all year" and its short term ZCR computed are shown in Figure_2. As it can be observed, in case of unvoiced sounds like |s|, the ZCR value is significantly high compared to the region of voiced sounds like |a| and hence can be used for distinguishing voiced and unvoiced regions.

1.6 Short Term Autocorrelation:
Cross correlation tool from signal processing can be used for finding the similarity among the two sequences and refers to the case of having two different sequences for correlation. Autocorrelation refers to the case of having only one sequence for correlation. In autocorrelation, the interest is in observing how similar the signal characteristics with respect to time. This is achieved by providing different time lag for the sequence and computing with the given sequence as reference.The autocorrelation is a very useful tool in case of speech processing. However due to the non-stationary nature of speech, a short term version of the autocorrelation is needed. Where s _w(n)=s(m).w(n-m) is the windowed version of s(n). Thus for a given windowed segment of speech , the short term autocorrelation is a sequence. The nature of short term autocorrelation sequence is primarily different for voiced and unvoiced segments of speech. Hence information from the autocorrelation sequence can be used for discriminating voiced and unvoiced segments.

1.7 Short Term Energy Computation

The speech signal and its sampling frequency along with the frame size and frame shift are the inputs needed for computing the short term energy. Using the sampling frequency value, the number of samples for the given frame size and frame shift are computed. For instance, if the sampling frequency is 8 KHz and frame size and frame shift are 20 msec and 10 msec , respectively then the number of samples in a frame will be 160 and number of samples for frame shift will be 80 samples. To compute short term energy, the input speech signal is considered in frames of 160 samples with a shift of 80 samples and the energy is computed for each frame. The short term energy values are then plotted as a function of time index.The STE contour follows the general shape of signal amplitude distribution of speech signal. The STE associated with unvoiced regions is relatively smaller compared to voiced regions. Thus STE can be therefore used for voiced/unvoiced class of speech.

1.8 Short Term Zero Crossing Rate (ZCR)
The input speech signal can be viewed in blocks of 10-30 msec for computing ZCR.For each block of the speech signal, the ZCR is computed using the short term ZCR relation. The ZCR value is highest in unvoiced region and lowest in voiced region . In case of silence region the value lies in between of voiced and unvoiced cases.

1.9 Hannig Window

The Hann function is typically used as a window function in digital signal processing to select a subset of series of samples in order to perform a fourier transform or other calculations.

1.9.1 Hamming Window:

It is a mathematical function that is zero valued outside of some chosen interval.For instance a function that is constant inside the interval and 0 elsewhere is called Rectangular Window.

Source code

MAT Lab Code To Load a Speech Signal:

Ø recObj = audio recorder(44100, 16, 2);

Ø get(recObj);

Ø % Record your voice for 5 seconds.

Ø disp('Start speaking.')

Ø Record blocking(recObj, 10);

Ø %disp('End of Recording.');

Ø % Play back the recording.

Ø play(recObj);

Ø % Store data in double-precision array.

Ø myRecording = getaudiodata(recObj);

Short-Time Speech Measurements, Short-Time energy calculation, window length, Rectangular

Ø  % Energy is calculated every period samples.

Ø  period = 50;

Ø  % 4 different window lengths

Ø  winLens = [161 321 501 601];

Ø  nWindows = length(winLens);

Ø  k = 0;

Ø  for iWinLen = winLens

Ø  k = k+1;

Ø  wRect = rectwin(iWinLen);

Ø  % Short-Time energy calculation

Ø  ienergyST = STenergy(speechSignal, wRect, iWinLen-period);

Ø  % Display results

Ø  subplot(nWindows, 1, k);

Ø  delay = (iWinLen - 1)/2;

Ø  plot(t(delay+1:period:end - delay), ienergyST);

Ø  if (k==1)

               title('Short-Time Energy for various Rectangular window lengths')

Ø  end

Ø  legend(['Window length:',num2str(iWinLen),' Samples']);

Ø  end

Speech waveform for Frame size 100,300,500

//function to plot speech wayform and compute short term energy

function [c] = short_term_energy(Speech_signal, Fs, Frame_size, Frame_shift, window_type)

y=Speech_signal;

Frame_size=Frame_size/1000:

Frame_shift=Frame_shift/1000:

t=1/F5:1/Fs:Ilength(y)/Fs):

subplot(4,1.1):plot(t.y):

xtitle(Speech Waveform .):

window _length = Frame_sizeFs;

sample_shift = Frame_shift.Fs;

sum1=0;energy=0;

w=wincloAwindow_type,window_lengthLi=1;

for i=1:(floor((lenth(y))/sample_shift)-cezl(window_length(samp)e_shift))

for i=(((i-11.sample_shift)-1-1):(((i-11.sample_shift1+window_length)

AD=AD*wa

suml=suml+yy;

end

length(w)

energy(i1=suml:

sum1=0;i=1;

end

w=0;

tt=1/Fs:Frame_shift:(length(energy)*Frame_shift);

c=energy;

return(c):

endfunction

Iy,Fs,bits]=wayread(/yar/www/scilab/wayfile/exp7.way.)://input: Speech wayform

Frame_size=30; filnput: Frame Size in millisecond

Frame_shift=10; filnput: Frame -shift in millisecond

max_yalue=max(abs(y)):

y=y/max_yalue:

window_type = .re.; //Input: .hrn for hamming window, .hro for hanning window and 're for rectangular window

energy=short_term_energy(y, Fs, Frame_size, Frame_shift, window_type);

tt=1/Fs:(Frame_shift/1000):(length(energy).(Frame_shift/10001):

sobplot(4.1,2):plot(ttenergy):

xtetle('Short Term Energy using Rectangular window.);

window_type = .hrre; //Input: .hm for hamming window, .hn for hanning window and 're for rectangular window

energy=short_term_energy(y, Fs, Frame_size, Frame_shift, window_type);

tt=1/Fs:(Frame_shift/1000):(length(energy).(Frame_shift/10001):

5,11,plot(4.1,3):plot(ttenergy):

xtetle('Short Term Energy using Hamming window.);

window_type = .hre; //Input: .hm for hamming window, .hn for hanning window and 're for rectangular window

energy=short_term_energy(y, Fs, Frame_size, Frame_shift, window_type);

tt=1/Fs:(Frame_shift/1000):(length(energy).(Frame_shift/10001):

sobplot(4.1.41:plot(ttenergy):

xtetle('Short Term Energy using Hanninq window);

Short-Time Speech Measurements, Short-Time energy calculation, window length, Hamming

//function to plot speech wavform and compute short term energy

function [c] = short_term_energy(Speech_signal, Fs. Frame_size, Frame_shift, window_type)

y=Speech_signal;

Frame_size=Frame_size/1000:

Frame_shift=Frame_shift/1000:

t=1/Fs:VFs:(length(y)/Fs):

subplot(5.1.1):plot(t.y):

xtitle('Speech Waveform .):

window _length = Frame_sizeo Fs;

sample_shift = Frame_shifto Fs;

sum1=0;energy=0;

w=window(window_typeorvindow_length);i=1;

for i=1:(0oor((length(y))/sample_shift)-cezl(window_length(sample_shift))

YI(Ii-11.sample_shift)+1)=YMi-11 0 sample_shift)+11'w(i):jj=jj+1:

for .i.(((i-11.sample_shift)-1-2):Mi-11 0sample_shift1+window_length)

YO)=Y(Vw(i):i=i+ 1:

YY=Y(i)*Y0-1):

if(yy < 0)

suml=sum1+1:

end

zerocrossing(i1=sumV(2.nindow_length):

sum1=0;i=1;

end

w=0;

c=zerocrossing;

return(c):

endfunction

Iy.Fs.bits)=vcavreadC/var/www/scilab/wavfile/exp7.wav.)://input: Speech wavform

Frame_size=30; filnput: Frame Size in millisecond

Frame_shift=10; filnput: Frame -shift in millisecond

max_value=max(abs(y)):

y=y/max_value;

window_type = .re.; //Input: .hm for hamming window, .hn for hanning window and .re for rectangular window

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type);

tt=1/Fs:(Frame_shift/1000):(length(energy) 0(Frame_shift/10001):

subplot(4.1,21;plotftenergy); xtitle('Short Term Zero -crossing Rate using Rectangular window.);

. window_type = .hm.; //Input: .hrn for hamming window. 'hr ) for hanning window and .re for rectangular window

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type);

tt=l1Fs:(Frame_shift/1000):(length(energy) 0(Frame_shift/10001):

stIbplot(4.1,31;plotftenergy); xtitle('Short Term Zero -crossing Rate using Hamming window.);

. window_type = .hre; //Input: .hm for hamming window. 'hr , for hanning window and 're for rectangular window

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type);

tt=l1Fs:(Frame_shift/1000):(length(energy) 0(Frame_shift/10001):

st.lbplot(4.1,41:r t(ttenergy); xtetle('Short Term Zero -crossing Rate using Hanning window.);

Auto Correlation

//function to plot speech wavform and compute short term  energy

function  [c]  =  short_term_energy(Speech_signal, Fs. Frame_size, Frame_shift, window_type)

y=Speech_signal;

Frame_size=Frame_size/1000:

Frame_shift=Frame_shift/1000:

t=1/F5:1/Fs:Ilength(y)/Fs):

subplot(5.1.1):plot(t.y):

xtitle('Speech  Waveform .):

window _length  =  Frame_sizeo Fs;

sample_shift  =  Frame_shifto Fs;

sum1=0;energy=0;

w=window(window_type,window_length);i=1;

for  i=1:(floor((length(y))/sample_shift)-cezl(window_length(sample_shift))

VIIIi-11. sample_shift)+1)=YMi-11 0 sample_shift)+11'w(i):jj=i+1:

for  .i.(((i-11.sample_shift)-1-2):Mi-11 0sample_shift1+window_length)

YO)=V(Vw(i):i=i+1:

YY=Y(i)*Y0-1):

if(yy  <  0)

suml=sum1+1;

end

end

zerocrossing(i1=sum11(2.nindow_length):

sum1=0;i=1;

end

w=0;

c=zerocrossing;

return(c):

endfunction

Iy.Fs.bits)= , ,avreadC/var/www/scilab/wavfile/exp7.wav .)://input:  Speech wavform

Frame_shift=10; //Input: Frame -shift  in  millisecond

window_type  =  .re.; //Input: . hm for hamming window..hn . for hanning window and 're for rectangular window

max_value=max(abs(y)):

y=y/max_value:

Frame_size=10; //Input: Frame Size in millisecond

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type)://calling  the short term energy function

tt=1/Fs:(Frame_shift/1000):(length(energy) 0(Frame_shift/10001):

subplot(5.1.21:plot(ttenergy);xtitle('Short  Term Zero -crossing Rate with 10 ms Framesize .);

Frame_size=30; //Input: Frame Size  in millisecond

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type)://calling  the short term energy function

tt=l1Fs:(Frame_shift/1000):(length(energy) 0(Frame_shift/10001):

solvlot(5.1.31:plot(ttenergy);xtitle('Short  Term Zero -crossing Rate with 30 ms Framesize .);

Frame_size=50; //Input: Frame Size  in millisecond

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type)://calling  the short term energy function

tt=igs:(Frame_shift/1000):(lengthIenergy) 0(Frame_shift/10001):

subplot(5.1.41:plot(ttenergy):xtitleCShort  Term Zero -crossing Rate with 50 ms Framesize .);

Frame_size=100; //Input: Frame Size  in millisecond

energy=short_term_energy(y. Fs. Frame_size, Frame_shift, window_type)://calling  the short term energy function

tt=igs:(Frame_shift/1000):(lengthIenergy) 0(Frame_shift/10001):

subplot(5.1,51;plotIttenergy);xtitle('Short  Term Zero -crossing Rate with 100 ms Framesize ...time in seconds.);

Pitch estimation by autocorrelation method

//Program to compute and plot the pitch contour of a speech waveform

Iy.Fs.bits]=wavreadC/var/www/scilab/wavfile/exp7.wav9:

Frame_size = 30d/input Frame -size in millisecond

Frame_shift = 10d/input Frame -shift in millisecond

max_value=ma4abs(0):

y=y/max_value:

window_period=Frame_size/1000;window_length = window_period.Fs;

shift_period=Frame_shift/1000;sample_shift = shift_periocPFs;

pitchireq=0;

t=11Fs:1/Fs:(length(y)ifs):

stibplot(2.1.1):

pIotIty);

xtitle('Speech signal waveform...time in seconds.);

sum1=0;energy=0;autocorrelation=0;

for i=1:(floor((length(y))/sample_shift)-cezl(window_length(sample_shift))

1,1;yy=0:

for .i.(((i-11.sample_shift)+1):(((i-11.sample_shift1+window_length)

YY04=545):

Ic=k-F1:

end

for 1=0:(lengthIyy)-11

sum1=0:

for u=1:(length(yy)-0

s=yy(u).yy(u+l);

suml=suml+s:

end

autocor(1+11=suml:

autocorrelation(1+11(i)= autocor(I+1):

end

auto=autocor(21:160):

max1=0:

for uu=1:140

if(auto(uu)›maxl)

maxl=auto(uu):

sample_no=uu:

end

pitch_freg(i)=1A(20+sample_no).(11Fs)):

end

(rows.cols)=size( autocorrelation):

1:1:1,1iFs:shift_period:(coleshift_period);

subplot(2,1,21;plotIkkkpitch_freg....);xtitle(Pitch Contour...time in seconds.);

Click Here To Download Complete Project With Source Code

please wait 2-3 seconds for the download to start

========== Hacking Don't Need Agreements ==========

Just Remember One Thing You Don't Need To Seek Anyone's To Hack Anything Or Anyone As Long As It Is Ethical, This Is The Main Principle Of Hacking Dream

Thank You for Reading My Post, I Hope It Will Be Useful For You

I Will Be Very Happy To Help You So For Queries or Any Problem Comment Below Or You Can Mail Me At Bhanu@HackingDream.net