Spectral Subtraction for Noisy Speech Signals


(a)  Load, display and manipulation of speech signals.

(b)  Compute and display the spectrum of speech signals.

(c)  Determine and plot the Power Density Spectrum of Speech signals.

(d)  Develop a simple spectral subtraction filtering technique for elimination of noise.

Telephones are increasingly being used in noisy environments such as cars, airports etc. The aim of this project is to implement a system that will reduce the background noise in a speech signal while leaving the signal itself intact: this process is called speech enhancement. It is desired to implement spectral subtraction technique for this purpose.

Algorithm: Many different algorithms have been proposed for speech enhancement: the one that we will use is known as spectral subtraction. This technique operates in the frequency domain and makes the assumption that the spectrum of the input signal can be expressed as the sum of the speech spectrum and the noise spectrum. The procedure is illustrated in the diagram and contains two tricky parts:

  1. Estimating the spectrum of the background noise

  1. Subtracting the noise spectrum from the speech

Task1: Denoising for multi-tone sinusoidal signal.

Step1: Generate a multi tone signal with frequency components 100 Hz, 500 Hz, 600Hz, 800 Hz and 1000 Hz. Add AWGN for the various noise variances (20dB, 30dB and 40dB). Display the signal and its spectrum.

Step2: Compute the magnitude and phase response of the signal using FFT and plot them. Find the Power Density Spectrum and plot.

Step3: Estimate the noise power.

Step4: Subtract the noisy estimate (power) from the Power Density Spectrum of signal. Determine the magnitude spectrum from resultant signal

Step5: Perform the inverse IFFT operation. Find the signal to noise ratio (SNR) and peak signal to noise ratio (PSNR).

Task2: Denoising for male voice speech signal.

Step1: Load and display a male voice speech signal and its spectrum.
Step2: Add AWGN for the various noise variances (20dB, 30dB and 40dB).

Step3: Divide the given speech signal into 50 ms blocks of speech frames and shift of 10 msec.

Step4: Compute the magnitude and phase response of the segmented speech signal using FFT and plot them. Find the Power Density Spectrum and plot.

Step5: Estimate the noise power by computing Log Energy and zero crossing to determine non-speech activity.

Step6: Subtract the noisy estimate (power) from the Power Density Spectrum of segmented speech signal. Determine the magnitude spectrum from resultant signal

Step7: Perform the inverse IFFT that results the denoised speech signal. Find the signal to noise ratio (SNR) and peak signal to noise ratio (PSNR).

Step8: Repeat the above steps for various segmented speech signal.

Task3: Filter the noise using Wiener filter.

Task3: Repeat the Task-2 and Task-3 for female voice speech signal.

Task4: Repeat the Task-2 and Task-3 for musical speech signal.


A discrete signal or discrete-time signal is a time series consisting of a sequence of quantities. In other words, it is a time series that is a function over a domain of integers. Unlike a continuous-time signal, a discrete-time signal is not a function of a continuous argument however, it may have been obtained by sampling from a continuous-time signal, and then each value in the sequence is called a sample. When a discrete-time signal obtained by sampling a sequence corresponds to uniformly space times, it has an associated sampling rate; the sampling rate is not apparent in the data sequence, and so needs to be associated as a characteristic unit of the system.
A digital signal is a discrete-time signal for which not only the time but also the amplitude has been made discrete; in other words, its samples take on only values from a discrete set (a countable set that can be mapped one-to-one to a subset of integers). If that discrete set is finite, the discrete values can be represented with digital words of a finite width. Most commonly, these discrete values are represented as fixed-point words or floating-point words. After sampling, the process of converting a continuous-valued discrete-time signal to a digital signal is known as analogue-to-digital conversion. It usually proceeds by replacing each original sample value by an approximation selected from a given discrete set a process known as quantization. This process loses information, and so discrete-valued signals are only an approximation of the continuous-valued discrete-time signal, itself only an approximation of the original continuous-valued continuous-time signal. Amplitude modulation (AM) is a modulation technique used in electronic communication, most commonly for transmitting information via a radio carrier wave. AM works by varying the strength (amplitude) of the carrier in proportion to the waveform being sent that waveform may, for instance, correspond to the sounds to be reproduced by a loudspeaker, or the light intensity of television pixels. This contrasts with modulation, in the frequency which the frequency of the carrier signal is varied, and phase modulation, in which its phase is varied, by the modulating signal.
Discrete time views values of variables as occurring at distinct, separate "points in time", or equivalently as being unchanged throughout each non-zero region of time .Thus a variable jumps from one value to another as time moves from time period to the next. In this framework, each variable of interest is measured once at each time period. The number of measurements between any two time periods is finite. Measurements are typically made at sequential integer values of the variable "time

Project description:

The principle of spectral subtraction the spectral subtraction is based on the principle that the
enhanced speech can be obtained by subtracting the estimated spectral components of the noise from the spectrum of the input noisy signal. Assuming that noised(n) is additive to the speech signal x(n), the noisy speech y(n) can be written as,

y(n)=x(n)+d(n), for 0≤n≤N-1

Where n is the time index, N is a number of samples. The objective of speech enhancement is to find the enhanced speech x(n)from given y(n),with the assumption that d(n) is uncorrelated with x(n). Input signal y(n) is segmented into K segments of the same length. The time-domain signals can be transformed to the frequency-domain as,
Y(w)=X(w)+D(w), for 0≤k≤K-1

Where k is the segment index, Yk(ω) Xk(ω) and Dk(ω)denote the short-time DFT magnitudes taken of y(n),x(n),and d(n), respectively, and raised to a power a(a=1 corresponds to magnitude spectral subtraction, a=2 corresponds to power spectrum subtraction). If an estimate of the noise spectrum D can be obtained, then an approximation of speech X can be obtained from Y
Spectral subtraction is based on the principle  that one can obtain an estimate of the clean signal spectrum by subtracting an estimate of the noise spectrum from the noisy speech spectrum. The noise spectrum can be estimated, and updated, during the periods when the signal is absent or when only noise is present.
Methods Of Spectral Subtraction
The first method for Spectral subtraction was introduced in post 1970’s. In past more then 30 years this method has been modified and new methods has been developed. This section
gives the study of some of such methods beginning from the starting till date.1. In 1979 Berouti [2] gave a Spectral Subtraction method, for enhancing speech corrupted by broadband noise. As
discussed in Section I, original method entails subtracting an estimate of the noise power spectrum from the speech power spectrum, setting negative differences to zero, recombining
the new power spectrum with the original phase, and then reconstructing the time waveform. While this method reduces the broadband noise, it also usually introduces an annoying
“musical noise” [11]. We have devised a method that eliminates this “musical noise” while further reducing the background noise. The method consists in subtracting an overestimate
of the noise power spectrum, and preventing the resultant spectral components from going below a preset minimum level (spectral floor). The method can automatically adapt to a wide
range of signal-to-noise ratios, as long as a reasonable estimate of the noise spectrum can be obtained. The technique can be described using equation below

Here |Ŝj(ω)| denotes the enhanced spectrum estimated in frame j and |De(ω)| is the spectrum of the noise obtained during non speech activity. With α ≥1 and D < β ≤ 1. Where α is over subtraction factor and β is the spectral floor parameter. Parameter β controls the amount of residual noise and the amount of perceived Musical noise. If β is too small, the musical noise will became audible but the residual noise will be reduced. If β is too large, then the residual noise will be audible but the musical issues related to spectral subtraction reduces.
Parameter α affects the amount of speech spectral distortion. If α is too large then resulting signal will be severely distorted and intelligibility may suffer. If α is too small noise remains
in enhanced speech signal. When α > 1, the subtraction can remove all of the broadband noise by eliminating most of wide peaks. But the deep valleys surrounding the peaks still remain
in the spectrum [1]. The valleys between peaks are no longer deep when β>0 compared to when β=0 [4]. Berouti found that speech processed by equation (7) had less musical noise.
Experimental results showed that for best noise reduction with the least amount of musical noise, α should be smaller for high SNR frames and large for low SNR frames. In this way
this method can adapt to various Signal to Noise ratios by adjusting the α and β and reduce the musical noise. The parameter values have to be set optimally so that the best enhancement performance can be achieved. It can be done using NSS algorithm [21]2. In the same year 1979, S.F.Boll [3] also proposed method for removal of acoustic noise in speech. In this method a spectral estimator is used to compute the spectral error and then four methods are used to minimize the error. Speech, suitably low-pass filtered and digitized, is analyzed by windowing data from half-overlapped input data buffers. The magnitude spectral of the windowed data are calculated and the spectral noise bias calculated during non speech activity is subtracted off.
Resulting negative amplitudes are then zeroed out. Secondary residual noise suppression is then applied. A time waveform is recalculated from the modified magnitude. This waveform
is then overlap added to the previous data to generate the output speech.Consider that a windowed noise signal n(k) has been added to a windowed speech signal s(k), with their sum denoted by   x(k) )
Bhanu Namikaze

Bhanu Namikaze is an Ethical Hacker, Security Analyst, Blogger, Web Developer and a Mechanical Engineer. He Enjoys writing articles, Blogging, Debugging Errors and Capture the Flags. Enjoy Learning; There is Nothing Like Absolute Defeat - Try and try until you Succeed.

No comments:

Post a Comment