Python audio signal classification MFCC features neural network -

i trying classify audio signals speech emotions. purpose extracting mfcc features of audio signal , feed them simple neural network (feedforwardnetwork trained backproptrainer pybrain). unfortunately results bad. 5 classes network seems come same class result.

i have 5 classes of emotions , around 7000 labeled audio files, divide 80% of each class used train network , 20% test network.

the idea use small windows , extract mfcc features generate lot of training examples. in evaluation windows 1 file evaluated , majority vote decides prediction label.

training examples per class:  {0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182}  example of scaled mfcc features: [ -6.03465056e-01   8.28665733e-01  -7.25728303e-01   2.88611116e-05 1.18677218e-02  -1.65316583e-01   5.67322809e-01  -4.92335095e-01    3.29816126e-01  -2.52946780e-01  -2.26147779e-01   5.27210979e-01    -7.36851560e-01]  layers________________________:  13 20 5 (also tried 13 50 5 , 13 100 5) learning rate_________________:  0.01 (also tried 0.1 , 0.3) training epochs_______________:  10  (error rate not improve @ during training)  truth table on test set: [[   0.    4.    0.  239.   99.]  [   0.   41.    0.  157.   23.]  [   0.   18.    0.  173.   18.]  [   0.   12.    0.  299.   59.]  [   0.    0.    0.   85.  132.]]  success rate overall [%]:  34.7314201619 success rate class 0 [%]:  0.0 success rate class 1 [%]:  18.5520361991 success rate class 2 [%]:  0.0 success rate class 3 [%]:  80.8108108108 success rate class 4 [%]:  60.8294930876

ok, now, can see distribution of results on classes bad. class 0 , 2 never predicted. assume, hints problem either network or more data.

i post lot of code here, think makes more sense show in following image steps taking mfcc features. please aware use whole signal without windowing illustration. ok? mfcc values huge, shouldn't smaller? (i scale them down before feeding them network minmaxscaler on data [-2,2], tried [0,1])

this code use melfilter bank apply directly before discrete cosine transformation extract mfcc features (i got here: stackoverflow):

def freqtomel(freq):   '''   calculate mel frequency given frequency    '''   return 1127.01048 * math.log(1 + freq / 700.0)  def meltofreq(mel):   '''   calculate frequency given mel frequency    '''   return 700 * (math.exp(freq / 1127.01048 - 1))  def melfilterbank(blocksize):   numbands = int(mfccfeatures)   maxmel = int(freqtomel(maxhz))   minmel = int(freqtomel(minhz))    # create matrix triangular filters, 1 row per filter   filtermatrix = numpy.zeros((numbands, blocksize))    melrange = numpy.array(xrange(numbands + 2))    melcenterfilters = melrange * (maxmel - minmel) / (numbands + 1) + minmel    # each array index represent center of each triangular filter   aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0   aux = (numpy.exp(melcenterfilters * aux) - 1) / 22050   aux = 0.5 + 700 * blocksize * aux   aux = numpy.floor(aux)  # arredonda pra baixo   centerindex = numpy.array(aux, int)  # int values    in xrange(numbands):     start, centre, end = centerindex[i:i + 3]     k1 = numpy.float32(centre - start)     k2 = numpy.float32(end - centre)     = (numpy.array(xrange(start, centre)) - start) / k1     down = (end - numpy.array(xrange(centre, end))) / k2      filtermatrix[i][start:centre] =     filtermatrix[i][centre:end] = down    return filtermatrix.transpose()

what can better prediction result?

here made example of sex identification speech. used hyke-dataset¹ example. it's made example. if 1 serious sex idenfification, 1 better. in general error rate decreases:

build data... train network... number of training patterns:  94956 number of test patterns:      31651 input , output dimensions:  13 2 train network... epoch:    0   train error: 62.24%   test error: 61.84% epoch:    1   train error: 34.11%   test error: 34.25% epoch:    2   train error: 31.11%   test error: 31.20% epoch:    3   train error: 30.34%   test error: 30.22% epoch:    4   train error: 30.76%   test error: 30.75% epoch:    5   train error: 30.65%   test error: 30.72% epoch:    6   train error: 30.81%   test error: 30.79% epoch:    7   train error: 29.38%   test error: 29.45% epoch:    8   train error: 31.92%   test error: 31.92% epoch:    9   train error: 29.14%   test error: 29.23%

i used mfcc implemenation scikits.talkbox. maybe code below helps you. (sex identification surely easier task emotion detection... maybe need more , different features.)

import glob  scipy.io.wavfile import read wavread scikits.talkbox.features import mfcc  pybrain.datasets            import classificationdataset pybrain.utilities           import percenterror pybrain.tools.shortcuts     import buildnetwork pybrain.supervised.trainers import backproptrainer pybrain.structure.modules   import softmaxlayer  def report_error(trainer, trndata, tstdata):     trnresult = percenterror(trainer.testonclassdata(), trndata['class'])     tstresult = percenterror(trainer.testonclassdata(dataset=tstdata), tstdata['class'])     print "epoch: %4d" % trainer.totalepochs, "  train error: %5.2f%%" % trnresult, "  test error: %5.2f%%" % tstresult    def main(auido_path, coeffs=13):     dataset = classificationdataset(coeffs, 1, nb_classes=2, class_labels=['male', 'female'])     male_files = glob.glob("%s/male_audio/*/*_1.wav" % auido_path)     female_files = glob.glob("%s/female_audio/*/*_1.wav" % auido_path)      print "build data..."     sex, files in enumerate([male_files, female_files]):         f in files:             sr, signal = wavread(f)             ceps, mspec, spec = mfcc(signal, nwin=2048, nfft=2048, fs=sr, nceps=coeffs)             in range(ceps.shape[0]):                 dataset.appendlinked(ceps[i], [sex])      tstdata, trndata = dataset.splitwithproportion(0.25)     trndata._converttooneofmany()     tstdata._converttooneofmany()      print "number of training patterns: ", len(trndata)     print "number of test patterns:     ", len(tstdata)     print "input , output dimensions: ", trndata.indim, trndata.outdim      print "train network..."     fnn = buildnetwork(coeffs, int(coeffs*1.5), 2, outclass=softmaxlayer, fast=true)     trainer = backproptrainer(fnn, dataset=trndata, learningrate=0.005)      report_error(trainer, trndata, tstdata)     in range(100):         trainer.trainepochs(1)         report_error(trainer, trndata, tstdata)  if __name__ == '__main__':     main("/path/to/hyke/audio_data")

¹ azarias reda, saurabh panjwani , edward cutrell: hyke: low-cost remote attendance tracking system developing regions, 5th acm workshop on networked systems developing regions (nsdr).

Search This Blog

Guide

Python audio signal classification MFCC features neural network -

Comments

Post a Comment

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -