Python audio signal classification MFCC features neural network -
i trying classify audio signals speech emotions. purpose extracting mfcc features of audio signal , feed them simple neural network (feedforwardnetwork trained backproptrainer pybrain). unfortunately results bad. 5 classes network seems come same class result.
i have 5 classes of emotions , around 7000 labeled audio files, divide 80% of each class used train network , 20% test network.
the idea use small windows , extract mfcc features generate lot of training examples. in evaluation windows 1 file evaluated , majority vote decides prediction label.
training examples per class: {0: 81310, 1: 60809, 2: 58262, 3: 105907, 4: 73182} example of scaled mfcc features: [ -6.03465056e-01 8.28665733e-01 -7.25728303e-01 2.88611116e-05 1.18677218e-02 -1.65316583e-01 5.67322809e-01 -4.92335095e-01 3.29816126e-01 -2.52946780e-01 -2.26147779e-01 5.27210979e-01 -7.36851560e-01] layers________________________: 13 20 5 (also tried 13 50 5 , 13 100 5) learning rate_________________: 0.01 (also tried 0.1 , 0.3) training epochs_______________: 10 (error rate not improve @ during training) truth table on test set: [[ 0. 4. 0. 239. 99.] [ 0. 41. 0. 157. 23.] [ 0. 18. 0. 173. 18.] [ 0. 12. 0. 299. 59.] [ 0. 0. 0. 85. 132.]] success rate overall [%]: 34.7314201619 success rate class 0 [%]: 0.0 success rate class 1 [%]: 18.5520361991 success rate class 2 [%]: 0.0 success rate class 3 [%]: 80.8108108108 success rate class 4 [%]: 60.8294930876
ok, now, can see distribution of results on classes bad. class 0 , 2 never predicted. assume, hints problem either network or more data.
i post lot of code here, think makes more sense show in following image steps taking mfcc features. please aware use whole signal without windowing illustration. ok? mfcc values huge, shouldn't smaller? (i scale them down before feeding them network minmaxscaler on data [-2,2], tried [0,1])
this code use melfilter bank apply directly before discrete cosine transformation extract mfcc features (i got here: stackoverflow):
def freqtomel(freq): ''' calculate mel frequency given frequency ''' return 1127.01048 * math.log(1 + freq / 700.0) def meltofreq(mel): ''' calculate frequency given mel frequency ''' return 700 * (math.exp(freq / 1127.01048 - 1)) def melfilterbank(blocksize): numbands = int(mfccfeatures) maxmel = int(freqtomel(maxhz)) minmel = int(freqtomel(minhz)) # create matrix triangular filters, 1 row per filter filtermatrix = numpy.zeros((numbands, blocksize)) melrange = numpy.array(xrange(numbands + 2)) melcenterfilters = melrange * (maxmel - minmel) / (numbands + 1) + minmel # each array index represent center of each triangular filter aux = numpy.log(1 + 1000.0 / 700.0) / 1000.0 aux = (numpy.exp(melcenterfilters * aux) - 1) / 22050 aux = 0.5 + 700 * blocksize * aux aux = numpy.floor(aux) # arredonda pra baixo centerindex = numpy.array(aux, int) # int values in xrange(numbands): start, centre, end = centerindex[i:i + 3] k1 = numpy.float32(centre - start) k2 = numpy.float32(end - centre) = (numpy.array(xrange(start, centre)) - start) / k1 down = (end - numpy.array(xrange(centre, end))) / k2 filtermatrix[i][start:centre] = filtermatrix[i][centre:end] = down return filtermatrix.transpose()
what can better prediction result?
here made example of sex identification speech. used hyke-dataset1 example. it's made example. if 1 serious sex idenfification, 1 better. in general error rate decreases:
build data... train network... number of training patterns: 94956 number of test patterns: 31651 input , output dimensions: 13 2 train network... epoch: 0 train error: 62.24% test error: 61.84% epoch: 1 train error: 34.11% test error: 34.25% epoch: 2 train error: 31.11% test error: 31.20% epoch: 3 train error: 30.34% test error: 30.22% epoch: 4 train error: 30.76% test error: 30.75% epoch: 5 train error: 30.65% test error: 30.72% epoch: 6 train error: 30.81% test error: 30.79% epoch: 7 train error: 29.38% test error: 29.45% epoch: 8 train error: 31.92% test error: 31.92% epoch: 9 train error: 29.14% test error: 29.23%
i used mfcc implemenation scikits.talkbox. maybe code below helps you. (sex identification surely easier task emotion detection... maybe need more , different features.)
import glob scipy.io.wavfile import read wavread scikits.talkbox.features import mfcc pybrain.datasets import classificationdataset pybrain.utilities import percenterror pybrain.tools.shortcuts import buildnetwork pybrain.supervised.trainers import backproptrainer pybrain.structure.modules import softmaxlayer def report_error(trainer, trndata, tstdata): trnresult = percenterror(trainer.testonclassdata(), trndata['class']) tstresult = percenterror(trainer.testonclassdata(dataset=tstdata), tstdata['class']) print "epoch: %4d" % trainer.totalepochs, " train error: %5.2f%%" % trnresult, " test error: %5.2f%%" % tstresult def main(auido_path, coeffs=13): dataset = classificationdataset(coeffs, 1, nb_classes=2, class_labels=['male', 'female']) male_files = glob.glob("%s/male_audio/*/*_1.wav" % auido_path) female_files = glob.glob("%s/female_audio/*/*_1.wav" % auido_path) print "build data..." sex, files in enumerate([male_files, female_files]): f in files: sr, signal = wavread(f) ceps, mspec, spec = mfcc(signal, nwin=2048, nfft=2048, fs=sr, nceps=coeffs) in range(ceps.shape[0]): dataset.appendlinked(ceps[i], [sex]) tstdata, trndata = dataset.splitwithproportion(0.25) trndata._converttooneofmany() tstdata._converttooneofmany() print "number of training patterns: ", len(trndata) print "number of test patterns: ", len(tstdata) print "input , output dimensions: ", trndata.indim, trndata.outdim print "train network..." fnn = buildnetwork(coeffs, int(coeffs*1.5), 2, outclass=softmaxlayer, fast=true) trainer = backproptrainer(fnn, dataset=trndata, learningrate=0.005) report_error(trainer, trndata, tstdata) in range(100): trainer.trainepochs(1) report_error(trainer, trndata, tstdata) if __name__ == '__main__': main("/path/to/hyke/audio_data")
1 azarias reda, saurabh panjwani , edward cutrell: hyke: low-cost remote attendance tracking system developing regions, 5th acm workshop on networked systems developing regions (nsdr).
Comments
Post a Comment