######################## # The phoneme database # ######################## 1. Sources: Dominique VAN CAPPEL (33) 92 96 45 44 THOMSON-SINTRA, 525 route des Dolines, BP157, F-06903 Sophia Antipolis Cedex, France This database was in use in the European ESPRIT 5516 project: ROARS. The aim of this project is the development and the implementation of a real time analytical system for French and Spannish speech recognition. 2. Past Usage: Alinat, P., Periodic Progress Report 4, ROARS Project ESPRIT II- Number 5516, February 1993, Thomson report TS. ASM 93/S/EGS/NC/079 Guerin-Dugue, A. and others, Deliverable R3-B4-P - Task B4: Benchmarks, Technical report, Elena-NervesII "Enhanced Learning for Evolutive Neural Architecture", ESPRIT-Basic Research Project Number 6891, June 1995 Verleysen, M. and Voz, J.L. and Thissen, P. and Legat, J.D., A statistical Neural Network for high-dimensional vector classification, ICNN'95 - IEEE International Conference on Neural Networks, November 1995, Perth, Western Australia. Voz J.L., Verleysen M., Thissen P. and Legat J.D., Suboptimal Bayesian classification by vector quantization with small clusters ESANN95-European Symposium on Artificial Neural Networks, April 1995, M. Verleysen editor, D facto publications, Brussels, Belgium. Voz J.L., Verleysen M., Thissen P. and Legat J.D., A practical view of suboptimal Bayesian classification, IWANN95-Proceedings of the International Workshop on Artificial Neural Networks, June 1995, Mira, Cabestany, Prieto editors, Springer-Verlag Lecture Notes in Computer Sciences, Malaga, Spain 3. Relevant information Most of the already existing speech recognition systems are global systems (typically Hidden Markov Models and Time Delay Neural Networks) which recognises signals and do not really use the speech specificities. On the contrary, analytical systems take into account the articulatory process leading to the different phonemes of a given language, the idea being to deduce the presence of each of the phonetic features from the acoustic observation. The main difficulty of analitycal systems is to obtain acoustical parameters sufficiantly reliable. These acoustical measurements must : - contain all the information relative to the concerned phonetic feature. - being speeker independant. - being context independant. - being more or less robust to noise. The primary acoustical observation is always voluminous (spectrum x N different observation moments) and classification cannot been processed directly. In ROARS, the initial database is provided by a cochlear spectra, which may be seen as the output of a filters bank having a constant DeltaF/F0, where the central frequencies are distributed on a logarithmic scale (MEL type) to simulate the frequency answer of the auditory nerves. The filters outputs are taken every 2 or 8 msec (integration on 4 or 16 msec) depending on the type of phoneme observated (stationnary or transitory). The aim of the present database is to distinguish between nasal and oral vowels. There are thus two different classes: Class 0 : Nasals Class 1 : Orals This database contains vowels coming from 1809 isolated syllables (for example: pa, ta, pan,...). Five different attributes were chosen to characterize each vowel: they are the amplitudes of the five first harmonics AHi, normalised by the total energy Ene (integrated on all the frequencies): AHi/Ene. Each harmonic is signed: positive when it corresponds to a local maximum of the spectrum and negative otherwise. Three observation moments have been kept for each vowel to obtain 5427 different instances: - the observation corresponding to the maximum total energy Ene. - the observations taken 8 msec before and 8 msec after the observation corresponding to this maximum total energy. From these 5427 initial values, 23 instances for which the amplitude of the 5 first harmonics was zero were removed, leading to the 5404 instances of the present database. The patterns are presented in a random order. 4. Summary Statistics: Attribute Min Max Mean Standard deviation 1 -1.70 4.11 0.82 0.86 2 -1.33 4.38 1.26 0.85 3 -1.82 3.20 0.76 0.93 4 -1.58 2.83 0.40 0.80 5 -1.28 2.72 0.08 0.58 Class Distribution: number of instances per class class 0 3818 - 70.65% class 1 1586 - 29.35% Correlation matrix: {{ 1.00,-0.10,-0.32,-0.19,-0.05}, {-0.10, 1.00,-0.25,-0.21,-0.07}, {-0.32,-0.25, 1.00, 0.02, 0.01}, {-0.19,-0.21, 0.02, 1.00,-0.04}, {-0.05,-0.07, 0.01,-0.04, 1.00}} Attributes maximum precision: 15 bits. The database resulting from the centering and reduction by attribute of the phoneme database is on the ftp server in the `REAL/phoneme/phoneme_CR.dat.Z' file. 5. Confusion matrix obtained with the k_NN classifier (test with the Leave_One_Out method). k was set to 1 to reach the minimum error rate : 8.97 +/- 1.1%) {{0, 0 , 1 }, {0, 95.0, 5.0}, {1, 18.5, 81.5}} With k set to 1, this result is a bit optimistic because of the fact that the database is composed of the same phonemes taken at 3 differents moments. Setting k=20, will permit to avoid this influence and will provide more realistic results: {{0, 0 , 1 }, {0, 91.4, 8.6}, {1, 27.8, 72.2}} In this case, the total error rate is: 14.2%. 7. Result of the Principal Component Analysis: The Principal Components Analysis is a very classical method in pattern recognition [Duda73]. PCA reduces the sample dimension in a linear way for the best representation in lower dimensions keeping the maximum of inertia. The best axe for the representation is however not necessary the best axe for the discrimination. After PCA, features are selected according to the percentage of initial inertia which is covered by the different axes and the number of features is determined according to the percentage of initial inertia to keep for the classification process. This selection method has been applied on the phoneme_CR database. When quasi-linear correlations exists between some initial features, these redundant dimensions are removed by PCA and this preprocessing is then recommended. In this case, before a PCA, the determinant of the data covariance matrix is near zero; this database is thus badly conditioned for all process which use this information (the quadratic classifier for example). The following files are available for the phoneme database: - ``phoneme_PCA.dat.Z'', the projection of the ``phoneme_CR'' database on its principal components (sorted in a decreasing order of the related inertia percentage; so, if you desire to work on the database projected on its x first principal components you only have to keep the x first attributes of the phoneme_PCA.dat database and the class labels (last attribute)). - ``phoneme_corr_circle.ps'', a graphical representation of the correlation between the initial attributes and the two first principal components, - ``phoneme_proj_PCA.ps'', a graphical representation of the projection of the initial database on the two first principal components, Table here below provides the inertia percentages associated to the eigenvalues corresponding to the principal component axis sorted in the decreasing order of their associated inertia percentage. Eigen Value Inertia Cumulated value percentage inertia 1 1.46471 29.30 29.30 2 1.10934 22.19 51.49 3 1.02830 20.57 72.06 4 0.94158 18.83 90.89 5 0.45516 9.10 100.00 It is thus clear that any dimensionality reduction based on PCA would lead to an important loss of pertinent data [Duda73] Duda, R.O. and Hart, P.E., Pattern Classification and Scene Analysis, John Wiley & Sons, 1973.