Improving DNA Barcode-based Fish Identification System on Imbalanced Data using SMOTE

Wisnu Ananta Kusuma, Nurdevi Noviana, Lailan Sahrina Hasibuan, Mala Nurilmala

Abstract


Problem in imbalanced data is very common in classification or identification. The problem is raised when the number of instances of one class far exceeds the other. In the previous research, our DNA barcode-based Identification System of Tuna and Mackerel was developed in imbalanced dataset. The number of samples of Tuna and Mackerel were much more than the number of other fish samples. Therefore, the accuracy of the classification model was probably still in bias. This research aimed at to employ Synthetic Minority Oversampling Technique (SMOTE) to yield balanced dataset. We used k-mers frequencies from DNA barcode sequence as features and Support Vector Machine (SVM) as classification method. In this research we used trinuclotide (3-mers) and tetranucleotide (4-mers). The training dataset was taken from Barcode of Life Database (BOLD). For evaluating the model, we compared the accuracy of model using SMOTE and without SMOTE in order to classify DNA barcode sequences taken from Department of Aquatic Product Technology, Bogor Agricultural University. The results showed that the accuracy of the model in the species level using SMOTE was 7% and 13% higher than those of non-SMOTE for trinucleotide (3-mers) and tetranucleotide (4-mers), respectively. It is expected that the use of SMOTE, as one of data balancing technique, could increase the accuracy of DNA barcode based fish classification system, particularly in the species level which is difficult to be identified.


Keywords


DNA Barcode, imbalanced dataset, mislabeled fish, smote, support vector machine

References


Nurilmala M, Widyastuti U, Kusuma WA, Nurjanah, Wulansari N, Widyatuti Y. DNA Barcoding for Identification of Processed Tuna Fish in Indonesian Market. Jurnal Teknologi (Sciences and Engineering). Jurnal Teknologi. 2016; 78(4-2): 115-118. doi: 10.11113/jt.v78.8190

Wulansari N, Nurilmala M, Nurjanah N, Detection Tuna and Processed Products Based Protein and DNA Barcoding. Indonesian Journal of Aquatic Product Technology. 2015; 18(2): 120-127.

Hajibabaei et al. DNA barcode distinguish species of tropical Lepidoptera. Proceedings of the National Academic of Sciences. 2009; 103: 968-971.

Hebert PDN, Cywinska A, Ball SL, deWaard JR.. Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences. 2003b; 270(1512):313-321. doi:10.1098/rspb.2002.2218.

Kress WJ, Wurdack KJ, Zimmer EA, Weigt LA, Janzen DH. Use of DNA barcodes to identify flowering plants. Proceedings of the National Academy of Sciences. 2005; 102(23):8369-8374. doi:10.1073/pnas.0503123102

Koch H. Combining morphology and DNA barcoding resolves the taxonomy of western malagasy liotrigona moure, 1961 (hymenoptera: apidae: meliponini) . African Invertebrates. 2010.; 51(2):413-421. doi:10.5733/afin.051.0210

Hebert PDN, Ratnasingham S, de Waard JR. Barcoding animal life: cytochromec oxidase subunit 1 divergences among closely related species. Proc R Soc. 2003a; 270:96–99

Seberg O, Petersen G. How many loci does it take to DNA barcode a crocus?. PLoS ONE. 2009.; 4(2):e4598. doi:10.1371/journal.pone.0004598

Howell N. 1989. Evolutionary conservation of protein regions in the protonmotive cytochromeb and their possible roles in redox catalysis. J Mol Evol. 29(2):157-169. doi:10.1007/bf02100114

Pati A, Heath LS, Kyrpides NC, Ivanova N. ClaMS: A Classifier for Metagenomic Sequences. Standards in Genomic Sciences. 2011; 5:248-253.

Benedict AM, Roselyn DA, Minerva FHV, Sweedy KLP, Mudjekeewis DS. Detection of Mislabeled Commercial Fishery by Products in the Philippines Using DNA Barcodes and its Implications to Food Traceability and Safety. Food Control . 2013; 33(1):119-125.

Lowenstein JH, Amato G, Kolokotronis SO. The Real Maccoyii: Identifying Tuna Sushi with DNA Barcodes-Contrasting Characteristic Attributes and Genetic Distances. PloS ONE. 2009; 4(11): e7866. doi:10.1371/journal.pone.000786.

Seo TK. Classification of Nucleotide Sequences Using Support Vector Machines. Journal of molecular evolution. 2010; 71(4): 250-267

Weitschek E, Fiscon G, Felici G. Supervised DNA Barcodes Species Classification: Analysis, Comparison, and Results. BMC Bio Data Mining. 2014; doi:10.1186/1756-0381-7-4

Mulyati, Kusuma WA, Nurilmala M. Identification of Tuna and Mackerel Based on DNA Barcodes using Support Vector Machine. Telkomnika. 2016.14(2):778-783. doi: 10.12928/telkomnika.v14i2.2469

Chawla VN, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research [Internet]. [diunduh 2016 Januari 14]; 16:321-357. Tersedia pada: http://arxiv.org/pdf/1106.1813.pdf.

Barro RA. Implementation of Synthetic Minority Oversampling Technique (SMOTE) due to imbalanced dataset for developing a model of Jamu composition [undergraduate thesis]. Bogor Agricultural University. 2013.

Batuwita R dan Palade V. 2010. Efficient resampling methods for training support vector machines with imbalanced datasets. International Joint Conference on Neural Networks; 20 Jul 18-23; Barcelona, Spanyol. Hlm 1-8

O’Fallon BD, Donahue WD, Crockett DK. 2013. A support vector machine for identification of single-nucleotide polymorphism from next-generation sequencing data. Bioinformatics. 29 (11):1361-6

Sujeevan R, Hebert PD. Bold: The Barcode of Life Data System. Mol Ecol. 2007;7(3): 355-364.

Han J, Kamber M. Data mining: concepts and techniques. 3th ed. New York (US): Morgan kaufmaann Academic Pr. 2012..

Yang Yu, Liang Zhou. Acoustic Emission Signal Classification based on Support Vector Machine. Telkomnika. 2012:10(5): 1027-1032.

Wahyuningrum, Rima Tri. Efficient Kernel-based Two Dimensional Principral Component Analysis smile Stages Recognition. Telkomnika. 2012:10(1).113

Meyer D. 2014. e1071: Misc functions of the department of statistics, TU Wien. R package version 1.6-3. Tersedia pada http://CRAN.R-project.org/package=e1071

Hsu CW, Chang CC, Lin CJ. 2003. A practical guide to support vector classification. Departemen of Computer Science and Information Engineering (TW): National Taiwan University.

Yen SJ, Lee YS. 2009. Cluster-based under-sampling approaches for imbalanced data distribution. Elsevier.36(3): 5718-5727




DOI: http://dx.doi.org/10.12928/telkomnika.v15i3.5011

Refbacks

  • There are currently no refbacks.


Copyright (c) 2017 Universitas Ahmad Dahlan

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.


TELKOMNIKA Telecommunication, Computing, Electronics and Control
website: http://telkomnika.ee.uad.ac.id
online system: http://journal.uad.ac.id/index.php/TELKOMNIKA
Phone: +62 (274) 563515, 511830, 379418, 371120 ext: 3208
Fax    : +62 (274) 564604

View TELKOMNIKA Stats