State of the art document clustering algorithms based on semantic similarity

Karwan Jacksi, Niyaz Salih

Abstract


The constant success of the Internet made the number of text documents in electronic forms increases hugely. The techniques to group these documents into meaningful clusters are becoming critical missions. The traditional clustering method was based on statistical features, and the clustering was done using a syntactic notion rather than semantically. However, these techniques resulted in un-similar data gathered in the same group due to polysemy and synonymy problems. The important solution to this issue is to document clustering based on semantic similarity, in which the documents are grouped according to the meaning and not keywords. In this research, eighty papers that use semantic similarity in different fields have been reviewed; forty of them that are using semantic similarity based on document clustering in seven recent years have been selected for a deep study, published between the years 2014 to 2020. A comprehensive literature review for all the selected papers is stated. Detailed research and comparison regarding their clustering algorithms, utilized tools, and methods of evaluation are given. This helps in the implementation and evaluation of the clustering of documents. The exposed research is used in the same direction when preparing the proposed research. Finally, an intensive discussion comparing the works is presented, and the result of our research is shown in figures.

Keywords


clustering documents; semantic similarity; algorithms; traditional method

Full Text:

PDF

References


K. Jacksi, S. R. M. Zeebaree, and N. Dimililer, "LOD Explorer: Presenting the Web of Data," Int. J. Adv. Comput. Sci. Appl. IJACSA, vol. 9, no. 1, 2018, doi: 10.14569/IJACSA.2018.090107.

K. Jacksi and S. Abass, "Development History Of The World Wide Web," Int. J. Sci. Technol. Res., vol. 8, pp. 75–79, 2019.

K. J. A Zeebaree SRM Zeebaree, "Designing an Ontology of E-learning system for Duhok Polytechnic University Using Protégé OWL Tool," J Adv Res Dyn Control Syst Vol, vol. 11, no. 5, pp. 24–37, 2019.

K. Jacksi, N. Dimililer, and S. R. M. Zeebaree, "A Survey of Exploratory Search Systems Based on LOD Resources," in PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON COMPUTING & INFORMATICS, COLL ARTS & SCI, INFOR TECHNOL BLDG, SINTOK, KEDAH 06010, MALAYSIA, 2015, pp. 501–509.

K. Jacksi, N. Dimililer, and S. R. Zeebaree, "State of the Art Exploration Systems for Linked Data: A Review," Int. J. Adv. Comput. Sci. Appl. IJACSA, vol. 7, no. 11, pp. 155–164, 2016, doi: dx.doi.org/10.14569/IJACSA.2016.071120.

H. Patil and R. Thakur, "A semantic approach for text document clustering using frequent itemsets and WordNet," Int. J. Eng. Technol., vol. 7, p. 102, Jun. 2018, doi: 10.14419/ijet.v7i2.9.10220.

R. Ibrahim, S. Zeebaree, and K. Jacksi, "Survey on Semantic Similarity Based on Document Clustering," Adv. Sci. Technol. Eng. Syst. J., vol. 4, no. 5, pp. 115–122, 2019, doi: 10.25046/aj040515.

J.-B. Gao, B.-W. Zhang, and X. H. Chen, "A WordNet-based semantic similarity measurement combining edge-counting and information content theory," Eng Appl AI, vol. 39, pp. 80–88, 2015, doi: 10.1016/j.engappai.2014.11.009.

K. Jacksi and S. Badiozamany, "General method for data indexing using clustering methods," Int. J. Sci. Eng., vol. 6, no. 3, pp. 641–644, Mar. 2015.

K. Jacksi, "Toward the Semantic Web and Linked Data Exploration," 2019, pp. 227–227.

S. Wang and R. Koopman, "Clustering articles based on semantic similarity," Scientometrics, vol. 111, pp. 1017–1031, 2017, doi: 10.1007/s11192-017-2298-x.

A.-Z. Adel, S. Zebari, and K. Jacksi, "Football Ontology Construction using Oriented Programming," J. Appl. Sci. Technol. Trends, vol. 1, no. 1, pp. 24–30, 2020.

K. Jacksi, "Design and Implementation of E-Campus Ontology with a Hybrid Software Engineering Methodology," Sci. J. Univ. Zakho, vol. 7, no. 3, pp. 95–100, 2019.

S. R. M. Z. Adel AL-Zebari Karwan Jacksi and Ali Selamat, "ELMS–DPU Ontology Visualization with Protégé VOWL and Web VOWL," J. Adv. Res. Dyn. Control Syst., vol. 11, no. 1, pp. 478–485, 2019.

A. Zafar, M. Awais, and M. A. Aftab, "Ontology Based Document Data Analysis," p. 7, 2018.

M. K. L. Sumathy and D. Chidambaram, "A Hybrid Approach for Measuring Semantic Similarity between Documents and its Application in Mining the Knowledge Repositories," Int. J. Adv. Comput. Sci. Appl. Ijacsa, vol. 7, no. 8, 2016, doi: 10.14569/IJACSA.2016.070831.

T. Wei, Y. Lu, H. Chang, Q. Zhou, and X. Bao, "A semantic approach for text clustering using WordNet and lexical chains," Expert Syst. Appl., vol. 42, no. 4, pp. 2264–2275, Mar. 2015, doi: 10.1016/j.eswa.2014.10.023.

P. Zandieh and E. Shakibapoor, "Clustering Data Text Based on Semantic," 2017.

S. Melasagare and V. Thombre, "Document Classification and Clustering using Feature Extraction for Similarity Measure," 2016.

P. Bafna, D. Pramod, and A. Vaidya, "Document clustering: TF-IDF approach," in 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Mar. 2016, pp. 61–66, doi: 10.1109/ICEEOT.2016.7754750.

I. Blokh and V. Alexandrov, "News clustering based on similarity analysis," Procedia Comput. Sci., vol. 122, pp. 715–719, Jan. 2017, doi: 10.1016/j.procs.2017.11.428.

S. AFREEN and D. B. SRINIVASU, "SEMANTIC BASED DOCUMENT CLUSTERING USING LEXICAL CHAINS," 2017.

A. Awajan, "Semantic Similarity Based Approach for Reducing Arabic Texts Dimensionality," Int. J. Speech Technol., Jun. 2015, doi: 10.1007/s10772-015-9284-6.

N. Mousavi, S. Scerri, and S. Auer, "Semantic Similarity based Clustering of License Excerpts for Improved End-User Interpretation," Sep. 2017, doi: 10.1145/3132218.3132224.

S. Kamath S and A. V S, "Semantic similarity based context-aware web service discovery using NLP techniques," J. Web Eng., vol. 15, Mar. 2016.

C. Kavitha, S. Sadhasivam, and S. Kiruthika, "Semantic similarity based web document classification using Artificial Bee Colony (ABC) algorithm," WSEAS Trans. Comput., vol. 13, pp. 476–484, Jan. 2014.

J. Avanija and K. Ramar, "Semantic Similarity-Based Clustering of Web Documents Using Fuzzy C-Means," Int. J. Comput. Intell. Appl., vol. 14, Sep. 2015, doi: 10.1142/S1469026815500157.

E. Agirre et al., "SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation," in Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, Jun. 2016, pp. 497–511, doi: 10.18653/v1/S16-1081.

I. Ali and A. Melton, "Semantic-Based Text Document Clustering Using Cognitive Semantic Learning and Graph Theory," in 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Jan. 2018, pp. 243–247, doi: 10.1109/ICSC.2018.00042.

S. Romeo, A. Tagarelli, and D. Ienco, "Semantic-Based Multilingual Document Clustering via Tensor Modeling," Oct. 2014, doi: 10.13140/2.1.2947.7765.

A. Elsayed, H. Mokhtar, and O. Ismael, "Ontology Based Document Clustering Using MapReduce," Int. J. Database Manag. Syst., vol. 7, May 2015, doi: 10.5121/ijdms.2015.7201.

J. G. Conrad and M. Bender, "Semi-supervised events clustering in news retrieval.," 2016, pp. 21–26.

S. R. Kolhe and S. D. Sawarkar, "A concept driven document clustering using WordNet," in 2017 International Conference on Nascent Technologies in Engineering (ICNTE), Jan. 2017, pp. 1–5, doi: 10.1109/ICNTE.2017.7947888.

W. Glänzel and B. Thijs, "Using hybrid methods and 'core documents' for the representation of clusters and topics: the astronomy dataset," Scientometrics, vol. 111, Feb. 2017, doi: 10.1007/s11192-017-2301-6.

D. Renukadevi and S. Sumathi, "TERM BASED SIMILARITY MEASURE FOR TEXT CLASSIFICATION AND CLUSTERING USING FUZZY C-MEANS ALGORITHM," 2014.

Y.-S. Lin, Y. Jiang, and S.-J. Lee, "A Similarity Measure for Text Classification and Clustering," Knowl. Data Eng. IEEE Trans. On, vol. 26, pp. 1575–1590, Jul. 2014, doi: 10.1109/TKDE.2013.19.

S. S. Desai and J. A. Laxminarayana, "WordNet and Semantic similarity based approach for document clustering," in 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), Oct. 2016, pp. 312–317, doi: 10.1109/CSITSS.2016.7779377.

L. Stanchev, "Semantic Document Clustering Using Information from WordNet and DBPedia," in 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Jan. 2018, pp. 100–107, doi: 10.1109/ICSC.2018.00023.

P. Nanayakkara and S. Ranathunga, "Clustering Sinhala News Articles Using Corpus-Based Similarity Measures," in 2018 Moratuwa Engineering Research Conference (MERCon), May 2018, pp. 437–442, doi: 10.1109/MERCon.2018.8421890.

T. Zheng et al., "Detection of medical text semantic similarity based on convolutional neural network," 2019, doi: 10.1186/s12911-019-0880-2.

T. Kenter and M. de Rijke, "Short Text Similarity with Word Embeddings," in Proceedings of the 24th ACM International on Conference on Information and Knowledge Management - CIKM '15, Melbourne, Australia, 2015, pp. 1411–1420, doi: 10.1145/2806416.2806475.

D. Mahapatra, C. Maharana, S. P. Panda, J. P. Mohanty, A. Talib, and A. Mangaraj, "A Fuzzy-Cluster based Semantic Information Retrieval System," in 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, Mar. 2020, pp. 675–678, doi: 10.1109/ICCMC48092.2020.ICCMC-000125.

R.-G. Radu, I.-M. Radulescu, C.-O. Truica, E.-S. Apostol, and M. Mocanu, "Clustering Documents using the Document to Vector Model for Dimensionality Reduction," in 2020 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR), Cluj-Napoca, Romania, May 2020, pp. 1–6, doi: 10.1109/AQTR49680.2020.9129967.

S. Fatimi, C. El, and L. Alaoui, "A Framework for Semantic Text Clustering," Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 6, 2020, doi: 10.14569/IJACSA.2020.0110657.

I. B. G. Sarasvananda, R. Wardoyo, and A. K. Sari, "The K-Means Clustering Algorithm With Semantic Similarity To Estimate The Cost of Hospitalization," IJCCS Indones. J. Comput. Cybern. Syst., vol. 13, no. 4, p. 313, Oct. 2019, doi: 10.22146/ijccs.45093.

Wai Wai Lwin, "Impressive Approach for Documents Clustering Using Semantics Relations in Feature Extraction," presented at the 2019 the 9th International Workshop on Computer Science and Engineering, 2019, doi: 10.18178/wcse.2019.03.007.

S. A. Curiskis, B. Drake, T. R. Osborn, and P. J. Kennedy, "An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit," Inf. Process. Manag., vol. 57, no. 2, p. 102034, Mar. 2020, doi: 10.1016/j.ipm.2019.04.002.

E. M. B. Nagoudi, J. Ferrero, D. Schwab, and H. Cherroun, "Word Embedding-Based Approaches for Measuring Semantic Similarity of Arabic-English Sentences," in Arabic Language Processing: From Theory to Practice, vol. 782, Cham: Springer International Publishing, 2018, pp. 19–33.

T. H. Cao, V. M. Ngo, D. T. Hong, and T. T. Quan, "Semantic Document Clustering on Named Entity Features," ArXiv180707777 Cs, Jul. 2018, Accessed: Jul. 04, 2020. [Online]. Available: http://arxiv.org/abs/1807.07777.

Z. Wu et al., "An efficient Wikipedia semantic matching approach to text document classification," Inf. Sci., vol. 393, pp. 15–28, Jul. 2017, doi: 10.1016/j.ins.2017.02.009.

Dr. N. Krishnaraj, D. P. Kumar, and S. K. Bhagavan, "Conceptual Semantic Model for Web Document Clustering Using Term Frequency," EAI Endorsed Trans. Energy Web, vol. 5, no. 20, p. 155744, Sep. 2018, doi: 10.4108/eai.12-9-2018.155744.

J. G. Conrad and M. Bender, "Semi-supervised events clustering in news retrieval.," in NewsIR@ ECIR, 2016, pp. 21–26.




DOI: http://dx.doi.org/10.26555/jifo.v14i2.a17513

Refbacks

  • There are currently no refbacks.


Copyright (c) 2020 Karwan Jacksi, Niyaz Salih

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

____________________________________
JURNAL INFORMATIKA

ISSN : 1978-0524 (print) | 2528-6374 (online)

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

View JIFO stats