Data science for digital culture improvement in higher education using K-means clustering and text analytics

Dian Sa’adillah Maylawati, Tedi Priatna, Hamdan Sugilar, Muhammad Ali Ramdhani Department of Informatics, UIN Sunan Gunung Djati Bandung, Indonesia Faculty of Information and Communication Technology, Universiti Teknikal Malaysia Melaka, Malaysia Department of Islamic Education, UIN Sunan Gunung Djati Bandung, Indonesia Department of Mathematics Education, UIN Sunan Gunung Djati Bandung, Indonesia Department of Informatics, UIN Sunan Gunung Djati Bandung, Indonesia


INTRODUCTION
In the technology disruption era, many software/applications/information systems built to help human activities. Tragically, 37% of 1,800 software are wasted, and 47% of them are software in the field of education [1]. This fact occurs because many factors such as unfulfilled user requirements; there are software errors, faults, and failures; software quality is not fulfilled; no innovation; does not apply the concept of human and computer interaction properly; difficult to use; not according to market needs (not up-to-date); to the lack of understanding of the use of technology due to its rapid development so that trends cannot be followed; etcetera. Of course, this is an obstacle for higher education to achieve Techno University [2], Digital Campus [3,4], Smart Campus [5][6][7][8], Green Campus [9], as well as various other terms in the era of digital technology-based education. Therefore, no matter how sophisticated the technology is offered, when the application program is not used as planned, it will not have significant implications for human activity. One of the main problems of the failure of implementing digital systems is that they are not ready to accept technological changes so quickly that the use of technology is not cultivated and does not become a necessity

RESEARCH METHOD 2.1. Research activities
The case of this research is UIN Sunan Gunung Djati Bandung that is one of the higher education with a vision to become a superior and competitive campus through the use of technology. No less than 58 information systems in the UIN Sunan Gunung Djati Bandung environment that support all education activities, ranging from admissions, Academic services administration systems, Financial information systems, employee information systems, e-Library, e-Learning, systems registration of assemblies, Helpdesk Systems, and various information systems and other applications. However, it turns out that awareness and the need for the use of the information system provided are not evenly distributed throughout the academic community, there are still those who rely on each other, even indifferent.
The research activity depicted in Figure 1 starts from literature studies related to data science, data mining, and the K-means algorithm, which then compiles questions for questionnaires to be distributed to stakeholders in various faculties, study programs/ departments, to units in the environment. UIN Sunan Gunung Djati Bandung. The questionnaire data is then processed using EDA and clustering the K-means algorithm. The results of EDA and K-means data processing are analyzed, studied, and interpreted to find a meaningful pattern so that can produce a recommendation model for strategies to strengthen digital culture in the academic community of Sunan Gunung Djati University, Bandung. And this research use Python as programming language for EDA, K-means, and text analytics [29].
The experiment of this study utilized the Google colaboratory (Google colab) with Python as programming language for EDA and text analytics. For the K-means clustering, this study used Orange as data mining tools. The visualization of EDA, K-means clustering, and text analytics is provided by the Python and Orange library.

Data collecting
Data collection was carried out by distributing questionnaires using Google forms with a total of 60 questions distributed to the ranks of the Rectorate, Dean, Senate, Bureau, Institution, and SPI (as policymakers), 9 Faculties and Postgraduate (involving students, lecturers, and education staff from 15 study programs in Postgraduate, 5 majors in the Faculty of Ushuluddin, 5 majors in the Faculty of Tarbiyah and Teacher Training, 6 study programs in the Faculty of Sharia and Law, 4 majors in the Faculty of Da'wah and Communication, 3 majors in the Faculty of Adab and Humanities, 1 department in the Faculty Psychology, 7 majors in the Faculty of Science and Technology, 3 majors in the Faculty of Social and Political Sciences, and 4 study programs in the Faculty of Economics and Islamic Business), 11 Technical Service Units, 5 General and Mahad Service Units. The questionnaire was prepared with the concept of the Technology Acceptance Model, among others: perceived ease of use, perceived usefulness, behavior intention to use, and actual system use.

Exploratory data analysis
Exploratory Data Analysis is a procedure to analyze data easily, accurate, precise with mathematical statistics as an output, where the process is automatically by machine [30,31]. Basically, EDA provides a summary of numerical data such as average, median, maximum value, minimum value, and quartile. EDA aims to suggest hypotheses about the causes of observed phenomena, to assess assumptions on which to base statistical conclusions, to support the selection of appropriate statistical techniques, and to provide a basis for further data collection. EDA results are usually visualized using graphical techniques, such as square plots, histograms, Pareto diagrams, distribution plots, multidimensional scaling, principal component analysis, and interactive version of the plot. In data mining or machine learning techniques, EDA is usually used in the pre-processing process to visualize, find missing, and also to look for correlations between data or variables. Because the pre-processing phase is important for data selection, data cleaning to improve quality, data transformation, and data reduction to run an efficient mining process.

K-means algorithm
Data mining is a technique for finding important information or insight knowledge from big data [32]. Where, data mining has four main approaches, among others, classification (classification) which is supervised learning, clustering which is unsupervised learning, association rule, and semi-supervised learning that combines classification and clustering. Data mining is used to find hidden information that is important and can be used to predict and support decision making. Clustering is not used to predict like classification, but clustering will produce the insight of data that problematic and analyzed and interpreted by human [19,33]. K-means is one of the most widely used clustering algorithms that find minimum distance values in the same cluster [34][35][36]. K-means is a simple algorithm with fast processing time and produces an optimal cluster. The K-means algorithm is as follows: 1. Determine the number of clusters. 2. Initiate the centroid value for each cluster ( 1, 2, … , ℝ) randomly. 3. Repeat the calculation with the formula (1) and (2) 5. And for each j, calculate:

Text analytics
Text analytics is a technique to find meaningful knowledge from text data [37]. Text analytics is not always called text mining, because text mining always contains the mining process inside it, such as classification, clustering, or association rule for text data. But, text mining is a part of text analytics also. Another text analytics technique such as sentiment analysis [38][39][40], opinion mining [41], social media analysis [42][43][44][45], social networks [46], and web scraping and crawling [47]. Several literature said that text analytics is a part of natural language processing (NLP) [48], because not all NLP use text data as language database, it can be voice/ sound, image, and video. Information retrieval [49], semantic search engine [50], text similarity or string matching [51], and text summarization [52] are a type of NLP that commonly uses text data.

Silhouette coefficient
The clustering result should be measured to ensure that the resulting pattern is good enough. There are internal and external measurements. External measurements like Jaccard Coefficient [53], Purity [54], Precision and Recall [55], F-Measure [56], and so on. Whereas, internal measurements such as Z-Score Index [57], Gamma and Somer's Gamma [58], Silhouette coefficient [59], BetaCV and Dunn index [60], and so on. Silhouette coefficient is widely used to evaluate the results of clustering. Silhouette coefficient is a metric that measures cluster separation and compactness at the same time [59,[61][62][63]. Formula (3) is used to calculate the average distance in a cluster and the minimum distance between objects to other clusters, where, is the average distance of objects in a cluster, i.e. (formula (4)): and is a distance between the object with nearest centroid center . calculated by the formula (5): Silhouette Coefficient values range between 1 to -1 (−1 ≤ ℎ ≤ 1), where 1 means the grouping solution is "correct" and -1 means the grouping solution is "wrong". However, according to the results of clustering, it does not offer a guarantee of accuracy, but many interpretations of the results of clustering. So, there is no guarantee that the Silhouette Coefficient value close to 1 always has the right cluster and many interpretations, and vice versa.

RESULTS AND DISCUSSIONS 3.1. Data collection
Data collection which was successfully obtained is a totally of 2887 data from 338 Lecturers, 200 Educational Personnel, and 2349 Students of UIN Sunan Gunung Djati Bandung. This data already fulfilled 10% of the population. However, to meet the quality of the result, the missing value and outlier are decided to be deleted. So, total data that used are 2365 respondent data with 298 Lecturer, 128 Educational Personnel, and 1939 Student. While the total female is 1348 respondents and 1071 respondents are male. The questions are collected based on parameters of the technology acceptance model (TAM), such as perceived usefulness (PU), Perceived ease of use (PEU), Behavioural intention to use (BIU), and Actual system use (ASU) [64,65]. TAM also already used to evaluate information technology in higher education, such as e-learning or learning management systems [66,67]. This research has 6 questions related to PU, 11 questions for PEU, 10 questions for BIU, and 12 questions for ASU.

Result of exploratory data analysis and data clustering 3.2.1. Exploratory data analysis result
The result of exploratory data analysis (EDA) shows several conclusions related to the implementation of information technology in UIN Sunan Gunung Djati Bandung and can be used as a basis for enhancing digital culture in higher education. This analysis is based on the result of EDA and the correlation between parameters that visualized in Figure 2. The analysis results, among others: a. Overall, the perceived usefulness of information systems is above average with value is 3.43. Perceived usefulness indicates the level of confidence in individuals that technology can improve their performance [68]. Therefore, the result value of perceived usefulness in UIN Sunan Gunung Djati Bandung can be concluded that academic society aware of the usefulness of digital technology that can make their activities more efficient and effective. This value is supported by 65.62% respondents know about the term of the information system, 69% knows about the benefits of information technology/ information system, 62.41% respondents know the main website that provides the up-to-date information about academic activities and news. However, the information sharing about the information system is low, only 42%. This fact shows that today, the academic society has been aware that their academic activities can not be separated from technology support. Therefore, to improve digital culture in higher education, the technology not always used every time but the perception of the usefulness of technology must always be maintained [69]. This can be realized if the technology that available support the needs of users, so that the usefulness can be felt.

Figure 2. Correlation between TAM parameters
b. The perceived ease of use of the information system which available in UIN Sunan Gunung Djati Bandung is quite low, below the normal value, it is 2.67. Perceived ease of use indicates the use of an information system whether easy to use or understand [70]. The result value means that many respondents feel difficulty in using the system. The fact shows that (31.07% respondent) is only 25% of all information system which provide the manual guide because of only a half of system that provides the complete manual guide and conducted the socialization or training how to use the system. Then, 42.16% of respondent agree that half of the information system easy to use, 44.95% respondent agree that half of system user interface is interesting so easy to understand, and 45.92% respondent feel that only a half of system that fulfills the requirements of the process business through the functions that available. Most of them decide to use the system although they need more time for understanding . This result concludes that the socialization, training, and manual guide for the system is important to improve the perceived ease of use of the information system in higher education. The impact is the digital culture in higher education will be improved too. c. The behavior of intention is defined as an effort or strong desire from the users to try and use the system [71]. In UIN Sunan Gunung Djati Bandung, the behavior of intention to use the information system is above the normal value, around 3.28. It means that digital culture in UIN Sunan Gunung Djati Bandung already awakened. It is proven by almost 50% of respondents' support and enthusiasm to try and use the new information system if it is released. Because 52.26% of respondent feel that using the information system can support their academic activities efficiently and effectively. This understanding must be continually developed and maintained to improve digital culture in higher education, because the behavioral intention to use information technology can be the habit of the digital user [72], especially academic society in using academic information systems such as the learning management system [73]. d. For the whole information system that available in UIN Sunan Gunung Djati Bandung, the actual system uses is still quite low below the normal value, it is 2.65. This value means in the implementation of the information systems is still low, such as 44.27% of respondents agree that only half of the system that fulfills the user or business process requirements, only 29.69% of respondents that remember the link address to access the system that they need. Even though, 78.14% of respondents decide to use a digital system to support their academic activities. This result proves that it is important to design the system that involving users in its development to fulfill their requirements. Because every type of user has a different requirement that must be accommodated, analyzed, and selected so as to best meet the needs of all users. The good information system/ software design will produce a good quality of information system [74], then the digital culture will be improved if all they need are accommodated. Figures 3-5 visualized the result of the K-means algorithm in clustering the type of data (the clustering result and the example of a cluster member). The clusters are formed due to the similarity of data characteristics. Based on the silhouette coefficient value with Euclidean distance, the best cluster for this research data is two clusters, with 1124 in Cluster 1 (C1 -Blue) and 1241 in Cluster 2 (C2 -Red). Therefore, Figure 3 visualize the result with two cluster-based on Respondent (Lecturer = 1; Educational Personnel = 2; Student = 3), Figure 4 is based on Gender (Male = 1; Female = 2), while Figure 5 based on Age. The initiation of the centroid center is assigned randomly. K-means algorithm uses random initialization with 300 maximum iterations.

Result and interpretation of the K-means algorithm
Actually, the K-means clustering result is not reliable enough, there are many members of the cluster (both blue cluster and red cluster) that far apart from the centroid. Many members that also have the similarity, whereas they are in a different cluster. The Silhouette coefficient value of this cluster is too low (0.125) so that the cluster is quite difficult to be interpreted. However, the cluster region/ area is still quite clearly separated (visualized with a background color). When examined further, there are several conclusions that can be obtained, among others: a. The C1 is a group of respondents that have perceived ease of use at the lowest level (under 2), it means that 47.53% of respondents still feel need the more effort to use/ adapt/ learn the information system because of lack of the information and socialization of system. On the other hand, 56.47% of the respondent (C2) feel normal or even feel easy in use the system. But, C2 has the highest perceived usefulness. It means that most of the respondents know and understand the benefits of digital technology to support academic activities in higher education. b. Based on the clustering result in Figure 3, compared with the student, most of the lecturer and educational personnel are in C2. It means that the lecturer and educational personnel can learn the system easier that student. This fact shows that the socialization has not been evenly distributed, it should be assumed that socialization/ training on the use of the system is mostly carried out by lecturers and educational personnel than the student. c. Based on gender in Figure 4, there is no significant cluster difference. Moreover, there are too many members of C1 and C2 that in the same cluster region. However, it is shown that gender does not affect the use of technology and digital culture in higher education. d. Because of the high age variation (visualized in Figure 5), it appears that the clusters formed are not compact. However, what can be obtained from cluster results based on age that in the age range above 30 years, more in C2. This shows that they can use the information system well and feel the benefit of the information system because of the support of good socialization and training systems.

Result of text analytics 3.3.1. Test pre-processing
Text pre-processing is an important phase in text analytics, including in text mining and natural language processing. Text pre-processing is a phase to prepare text data until ready to process in the next phase and ensure the quality of text data [75,76], either in the input process or result process. Not all process in the text pre-processing is used, sometimes it is in accordance with the needs of the research. Generally, there are tokenizing, lower case (case folding), remove regular expression, stop-word removing, and stemming process in the text pre-processing process [77]. Every language has different characteristics, structures, and grammar, including the Indonesian language. This research uses the Indonesian language. The text data is collected is contained the message, impression, and hope from 2887 respondents. Figure 6 illustrated the word cloud based on the frequency of words. As shown in Figure 7 top 15 of words that appeared from text data are: lebih (more), digital, aplikasi (application), mahasiswa (college student), system (system), semoga (hope/ wish), baik (good/ well), UIN, online, yg (abbreviation of yang-preposition in Indonesian language), banyak (a lot of/ many/ much), tidak (no), nya (possessive pronoun in Indonesian language), sosialisai (socialization), and information (informasi). Actually, the word such as yg, tidak, nya, tidak, and many more which are abbreviated and included in the stop-word category, unsuccessfully removed. And also, several affixes in the stemming process is not changed. It happens because this experiment uses the Sastrawi library for Python without improvement for this case [78].

Result and interpretation of text analytics
Based on the text analytics result about the message and hope of respondent including lecturer, educational personnel, and student, it can be concluded that: a. In accordance with the result of perceived ease of use (PEU) and actual system use (ASE) which are low, the respondent (especially student) hopes that socialization or training of information system must be comprehensive and massive. This can improve the digital culture in higher education that introduces the system (or new system) completely and thoroughly for all end-users. Not only certain groups, because each user has a different level of understanding and adjustment about the information system. The digital natives who born more than 1980 and familiar with digital technology allegedly faster in understanding new systems or technology than immigrants natives [79]. b. The socialization must be supported by a manual book which completes for each information system.
The fact of the survey shows that manual books not all available for each system, and also incomplete instructions for use in the system. Even though the manual book is available, but it has not shared/ inform/ socialized well, so that not all end-user get the manual book or can search the manual book easily. Especially for the student, in accordance with the K-means clustering result, most of the students are in C1 who feel need more effort to use the system because of the lack of socialization. c. Generally, respondents hope that the information system, budaya digital/ digital culture, system digital/ digital system, online system in UIN Sunan Gunung Djati Bandung lebih baik/ better than before. This hope proves that digital culture in UIN Sunan Gunung Djati Bandung already awakened. It is in accordance with BIU results that above average, interest, support, and desire to use information technology are quite high. This needs to be supported by system functions that meet the needs of academic society, thorough socialization, and a complete manual book. So that digital culture will further improve because the system is easy to use, according to the needs of academic society, and has a direct impact on performance because work becomes more effective and efficient.

CONCLUSION
This research is conducted comprehensively in order to evaluate the information technology implementation in higher education. The questionnaire data from lecturers, students, and educational personnel are analyzed using exploratory data analysis (EDA), K-mean clustering algorithm, and also text analytics. The result of the experiment of EDA and K-means algorithm shows that to improve the digital culture with high perceived ease of use and actual system use of information technology should be supported with complete and comprehensive socialization, and also provide the manual guide for each information system. This result in accordance with the hope of end-user that need information, knowledge, and guideline for an information system that they used. Digital culture through behavioral intention use of information system that already awakened should be maintained and improve with the quality of information system which fulfills the user requirements.
For further works, it needs to prepare the data better so that it can produce the reliable cluster, although clustering is not used to predict, it can produce a more accurate interpretation if the data prepared is better. The other clustering methods can be used to get a better cluster. And also, it can use the classification approach to predict the type of respondent and the result can be used as decision support by policymaker in higher education related to information technology and digital culture improvement.