Trends in sentiment of Twitter users towards Indonesian tourism: analysis with the k-nearest neighbor method

ABSTRACT


INTRODUCTION
Indonesia has immense tourism potential and incredible diversity, making it attractive to both local and international tourists [1], [2].In an effort to promote Indonesian tourism, the government uses the tagline "wonderful Indonesia" as the identity of Indonesian tourism [3], [4].In the current digital era, Twitter has become one of the popular platforms for various information and opinions.Based on the We Are Social report in Figure 1, the number of Twitter users in Indonesia reached 18.45 million in 2022 [5].This number is equivalent to 4.23% of the total Twitter users worldwide, which reached 436 million.These figures place Indonesia as the fifth-largest country in terms of Twitter users globally.Therefore, analyzing the sentiment of Twitter users towards Indonesian tourism using the keyword "wonderful Indonesia" becomes crucial to understand public perspectives on Indonesian tourism.
In this article, the K-nearest neighbor (KNN) method will be used to analyze the sentiment of Twitter users [6].The aim of this analysis is to determine the level of satisfaction and perspectives of Twitter users towards Indonesian tourism [7].Research by Syarifuddin regarding public opinion on Twitter on the government's large-scale social restrictions (PSBB) (pembatasan sosial berskala besar) (PSBB) policy, or the policy of restricting movement and stricter community activities which is also commonly known as lockdown in many countries.The algorithm used in the study is also one of the algorithms employed by Syarifuddin.He utilized three algorithms, namely decision tree, KNN, and naïve bayes, with the aim of finding the best accuracy value in the prediction process.Among the three algorithms used, the decision tree algorithm yielded the best results with an accuracy of 83.3%, precision of 79%, and recall of 87.17%.
On the other hand, in a research study conducted on sentiment analysis towards the reopening of tourist destinations amidst the COVID-19 pandemic, the naïve bayes algorithm and the KNN algorithm were utilized to classify tweet data as positive or negative [9].The research findings revealed that the naïve bayes algorithm achieved the highest accuracy rate of 75.53%, with a positive precision of 71% and a positive recall of 99%.Meanwhile, the KNN algorithm obtained the highest accuracy rate of 48.66%, with a positive precision of 69%, and a positive recall of 69%.
The novelty of this research is a deeper understanding of public sentiment in relation to tourism promotion efforts in Indonesia through the "wonderful Indonesia" campaign, supported by social media data analysis.Along with the development of technology, the important role of social media in shaping public opinions and views is becoming increasingly apparent.Therefore, this article discussed how the the KNN method is used in analyzing Twitter user sentiment towards Indonesian tourism using the keyword "wonderful Indonesia" from January 2021 to November 2022.We will delve into the data collection process from Twitter and how the KNN method is implemented in sentiment analysis.Additionally, this research can provide insights into how public feelings and intentions regarding tourism can be influenced by social media and aid in making strategic decisions for the tourism industry [10].It can also offer information on traveler trends and preferences, thus assisting in tourism product planning and development.

METHOD
The method used is the KNN method.As depicted in Figure 2, the KNN method broadly consists of the following steps: scraping, preprocessing, sentiment data, data splitting, training data, testing data, and data visualization [11].The KNN method was chosen because the stages in system development using the KNN method are considered to be clearly structured.
Apart from the algorithm, the use of larger and more representative datasets is also a key factor in achieving better results.Larger datasets give the model more examples to learn and adapt to a wider variety of language and user communication styles.This allows the model to produce a more accurate and generalized representation of the sentiments expressed in tweets.In addition, new techniques in natural language processing are also applied to derive more informative features from the tweet text, ultimately improving the quality of sentiment prediction.

. Scraping data
Data scraping is an automation technique used to extract data from websites, databases, enterprise applications, or legacy systems, which can then be saved in a tabular or spreadsheet format [12].In this study, data scraping was performed on Twitter social media using Google Collaboratory and the Python programming language [13].The retrieved tweet data was then stored in a CSV file, resulting in a total of 16,543 data.This stage was crucial in obtaining the necessary data for the sentiment analysis of Twitter users towards Indonesian tourism using the keyword "wonderful Indonesia".Table 1 shows the data successfully collected from the data scraping stage.

Preprocessing
Preprocessing is the stage where the obtained tweets are cleaned by removing duplicate tweets, RT, #, @, numbers, emoticons, and other symbols [14].This stage involves several processes, including data cleansing, tokenization, removal of stopwords, and weighting [15].The commands used in preprocessing are designed to simplify the application to the data used.

Cleansing
The cleansing stage involves cleaning the data obtained in the sentiment analysis process.The goal is to remove noise and increase the validity of the data before performing sentiment analysis [16].This stage includes several processes as follows: -data = data.dropduplicates(subset=["]) to remove duplicate data, which refers to data that have the same sentence.As a result, the obtained data is reduced from 16,543 entries to 14,189 entries.df = data.resetindex(drop=True) to normalize the index order again.-Tweet = re.sub('@[A-Za-z0-9]+',", Tweet) to remove the mention (@) from the username.

Tokenizing
Tokenization is an important step in text processing that involves breaking a text or sentence into smaller parts called tokens [17].Each token is usually the smallest unit of language, such as a word or punctuation mark.The tokenization process aims to break text into elements that are easier for computers to manage and understand.This process is done with the function: from nltk.tokenize import word_tokenize data['tokens] = data [text].apply(lambdax: word_tokenize(x))

Remove stopwords
Stopwords are a crucial step in text processing that aims to remove words that tend not to contribute significant meaning in a given language [18].In Indonesian, this process often uses literary libraries to identify and remove words that are considered common and do not provide deep meaning in the context of a sentence or document.

Weighting
Weighting is an important process in text analysis that aims to assign weight or importance to words in a document or corpus based on their relative frequency or significance [19].In text analysis, not all words contribute equally to the meaning or information contained in the text.Therefore, weighting helps distinguish frequently occurring (common) words from infrequently occurring (rare) words, thus creating a more accurate representation of the text.In this stage, the "get word weights" function is used.

Data labeling
Labeling is the process stage of scanning or labeling the data to identify the sentiment in the data [20].Labeling is done using the TextBlob library which has an implementation of the sentiment analysis model by calling 'TextBlob(text).sentiment.polarity', it will get a sentiment value between -1 which means negative, 0 which means neutral, and 1 which means positive [21].At this stage, the sentiment results are obtained with a total of negative 156 data, neutral 9242 data, and positive 4791 data.

Splitting data
Splitting data is a process stage to divide the dataset into different parts to perform model evaluation and testing [22].The splitting process is done by dividing the dataset into two parts, namely the training part and the testing part [23].Training data is used to train the model and test data is used to evaluate the model.Data splitting is done with a ratio of 8:2, because this process can reduce the risk of overfitting and provide good accuracy.until obtained.This process is done with the help of the scikit-learn library, X = data[['Subjectivity', 'Polarity']] y = data['Sentiment'] X train, X test, y train, y test = train test split(X, y, test size = 0.25, random state = 0) The data is then split into 10641 train data and 3548 test data.

Training data
Training data is integral to the formation and development of algorithms in machine learning [24].The concept involves using a subset of the overall dataset collected to train a model or algorithm so that it can understand the patterns and relationships among the variables.In the context of machine learning, training data is the basis for teaching the model how to make accurate predictions or decisions based on the information at hand.Training is done using the library 'from sklearn.neighborsimport KNeighborsClassifier' then using the command, K = 3 model = KNeighborsClassifier(n neighbors=K) model.fit(Xtrain, y train)

Test data
Test data is a portion of the dataset used to evaluate the performance of the model built with the training data [25].Test data does not participate in the model training process.The test data is accessed using the commands 'x_test' and 'y_test', which contain the test data and the corresponding sentiment labels.The command 'clf.predict(X_test)' is then used to make predictions, and the model's accuracy is evaluated by comparing the predicted results with the actual labels using the command 'accuracy_score(y_test, y_pred)'.
Prediction, in this context, refers to the ability to predict the class or sentiment label of unknown sentiment texts.In sentiment analysis, a model is developed and trained using labeled training data, enabling it to learn patterns and trends that occur in texts with known sentiment.Once trained, the model can be used to predict the sentiment of new, unlabeled texts.After performing the test data, predictions can be made and compared with the actual data using a confusion matrix for better understanding.cm = confusion_matrix(y_test, y_pred) class_names = ['Negative', 'Neutral', 'Positive'] plot_confusion_matrix(model, X_test, y_test, display_labels=class_names) plt.xlabel('Predicted') plt.ylabel('True') To facilitate readability of the matrix, the sentiment labels -1, 0, and 1 found in model.classes_are replaced with "negative," "neutral," and "positive" respectively.The results of the above command can be seen in Figure 3 In comparison with the previously mentioned research results, this study proves excellent quality in predicting sentiment on test data.Research by Syarifuddin [8] in the context of public opinion regarding PSBB policies resulted in the best accuracy of 83.3% with the decision tree algorithm.Research by Era et al. [9] related to sentiment towards the reopening of tourist destinations using the naïve bayes and KNN algorithms with naïve bayes accuracy of 75.53% and KNN of 48.66%.

Visualization
Visualization is a key step in data analysis that aims to present information visually through graphs, diagrams, plots, or word cloud.The main goal of visualization is to transform complex data into a more intuitive and understandable representation [26].Through the use of different types of visualizations, such as bar charts, pie charts, scatter plots, or heat maps, scattered and complex data can be interpreted more clearly and effectively.

Visualization of each month
In Figure 4, the positive sentiment in November has the highest number.It tends to be because the amount of data available in November is more than other months.However, August has more negative sentiment compared to other months and in comparison during January to December the number of negative sentiments is much less than the number of positive sentiments.This shows that the sentiment of Twitter users towards Indonesian tourism has a positive sentiment (a good response to Indonesian tourism).

Word cloud visualization
Word cloud visualization is a graphical representation of a collection of words, where the size of each word corresponds to its frequency [27].In this visualization, the most frequently occurring words in the dataset are displayed in larger sizes, while less common words are displayed in smaller sizes.Based on the word cloud in Figure 6, it can be concluded that the words "wonderful," "Indonesia," and "Mandalika" are frequently discussed topics.This is because the data obtained consists mostly of tweets about promoting tourism in Indonesia, which have a neutral sentiment.During that time, there was also a lot of discussion about Mandalika as a new tourist attraction in Indonesia.In the effort to improve management related to negative sentiment towards tourism in Indonesia, there are several suggestions that can be implemented.Firstly, there is a need for infrastructure improvement by repairing poor road access, such as better paving and maintenance.This action will enhance the comfort and safety of tourists when visiting tourist destinations.Furthermore, there is a need for an enhancement in the quality of services in the tourism sector, especially in hotels.Better training for hotel employees, maintenance improvements, and a focus on customer satisfaction will help improve guest experiences and reduce negative sentiment related to poor service.
In addition, efforts are also needed to improve cleanliness and environmental management in tourism.This refers to complaints related to beach cleanliness and waste management.Improving waste management, environmental awareness campaigns, and involving community participation in maintaining cleanliness can help reduce negative sentiment related to environmental cleanliness in tourism.In the context of consumer protection, it is important to enhance transparency in tourism transactions and protect consumers from fraudulent practices or unfair pricing.Clear regulations, strong law enforcement, and education for tourists are necessary to build trust and reduce negative sentiment related to consumer protection.
Lastly, there is a need to improve coordination and communication among stakeholders such as tourism authorities, destination managers, and local communities.Accurate information, clear directions, and open communication will help reduce confusion and enhance the tourists' experience, as well as address complaints related to directions and destination conditions.By implementing these suggestions, it is hoped that tourism management can be enhanced and negative sentiment experienced by tourists can be reduced, thus providing a more positive experience and improving perceptions of tourism in Indonesia.

CONCLUSION
Based on the results of the research conducted, it is concluded that the majority of Twitter users' tweets contain neutral sentiments related to tourism in Indonesia.However, if neutral sentiments are ignored, Twitter users' views on tourism tend to be positive.It can be seen that the percentage of positive sentiment reaches 33.8%, while negative sentiment is only 1.1%.This trend is reflected in key words such as "wonderful", "Indonesia", and "Mandalika" that frequently appear in discussion topics.This connection can be seen from the attention given to Mandalika as one of the main tourist destinations in Indonesia in those years.The analysis results using the KNN algorithm showed an accuracy rate of 98.2%, recall 97.1%, precision 98.2%, and F1-score 97.7%, reflecting a highly accurate and valid evaluation for policy purposes.For suggestions in future research, it is recommended to comprehensively consider the following aspects.One important aspect is to conduct comparisons with other methods and algorithms to test and strengthen the research findings.By adopting a variety of approaches, researchers can gain a broader understanding and increase the reliability of the findings.This approach will allow future research to provide a more comprehensive and in-depth insight into Twitter users' sentiment towards tourism in Indonesia.

Comput 21 Figure 2 .
Figure 2. Road map of the KNN method employed -Tweet = re.sub('#',", Tweet) to remove the hashtag (#).-Tweet = re.sub('RT[s]+',", Tweet) to remove retweets (RT) -Tweet = re.sub('https?: §+', ", Tweet) to remove the hyperlink (URL).-Tweet = re.sub('www.S+', ", Tweet) to remove the website link.-Tweet = re.sub('[A-Za-zˆ ]'),", Tweet) to remove characters other than the letters A-Z.-Tweet = re.sub('(d)',",Tweet) to remove digit characters in the Tweet string.-Tweet = re.sub('t',"", Tweet) to remove the tab character in the tweet string. ISSN: 2722-3221 Comput Sci Inf Technol, Vol. 5, No. 1, March 2024: 19-28 22 . -Correctly predicted negative label: 0 data -Negative label predicted as neutral: 9 data -Negative label predicted as positive: 33 data -Neutral label predicted as negative: 0 data -Correctly predicted neutral label: 2279 data -Neutral label predicted as positive: 19 data -Positive label predicted as negative: 0 data -Positive label predicted as neutral: 0 data -Correctly predicted positive label: 1208 dataBased on the above data, it can be concluded that the most accurate prediction is in the positive label.However, there are still errors in predicting the neutral and negative labels.After that, scoring is calculated using scikit-learn.The resulting calculations are as: Accuracy: 0.9828072153325818 Precision: 0.9715633303499707 Recall: 0.9828072153325818 F1-score: 0.977034183850208The built model shows excellent results in predicting sentiment on the test data.With an accuracy of 98.28%, the model is able to correctly predict almost all of the data used in the testing phase.The precision of 97.16% indicates that the majority of positive predictions made by the model are correct.A recall of  ISSN: 2722-3221 Comput Sci Inf Technol, Vol. 5, No. 1, March 2024: 19-28 24 98.28% indicates that the model can accurately identify almost all true positive instances in the dataset.The F1-score of 0.97 demonstrates a good balance between precision and recall.With consistently high performance across all evaluation metrics, it can be concluded that this model is effective in predicting sentiment on the test data.

Figure 4 . 25 Figure 5
Figure 4. Visualization of the number of sentiments per month

Figure 5 .
Figure 5. Sentiment percentage from January 2021 to November 2022

Comput
Trends in sentiment of Twitter users towards Indonesian tourism: analysis … (Eka Purnama Harahap)

Table 2 .
Result of data labeling Love catcher in Bali is fun too.Not bad, Aqua and Wonderful Indonesia are very popular (In Indonesian) _fleurdevella Positive 2021-01-30 14:49:36 BUMN Minister Erick Thohir: Mutual Cooperation is the Strength of the Indonesian Nation, the Heritage of Our Ancestors.And we must preserve it. . .Wonderful Indonesia...! @erickthohir #RiseTogetherET Trends in sentiment of Twitter users towards Indonesian tourism: analysis … (Eka Purnama Harahap) 23

Table visualization
[28]e visualization is a graphical representation of data where the data is organized in columns and rows.Table visualizations allow researchers to clearly display relevant data[28].The purpose of table visualization is to present data in a structured format.Table3.shows the results of tweets with negative sentiment.

Table 3 .
Table tweet sentiment negative