The architecture social media and online newspaper credibility measurement for fake news detection

Social media is one of the communication media favored by people in the world, especially in Indonesia. This is evidenced by the results of the APJII survey which shows that the majority of Indonesians use social media in their daily activities. One of the advantages of social media is the dissemination of information faster than conventional media so that the quality of information disseminated is lower than conventional media due to the process of disseminating information not through the filter process. By measuring the level of credibility of the online newspaper based on the time credibility, website credibility, and message credibility factors and measuring the level of credibility on social media based on the time credibility, Social Media Credibility, and Message Credibility factors with different levels of weight, it will produce a news likelihood level it's fake news or facts.


Introduction
The development of Internet technology in the world, especially in Indonesia, greatly influences the lifestyle of people who are in desperate need of internet for everyday life.This is reinforced by the results of a survey of internet users in Indonesia conducted by the Indonesian Internet Service Providers Association (APJII) in 2017 showing that internet users from year to year continue to increase to exceed 143.26 million people or 54.68% of the total number of Indonesians in the year 2017 which reached 262 million people.With details 87.13% use the internet to access social media [1].The large number of internet users accessing social media is related to the benefits gained from social media, namely: (i) the majority of social media provides news more timely and requires lower costs to obtain information when compared to traditional news media such as newspapers or television.(ii) social media allows users to share information with other users, comment directly on the news presented, and discuss with other users related to the theme presented [2].
With the benefits of using social media, the quality of news disseminated through social media is lower than the news distributed through traditional media.This is influenced by one of the characteristics of social media where the message is delivered freely without having to go through the gatekeeper [3] so that social media users will easily make false news with certain objectives such as attacking political opponents, attacking business opponents, or even vilifying other groups.Likewise, the number of users is very large and has a low cost to access social media, so the news that has a negative meaning will very quickly spread to other social media users.The Indonesian Government through the Ministry of Communication and Information has anticipated the spread of false news by drafting law number 11 of 2008 or better known as the Electronic Information and Transaction act [4].However, the existence of such legislation is deemed insufficient to reduce the spread of fake news on social media because some sites like turnbackhoax.idand hoaxornot.detik.comstill keep updating their news about fake news spread on social media.

739
Fake news detection can be done in several ways that can be grouped into 3 categories, namely [5]: (i) Knowledge-based or commonly known as fact checking which is a way to detect fake news by searching for facts from the news using information retrieval method, semantic web, and linked open data (LOD).(ii) Context-based is the detection of false news based on the analysis of social networks where the spread of false news will be sought news sources that spread the first time.(iii) Stylebased, namely false news detection based on computational linguistics and natural linguistic processing or more specifically on the detection of lies from the news disseminated.All of these categories represent several factors in detecting fake news that have been published by the International Federation of Library Associations and Institution (IFLA).However, one of the factors that has not been included in the questionnaire is the factor of checking the date of publication.
The proposed architecture system consists of several stages, namely Keywords Extraction, News Scoring, Social Media Scoring.Keywords Extraction will function to get keywords from sentences or documents from the user.News Scoring serves to measure the level of fake news from a sentence or document from the user based on information obtained on news websites or online newspapers by paying attention to the time credibility, website credibility and message credibility factors.Furthermore, the Social Media Scoring stage will be carried out if the News Scoring results are still less than the specified weight, it will calculate the level of social media credibility that contains news from the user by considering the Time Credibility, Social Media Credibility, and Message Credibility factors with their respective weights.has been determined.It is hoped that with this architecture system, it can get a better level of credibility than previous research.

Related Works
IFLA through its blog provides information that to recognize a fake news can be done in several stages, namely: Consider the source (to understand its mission and purpose), Read beyond the headline (to understand the whole story), Check the authors (to see if they are real and credible), Assess the supporting sources (to ensure they support the claims), Check the date of publication (to see if the story is relevant and up to date), Ask if it is a joke (to determine if it is meant to be satire), Review your own biases (to see if they are affecting your judgement), Ask experts (to get confirmation from independent people with knowledge).
In the knowledge based category or what is commonly called fact-checking that uses information retrieval techniques exemplified in several research according to research [5] were research [6] that utilizes Text Runner [7] tools to obtain information from several online sources and information extraction and information indexing processes that will be compared to the inconsistency level based on questions entered by the user, then to improve the results of fake news detection, the [8] research proposes to check factual information obtained from several sources using a statistical model where the more news is discussed by several online sources, the information has more truth high.But in the study also still left a problem where the level of credibility and reputation of a website that contains the same information should be measured to filter which websites can be trusted or not.In research [9] which provides a literature review on the level of credibility and reputation of a website can be one of the inputs in completing previous research.Where one technique to analyze the level of credibility of a website can be determined based on the reputation of the website.The higher the reputation of a website, the higher the level of credibility of a website.
There are several studies that use the semantic website or LOD technique in the knowledge-based category, including the [10] research that builds a framework to test claim models by changing the parameters that have been determined and analyzing the changes in conclusions that occur due to changes in parameters.Whereas in the study [11] revealed the problems encountered in the fact checking method where finding the shortest path between concepts in a knowledge graph get less optimal measurement results so that a calculation metric is proposed to analyze the truth of the statement by considering the path length factors between concepts at issue.Whereas in the research conducted by Shi and Weninger [12] shows that the detection of fake news can be done by using the probability that the information in question must have a probability of connectedness with the facts that have been collected to obtain a higher level of credibility.
Measurement of credibility devoted to social media, reseaarch [13][14][15] show about measuring the level of credibility of a social media especially on Twitter.The measurement of the credibility of a social media can be measured by the owner of a social media account that interacts with others including how often the account is spreading information and how open about the identity of the account owner.Furthermore, research in [16] shown that the information shared on social media affects the perceptions of credibility from sources that share information.This is shown by conducting an experiment showing twitter pages to correspondents then correspondents specify the source they consider the page owner.The experimental results show that the Tweet reviewer affects the source credibility.Moreover, research [17], showed that to eliminate false news or spam on Twitter can be done using semisupervised ranking models for scoring tweets according to their credibility on Tweets containing information relating to crisis events.The settlement of credibility issues with the information shared in social media especially Twitter was also shown in the study [18] in which two approaches were conducted to determine the credibility of information on Twitter (low, high and average).The first approach is to calculate the degree of similarity between information on Tweets to verified news and the second approach is based on the similarity with verified news sources in addition to a set of proposed features.Furthermore, CREDBANK was introduced in research [19] where CREDBANK is a collection of corpus derived from Twitter, topics, and events that have been determined by the level of credibility by human.The collection of corpus was obtained from streaming Twitter over a period of more than three months totaling over 1 billion tweets and has been automatically verified by humanity against 30 human annotators.Besides research in the field of journalism, RumorLens was introduced [20] which as a tool to help journalists to identify new rumors distributed via Twitter and show the audience that the information shared is a rumor and verification of the rumors which had been spread in other Tweets.So the user can determine whether the information from Twitter will be explored further, to be corrected from other Tweets, or not related to other Tweets.
In the study [21], research to detect information on Twitter is a rumor or not by performing five stages of the procedure by: identify Signal Tweets, identify Signal Clusters, detect Statements, Capture Non-Signal Tweets, and rank Candidate Rumor Clusters.In this study, identified a third of the top 50 clusters were judged to be rumors.The detection of hoaxes can be done as in research [22] where to detect identity, demographic background and linkage between documents can be detected based on the writing style of the source document.The contribution of this research is detecting stylistic deception in written documents.The test result using large feature set between regular documents with deceptive document shows 96.6% accuracy as (F-Measure).Moreover, research in [23] proposed a novel time-aware ranking model that leverages on multiple sources of crowd signals.The approach built on two major novelties.First, a unifying approach that given query q, mines and represents temporal evidence from multiple sources of crowd signals.The model could predict the temporal relevance of documents for query q.Second, a principled retrieval model that integrates temporal signals in a learning to rank framework, to rank results according to the predicted temporal relevance.Research in [24] it was shown that the limitations of fact-finding on web sources can be fragmented due to the scarcity of related resources so that in this study carried out the taking of a claimed article, and model the mutual interaction between: the stance (i.e., support or refute) of the sources, the language style of the articles, the reliability of the sources, and the claim's temporal footprint on the web.Extensive experiments demonstrate the viability of the method and its superiority over prior works.The result showed that the methods work well for early detection of emerging claims, as well as for claims with limited presence on the web and social media.

Proposed System Architecture
In the fake news detector model proposed as seen in Figure 1, can be divided into 3 steps such as: Keyword Extractions, News scoring and Social Media Scoring.At the end, credibility score will be counted as justification as fake news or not.

Keyword Extraction
Keyword Extraction or commonly known as Preprocessing Text [25] is used to get important keywords or words from documents entered by the user.This stage is divided into 4 processes such as Tokenizing, Filtering, Stemming, and Tagging.The process of separation can be done by separating words based on spaces or punctuation in the sentence.b.Filtering is a non-inclusive word-removal process that can be used as a keyword that is usually a person's pronoun or a hyphen.This process relies heavily on a collection of words called Stopword.If there is a word included in the Stopword, then the word is omitted.c.Stemming is a basic word search process on every word the result of filtering.The function of this process is to find the basic word of a word that has either a prefix, a suffix, an insertion, or a combination thereof.In previous research, special on Indonesian stemming can be done by using method of Sastrawi Stemming.d.Tagging [26] is a process of correction of the Stemming process where in the process of stemming there are some words that do not match the basic word that existed in the Big Indonesian Dictionary (KBBI) [27].So it takes a basic Indonesian word database obtained from KBBI.
Figure 1.System architecture approach The results of Keyword Extraction will be used at the stage of Online Search News that uses help from the Google Search API to find similar information on online news sites.Collection of news obtained from searching online news sites will be calculated similarity levels using the Document Similarity algorithm with Vector Space Model [28].
where: cos(  , ) = similarity between document and query   = document  = query whereas to find the weight vector for each document can be calculated by ( 2

|𝐷|
= the total number of documents in the document set |{ ′ ∈ |  ∈  ′ }| = the number of documents containing the term t.

News Scoring
News scoring (NS) as shown in Figure 1 will be counted with TC (Time Credibility), WC (Website Credibility) and MC (Message Credibility) and divided with N as Number of News Document, where the equation is shown in (1) where TC will be scored 0.4, WC is 0.3, and MC is 0.3 respectively.Time Credibility gets a higher weight because in the previous study did not pay attention to the factors of news publication being a determinant of false news detection, but the repetition of news publications was also included in one of the factors that caused false news according to IFLA.
Results from the online news search process will collect several news sites that contain information that has similarities to the queries entered by the user.Some of these sites will be taken when the news publication will be a factor in Time Credibility.The website name will be a factor in Website Credibility, and the contents of the news will be a factor in Message Credibility.If a news is more distant from the time of the smallest news publication, it will get a smaller Time Credbility value.If the Website that publishes the news has a high level of popularity, it will get a higher Website Credibility value.Whereas in Message Credibility, a news will get the highest value if it has a high level of similarity based on all news content on the query input from the user.Then the connectedness of the three factors can be formulated into: where NS = News Scoring, TC = Time Credibility WC = Website Credibility MC = Message Credibility

Social Media Scoring
Similar like News Scoring, then Social Media scoring (SMS) will be counted with 3 parameters such as TC (Time Credibility), SMC (Social Media Credibility) and MC (Message Credibility) and divided with N as Number of Social Media Document, where the equation is shown in (4) where TC will be scored 0.4, SMC is 0.3 and MC is 0.3 respectively.TC on SMS gets a higher weight because if there is a repetition of information dissemination, it will be directly considered false news.This stage will be carried out if the value of News Scoring does not reach or less than 0.6 because the value of 0.6 becomes a representative value above the middle value between 0 and 1.However, this value will change depending on the results of the experiment to be carried out so that it can get a more accurate value .But if the NS value has reached more than 0.6, then the news can be said as fact news or have a high level of credibility.Social media used as a search site is Twitter and Facebook.Where in the search process used keywords entered by the user so that found some news that contains keywords from users on some social media profiles on Facebook or Twitter.Scoring is based on the time of publication (Time Credibility), the credibility of the account that publishes the news (Social Media Credibility), and the relevance of the content of the published news to the keywords of the user (Message Credibility).

Credibility Score
At this stage the conclusion of the scoring stage has been done in the previous stage.Where there are two conditions of conclusion scoring, namely: The conclusion is made when the scoring of News Scoring more than 0.6 and the conclusion is made when the scoring of News Scoring less than 0.6.Inferences made on the value of News Scoring more than 0.6 then the value of News Scoring results are considered as Credibility Score of the news entered by the user application.As shown in (5).
If (NS > 0.6) CS = NS (5) where, NS = News Scoring CS = Credibility Score The inference made on the News Scoring score of less than 60 then required a sum of scoring results from the News Scoring stage with the scoring result in Social Media Scoring stage and given weight on each scoring result as in (6). ( < 0.6)  = ( * 0.7) + ( * 0.3) where CS = Credibility Score NS = News Scoring SMS = Social Media Scoring

Conclusion
This research intends to propose a system architecture to detect hoax or fake news on Indonesian language news.Where the system architecture is built with several stages, namely: Keyword Extraction that serves to find the keyword of the text entered by the user, Scoring that serves to provide a credibility score of a news based on time, online news/social media, and message so it can be concluded the possibility from such news hoaxes or facts the proposed architectural system has represented several stages to avoid fake news by the IFLA, including: consider the source stage, check the author and supporting source represented in the Website Credibility process in News Scoring and Social Media Credibility in Social Media Scoring.Then at the Check the date stage it is represented in the Time Credibility process in the News Scoring and Social Media Scoring.Whereas in the Read Beyond and Check your biases stages in IFLA are represented in the Message Credibility process in News Scoring and Social Media Scoring.This research still necessary need to be tested with several datasets and optimization of the weighting values based on the tested dataset.
The architecture social media and online newspaper credibility... (Rakhmat Arianto) 741 a. Tokenizing is the process of separating words per word from sentences entered by the user.