Psychometric properties of the SE-Revised: A rasch model analysis

Intelligenz Struktur Test (IST) is an intelligence test developed by Rudolf Amthauer in Frankfurt, Germany in 1953 (Adinugroho, 2016; Wiratna, 1993). This intelligence test is classified as a speed test, which prioritizes speed and accuracy of work (Nur’aeni, 2012). In Indonesia, IST is still used quite often, particularly in workplace selection and placement in the workplace and education settings (Bawono, 2008; Hamidah, 2001; Princen, 2011; Rahmawati, 2014). Therefore, IST is expected to have good measurement performance, which can appropriately measure the test-taker’s abilities. The IST currently used in Indonesia was adapted from IST-70 by Universitas Padjadjaran in 1973. As time goes by, this instrument needs regular evaluation to ensure that it accurately performs in measuring test takers' abilities. However, the IST currently in use has not undergone any revision, despite the original IST having already undergone two revisions from IST-70 to the latest version which is the IST 2000-Revised (Kipman, Kohlböck, & Weilguny, 2012). The IST consists of 9 subtests, each of which can stand alone to measure specific abilities in individuals (Wahyuni, Widyastuti, & Fitriyani, 2015). Previous studies have measurement properties. Widianti (2008) tested for convergent validity where the SE subtest was correlated to the RA subtest. The results show that there is a significant AR T I C L E I N F O AB STRACT


Introduction
Intelligenz Struktur Test (IST) is an intelligence test developed by Rudolf Amthauer in Frankfurt, Germany in 1953 (Adinugroho, 2016;Wiratna, 1993). This intelligence test is classified as a speed test, which prioritizes speed and accuracy of work (Nur'aeni, 2012). In Indonesia, IST is still used quite often, particularly in workplace selection and placement in the workplace and education settings (Bawono, 2008;Hamidah, 2001;Princen, 2011;Rahmawati, 2014). Therefore, IST is expected to have good measurement performance, which can appropriately measure the test-taker's abilities. The IST currently used in Indonesia was adapted from IST-70 by Universitas Padjadjaran in 1973. As time goes by, this instrument needs regular evaluation to ensure that it accurately performs in measuring test takers' abilities. However, the IST currently in use has not undergone any revision, despite the original IST having already undergone two revisions from IST-70 to the latest version which is the IST 2000-Revised (Kipman, Kohlböck, & Weilguny, 2012).
The IST consists of 9 subtests, each of which can stand alone to measure specific abilities in individuals (Wahyuni, Widyastuti, & Fitriyani, 2015). Previous studies have measurement properties. Widianti (2008) tested for convergent validity where the SE subtest was correlated to the RA subtest. The results show that there is a significant correlation between SE and RA, meaning that SE items can measure reasoning abilities. In testing discriminant validity by correlating SE and WU, it shows that SE correlates significantly with WU despite having a relatively weak correlation. This indicates that SE does not only measure one aspect of reasoning but also measures the ability measured by WU. In testing reliability, the results of previous studies show that SE has low internal consistency (Agung & Fitri, 2016;Widianti, 2008).
The results of the item parameter analysis indicate that the level of difficulty of SE items varied, with nine items being difficult (<.3) and three items being easy (>.7) (Agung & Fitri, 2016). Research by Elvira (2011) indicated that eleven items need to be improved, while research by Rahmawati (2014) shows that nine items need to be improved. Also, Suryani (2018) found that item number 20 was biased towards a particular gender, thus also suggested improvement.
Based on previous studies that found poor validity and reliability of SE as well as poor quality of the items, we developed an interest to revise the SE items and conduct psychometric tests on the revised SE. Evaluation of psychometric properties of the SE subtest was done in almost the same way as what has been done by previous researchers (Agung & Fitri, 2016;Elvira, 2011;Rahmawati, 2014;Widianti, 2008). However, previous researchers evaluated data obtained from the original version of the SE subtest, while this study evaluated the revised version of the SE subtest. Two evaluation processes were conducted by the researchers, namely evaluation by using a data bank obtained from the SE test results as the basis for revising items that have poor psychometric properties. After the revision was done, then the data obtained from the revised SE test were re-evaluated.
Evaluation is carried out using the Item Response Theory (IRT). IRT is utilized instead of the Classical Test Theory (CTT) which has the weakness of being test-dependent, which means that the ability of individuals is influenced by the characteristics of items in a test (Embretson & Reise, 2000;Fan, 1998). The ability of test-takers changes depending on different occasions when they take the test results in poor test consistency (Magno, 2009). Based on this explanation, the characteristics of items in CTT are influenced by test-takers abilities, and vice-versa, test-takers abilities are influenced by the characteristics of the items.
Unlike the CTT which focuses on the obtained scores, the IRT does not depend on particular sample of items or the person selected in the test (item free and person free), so that the measurements are more precise and the items can also be calibrated (Ariffin, et al., 2010;Sumintono & Widhiarso, 2014). The IRT assumes that in a test condition, the test taker's performance on the test can be predicted by defining the characteristics of the individual's trait or ability, estimating the test-taker's scores based on these traits (ability scores), and using the scores to predict or explain the items and test results (Hambleton & Swaminathan, 1985;Kubinger, Rasch, & Yanagida, 2011;Prieto, Alonso, & Lamarca, 2003).
Psychometric characteristics are quantitative attributes that relate to the strengths or weaknesses of the statistics obtained from tests or measurements, consisting of reliability, validity, and difficulty index (Embretson & Reise, 2000). The psychometric analysis conducted in this study used the rasch model approach, which consists of unidimensionality, reliability, item-fit order, item difficulty index, and differential item functioning (DIF). The rasch model provides various diagnostic information that allows researchers to recognize and diagnose the difficulties of the test and then suggest corrective actions that can improve the nature of test measurements (Curtis & Boman, 2007;Petrillo et al., 2015).
Unidimensionality means that only one attribute or ability is measured by a set of items in the test (Bond & Fox, 2015). Therefore, one instrument must be able to measure a particular ability of the test-taker. This assumption cannot strictly be met because there are always other factors that influence the implementation of tests, such as cognitive factors, personality, motivation, levels of anxiety, the ability to perform in fast-pace, and the tendency to guess answers when in doubt (Hambleton & Swaminathan, 1985;Hambleton, Swaminathan, & Rogers, 1992). However, there are circumstances where it is necessary to think of concepts in unidimensional terms so that comparisons can be made using the differences (Hagell, 2014). The minimum prerequisite of unidimensionality is 40% of the raw variance value, indicating good unidimensionality, while 60% means very good unidimensionality. The variance that cannot be explained by an instrument should ideally not exceed 15% (Sumintono & Widhiarso, 2014).
Reliability indicates the extent to which repeated measurements will produce the same information, meaning that it does not produce significant meaningful differences in information. Differences in information will always exist; therefore, convincing measurements do not have to produce the same information, rather, differences of very little value which can still be tolerated (Azwar, 2009(Azwar, , 2014Sumintono & Widhiarso, 2014). The reliability of test scores range from 0-1, where r = 0 indicates no reliability, and r = 1 shows absolute reliability (Aiken & Marnat, 2008;Azwar, 2009).
Item-fit is a "quality-control mechanism" which explains whether items can measure certain variables according to the unidimensionality construct (Bond & Fox, 2015). The criteria that we used to check whether an item is fit are if the Outfit Mean Square (MNSQ) obtained score is .5<MNSQ<1.5; Outfit Z-Standard (ZSTD) obtained score is -2.0<ZSTD<+2.0; and Point Measure Correlation (Pt Measure Corr) obtained score is .4<Pt Measure Corr<.85 (Osman et al., 2012;Rashid, et al., 2008;Sumintono & Widhiarso, 2014. According to Boone et al. (2014), the scores of MNSQ, ZSTD, and Pt Mean Corr are criteria used to see how the item's suitability measures the variables that should be measured. If the item does not meet the criteria for outfit MNSQ, outfit ZSTD, and the point measure correlation, it means that the item is not good enough and needs to be adjusted or replaced. This can be caused by an error of setting the wrong answer key, the many of individuals who are less motivated in working on the questions, and questions with low power difference that reduces the accuracy of the items (Sumintono & Widhiarso, 2015).
The item difficulty index (symbolized by b) is indicated from the logit score in the item measure table, which has been sequenced from the highest to the lowest logit score. High logit scores indicate high item difficulties (Sumintono & Widhiarso, 2015).The item measure provides information on the standard deviation score, which when combined with the logit mean allow grouping of items based on the difficulty (Sumintono & Widhiarso, 2015). For example, .0 logit +1SD is categorized as a difficult item, greater than +1SD is considered a very difficult item, .0 logit -1SD is considered an easy item, and smaller than -1SD is categorized as a very easy item. This means that there are four groups of items based on the level of difficulty.
Differential Item Functioning (DIF) is a crucial technique for analyzing survey data and tests (Boone et al., 2014). DIF serves to detect whether items contain biases based on the respondent's demographic variables. This occurs when different groups in the sample (e.g., men and women) respond differently to each item (Pallant & Tennant, 2007). The bias in the item can be determined based on the value in the 'Prob.' Identification of the values in this column that are at or below .05 (the threshold used in the statistical analysis) shows that the relative locations of items differ between certain demographic variables, such as between male and female (Boone et al., 2014). This means that there are indications that this item is biased towards a demographic variable.
This study aims to look at the psychometric characteristics of the SE-revised subtest using the rasch model approach. This study is essential as, among the nine IST subtests, the SE subtest has problems in item parameters and reliability. By revising the SE subtest, it is expected there will be an improvement in the quality of its measurements, as well as the reliability of the IST test as a whole. Thus the measurement results obtained can be used as a basis of making decisions, both in the context of selection and placement.

Respondents
Respondents of this study were 159 first-year undergraduate students from a university in Riau, which consisted of 46 men and 113 women. The age of respondents ranged between 17 to 22 years (mean = 19 and SD = .78).

Instrument
The measuring instrument utilized in this study was the SE-revised. Respondents were given 6 minutes to work on 20 questions of the SE-revised. The data was collected in the form of responses to the 20 items of the SE-revised. For each correct answer, a score of 1 is given, while for any wrong answer a score of 0 is given.

Procedure
This study was conducted in 4 stages: (I) Preliminary study, carried out by evaluating the psychometric characteristics of the SE subtest based on the rasch model, using the data bank obtained from IST testing on the 293 first-year students; (II) Revision of the SE subtest based on information of psychometric characteristics obtained from the preliminary stage; (III) Administration of SE-revised to respondents in a situation that imitates the real test condition; (IV) Analysis of respondents' answer to understand the psychometric characteristics of the SE-revised. Data analysis at the final stage analyzed with the rasch model. Stage I and II were the stages of constructing the SE-revised subtest. Stage III was the stage of data collection. Stage IV was the data analysis of psychometric characteristic of the SE revised.
Stage I. Based on the preliminary study that utilized data from an IST testing conducted on new students of a Faculty of Psychology at a university in Riau, the SE had fulfilled the instrument unidimensionality requirements (θ = 38.2%). This means that the SE subtest was able to measure the reasoning construct accordingly. However, results of the reliability test show that SE has very low reliability (α = .41) as a result of the interaction between the respondents' low reliability (α = .36) and the item reliability that is classified as exceptional (α = .99). This shows that the respondent's ability is lower than the difficulty of the item.
Rasch analysis was conducted to identify the SE items that need to be adjusted. The test results indicated one misfit item (item number 17), five items with a low difficulty index (items number 2, 3, 4, 12, 18), and one item infected with DIF (item number 16). Item number 17 was classified as a misfit (MNSQ=1.54, ZSTD=2.5, Pt Mean Corr=.15); indicating that the item is not suitable for measuring the reasoning variable. Whereas item number 16 was found to have been infected with DIF (.0063 <5%) in the male category, meaning that the item was considered more difficult to answer by the male respondent group.
Stage II. Several items that require adjustment were items number 2, 3, 4, 12, 16, 17, and 18. Revisions to the SE subtest were conducted by analyzing item by item qualitatively through focus group discussions (FGD) with psychologists, psychometric professors, and students who were researching the field of psychometry. Changes to the items were based on collective decisions made in the FGD. Items number 12 and 18 were revised by changing the question along with its answer choices. Item number 16 was revised only in one of the answer choices that was mistyped and might create confusion for respondents. In item 17 (misfit), one of the answer choices was changed. There were no changes in items number 2, 3, and 4 because it is assumed that the difficulty index of these items will increase when the SE item order is sorted from easy to difficult.
During this revision process, the SE items were also reordered based on the item difficulty index from easy to difficult, in accordance with the provisions of intelligence tests that order its items from the easiest to the most difficult (Murphy & Davidshofer, 2003). This allows the respondents to answer easy items first. The order of items based on the level of difficulty is: 2, 4, 3, 6, 10,11,8,9,1,7,14,19,15,13,20,5,16,17,18,12.

Data analysis
Data analysis was carried out using computerized rasch model through the Winstep 3.73 for Windows application program, which produces results that has been sorted based on difficulty level -from the highest difficulty level to the lowest difficulty level, making it easier to identify which questions are difficult and which questions are easy (Suryani, 2018). The data analysis includes analysis of unidimensionality, reliability, fit items, item difficulty index, and DIF.

Results
The results of the analysis on SE-revised provided various information, both in terms of the instruments and items. Table 1 shows a raw variance of 38.5%, which is not much different from the expected 38.7% -very close to the unidimensionality requirement of 40%. This indicates that the SE-revised is capable of measuring reasoning abilities. This raw variance has undergone a slight increase from 38.2%, prior to the revision. As in Table 1, the value of the raw unexplained variance is 61.5%, indicating the magnitude of other factors that also affect the test, such as cognitive factors, personality, motivation, and anxiety (Hambleton & Swaminathan, 1985;Hambleton et al., 1992). From the unexplained variance, it can be seen that all percentages are below 10%, meaning that the independence of the items in the test is classified as good (Wibisono, 2016).
The results of the reliability analysis contain two outputs, namely person reliability and item reliability, as shown in Table 2. The Cronbach alpha is defined as the reliability of the interaction between the person and item reliability. In Table 2, α=.61 which can be considered as sufficient. Person reliability is .65, meaning that the ability of respondents who worked on the SE-revised test is classified as weak. In addition, the item reliability is .98, meaning that items of the SE-revised are classified as exceptional. The average of the measure value of the person table is -.01 (µ<.00), which indicates that the respondents have lower ability relative to the item difficulty level.
The test reliability indicated a good increase, from .41 which was classified as poor to .61 which falls under the category of sufficient. This increase in reliability occurred in accordance with the increase of person reliability from .36 to .65 after being revised, though both are still considered relatively weak.  Table 3 shows that all items of the SE-revised subtest are fit. Most of these items have fulfilled one or more of the suggested criteria. The MNSQ, ZSTD, and pt mean corr values of each item, particularly for item number 17. Analysis of item fit indicated an increase in the quality of items after being revised. Item number 17 (or item number 18 in SE-revised) changed in the value of MNSQ Outfit=1.54 (>1.5) to 1.01, Outfit ZSTD=2.5 (>2.0) to .1, and Pt Mean Corr = .15 (<.4) to .34 (<.4). Pt Mean Corr value of item number 17 that has been revised does not match the required criteria, but the other two criteria (MNSQ Outfit and Outfit ZSTD) have been fulfilled; thus the item can still be used and does not need to be discarded.
Based on the results of the item difficulty estimation, a standard deviation (SD) of 1.98 is obtained in the item measure. Through SD value, items can be grouped based on the level of difficulty. The value of b=>1.98 is classified as very difficult, b=0.0-1.98 is classified as difficult, b=-1.98-0.0 is relatively easy, b=<-1.98 is classified as very easy. Items 1 to 20 has a non-sequential difficulty index. Table 3 shows three items have a logit value smaller than -2, so the items were classified as very easy.
Analysis on the item difficulty index shows some items have better index than the original form. The difficulty index of item number 12 (item number 20 in SE-revised) is better with a value of b=1.63 (-2<b<+2), while for item number 18 (item number 19 in SErevised) with changes in the value of b=1.31. In addition, one item that was classified as difficult was found after revision, that is item number 16 (item number 17 in SE-revised) with a value of b=2.59.
Bias analysis was carried out in the gender category. Based on Table 3, there were two items that are biased towards gender, that are item 8 (.0026<.05) and item 16 (.0143<.05). The results of the DIF testing show that item 16 (item number 5 in the original SE) easier for male respondents to answer. Item number 8 (item number 9 in the original SE) is more favorable for women, which means that the male respondents find it more difficult to answer this question. That being said, the probability of male respondents answering item number 5 correctly is greater than the probability of female respondents. Meanwhile, for item number 9, the probability of male respondents answering correctly is smaller.

Discussion
The main objective of this study was to examine the psychometric characteristics of the SErevised using the rasch model, which focused on testing unidimensionality, reliability, item fit, item difficulty, and DIF. When compared to its quality before revision, the SE-revised have a general increase in the quality of its items and instrument, indicated by the unidimensionality and reliability of the instrument as well as the fact that there no more misfit items observed. This examination of unidimensionality is very important to see the extent to which items provide independent information, for the SE is aspects of reasoning (Ireland, Goh, & Ida, 2018;Kubinger, Rasch, & Yanagida, 2011). In any case, where there are signs of other dimensions measured, the rasch model will indicate which items have the potential to contribute to these 'other' dimensions, thus directing researchers to further investigate then replace or maintain the items (Ishak, Osman, Mahaiyadin, Tumiran, & Anas, 2018). The increase in the general reliability of the test in accordance with the increase of person reliability after being revised. Two items, that are number 15 and 19, were changed based on the agreement of FGD participants. After revision, items number 15 and 19 were found to have good psychometric characteristics as seen from the results of the conformity test (item fit), difficulty, and DIF.
The overall quality of the SE-revised items is quite good because there are no misfit items observed. This finding is in line with previous studies which found several items in the original SE that were not functioning properly and needed improvement (Elvira, 2011;Rahmawati, 2014;Suryani, 2018). After the revision was carried out, all items of the SE was deemed fit, meaning that there was an increase in the quality of items after being revised.
The evaluation results of the SE-revised shows that the items have sufficient ability to measure the reasoning variable. However, 30% of the items did not meet the item difficulty index and DIF criteria and require a review. Various factors may greatly influence the response of the answers given by respondents, for example, factors related to test administration related, the situation of the testing, and the form of the test equipment used when conducting the trial. Azwar (2016) explains that in carrying out tests for data collection trials, the administrative situation needs to be considered and should be executed like the actual test. Meanwhile, the data collection process in this study did not strictly control the situation and condition of the test -for instance in terms of execution time, room temperature, and sitting position of the test takers.
Other assumptions that may influence respondents' answers are the trial respondents who were already familiar with the given questions. This is in line with Rahmawati (2014) opinion that individuals may be familiar due to the fact that it has been 40 years since the test was first adapted in Indonesia. This also indicates allegations that this test has been leaked in the general community (Rahmawati, 2014), for example, we can easily find problem examples along with an explanation on how to answer it on the internet.
As an implication of this research, the SE-Revised can be used in testing. Due to the improvement of its psychometric characteristics, SE-revised can be used as a replacement of the old version of SE, specifically in the context of testing new students at one of the universities in Riau, consequently leading to the attainment of more accurate results. The results of this study need to be improved by further research to obtain better results.
This study has several disadvantages including; First, research respondents were less representative of the population intended, especially in terms of age and education level. Therefore, the results of the study cannot be generalized, and the SE-revised can only be used for the characteristics of the test participants in accordance with the respondents of this study. Second, lack of respondents to obtain quality results of a measuring instrument, so that even though the reliability of the SE-revised has increased compared to the old SE, the reliability score obtained is not ideal. Third, convergent and discriminant validity testing was not conducted.
Based on the results of this study, the following are suggestions for future research. First, to revise the six items that require review, to improve its psychometric characteristics. Second, revise items of the SE subtest that has low quality by replacing the questions along with its answer choices, including replacing words or terms that are rarely used. Third, research on SE-revised should be performed repeatedly to produce tests and items with good psychometric characteristics. Fourth, the development of similar test tools that measure reasoning construct and then evaluate its psychometric property with item response theory (IRT).

Conclusion
Evaluation of the SE-revised using rasch model indicated better psychometric quality compared to the original SE. This is indicated by an increase in unidimensionality and reliability. However, the quality of some SE-revised items need to be improved. Therefore, this test is the first step to improving the quality of SE items; hence further research is needed to be able to obtain better psychometric measures of SE.