The credibility analysis of the microteaching assessment instrument based on the responses of biology education students using the Rasch model

ARTICLE INFO ABSTRACT Article history The credibility analysis of the microteaching assessment instrument based on the responses of biology education students using the Rasch model. This study aims to reveal the credibility of the microteaching (MT) assessment instrument by analyzing the responses of biology education students using the Rasch model. The biology education students as participants played roles as assessors and practitioners of MT. The instrument used was the Instrument of Assessing the Learning Implementation (IPPP) in Book 4 of the 2013 In-Service Teacher Training Program (PLPG) Guidelines issued by the Human Resources Development Agency (BPSDM) of the Ministry of Education and Culture. The data obtained were analyzed by multirater analysis of Many-Facet Rasch Measurement (MFRM) version 3.83.2. Based on the results and discussion, this study concluded that the MT assessment instrument in biology education using IPPP was quite credible, which was revealed from the following three points: 1) The responses of students practicing MT are based on the results of peer ratings on a logit scale ranging from -0.82 to 0.80, which means that students practicing were quite diverse and can be distinguished by the instruments; 2) Student responses as assessor to the MT assessment instrument in logit scale units ranged 1.99 to 0.94 which means that as an assessor, student were able to assess diversely. The student as assessor was also able to understand the different rating scales used; and 3) The results of the calibration of the instruments showed that 24 instrument items can provide accurate information about the performance of the practitioner with logit values ranged from a scale of -0.72 to 0.66.

The credibility analysis of the microteaching assessment instrument based on the responses of biology education students using the Rasch model. This study aims to reveal the credibility of the microteaching (MT) assessment instrument by analyzing the responses of biology education students using the Rasch model. The biology education students as participants played roles as assessors and practitioners of MT. The instrument used was the Instrument of Assessing the Learning Implementation (IPPP) in Book 4 of the 2013 In-Service Teacher Training Program (PLPG) Guidelines issued by the Human Resources Development Agency (BPSDM) of the Ministry of Education and Culture. The data obtained were analyzed by multirater analysis of Many-Facet Rasch Measurement (MFRM) version 3.83.2. Based on the results and discussion, this study concluded that the MT assessment instrument in biology education using IPPP was quite credible, which was revealed from the following three points: 1) The responses of students practicing MT are based on the results of peer ratings on a logit scale ranging from -0.82 to 0.80, which means that students practicing were quite diverse and can be distinguished by the instruments; 2) Student responses as assessor to the MT assessment instrument in logit scale units ranged -1.99 to 0.94 which means that as an assessor, student were able to assess diversely. The student as assessor was also able to understand the different rating scales used; and 3) The results of the calibration of the instruments showed that 24 instrument items can provide accurate information about the performance of the practitioner with logit values ranged from a scale of -0.72 to 0.66.
This is an open access article under the CC-BY-SA license. Microteaching (MT) or commonly referred to as micro-teaching is an obligatory course for the student in Faculty of Teaching and Education (FKIP) especially the Department of Biology Education at Sebelas Maret University. Generally, MT has been used in Education since 1960s. MT course teaches techniques in teaching for the teacher candidates to apply teaching skills resulted from careful lesson planning (Allen, 1967). MT provides an important part in preparing a teacher's candidate because of its potential to emphasize the relationship between theory and learning practice (Saban & Çoklar, 2013). MT is the first teaching practice for biology teacher candidates using co-student or peers to become their student to gain feedback (Sezen-Barrie, Tran, McDonald, & Kelly, 2014). MT provides a provision to biology teacher's candidates by practicing their teaching skills, so that, they can gain feedback from their peer or supervisor lecturer.

Introduction
According to curriculum structure, MT course in Department of Biology Education is given in Semester 6, before students were participating in Internship Program III in partner schools.
The curriculum in the Department of Biology Education places MT courses in semester 6 by considering that the students have gained many biology lesson material or pedagogic knowledge in the previous semesters. Even though the students have been provided with pedagogic knowledge or biology lesson material, some lecturers suggest that teaching practice in MT still needs to apply teaching skills in one separate stage (only teaches one concept using one or two teaching skills). On the other hand, some lecturers suggest that MT does not need to apply basic teaching skills in the separated stage, but it should provide the whole teaching skill. The term "micro" in the latter notion refers to the aspect of a few student amount, lesson material amount, limited time allocation, not on the teaching skills practiced in separated. The difference between both opinions surely will impact the way the practitioner gives their assessment.
Until now, the follower of the first notion assesses MT students usually separated from biology basic teaching skills. Assessment is conducted by the lecturer by involving another student to give input on several skills that are used by practitioners and their performance that are practicing. In this study, MT practice uses the second notion, which is the whole teaching practice. MT implementation is conducted by practicing the whole teaching skills and using peer friends as students as well as assessors based on their perception. MT assessment from peer functions as a form of feedback and followed by a discussion to focus on improving teaching strategy mastery by student's practitioners (Faculty Development and Instructional Design Center, 1993). The discussion which focuses on enhancing mastering teaching strategy is useful to increase the competency of the biology teacher's candidate using some indicators such as biology teacher progress in teaching behavior, planning, learning process, class management, communication, and evaluation (Kilic, 2010).
One of the instruments, that is parallel with the second opinion and can be used to give MT practice assessment, is the Instrument of Assessing the Learning Implementation (IPPP) issued by BPSDM in 2013. Therefore, the study intends to convey an instrument reliability description based on the student's response analysis result on the instrument used. The description response mentioned is in the form of assessment map logarithm odd unit (logit) from assessor aspect (peer friend), practitioner, and instrument including score range scale used. The whole complete description of student response toward instrument is very required to give accuracy certainty and a sense of fairness in MT assessment because every instrument items are already well-calibrated.
MT is a teaching practice activity with many micro/small limitations. What is meant by the limitations include: lesson taught was little, the students were few, the skills practiced were little, and the time was few. Because many aspects are limited, so the student practitioners are demanded to decide not only what strategy will be applied, but also the time allocated (Pauline, 1993). MT practice is intended to improve the basic skill and knowledge of a teacher's candidate in teaching (Cheng, 2017).
MT learning practice is guided by a lecturer who also gives assessment. Moreover, assessment and input are also carried out by students in the group as a learning process. MT evaluation by involving peer shows weakness. MT evaluation involving peer commonly will only give advantages among the participant because there are factors of subjectivities among participants (Cheng, 2017). Even though MT is not similar to a regular class, but it was quite effective to improve teaching skills (Pauline, 1993). Teaching skill improvement can be created through the MT environment which is by studying and understanding various student's characteristics (Seidel, 2007).
The instrument used in this study was the Instrument of Assessing the Learning Implementation (IPPP). This instrument was regarded as quite comprehensive in assessing teaching skill which related to the pre-learning aspect, skills in involving student, assessing learning process and result, language usage, and closing skill (Badan Pengembangan Sumber Daya Manusia Pendidikan Kebudayaan dan Penjaminan Mutu Pendidikan, 2013). IPPP was explained in detail into 24 items of assessment, each of the items was given a 1 -5 value scale (attached Instrument).
IPPP has been used nationally in implementing PLPG and also it could be used as an MT assessment instrument outside the activity. Giving the final score for the practitioner was carried out by summed the entire raw score from the existing 24 items. This type of assessment application was also known as the classical scoring theory. The number selected by the assessor in applying classical grading assessment theory actually cannot be summed up as usual mathematical operation does because it was ordinal scale. The way of classical scoring theory has a weakness because it never attained information of interaction among the assessor aspect, the person assessed, and the assessment instrument. Besides that, assessment using classical grading also did not attain instrument calibration.
Some notes from the classical scoring theory which solely use raw score of other instrument are: 1) raw score is not measurement result, but it is merely described frequency from assessor perception, 2) raw score is initial information, while perception amount from every item is ordinal data type which is giving rank symbol, 3) raw score has a weak quantitative meaning which cannot be operated as in mathematics, 4) raw score does not provide clear description toward student's ability in performing the task, 5) raw score and percentage amount answer do not always linear (Sumintono & Widhiarso, 2015).
If assessment carried out in MT only provides information on score/value gained for the practitioner, then the reliability is questioned because perception value symbol from the assessor is ordinal scale (grade/level) which cannot be operated in mathematics. After all, it contains objectivity and subjectivity element mixture. Thus, the raw score could be more significant by analyzing it through the Rasch Model Multifacet Analysis. The Rasch model analysis can be carried out toward three aspects simultaneously, which are 1) practitioner (a student who practices teaching), 2) assessor (other students who observe practitioner), and 3) instrument used in the assessment. The three aspects can generate an equal interval measurement scale because it is stated in the logit unit. Rasch Model Multifacet analysis can measure interaspect interaction. Afterward, the model can detect other assessor effects such as range limitation, hallo effect, and internal consistency through fit statistic utilization (Kudiya, Sumintono, Sabana, & Sachari, 2018). Measurement result using the Rasch model will obtain a logit unit (logarithm odd unit). Logit unit will form a logit ruler with the same length (equal interval), so that, it is similar to measurement result as in physics (Sumintono & Widhiarso, 2015).
Rasch model Multifacet analysis is a solution to give a complete description and logical measurement result toward those three MT aspects, which are: assessor, practitioner, and assessment instrument. This analysis can be said as an effort to gain instrument reliability through students' response in Biology Education toward the complete and logical MT assessment instrument.
Instrument credibility is important to be conveyed to give certainty in measurement results on MT practitioner's competence.
Based on the above background, the study is important to be conducted to convey the credibility of the MT assessment instrument in biology education students' response using the Rasch model through the following formulation of problems. 1) How is the actual description (in logit scale/logarithm odd unit) of the practitioner's response toward MT assessment instrument based on peer assessment/peer in MT practice?; b) how is the actual description of student's response as assessor toward MT assessment instrument which is stated in logit scale unit (logarithm odd unit)?; and c) how is the calibration result based on logit scale score of every instrument item which is used in MT assessment?.

Method
The study was conducted in the even semester of academic year 2018/2019; it was undertaken in the Department of Biology Education FKIP UNIS located in D Building FKIP UNS, Jl. Ir. Surami No 36A Surakarta. The research subjects were semester 6 of biology education students who were taking the MT course as much as 9 students (practitioners) using 24 instrument items, so that, it was derived 216 data for analysis.
The study is descriptive research that tries to give the whole description of MT assessment instrument credibility through students' response analysis. Responses which are analyzed include students' responses as assessors toward the instrument, practitioners' response toward the instrument, and calibration of every instrument item in a new analysis way, which is the Rasch Model especially using Multifacet or Multirater.
The research instrument used was the Instrument of Assessing the Learning Implementation (IPPP) (attached). The instrument was considered quite comprehensive in assessing teaching skills which is consisted of 8 aspects, they are; learning, lesson material mastery, utilizing learning/media resource, skill in involving students, assessing learning process and result, language usage, and closing skill (Badan Pengembangan Sumber Daya Manusia Pendidikan Kebudayaan dan Penjaminan Mutu Pendidikan, 2013). IPPP is explained in detail into 24 assessment items, each of the items was given a 1-5 value scale.
The detailed stages and procedures in the research were as follows. 1) Every week, the students participated in the MT course under the condition if there was a student practiced teaching upfront, so the other students took a role as both students and assessor toward practitioner using instrument provided. 2) Every week, there were 3 out of 9 students who practiced teaching with an approximate duration of 30 minutes for every student/practitioner, so that, in one semester every student will undertake 5 times teaching practice; 3) assessment was conducted using an instrument which referred to peer teaching assessment in PLPG Manual Book 2013; 4) the result of the assessment was tabulated according to the rules in Rasch Model Multirater/Multifacet analysis (Linacre, 2018); 5) it was derived logit map of assessor aspects, practitioners, and instrument assessment as well as the interaction among those three aspects; 6) the result of Multifacet was used to measure student's response as assessor, practitioner, and instrument items.
The logit scale map appeared on the logit ruler. It was used to observe every practitioner position, every assessment item, and every assessor. Using the logit ruler, a detailed description of every practitioner can be implied that the bigger logit value means that the practitioner's performance is better and conversely, the smaller logit value means the poorer their performance. Logit ruler also has placed an assessor based on its logit values, the greater logit value means the assessor is more parsimonious and conversely; the smaller its logit value means the assessor is lower (generous). Logit ruler also places every assessment item in its position, the greater its logit value means the easier it will be achieved by the practitioner. Therefore, using a logit ruler in every instrument item is able to give information in detail about the practitioner skills and assessor responses, this is what is called by instrument calibration (Boone & Staver, 2020).
Multifacet used in this study included practitioner, assessor, and assessment instrument aspects (Maryati, Prasetyo, Wilujeng, & Sumintono, 2019). The three aspects can be measured on its consistency, so that, assessment pattern can be explained thoroughly from the three aspects (Andrich & Marais, 2019). Software used to analyze was Many-Facet Rasch Measurement (MFRM) version 3.83.2.
Multirater analysis with the Rasch model can use some parameters according to the requirement and objective of the study.

Results and Discussion
Result analysis using multifacet on assessor, practitioner, instrument, and scale used in MT can be described in a logit scale map that has an equal interval. A logit scale map can be seen in Figure 1. The Figure I explains that the logit scale ranges from -2 until 1. This logit scale was used as mapping measurement toward the practitioner, assessor, and scoring scale used. Based on Figure 1 and Table 1, practitioner 6 derives the best assessment from peer practitioner (assessor), conversely practitioner 4 derives the poorest assessment from peer practitioner (assessor). Figure 1 also explains that instrument item used to assess the practitioner who generates the least assessment score are instrument item number 14, 18, and 4. Whereas, the instrument items that generate the highest score are the instrument items number 15 and 20. Meanwhile, it is reviewed from the assessor, Figure 1 explains that the stingiest assessor in giving grade is assessor C. Otherwise, the most generous assessor in giving grade is assessor I.
Practitioner's responses toward MT assessment instrument Practitioner's responses toward MT assessment instruments were shown by MT practice score gain. MT practitioner's performance total score gain becomes a basis in determining logit value which will be placed in a logit ruler. Besides performance total score gain, determining the logit score also can be seen based on the assessment probability of every instrument's item. Therefore, it does not always that a practitioner who gains a high score will get a higher score as well.
Score gain pattern tendency was assessed as a practitioner's probability in obtaining a score that will be taken into account in determining the logit scale. Data summary on score gain, logit value, statistic, and point measure correlation is presented in Table 1.
Based on Table 1, it is derived information that the practitioner who gains the highest practice score is practitioner 6 with 682 total score that places a logit scale on 0.8. While a practitioner who obtains the lowest practice, value is practitioner 4 with a total score of 550 and places a logit scale -0.82.
Therefore, MT practitioner's responses toward MT assessment instrument were on logit ruler between -0.82 until 0.8 which was illustrated by practitioner 4 until practitioner 6 in Figure  1. This indicates that there were assessment distributions in MT practitioner's student performance. In other words, the assessor can distinct participants (practitioners) in giving an assessment. Table 1. Summary of data on score acquisition, logit value, statistical fit, and point measure correlation A more obvious description of the practitioner's response toward MT assessment instrument based on its response precision was exposed through its statistics fit score and combined by outfit statistics and point measure correlation. Practitioners' performances in MT were responded varied by the assessor through 24 instrument assessment items.
Generally, every practitioner will be assessed by 8 assessors, so that, it will collect a score of 24 x 8 = 192 times. This value of 192 times will occur if every assessor gives their complete assessment. Based on infit and outfit value both in MnSq or ZStd as well as point measure correlation value, practitioners' position is described in Table 2. conformity of MT practitioners' responses based on the score obtained can be seen from the statistics fit score. Limitation used is amount of mean + SD = 1.00 + 0.13 = 1.13. Based on these criteria, there is response (assessment result) toward practitioner which is not inappropriate, specifically practitioner number 6. It gains Infit MnSq score as much as 1.28 > 1.13. This was seen clearly if it was observed on the total score attained, which is 682 (much higher from the other practitioners' score gain). However, this must be confirmed by seeing MnSq outfit, outfit ZStd, and point measure correlation value. Afterward, the examination of practitioners' precision responses can be carried out from a statistic outfit score. Through this examination practitioner, number 6 has an Outfit MnSq value of 1.27 (within acceptable ranges). However, when it was seen from outfit ZStd, practitioner number 6 has Outfit ZStd value = +2.5 higher than ZStd acceptance limit which is +2.0. Based on the analysis, there was one practitioner who derived a mistaken assessment that is practitioner number 6. This condition could be occurred because the practitioner's performance number 6 was confusing the assessor in giving proper assessment on the performance.
The practitioners gave varied responses to the MT assessment instrument used. This appears in the logit map of practitioners which ranged from -0.82 until 0.80. Practitioners' logit scores were obtained based on the performance and scored based on instrument items used. In this case, it meant that the best practitioner in obtaining score was practitioner number 6 with a 682 total score. While a practitioner who gained the lowest score is practitioner number 4 with a 550 total score.
Even though there is a tendency that practitioner number 6 was given the highest score and practitioner number 4 obtained the lowest assessment by the assessor, but it was still on tolerance limitation. This means that measurement toward practitioners using the instrument used was still reasonable. This normality indicates that practitioners' responses toward the instrument became one of the proofs that the instrument has shown its credibility in measuring, if it was seen from the practitioners' point of view.
Based on data in Table 2, the practitioners' grade was supported by the fact that practitioner number 6 was the most diligent practitioners during MT practice and always respond task assignment seriously. Conversely, practitioner number 4 was the least took seriously on responding the assignment given. The attitude in implementing MT mostly looked carelessly and ignored to complete the assignment. This condition was also affirmed by the peer in the group, based on assessment result data that has been conducted.
Practitioners' performances, assessed based on instrument items, were seen its pattern precision based on statistics fit score particularly using Outfit MnSq, Outfit ZStd, and point measure correlation score limitation (Boone, 2016;Boone, Staver, & Yale, 2014;Engelhard & Wind, 2017). Based on those criteria, there was only one practitioner who was confusing the assessor in giving a score, which was practitioner number 6, who has ZStd = 2.5 value.
Practitioner number 6 made assessor confused in giving their score. Further analysis can be used to clarify this phenomenon. Practitioner number 6 was a coordinator or responsible in the MT group. As a person in charge of the group, practitioner number 6 was responsible for giving all the information in MT practice instruction from the lecturer to the students and became liaison in communication with the MT supervisor lecturer. This role demanded practitioner number 6 became more diligent and responsive.
The role was quite strategic and make other assessors who were also copractitioner gave their appreciation for practitioner number 6. The impact was the practitioner number 6 gain the highest total score compared to the other practitioners' score gain. Generally, it can be concluded that MT practitioner assessment by peer practitioner was quite objective, it's just that the role as coordinator has influenced psychological condition among the participants, so that, practitioner number 6 was given the highest score.

Assessor responses toward MT assessment instrument
Assessor responses in assessing practitioners were actualized by giving scores in every instrument items. The summary of responses based on the logit score is presented in Table 3.
Based on Table 3, it was obtained information that the stingiest assessor in the grading practitioner was assessor C with logit value (measure) as much as 0.94. In this case, assessor C gave grade as much as 188 times with the total score given was 492. Conversely, the most generous assessor in grading was assessor I with a logit score of -1.99. In this context, the assessor I gave grade as much as 191 times with the total score given was as much as 726. Grading pattern was a probability that can be calculated, so that, it did not solely determine based on total score given. For example, assessor B who assessed with total score 629 has higher logit score (parsimonious) compared to F who gave a total score of 622. This was understandable if it was viewed from the number of grading, which is assessor B gave the grade as much as 192 times while assessor F gave the grade as much as 190 times.

Table 3. Summary of rater (Assessor) responses based on logit values
A provision in giving the grade toward practitioners was also reported as seen in Table 4. Information derived from Table 4 is that the lowest statistics fit score was assessor D with MnSq Infit score = 0.63 and the highest was assessor A with MnSq infit score = 1.62. The assessor precision range in assessing MT was between 0.63 until 1.62. Assessor responses conformity based on the score obtained can be observed from statistics fit score and discrepancy grading pattern can be seen from its outfit value. Further examination can be carried out from statistics outfit score. The accepted outfit mean square (MnSq) score was 0.5 < MnSq < 1.5. The accepted outfit z-standard (ZSTD) score was -2.0 < ZSTD < +2.0. Point measure correlation score was 0.4 < Pt Measure Corr < 0.85. Based on other examination parameter, assessor A had Outfit MnSq score = 1.62 (indicated gave confusing assessment pattern), meanwhile based on Outfit ZStd examination, there were 5 assessors who were outside tolerance limit, they were: assessor A, D, F, G, and I, each of their score were 5.3; -4.2; -2.4; -2.8; and 2.6. Based on point measure correlation examination, there were three assessors whose pattern were not consistent, they were: assessor D, E, G, H, and I, each of them had score of 0.27; 0.14; 0.30; 0.38; and 0.39 (less than 0.4).
Based on the three discrepancy measurement parameters, the assessors in grading all the assessments were still reasonable because none of them met the criteria. Therefore, generally, it can be concluded that the entire practitioner gave proper assessment according to every practitioner who was graded. The assessor gave assessment objectively and orderly according to its pattern in grading.
Assessors' responses in grading the practitioners in MT practice were quite varied. However, grading diversity was within tolerance limitation because it showed an orderly pattern that was still tolerated. Tolerance of fairness was given because it did not exceed three parameters simultaneously, they were: outfit MnSq, Outfit ZStd, and point measure correlation. At most, there were only exceeding the threshold for the two parameters simultaneously.
Analysis toward assessment pattern given, some assessors were indicated to almost make irregular grading. The assessors were A, D, G, and I. Assessor A was exceeding two parameters, which were: Outfit MnSq (1.62) and Outfit ZStd (5.3). Assessor A was indicated giving parsimonious assessment (grading with low score) to practitioners.
Assessor D, G, and I exceed two assessment thresholds, which were Outfit ZStd and point measure correlation simultaneously. The three assessors were indicated to give easy grading (generously gave a high score) to the practitioner or other participants in MT. Grading pattern irregularity has possibly occurred because there was a close relationship between participants or it could be meant they gave grading without considering objectivity.
Assessor responses toward instrument also can be seen in the comprehension on every meaning scale used. This instrument used 5 scales, which were 1 (very poor), 2 (poor), 3 (enough), 4 (good), and 5 (very good). Based on the analysis result using the Rasch model, it was found that the assessor can distinct the meaning among the 5 scales. This can be proved in Figure 2.  Figure 2 shows that the peak of every scale is separated from each other; there is no graphic peak that is coincided. This indicates that the assessors can distinctly meaning of scale 1, 2, 3, 4, and 5. The assessor is aware and able to use the scale as a grading scale (Kudiya et al., 2018). Graphic on Figure 2 gives visual information on the grading scale that can be used to distinguish practitioners according to its performance in MT practice (Van Zile-Tamsen, 2017). The assessor skill in giving grading is also meant as feedback for the practitioner that will be useful to improve its performance in MT practice. Therefore, MT can be function as practitioners' guidance in professional development for teacher candidates (Pekdağ, Dolu, Ürek, & Azizoğlu, 2020). Analysis of MT assessment instrument MT assessment instrument consists of 24 items. Assessor gives MT practitioners' performance grading through the 24 items. The position of every instrument in the logit scale ruler was described in Table 5. Table 5. Position each instrument within the logit scale ruler Table 5 informs that the most difficult MT assessment instrument to be fulfilled is item 4. It is a practitioner assessment on skill in correlating lesson material with other relevant knowledge. Out of 72 times grading, item 4 only collects 214 total scores with a 0.66 logit score. Meanwhile, the easiest instrument item to be achieved by practitioners is item 20. It is conducting a final assessment according to competency (objective). Out of 71 times in grading, item 20 collects 251 total scores with MT assessment on scale 0.72 until 0.66. This means that every assessment instrument item is quite understandable by the assessor and it gained varied responses.
The entire instrument item used in MT practice assessment can be analyzed its precision in measuring practitioner skills. The precision of every MT assessment instrument item is presented in Table 6.
Based on information from Table 6, the lowest instrument item precision in measuring practitioner is item number 8 with Infit MnSq value of 0.63. While the highest instrument item precision in measuring practitioners' skill is item number 12 with MnSq Infit value of 1.47. Therefore, the whole instrument item has Infit MnSq range value from 0.63 until 1.47, which later will be confirmed through examination based on statistics outfit.  Table 6. Precision in every MT assessment instrument item Further examination can be conducted starting from statistics outfit value. The accepted outfit mean square (MnSq) value was 0.5 < MnSq < 1.5. The accepted outfit z-standard (ZSTD) value was -2.0 < ZSTD < +2.0. Point measure correlation value was 0.4 < Pt Measure Corr < 0.85. Using this examination, all instrument items were still within the range of Outfit MnSq or point measure correlation acceptance limitation. However, if it was seen from outfit ZStd, two instrument items are immoderate which was instrument item number 8 with Outfit ZStd value of 2.5 and item number 12 with Outfit ZStd value of 2.5. Therefore, all MT assessment instrument items used were not confusing to the assessor in giving practitioner grade. All of the 24 instrument items gave input in assessing practitioner objectively and transparent to fulfill the existing criteria. Thus, using this analysis is derived psychometric description through measurement that connects measuring people and item calibration (Engelhard et al., 2018).
Generally, all of the instrument items were feasible because none of the items that were exceeded three threshold discrepancy parameters. The instrument used in MT assessment was an instrument that was used in the Teacher Profession Training and education activities issued by the Human Resources Development Agency for Cultural Education and Education Quality Assurance in 2013. This instrument was known as the Instrument for Assessing the Learning Implementation (IPPP). This instrument was applied nationally in all regions holding PLPG.
This instrument was considered quite representative in assessing teaching skills which related to aspects in pre-learning, learning approach/strategy, utilizing learning/media resource, skill in involving students, assessing learning process and result, language usage, and closing skill. There for MT assessment instrument that utilizes IPPP is a reliable instrument and able to give accurate information on MT practitioners' performance.

Conclusion
Based on result and discussion, this research concludes that MT assessment instrument in biology education using IPPP is quite reliable that is explained from these following three explanations: 1) MT practitioners' responses based on assessment result from the peer in logit scale ranges between -0.82 until 0.80, which means practitioners students are quite varied and can be differentiated by instrument used; 2) student response as an assessor in grading MT instrument on logit scale unit were on the range -1.99 until 0.94 which means that as assessors, the students gave varied grading based on the instrument used. 3) instrument item