Application of Rasch Measurement Model in Developing Calibrated Item Pool for the Topic of Rational Numbers

Rational Numbers is an essential topic in mathematics since it necessitates the learning progression of more advanced topics. Nevertheless, previous literature shows that students are having difficulties in understanding the topic for numerous reasons. The inability of teachers in providing good examples during teaching is identified as one of the major causes. Thus, this study aims to develop a calibrated pool of items to facilitate teachers in giving appropriate examples for the topic of Rational Numbers. We employed a descriptive design to provide descriptions of the item statistics for the calibrated pool of items. Samples of the study consisted of 1,292 secondary school students. We used the Rasch measurement model framework via a quantitative approach to analyse the data. The results showed that all items demonstrated an acceptable quality of measuring students’ ability in rational numbers while at the same time demonstrated high evidence of validity and reliability as well. Ultimately, we also provided suggestions on how teachers can use the pool of items in delivering appropriate examples in the classroom.


INTRODUCTION
Success in school and beyond is greatly influenced by mathematics proficiency (Ritchie & Bates, 2013). And, according to Tian and Siegler (2018), one of the prime factors that contribute to mathematics proficiency is knowledge about rational number. Rational number is defined as any number that can be expressed as a ratio of two integers with the denominator ≠ 0 (Blinder, 2013). For example, 2 is a rational number since it is a product of 2/1 or 4/2 etc. Decimals such as 0.125 is also a rational number since it can be expressed in terms of 1/8. In general, since integers take the values of positive, zero, and negative numbers, rational numbers also have similar properties. Rational numbers and the concepts connected to them are essential for learning mathematics since the understanding of these concepts helps students to progress better in more advanced topics (Mozacco et al., 2013;Siegler et al., 2012). For example, since probabilities are widely expressed as fractions, decimal, and percentages, it requires an understanding of the magnitudes of these rational numbers to understand the concept of probabilities and therefore the decisionmaking contexts.
The following examples might give a better insight into the importance of understanding the concept of rational numbers that can be further applied beyond the classroom. In inferential statistics, to be able to recognize the different meaning of p = .01, p = .10, p = .05, or p = .001 requires some level of understanding of decimals. Also, in engineering, fractions and decimals are always being used in the conversion of units, while ratios and proportions are also used in medical practice for calculating the right amount of dosage of medication. More than that, rational numbers also play an important part in our daily life. For instance, knowledge of fractions helps us to understand discounts for items on sale, while understanding decimals will surely encourage precisions.

LITERATURE REVIEW
Despite its importance, previous literature shows that the topic of Rational Numbers is very challenging for students. Notably, there are different level of difficulties associated with the topic. One of the most resounding difficulty relates to the "whole number bias" phenomenon, in which the inappropriate application of natural number rules was used (Ni & Zhou, 2005). For example, Sun (2019) listed some difficulties in the operations of addition, subtraction, multiplication and division of fraction which resulted from the algorithm that was not supported by the whole number rule. Further, according to van Hoof et al. (2015), whole numbers differ from rational numbers in four distinct aspects, namely, (1) density, (2) representation, (3) number size, and (4) arithmetic operations. Applying whole number rules in these aspects may lead to systematic errors. Other than that, research by Yetim and Alkan (2013) identified basic mistakes such as failure to convert rational numbers into decimal numbers and vice versa and stating that -8/5 is equal to -8.5. Besides, there is the Longer-is-Larger rule, where numbers with more digits are commonly considered bigger (Liu et al., 2014). To illustrate, students who adopt the rule believe that 4.9 is smaller than 4.34 since the latter has more digits.
Apart from the "whole number bias" phenomenon, Sigler and Lortie-Forgues (2017) identified two other sources of difficulties encountered by the students, which they termed as inherent and culturally difficulties. Inherent sources of difficulty include difficulty in understanding individual rational number (such as why 1 2 is bigger than 1 3 when 3 is bigger than 2?), the relationship between rational and whole number (such as why On the other hand, as the name suggests, a culturally contingent source of difficulty involves the culture within which the learners originated from. It is well acknowledged that teachers' knowledge differs based on their countries of origin. For example, while Canadian pre-service teachers find it difficult in explaining the concept of multiplication using two fractions (Siegler & Lortie-Forgues, 2015), a large majority of their Chinese counterparts reported otherwise (Lin et al., 2013).
One of the possible explanations for the disparity was the conception of teacher professional development (TPD). While TPD in the Western countries often takes place in the form of workshops that are considered remote, inconsistent, and sometimes contradictory (Guskey, 2003), the culture of professional development in East Asian countries such as China can happen at any point of time in the teachers' daily routine (Huang, 2006). Plus, there is also an Asian culture of learning from more experienced teachers that in turns improve the younger teachers' knowledge and skills in teaching (Li et al., 2006).
Another identified source of difficulty is textbook content. Lack of coverage in the textbook may influence students' exertion in understanding a particular mathematical concept. For example, Son and Senk (2010) found that the US textbook contained fewer examples of fraction division problems for the fifth and sixth graders compared to fraction multiplication problems. This may explain the poorer results among American students. Apart from that, language is also decidedly an important culturally contingent source of difficulties. To exemplify, numerical terms used in East Asian countries seem to facilitate students' better achievement in mathematics (Dowker et al., 2008).
Like other topics in mathematics, one of the important approaches to teaching rational numbers is by providing examples. The primary purpose of providing examples is to assist retention by repetition of the procedure so that students develop proficiency. Likewise, it is hoped that while working on examples, students can construct new awareness and understanding with regards to both procedure and concept. There are many ways that teachers can give examples. Among the common approaches is by introducing an idea or explaining a concept. Several researchers have conducted studies to explore strategies used in providing examples. For instance, Bills and Bills (2005) suggest that teachers should use simple examples first, such as using small numbers and minimum operations and use examples that build on students' prior knowledge to scaffold students' learning. Teachers should also use examples that allow them to attend to common errors and misconceptions (Zodik & Zaslavsky, 2008).
Yet, studies also show that teachers struggle to do just that since they depended heavily on the examples and exercises from the textbook. Compounding this issue is the fact that the difficulty of the items in the textbook was not empirically tested. Also, there is an abundance of items in the textbook to choose from, making it a laborious task for the teachers to choose the best possible items to be used as classroom examples. Moreover, literature shows that teachers have been known as having poor ability to estimate the difficulty of a particular item (Impara & Plake, 1998;van de Watering & van der Rijt, 2006). Thus, teachers, especially the lessexperienced, might find it challenging to find items from the textbooks that tailor to their teachings at a particular learning standard.

Calibrated Item Pool
One of the possible solutions to this is by having a pool of calibrated items for teachers to choose as examples in the classroom. Calibrated item pool is defined as a group of items that have been arranged according to their difficulty intensity. Thus, teachers might use easy items from the pool to introduce new concepts as well as reducing misconceptions. Gradually, more difficult items can be added to increase students' understanding of a particular topic. According to the literature, a calibrated item pool can be used for a variety of purposes. One of the most important benefits is that it aids in constructing tests that are relevant to the testing objectives. To give an example, Aung and Lin (2020) established a calibrated pool of 164 mathematics items for Grade 6 children, and based on the statistics of the items in the bank; they were able to develop a psychometrically sound new 60-item test to evaluate the average ability students' mathematical ability.
The Classical Test Theory (CTT) and the Item Response Theory (IRT) are two widely used measurement theories for developing calibrated item pools (IRT). The IRT, on the other hand, is more commonly employed for item calibration. Many researchers choose the Rasch measuring framework within the IRT family because it requires less parameter estimate and is thus easier to deal with. For example, Bjorner et al. (2017) used the Rasch model to create a pool of high-quality items, with only five items from the pool having a very high concordance with the score based on all items. Kallinger et al. (2019) used the same model to calibrate an item bank of anxiety-related questions for orthopedic patients. The item bank serves as the foundation for a computer-adaptive exam that can be used to assess a wide variety of anxiety in orthopedic rehabilitation patients. Meanwhile, Nieto et al. (2017) used the adaptive power of a calibrated item pool to demonstrate that only one-third of the pool questions are sufficient to assess the Five-Factor Model personality facets accurately.
Despite its potential, however, research on calibrated item pools in education is minimal. Hence, the purpose of the present study is to develop a calibrated item pool in the topic of rational numbers so that teachers can use the items effectively as examples during classroom instructions. As a result, teachers are no longer required to estimate the difficulty of the items as classroom examples. Instead, teachers can continue to identify such items to use as examples based on their difficulty statistics.

METHODOLOGY Participants
Participants in this study consist of 1,292 secondary school students with an average age of 13 years old. The gender distribution is 590 males (45.7%) and 702 females (54.3%) from schools in the states of Kedah, Penang, and Perak in the northern parts of Malaysia. The selection of the schools was based on purposive sampling, in which the researchers identified schools with various degrees of achievement in mathematics.

Instrument
This study employed ten mathematics tests that were administered to ten schools. The tests were conducted separately but were linked together by several common items using the common item non-equivalent group design (Kolen & Brennan, 2014). Altogether, we employed 81 common items to link the ten tests and 362 unique items measuring 13 topics specified in the curriculum specifications (Ministry of Education, 2016). However, only results involving the topic of Rational Numbers will be presented in this article. The tests were developed both by the researchers as well as by the practising teachers. Content validity of the test was observed by the head of the mathematics panel of each school. The tests included both multiple-choice and partial credit items. In the multiple-choice format, participants chose one correct answer from a list of four possible choices. One mark was given to the correct answer and no mark for the incorrect answer. In the partial credit format, the scoring was based on the completion of the steps in solving the problem. The marks for each item ranged from 1 to 4 marks, and the total marks for each test were 100. Correspondingly, items that shared the same stem in the partial credit format were treated as different items. Examples of a multiple-choice item and a 2-marks partial credit item are given in Table 1.

Data Analysis
The quality of each item in the item pool was examined by using Rasch model software WINSTEPS 3.74. Apart from its simplicity, the model is favored to others in the IRT family, such as the 2-parameter model, since each item must have the same discriminatory power, allowing students to be estimated solely by item difficulty and not by how well they know the content being tested. Meanwhile, all forms of data are accepted 4 / 11 when utilizing the 3-parameter model because the model adjusts for any disparities in the data. Nevertheless, we believe that erratic data that did not fit the model's expectations for achievement tests will not be accepted for analysis. Similarly, guessing is also not accepted and is considered as reflecting the unreliability of the respondents.
The plan of analysis started with assessing the assumptions of the Rasch model, specifically, (1) the model-data fit and (2) the unidimensionality assumptions. This is a crucial step since the Rasch model is considered as a model with a strict assumption that must be met to create the equal-interval scale (Bond & Fox, 2015). The first assumption was that the data must fit the model's expectation. Model-data fit refers to the extent to which the data collected matches expectations from the model. This assumption was examined using the infit and outfit mean-square (MNSQ) values generated from WINSTEPS 3.74. While both statistics are sensitive towards unexpected responses, the infit MNSQ deals with responses by the respondents that are targeted towards them while the outfit MNSQ explains far from the targeted respondents (Linacre, 2002). According to Bond and Fox (2015), the assumption is met when the values of the infit and outfit MNSQs were in the range from 0.6 to 1.4. Meanwhile, the unidimensionality assumes that items in a test measure a single construct (Wright & Masters, 1982). The assumption was examined from the principal component analysis of the residuals procedure in the software. The assumption is met when the variance explained by the measurement dimension from the procedure is more than 40% (Linacre, 2006).
In this study, apart from examining the assumptions, we also reported statistics at the item level, specifically, the item reliability and item separation indices. Item reliability statistics refer to the ratio between true to observed item variance (Linacre, 2006). This provides information on the consistency of the ordering of item difficulty if an instrument is administered to a comparable sample of participants. High item reliability statistic indicates the consistent ordering of the items' difficulty and vice versa. Meanwhile, the item separation index is an indication of the adequacy of the measurement to distinguish between participants. For example, if the separation index is 2, then it is possible to distinguish the participant into two ability groups. It should be noted that a proper measurement should be able to distinguish clearly the ability of the participants. For a proper measurement, the item reliability index should be more than 0.94 (Fisher, 2007), while the separation index should not be less than 2.0 (Bond & Fox, 2015).
At the same time, statistics for each item were also reported. Apart from the item difficulty and the fit statistics, the point measure correlation (PTMEA) statistic was also included. The positive values of this statistic indicate that the particular item is working together with other items in the same direction to measure the intended construct (Bond & Fox, 2015).
Apart from the abovementioned analysis, the present study also provided information regarding the learning standards for the topic of Rational Numbers. In the curriculum, learning standards are indicators of the quality of learning and achievement that can be measured (Ministry of Education, 2016). The analysis is essential to identify the most difficult-to-master learning standards so that teachers can benefit from the information when providing examples during the classroom. The topic of Rational Numbers consists of 21 learning standards which are the most for any topic in the curriculum (Ministry of Education, 2016). The list of learning standards for this topic is presented in Table 2. Note that learning standards 1.2.4 (Describe the laws of arithmetic operations, which are Identity Law, Communicative Law, Associative Law, and Distributive Law) was not targeted by any item since it is supposed to be measured orally and not through test items. Meanwhile, learning standards 1.3.1 (Represent positive and negative fractions on number lines) was also not targeted by any items. This is because the knowledge and skills for the learning standards are similar to learning standards 1.4.1.

FINDINGS
In terms of a model-data fit, results from the calibration of all 447 items showed that the software dropped three items due to a lack of responses by the students. Meanwhile, 21 items that exhibited the infit and outfit MNSQ values outside the acceptable 0.6 -1.4 guideline was manually deleted (see Table 3). Which of the following is correct? A -2 < -5 B -3 > 0 C -6 > -2 D -5 > -9 Solve the following: 3 Conversely, results from the PCA of residuals showed that raw variance explained by both the students and the items measures was 54%, which was more than the intended value of 40% (Linacre, 2006). As such, we provided ample evidence that the unidimensionality assumption was also fulfilled. Besides, both item reliability and item separation indices exceed the intended values. Table 4 showed statistics for all 71 items measuring 17 learning standards. Two items were measuring the    (Fisher, 2007) 0.97

Item separation
The adequacy of the measurement to distinguish between participants > 2.0 (Bond & Fox, 2015) 6.10  first learning standards 1.1.1, with both were in the form of multiple-choice questions (MCQ). The difficulty of items was estimated from the Winsteps software. Since the mean difficulty was set at 0, then the negative sign showed that respondents have more than a 50% chance of getting HA1 and KA4 correctly, with the latter was considered as more difficult based on its larger values. The SE indicated the standard error of the estimation. The infit and outfit MNSQ values of 1.03 and 1.15 signify that there were only 3% and 15% variation from the model's expectations for the on-target and off-target participants. Finally, the positive value of 0.30 of the point-measure correlation (PTMEA) yielded evidence that Item HA1 was working together with other items in measuring the participants' ability in Rational Numbers. In general, it seems that the teachers developed relatively easy items for this topic since the respondents have more than 50% of getting correct answer for 44 (61.97%) items. Table 5 shows examples of easy, moderate, and difficult items.

DISCUSSION
From Table 5, it can be observed that the results for the easiest items were duly expected. Previous studies in Malaysia have shown that students have a high mastery level when answering items that measure procedural understanding, such as items MA3 and KA2. One possible explanation was that the ability to perform a series of computational tasks has always been exposed to the students since primary school (Rittle-Johnson et al., 2001). Therefore, the students were quite familiar with the types of items and had no problem solving them.
Item HA1 was endorsed as one of the easiest-to-score since the item was very similar to the examples in the textbook. It is plausible that the teachers had gone through similar items with the students in the classroom. Materials from textbooks are always used as primary sources for teaching and learning activities, as demonstrated by Lepik et al. (2015). As a result, when asked again in these tests, students need to recollect the solution steps taught in class instead of engaging in high-level cognitive tasks like interpreting or evaluating. While there was a possible explanation for the easy-toscore items, the same could not be generalized to the difficult items. This is because, based on the explanation given by the teachers, these items were considered easy items since they are measuring low-level learning standards such as recognizing integers (item TB3) or arrange positive and negative decimals (item L5R45). Even though item NA2 requires students to solve a problem, it is considered a routine problem, and the students may have discussed it with their teachers during lessons. Since the teachers themselves did not have the explanations why these items were perceived as difficult, perhaps there is a need to retrace the students' responses and identify if there was a problem during the teaching and learning of these items.
The calibrated pool of items developed from this study may help teachers in multiple ways. Firstly, it can be used to provide appropriate examples in the classroom. It is widely accepted that teachers should begin by providing easy examples to help students understand a concept before going progressively with more challenging ones (Bills & Bills, 2005;Rowland, 2008). We believe that the effecting of instructional scaffolding like this can be best implemented using the calibrated pool of items. To illustrate, to teach learning standards 1.1.2 (Recognize and describe integers), teachers may use the item KB1 as the first example to convey the concept of integers (see Figure 1). This is because the item is the easiest, and it is conceivable that the students would be able to understand how they can come up with the answer. Teachers may start by explaining the definition of integers and ask the students whether -2.4 is an integer or not. Note that the answer is 'no' because it is a decimal number and not a whole number. Teachers then should ask the students why it is not an integer. Next, the teachers may proceed with the subsequent number, i.e., -7 and ask the students the same question again. After that, teachers can ask the students to identify whether 2 3 , 59, 0.25, and -61 are integers or not. We would expect that the students will circle 59 and -61 and provide justifications. If somehow the students have difficulty in recognizing the integers, teachers may provide remedial activities at this early stage before students get more difficulties at the more advanced stage.
After that, teachers may use more difficult items from the pool, such as item LB4 as examples to strengthen the students' ability in recognizing and describing integers, particularly with regards to the prime numbers (see Table 6). Then teachers may ask the students to try answering the more difficult items L11R82 and TA2 on their own since not only both items involve prime numbers, but also possess a similar degree of difficulty. Finally, teachers might want to give items L1R7, SB1, and TB3 as part of homework together with other items from the textbook. Note that since TB3 was the most difficult items within this learning standard, then perhaps the teacher might want to revisit this item in the next class to see whether the students have difficulties in answering it.
Even though the development of the pool of items was able to help teachers with instructional scaffolding for students by starting with easy examples and followed by more difficult ones, we believe that there is still a need to include more items. For example, we can see that only KB1 involves integers, decimals, and fraction, while other items in this learning standard  The following shows several numbers. State the prime numbers 1.63 logits measured students' knowledge in prime numbers. Therefore, we believe there is a need to add more items that measure decimals and integer since these two concepts were often considered as challenging for the students (Barnett-Clark et al., 2010;Chval et al., 2013;Idris & Narayanan, 2011;Morge, 2011;Razak et al., 2011).
Apart from facilitating teachers in providing examples, the pool of items developed in this study also has several other potentials. For example, at the end of the topic, teachers may use the pool of item for diagnostic purposes. That is, teachers can assemble some of the items to form a diagnostic test and then chart students' performance for each learning standards. Using this approach, teachers can diagnose each students' strengths and weaknesses with regards to the specified learning standards. Teachers can then plan a more focused intervention based on the diagnostic information.
For students who demonstrated a high level of proficiency, the teachers can engage the students in enrichment activities such as disseminating more challenging items from the item pool. Also, teachers can develop different forms of test from the pool so that even though the students may not need to sit for the same test, their performance can still be compared. For instance, a teacher might choose ten items from the pool to create a short test on Rational Numbers to be administered to one particular class. The teacher then can select different items from similar difficulties to create another set of test to be administered to another class. Since all items were calibrated on a common scale, the performance of students in both classes can still be compared despite them answering different sets of test items (Holland & Dorans, 2006). This practice effectively helps maintain the security of the test items.

CONCLUSION
The current study described the process of developing a calibrated pool of items in the topic of Rational Numbers to facilitate teachers in giving examples in classroom learning. We were able to pool 71 high-quality test items with varying degree of difficulties that were calibrated on a common scale. We were also able to provide evidence of validity and reliability of all items to measure the students' ability in rational numbers. Subsequently, we presented indications that the difficulty measure of each item is highly reproducible when subsets of the pool items were administered to other groups of students. We also furnished some guidelines on how teachers could use the pool of items in giving examples as well as for other assessment purposes.
Whilst several encouraging outcomes were demonstrated, the present study is bounded by few limitations. Firstly, this study delivers a strong assumption that the test items were administered to a relatively homogenous sample of students. We would have little knowledge about the results of the calibration if the samples were drawn from a heterogeneous population. Secondly, even with the large sample of items tested in this study, we believe that the items still represent a limited sample of stimuli for measuring students' ability in rational numbers. As such, there is still a need to develop more items before the pool of items can maximize its capability to be used for formative purposes especially with regards to learning standards 1. 1.1, 1.2.2, 1.3.4, 1.4.4, 1.5.1 since we believe each of the standards should be measured by at least three items.
Author contributions: All authors have sufficiently contributed to the study, and agreed with the results and conclusions.