Psychometric Features of the General Teacher Test under the D-Scoring Model: The Case of Teacher Certification Assessment in Saudi Arabia

The teachers’ knowledge and skills on general standards under the Saudi National Professional Teacher Standards is assessed with the use of the General Teacher Test (GTT) administered by the National Center for Assessment (NCA) in Saudi Arabia. This paper examines the psychometric features of the GTT in the framework of a new approach to test scoring, referred to as D-scoring model, which is used with assessments at the NCA. The stability of such features across four test forms of the GTT is also examined. The study findings provide valuable information about the accuracy of the GTT scores and the validity of their interpretation and decisions regarding the licensure of teachers.


Introduction
There is a significant policy focus on the human capital of teachers in Saudi Aribia. This is motivated both by the Saudi "Vision 2030" blueprint to modernize its economy and society (http://vision2030.gov.sa/en) and the substantial body of empirical evidence showing the importance of teacher quality for student achievement (Aaronson, Bar-row, & Sander, 2007;Goldhaber & Hansen, 2013). One way that Saudi states try to ensure a high-quality teacher workforce is by requiring teacher candidates to pass licensure tests, often of both their general education skills and content knowledge, as a requirement for receiving a teaching license.
where is the score (1/0) of person s on item i. Clearly, 0 1, with = 0 if the answers of all items are incorrect (X si = 0; i = 1, ..., n) and = 1 if all answers are correct (i.e., X si = 1; i = 1, ..., n). The D-score of an examinee can be interpreted as the proportion of ability required for a total success on the test demonstrated by the examinee. The same interpretation holds when Equation 1 is used, say, with test items grouped by content domains, thus allowing for valid comparison of the examinees' performance on the entire test and its content domains.

Item-Person Map on the Delta Scale
With the use of Equation 1 To illustrate, Figure 1 shows the IPM obtained with one of the four Test Forms (TFs) of the GTT examined in this study. This test form, denoted TF1, consists of 75 dichotomously scored items (1 = correct response, 0 = incorrect response). The interpretation of the IPM on the D-scale is similar to the IPM interpretation in the framework of IRT. As shown in Figure 1, there is a relatively good overlap between the range of examinees' D scores and the range of expected item difficulties, (i = 1, ..., 72).
That is, there is a good match between examinees' ability measured by the test and the difficulties of its items.

Item Response Function on the D-Scale
Under the D-scoring method, the probability for correct response on an item by person s, given the D s score of that person on the delta scale (from 0 to 1), is estimated as a predicted item score, , with the use of the following two-parameter logistic regression (2PLR) model: where D s is the known independent variable (predictor), obtained via Equation 1, whereas and are regression coefficients. In fact, is the true score on item i for a person with score ; (see Dimitrov, 2017 Figure 2. These items are selected from the GTT test form denoted TF1, one the four test forms of the GTT examined in this study, with IRF parameters (a = slope, b = location) given in Table 4 as follows: item 4 (a = 3.3576, b = 0.5098), item 7 (a = 5.0000, b = 0.7992), and item 8 (a = 2.5760, b = 0.2667).  Table 4: Item 4 (a = 3.3576, b = 0.5098),

True Values and Standard Errors of D-Scores
Note that , obtained via Equation 3, is the "true" (expected) value of the observed binary score for persons with a score on the D-scale. On the other side, the "true" value of the observed D score, denoted ), is obtained via Equation 1 by replacing the observed scores with their expected values, . That is, The error associated with D s , denoted (D s ), is the difference between the score D s and its expected (5)

Testing for Item Fit
The testing for item fit under the 2PLR model (see Equation 3) is performed with the use of the Mean where n is the number of bins (intervals) that cover the range of the D-scale (from 0 to 1), is the observed proportion of correct responses on item i for the examinees with D scores in bin k, and is the average probability of correct item response for the examinees with D scores in bin k (this probability is estimated via Equation 3, with D s being the midpoint of bin k). Typically, the D-scale is divided into 10 bins (n = 10), with the range of each bin equal to 0.1, but other approaches to 'binning' can be used to make sure that there are enough examinees in each bin. In a simulation study on testing for item fit under the D-scoring model, Dimitrov and Luo (2017) found that performs well with a cutting score of 0.07. Specifically, with MAD ≥ 0.07 indicating item misfit, the Type I error rate is 0.019 (i.e., 1.9% chances that a fit item is signaled as misfit), whereas the Type II error rate is 0.061 (i.e., 6.1% chances that a misfit item is signaled as fit).

Data
The data in this study come from the scores of teachers on four different Test Forms (

Variables and Measures
Each test form (TF1, TF2

Statistical Analysis
In line with the purpose of this study, the analysis relates to examining key psychometric features of the with the use of customized modules incorporated in the computer program SATSE (Atanasov & Dimitrov, 2016) used with assessments at the NCA.

Reliability Across Test Forms
As the D-score of a person is a linear combination of the binary scores (1/0) on the test items, (i = 1, ..., n), of that person, the reliability of D-scores is the same as the reliability of the reliability of the raw score on the test (number correct responses) (Dimitrov, 2016). In this study, the reliability, ρ, of test scores was estimated via the Latent Variable Modeling (LVM) approach using the computer program Mplus (Muthén & Muthén, 2010) (e.g., Dimitrov, 2012, pp. 186-188;Raykov, 2007;Raykov, Dimitrov, & Asparouhov, 2010

δ-Values Across Test Forms
The stability of δ-values across test forms (TF1, TF2, TF3, and TF4) is examined in terms of statistics such as range, mean, standard deviation, and correlation between δ-values of the anchor items for any two adjacent test forms in the linking sequence TF4TF3TF2TF1. The results are provided with   The "anchors" are common item between adjacent test forms in the linking sequence TF4 TF3TF2TF1, with TF1 being the "base" test form

D-Scores Across Test Forms
As shown in the previous section, the practical equality of δ-values for the anchor items in any pair of adjacent test forms allows for valid comparisons of D-scores across the test forms. The results in Table   2 and Figure 4 show that the distribution of D-scores in terms of shape, range, mean, and standard deviation is quite stable across the test forms TF1, TF2, TF3, and TF4.
Thus, one can treat the study groups of teachers who took these forms as practically equivalent in ability measured by the GTT. With this, the mean of the demonstrated about 40% of the ability required for total success on the GTT.

D-Score Errors Across Test Forms
The errors associated with D-scores are evaluated here with their standard error, (see Equation

5), and the correlation between the D-scores and their true values, ) (see Equation 4). The results
are presented in Table 3 and depicted in Figures 5 and 6. As given in Table 3, the mean is stable and very small (about 0.05) across all four test forms. From a different perspective, this finding is supported with very high (almost perfect) positive correlations between the D-scores and their true values (see in Table 3 and Figure 6). Note. =correlation between D-scores and their true values, TD = ). (TF1, TF2, TF3,   TF4) www.scholink.org/ojs/index.php/wjssr

Item-Person Maps Across Test Forms
As noted earlier, the Item-Person Map (IPM) provides information about the match between the item difficulties, δ-values, and examinees' ability levels on the D-scale. The IPMs depicted in Figure 7 show that there is a relatively good overlap between the range of examinees' D scores and the range of expected item difficulties, (i = 1, ..., 72), with the nature of this overlap being consistent across the test forms TF1, TF2, TF3, and TF4. That is, there is a consistent and similar match between examinees' ability measured by the GTT and the item difficulties across the four test forms. However, one may also notice that there are items with difficulty higher than 0.75 (δ > 0.75) on the D-scale, but there are no examinees with ability scores in that range. At the same time, there is no enough items with difficulties in the range from, say, 0.3 to 0.4 on the D-scale, whereas the highest frequency of examinees is within this range. Therefore, test developers may decide to use more items with difficulty between 0.3 and 0.4 at the expense of items with difficulties higher than 0.75 on the D-scale.   Note. A = anchor item for the two test forms; * misfitting item (MAD ≥ 0.07).

Discussion and Conclusion
Assessments for teachers' certification are conducted by the NCA in Saudi Arabia with the use of multiple test forms of a General Teacher Test (GTT) and specialty tests in academic areas such as math, chemistry, physics, and so forth. The GTT, which is of interest in this study, is examined here on important psychometric features and their consistency across four test forms (TF1, TF2, TF3, and TF4).
Dependable psychometric features and their generalizability across multiple test forms are the key to accuracy of the test results and their valid interpretations for the intended purposes of the assessment. It should be also noted that the psychometric analyses were conducted in the measurement framework of the D-scoring model which is under implementation in the assessment practice and research at the NCA (Dimitrov, 2016(Dimitrov, , 2017, Prior to the implementation of the D-scoring method in the assessment practice at the NCA, the analysis of test results were conducted in the framework of Item Response Theory (IRT). Therefore, the examination of the psychometric features of the tests, including the GTT, was also conducted in the framework of IRT (e.g., Dimitrov & Al-Sadaawi, 2014, 2015. For this reason, the results obtained in this study, related to psychometric features of the GTT under the D-scoring method, are not directly comparable to those obtained in previous studies using the IRT. It should be noted that in both scenarios (IRT and D-scoring) the psychometric features of the GTT were in support to its validity and reliability. However, unlike previous IRT-based studies of GTT, the present study provides valuable information about the stability of psychometric features across different test forms of the GTT, thus providing support to the generalizability aspect of the validity of GTT data.
There are several main findings that stem from the results in this study. First, the GTT scores on the D-scale are sufficiently accurate for the intended purposes of the test, as indicated by their high reliability and small errors of the D-score, which were found to be very stable across the four test forms examined in this study. Second, the difficulty of the test forms was moderate (close to average on the D-scale) and stable across the test forms in basic statistics (range, mean, and standard deviation) for both the entire test forms and the sets of anchor items for any pair of adjacent test forms in the adopted linking sequence TF4TF3TF2TF1. Also, the δ-values of the anchor items are highly correlated and practically equal for each pair of test forms, which validates their comparison (and that of resulting D-scores) without the necessity for rescaling the δ-values (e.g., see Dimitrov, 2017 , is relatively small, with 3, 6, 3, and 1 misfitting items in test forms TF1, TF2, TF3, and TF4, respectively. It is important to note in this regard that none of the anchor items in the test forms is signaled for misfit. In conclusion, the findings in this study support the psychometric validity of the GTT for the intended use of this test in the framework of D-scoring adopted in the assessment practice at the NCA in Saudi Arabia. Also, the methodology used in this study can be useful to researchers in their work on validation of test data under the D-scoring model.