Analysis of Test Data in jMetrik - American Chemical Society

Chapter 5

Making the Most of Your Assessment: Analysis of Test Data in jMetrik Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch005

Alexey Leontyev,*,1 Steven Pulos,2 and Richard Hyslop3 1Department

of Chemistry, Computer Science, and Mathematics, Adams State University, Alamosa, Colorado 81101, United States 2School of Psychological Sciences, University of Northern Colorado, Greeley, Colorado 80639, United States 3Department of Chemistry and Biochemistry, University of Northern Colorado, Greeley, Colorado 80639, United States *E-mail: [email protected].

The chapter provides an overview of the jMetrik program designed to analyze test data. We used a dataset of students’ responses to the Stereochemistry Concept Inventory to illustrate the functionality of jMetrik. Steps of data analysis in jMetrik include uploading data, scoring the test, scaling the test, and analysis of the test at the scale, item, and distractor level. The following chapter provides step-by-step guidance for the use of jMetrik software.

Introduction As chemists need high-quality measurement tools, chemistry education researchers also find themselves in the situation where they need instruments that produce reliable data and valid inferences. However, while the measurements in chemistry are objective and can be observed directly (e.g., melting point), in educational research, variables are latent and cannot be measured as a result of direct observation. Quite often a need for a measurement tool is crucial, for example, in quasi-experimental and experimental studies in which performances of two or more groups are compared to determine the effect of a certain pedagogical intervention. The importance of high-quality assessment has been emphasized in the chemical education research literature in recent years (1, 2). Several reviews (2, 3) addressed best practices to write multiple-choice items that © 2017 American Chemical Society Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Downloaded by UNIV OF FLORIDA on December 11, 2017 | http://pubs.acs.org Publication Date (Web): November 20, 2017 | doi: 10.1021/bk-2017-1260.ch005

may be beneficial to both researchers and practitioners in the field of chemistry education. Similar to chemistry, which is an experimental science, in chemistry education, evidence that supports the quality of multiple-choice items is obtained experimentally, by multiple administration of tests and analysis of data sets that are produced by students. Towns (2) suggested using item analysis to improve item writing. This manuscript covers uses of jMetrik, a test analysis software, that can be used to conduct item analysis and examine multiple estimators of the quality of multiple-choice items. We used student responses to the Stereochemistry Concept Inventory to show how jMetrik works and types of analyses that can be performed. The Stereochemistry Concept Inventory was developed with the intent to assess organic chemistry students’ knowledge of stereochemistry. The Stereochemistry Concept Inventory consists of 20 trial-tested multiple-choice questions that assess the most important aspects of stereochemistry. Distractors are based on students’ misconceptions previously identified in a qualitative study. An example of an item from the Stereochemistry Concept Inventory is presented in Figure 1. We collected data from 439 students from various institutions across the United States. In this manuscript, we show how various stages of the data analysis of the Stereochemistry Concept Inventory can be performed in jMetrik and what inferences can be drawn from analysis of data in jMetrik.

Figure 1. Item 18 from the Stereochemistry Concept Inventory. Response options A, B, and C are distractors that represent misconceptions, while response option D is the correct answer.

jMetrik Overview jMetrik is free software that was developed and routinely updated by J. Patrick Meyer from the Curry School of Education at the University of Virginia. This software is intended for everyone who is interested in analysis of test data. Types of analysis that can be done in jMetrik vary from basic (e.g., calculation of the sum score) to advanced (e.g., test equating), which makes jMetrik a suitable tool for a broad audience. The most recent version of jMetrik software (4.0.3) can be downloaded from www.ItemAnalysis.com. jMetrik can be used on Mac, PC, 50 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

or Linux, which is one of many advantages of this software compared to other software packages for test analysis. Analysis of data in jMetrik includes sequential steps that can be summarized in the scheme presented in Figure 2.


Figure 2. Data analysis steps in jMetrik include importing data, item scoring, test scaling, and test analysis at the scale, item, and distractor level.

While this review is not meant to be comprehensive, it covers mechanics of the main steps of data analysis using jMetrik and provides several examples of data interpretation. For more detailed information, we refer our readers to the book (4) which provides the comprehensive coverage of jMetrik functionality from its developer.

Analysis of Data in jMetrik Preparing Data for the Analysis in jMetrik The dataset should be organized to have participants (cases) in rows and variables (items) in columns. Data in this format can be obtained from reading Scantron sheets. In cases when a test is administered without Scantron sheets, data in this format have to be produced manually. As tedious as this task can be, it can produce rewarding insights into your data. Responses to web-based surveys administered through Qualtrics or SurveyMonkey can be downloaded in the desired format that allows analysis in jMetrik. Data may also be imported from any spreadsheet or statistical software package by simply saving the file from this software in a .csv file format. A dataset for analysis can contain any number of cases and 1024 variables. It is unlikely that a single test would contain this number of questions; however, when you combine responses on several tests, it may yield datasets with a large number of variables. You are quite likely to add additional variables to your dataset over the course of analysis, so it is advisable to use less than 1024 variables in the initial dataset for uploading to jMetrik. The dataset may contain student identification information in a separate column if you plan to obtain and report their individual score or percentiles. Usually, the first column is used for these purposes. The first row may contain variable names that are case insensitive. The dataset may contain letters or numbers for response options. Scantron software produces data in the letter format (responses are coded as A, B, C, D, and E), while web-based surveys produce data in the number format (responses are coded as 1, 2, 3, etc.). Since both formats can be handled, recoding is rarely necessary. However, it is important to use lowercase or uppercase format consistently. 51 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Importing Data into jMetrik


Since jMetrik utilizes database format, you first need to create a database to store your data. When you first start jMetrik, you will see a screen similar to the one shown in Figure 3.

Figure 3. Starting window of jMetrik. Click Manage → New Database to create a database. You will see a popup window (Figure 4).

Figure 4. Create New Database popup window. Type in the name for your database and press the Create button. It is advisable to keep datasets from the same test in one database to allow easy navigation between subsets of data. Click Manage → Open Database and select the database that you just created. You are now in the newly created database that is empty at this point. To import data click on Manage → Import Data. You will see a dialog box illustrated in Figure 5. 52 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


Figure 5. Import Data dialog box. To find your file click on the Browse button and you will see your file directory (Figure 6) where you need to find the .csv or .txt file you want to upload, as well as specify the type of delimiter you have in your data (tab, comma, semicolon, or colon). You should also specify if the first row in your dataset includes variable names.

Figure 6. Import Data file directory dialog box. 53 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Once you find and specify the file you want to import, click on it and its name will appear in the File Name window, then hit return and you will be returned to the previous window of the Import Data dialog box. Be sure to specify a name for the file in the Table Name window (Figure 5). Once you have done this simply press the Import button.


Scoring Data in jMetrik At this point the responses are just variables, and jMetrik does not recognize which response is correct and which response is not. You need to score the items. There are two types of scoring possible in jMetrik. Basic scoring is used for binary and polytomous items. Advanced scoring is used when several correct answers are possible, or you wish to award partial credit for certain responses. Another advantage of advanced item scoring is that you can produce syntax with the answer key and reuse it if you plan to have multiple databases or tables from the same test. This is an especially useful feature for the data collected at multiple sites, and you plan to analyze data and produce reports separately for each site. To do basic scoring click Transform → Basic Item Scoring. Enter the correct answers in the first row and the total number of response options in the second row. Use Tab or Enter to switch between cells. Figure 7 represents a dialog box for basic item scoring. For example, items q1, q4, q7, and q8 have four response options and C is a correct answer, while item q3 has only three response options and B is the correct option.

Figure 7. Basic item scoring of responses of the Stereochemistry Concept Inventory. The top line contains correct responses. The bottom line contains the number of possible responses. At the bottom of the dialog box, there are options where you need to indicate how you want to proceed with omitted and not reached responses. Omitted responses are usually scored as zero. However, you may score them separately by assigning them into a special category. Not reached responses occur if participants stop answering questions, which may occur if you use timed tests. jMetrik allows you to handle missing, omitted, and not reached responses distinctly. There is no universal solution for these categories; you should base your decision on the nature of your dataset, conditions under which a test was administered, and type of analysis that you wish to perform. A fairly comprehensive description for handling missing data in education research is available (5). 54 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


Scoring the polytomous items (e.g., response to Likert-type scales) is done differently. In this case, in the first row put “+” for the items that are scored in ascending order or “–” if the items are scored in reverse order. The second row still should include the number of response options. In cases, when you want to award partial credit or your multiple-choice tests have several correct responses, basic scoring is not sufficient for your analysis. To do advanced scoring click Transform → Advanced Item Scoring. In the dialog box that will appear provide scores for all options. Then select items that match the key that you entered and click Submit. The key will appear in Scoring Syntax box. Score all items and notice that they become bold. Figure 8 represents an example of advanced item scoring for items q1, q4, q7, q8, q9, and q14 of the Stereochemistry Concept Inventory. These items have four response options (A, B, C, and D) where a response of “C” is a correct answer and awarded 1 point. Responses A, B, and D are distractors and thus are awarded 0 points.

Figure 8. Advanced item scoring dialog box. The correct answer (response option C) is awarded 1 point, and distractors (A, B, and D) are awarded 0 points. Items with the same key are selected.

Note that the bottom of the Advanced Item Scoring dialog box contains the Scoring Syntax window. All commands will be entered in this box. If you plan to reuse the key for another dataset, you can save the syntax to a separate file and reuse it. This approach is particularly useful when data on performance on the same test are collected from multiple sites. 55 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


To view which items are scored and the key, go to the Variables tab. Occasionally, it requires hitting the button “refresh data view” in the top panel. In the Variables tab you can see which variables program recognizes as items (binary or polytomous) and the scoring key for each item (Figure 9).

Figure 9. Variables tab contains information on which variables are treated as items and the scoring keys for all items.

Scaling of Test Before running any analysis, the test should be scaled. In other words, responses on individual items need to be converted into meaningful aggregate numbers. Combining individual responses into some aggregate score is called scaling. Traditionally, in the educational setting, cognitive tests are scaled by calculating the sum score. To do this, click on Transform → Test Scaling and then select items that you want to include in the total score and type of the total score. You also need to name the new variable that you are adding to your database as a result of scaling. Figure 10 represents a dialog box for Test Scaling. 56 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


The sum score is the most often used but not the only one. The current version (4.0.3) of jMetrik also allows computing average scores, Kelley scores, percentile ranks, and normalized scores. A test can be scaled in multiple ways. If you go to the Data tab, you will see that composite scores are added to the data at the very end of the table.

Figure 10. Test Scaling dialog box. Items that are used to produce the composite scores should be moved into the empty box. The new variable must be named, and the type of composite score should be selected. 57 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


Figure 11. Item Analysis dialog box. Items for analysis should be dragged into the empty box.

Item Analysis in jMetrik jMetrik provides insights into properties of the scale, properties of individual items, and properties of distractors. Most of that information can be obtained by item analysis. To run item analysis, go to Analyze → Item analysis. This will open a dialog box, presented in Figure 11. In this dialog box, you need to select items that you want to analyze. In the Options tab, you are presented with 58 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


multiple options for the analysis. The most important here are Compute item statistics that would yield difficulty and discrimination indices and All response options that would give these parameters for all response options, including distractors. By default, these options are selected. If you want to run item analysis only for correct options, you need to uncheck the All response options checkbox. By default, jMetrik analyzes data adjusting for spuriousness. Correction for spuriousness removes variance from the total score that is due to the item when computing item-total correlations. We endorse using this correction, especially for tests with a small number of items. After you run this analysis, you will see the output (Figure 12). In this output, you will see test level statistics, such as a total number of items, a number of examinees, minimum and maximum scores, mean, median, standard deviation, interquartile range, skewness, kurtosis, and Kuder-Richardson coefficient 21. These statistics can be instrumental in examining the shape of the distribution and its normality.

Figure 12. Output for the scale level analysis and reliability estimates.

jMetrik provides multiple estimates of reliability (Guttman, Cronbach, Feldt-Gilmer, Feldt-Brennan, Raju as can be seen in Figure 12) which may be used to estimate the amount of measurement error in an instrument. Reliability coefficients estimate to what degree test scores are consistent with another hypothetical set of scores obtained in a similar process. Different estimates have different underlying assumptions and ease of calculation. Coefficient alpha is appropriate when responses at least satisfy assumptions of the essentially tau-equivalent model, while Feldt-Gilmer and Feldt-Brennan estimates are suitable for congeneric models (4, 6). 59 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


While coefficient alpha is the most commonly used estimator, it is unlikely that underlying assumptions for the essentially tau-equivalent model are met. A more appropriate method would probably be Guttman’s L2 which estimates lower bound to reliability of a test assuming the scores are congeneric (4, 6). Multiple estimates of reliability provide better insight into the quality of the scale than a single one. One of the unique features of jMetrik is that confidence intervals for all reliability coefficients are also computed which provide insight into how much trust can be put into certain reliability estimates. If you select Compute item statistics, the output would also contain item analysis for the selected items. Item analysis output is organized as a table (Figure 13) containing item numbers, all response options and their scores, difficulty, standard deviation, and discrimination index. For example, item q1 from the Stereochemistry Concept Inventory had a difficulty value of 0.6424 meaning that 64% of students chose the correct answer. Difficulty values for distractors simply indicate the fraction of students who selected them. The discrimination index for item q1 was 0.1529, indicating that high-scoring students tend to select the correct option, while low-scoring students are more likely not to select it. Discrimination indices for distractors are all negative suggesting that low performing students are more likely to select these distractors rather than high performing students. Generally, the higher the discrimination index is, the better the item differentiates between low and high performing students. Popham (7) suggested some cut-off values for revising items based on the discrimination index value. jMetrik also allows for item analysis at a distractor level. While examining percentages of students who endorsed distractors might be important to estimate fractions of the population that possess misconceptions, addressed in distractors, it is utmost important to research the distractor-total correlations. These coefficients may provide some insights into the relation of a particular misconception and the level of ability as measured by the test. Negative correlations suggest that a misconception is prevalent to lower ability levels; zero correlations suggest that a misconception is independent of the ability level, while positive correlations may suggest that the distractor is misleading to higher ability students. Quite often, positive correlations for distractors simply indicate a coding error or miskeyed item.

Figure 13. Item analysis of items q1 and q2 of the Stereochemistry Concept Inventory. 60 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


If you wish to produce only difficulty and discrimination indices for the correct option, you should uncheck the All response options checkbox in the item analysis dialog box. Low discrimination indices may suggest ambiguously worded stems. Negative discrimination indices indicate that the most knowledgeable students answer the item incorrectly and the least knowledgeable students answer the item correctly. Often, an item with a negative discrimination index is a miskeyed item.

Figure 14. Nonparametric Characteristic Curves dialog box. You need to select items for which you wish to perform item analysis and select All options in the right bottom box. 61 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.


Nonparametric Response Curves While in the previous section we addressed the importance of evaluating the quality of distractors by examining their correlation coefficients, the conclusions should be drawn carefully because correlation coefficients are based on the assumption that the relationship between variables is linear. As appealing as this model is, it does not provide an accurate picture if this relationship is not linear. Recently, item response curves (8, 9) have been employed to examine the quality of concept inventory items. Item response curves (IRCs) can be generated in jMetrik by plotting the probability of students selecting response options versus the total score for that group of students. Item response curves are useful tools for examining the relationship between the probability of selecting the correct response or distractor and the person’s ability as determined by the sum score. IRCs allow evaluating the overall quality of the item and the performance of each distractor. This analysis is especially useful in the earlier stages of test development because it allows spotting poorly functioning distractors. To generate item response curves, click Graph → Nonparametric Curves. This opens the dialog box Nonparametric Characteristic Curves (Figure 14). In this box, you should select items for which you want to produce curves. For the box Independent Variable, you should select an estimator of the ability level, for example, the sum score. To get a complete picture, select All options.

Figure 15. Nonparametric IRC for item q11 from the Stereochemistry Concept Inventory. 62 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Figure 15 shows the IRC for item q11 from the Stereochemistry Concept Inventory. Response options B and C are distractors. Option B is the most likely response for examinees with the total score below 3, and option C is the most likely response between 3 and 13. At higher ability levels (total score higher than 13), the correct answer (option A) is the most probable response. These results show that options B and C are not only plausible distractors, but there is an order to their degree of plausibility, which is indicative of the underlying cognition model to this question.


Conclusions jMetrik is a powerful instrument designed specifically to handle test data. In jMetrik, you can analyze your data using both classical test theory analysis and modern psychometric methods. The graphical interface of jMetrik makes it intuitively easy to use even for practitioners with no prior experience with psychometric software. Data from Scantron or web-based surveys can be directly uploaded into jMetrik and requires very little preparation. If you decide to rerun the analysis with recoded responses or omitted items, it can be easily done with point-and-click menus and the drag-and-drop interface. Computation power of jMetrik allows for many useful analyses that can provide insights into the quality of tests, individual items, and their distractors. In the past, various hand calculations have been employed to compare responses for individual items with total scores using high and low scoring groups of students. One of the most commonly used approaches is to compare responses between the upper and lower 27% of examinees. While Kelley (10) suggested that approach in 1939, it was based on convenience and ease of calculations. Using jMetrik allows for the computational analysis that provides more accurate assessment of the discrimination power of items because all student responses are taken into account. As we mentioned previously, we did not intend to provide comprehensive guidance for all types of analyses that can be done in jMetrik. In addition to the types of analyses described in this chapter, you can perform the Rasch analysis that allows measuring items and examinees on the same scale, differential item functioning analysis to compare the performance of two groups of examinees on the same items, or score equating for participants that were administered several different forms of the same test. Moreover, jMetrik is constantly updated, and we expect to see additional functionality in the future.

References 1. 2. 3. 4.

Holme, T.; Bretz, S. L.; Cooper, M.; Lewis, J.; Paek, P.; Pienta, N.; Stacy, A.; Stevens, R.; Towns, M. Chem. Educ. Res. Pract. 2010, 11, 92–97. Towns, M. H. J. Chem. Educ. 2014, 91, 1426–1431. Haladyna, T. M.; Downing, S. M.; Rodriguez, M. C. Appl. Meas. Educ. 2002, 15, 309–333. Meyer, J. P. Applied measurement with jMetrik; Routledge: London, 2014. 63

Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Cheema, J. R. Rev. Educ. Res. 2014, 84, 487–508. Meyer, J. P. Understanding measurement: Reliability; Oxford University Press: Oxford, 2010. 7. Popham, J. W. Classroom Assessment: What Teachers Need to Know; Pearson: London, 2010. 8. Brandriet, A. R.; Bretz, S. L. J. Chem. Educ. 2014, 91, 1132–1144. 9. Linenberger, K. J.; Bretz, S. L. Biochem. Mol. Biol. Educ. 2014, 42, 203–212. 10. Kelley, T. L. J. Educ. Psychol. 1939, 30, 17–24.


5. 6.

64 Gupta; Computer-Aided Data Analysis in Chemical Education Research (CADACER): Advances and Avenues ACS Symposium Series; American Chemical Society: Washington, DC, 2017.

Analysis of Test Data in jMetrik - American Chemical Society

Recommend Documents