Dr. V.K. Maheshwari
M.A, M.Ed, Ph.D
Roorkee, India
|
Rakhi Maheshwari
M.A, B.Ed
Noida, India
|
A test is usually desirable to evaluate effectiveness of items. The quality of a test depends upon the individual items of a test. Item analysis is a post administration examination of a test. Item analysis begins after the test has been administered and scored. Item analysis is the process of “testing the item” to ascertain specifically whether the item is functioning properly in measuring what the entire test is measuring. An item analysis tells about the quality of an item. It provides information concerning how well each item in the test functions.
Item analysis indicates which item may be too easy or too difficult and which may fail for other reasons. Thus makes it transparent to discriminate clearly between the better and the poorer examinees (Ebel 1972). Brown (1971) mentioned that item analysis has two purposes: First it enables us, by identifying defective items, to improve the test and evaluation procedures. Second, through indicating which items or material students have and have not mastered, one can plan, revise, and improve the instructions.
One primary goal of item analysis is to help improve the test by revising or discarding ineffective items. It involves detailed and systematic examination of the students’ responses to each item Another important function is to ascertain what test takers do and do not know.
Item Analysis describes the statistical analyses, which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardize existing tests. Item analysis helps to find out how difficult the test item . Similarly it also helps to know how well the item discriminates between high and low scorers in the test. Item analysis further helps to detect specific technical flaws and thus provide further information for improving test items . Similarly, it helps in selecting the best items for the final test, reject poor items and modify some of the items .
Item analysis is usually designed to help determine whether an item functions as intended with respect to discriminating between high and low achievers in a norm-referenced test, and measuring the effects of the instruction in a criterion referenced test items. It is also a means of determining items having desirable qualities of a measuring instrument, those that need revision for future use and even for identifying deficiencies in the teaching/learning process. In addition, item analysis has other useful benefits amongst which are providing data on which to base discussion of the test results, remediation of learning deficiencies and subsequent improvement of classroom instruction. Moreover, the item analysis procedures provide a basis for increase skill in test construction.
Precautions to be taken prior to item analysis
Before starting the process of item analysis on any tesy, certain precautions should be exercised;
- Item analysis is a process of rejection or exclusion of test items. New items are not included in place of rejected items.
- The items included in the first format of the test should be in accordance to the norms previously determined in terms of difficulty level and discriminating power. If the suitable types of items have not been included in the test before starting the item analysis process, then it would be foolish to think about the presence or inclusion of those items in the test.
- The number of items in the first format should be about one and a half times othe number specified for the final format.
- While making of multiple type , an extra alternative should be included in the alternatives; that is, if four alternatives have been specified, then the first format should be comprised of five alternatives. With the help of item analysis , the most unsuitable alternative in the item can be rejected.
- Adequate time should be given when the first format of a test is administered, so that students are able to solve all items, otherwise exhaustive data will not be available for the process of item analysis.
It is worthwhile knowing that both the validity and reliability of any test depend ultimately on the characteristics of its items. High reliability and validity can be built into a test in advance through item analysis. Item analysis was used to study two characteristics:
a) Item difficulty:
The proportion of students who answered an item correctly i.e the difficulty level of each item (how hard it is).
b) Item discrimination power:
Tells whether a particular item differentiates between students who have greater aptitude with the material tested . It means he discriminating power of each item (whether or not good students get the item right more often than poor students)
The difficulty level of the items
In item difficulty, if most students answered an item correctly then the item was an easy one. If most students answered an item incorrectly then it should have been a difficult one (Brown, 1983). The higher the values of the difficulty index the easier the item. This definition is somewhat illogical and has led some researchers to refer to the index as an index of facility, or easiness, rather than as an index of difficulty
Item difficulty is required because it is almost always necessary to present items in their order of difficulty. The easiest is administered first so that to give a sense of accomplishment and a feeling of an optimistic start.
Item Difficulty in a Norm-referenced test:
A difficulty level represents the proportion of examinees responding correctly to an item. Measurement specialists suggest an ideal mean difficulty for a norm-referenced achievement test to be halfway between a perfect score and a chance score. For example, if there are four response options, a chance score is 25% and 62.50 is the ideal average difficulty. Also, measurement experts believe that four-option multiple-choice items with difficulty levels below .5 (less than 50% passing) are too difficult. Either there is a problem with the item itself or the content is not understood. Another possibility is that students are accustomed to studying for multiple-choice tests written at the rote level, and may not be prepared for a test requiring higher cognitive levels. Thus, provide students with examples of the types of items the test will include.
Item Difficulty in Criterion-referenced test:
For minimum-competency criterion-referenced tests, because a large proportion of examinees should answer correctly, the same average criterion does not apply. The difficulty for a criterion-referenced test should be consistent with the logically or empirically based predetermined criterion. For example, if 80% is the criterion identified, the difficulty levels should be similar to that percentage.
Item Difficulty In the Criterion –Referenced Mastery Tests:
The desired level of item difficulty of each test item is determined by the learning outcome it is designed to measure and not as earlier stated on the items ability to discriminate between high and low achievers. However, the standard formula for determining item difficulty can be applied here but the results are not usually used to select test items or to manipulate item difficulty. Rather, the result is used for diagnostic purposes. Also most items will have a larger difficulty index when the instruction is effective with large percentage of the students passing the test.
Determination of Difficulty Level
The percentage of students who answer test items correctly in a test is called difficulty level of an item . In other words, the difficulty level of item is defined as the proportion or percentage of the examinees or individuals who answered the items correctly . In a test, calculation of proportion or percentage of individuals choosing the right answer of the test item is called item difficulty . According to J.P. Guilford ” the difficulty value of an item is defined as the proportion or percentage of the examinees who have answered the item correctly” . In this method, index of difficulty level of each item is determined on the basis of responses of all the examinees. This formula would be more accurate to determine the difficulty level of items from the entire sample . According to Frank S. Freeman “the difficulty value of an item may be defined as the proportion of certain sample of subjects who actually know the answer of an item” . The difficulty of an item can be determined in several ways :
- By the judgment of competent people who rank the items in order of difficulty.
- By how quickly the items can be solved.
- By the number of examinees in the group who get the item right
The third procedure is the standard method for determining difficulty of each item in the objective test. The level of difficulty is represented by a numerical term, which may range from zero to 100 percent . An item of test is not answered correctly by any of the examinees, it means the item is most difficult, the difficulty value is zero percent or proportion is also zero. If an item of test is answered correctly by every examinee, it means the item is very easy the difficulty value is 100 percent or proportion is one. An item answered correctly by 70 percent of the students is said to have a difficulty index of 70. A general rule of measurement of any item whose difficulty index is lower than 10 or higher than 90 is worthless measurement . Since, difficulty refers to the percentage answering the item correctly, the smaller the percentage figures the more difficulty the item. Thus if item is correctly answered by 90 percent of examinees, they are regarded easy whereas those items are difficult if they are correctly answered only by 5 percent of examinees. So, an items answered correctly by 100% or 0% examinees have no differentiating significance.
Computing Difficulty level
To ascertain the difficulty level, the number of respondents answering correct should be compared with the number of respondents answering wrong. The proportion of the examinees responding an item correct is used to measure the difficulty level of that item. Several formulae have been developed to calculate difficulty level according to different situations.
The simplest method to calculate difficulty level is to determine the proportion of total number of examinees who have responded correct. For this the following formula is used;
DL= R/N
For the sake of convenience, the proportion is generally expressed in terms of percentage.
While solving multiple choice test items , there may be a possibility that the students may get the right answer only by fortunate guessing without knowing what the correct answer is, In such a situation, the proportion calculated by the above formula is suitably amended to know the proportion. The amended formula is as follows;
DL = (R – W/ K-1) 1/N
Here-
= Difficulty level
W= Number of examinees answering wrong an item
R= Number of examinees answering right an item
K= Number of alternatives given as solution
N= The total number of examinees
The two formulae referred above is based on the assumption that every item in the test is attempted by the examinee. But there is a possibility that some examinees may not got the opportunity to solve some items due to lack of time. This type of items are termed as un-attempted item.This causes a lot of confusion, whether the answer of un-attempted item is not known to examinee or the examinee fail to attempt it due to shortage of time. This confusion can be solved by the example referred below-
Suppose there are a total of 50 test items in a test. An examinee solved items from 1 to 16 and did not attempt item number 17 and 18, and again solved 19 to 46 , and did not attempt any item thereafter. In such case the item numbers46 to 50 will be termed as un-attempted, but item number17 and 18 can not be treated as un-attempted, because it can be safely concluded that the item number 17 and 18 are not left un-attempted because of shortage of time, For item number46 to 50 there is a probability that the examinee might know the answer but fail to attempt due to shortage of time.
So, three groups of examinees are formed in this respect, one examinees solving correct, second examinees solving wrong and the examinees not attempting the item.
For this situation the following formula is used for the determination of item difficulty.
DL = (R- W/ K-1) 1/N-NR
Here- NR stands for un-attempted items
The other symbols are described earlier.
If the number of examinees is very large then much time and energy will be required for the complication of the above data before the above referred formula be used. In such a situation, it is advisable to divide the examinees into three groups.
High group : 27% of the total examinees
Middle group : 46% of the total examinees
Low group : 27% of the total examinees
It is obvious that the examinees obtaining maximum marks in a test will form the high group and those obtaining the minimum marks will form the lower group.
In the formula laid down by Clley, the data obtained from high 27% and low 27% of examinees are gathered, and thus, a limited part of a large group of examinees ( that is only 54% ) to get the correct result by the item analysis process. It will be worth while to mention that Cally’s formula must not be used if the number of examinees are less than 370. This way at least 100 examinees will be included in higher and lower group.
The related formula is :
DL = [ ( RH – WH / K-1) 1/ NH – NRH + (RL – WL / K-1) 1/NL – NRL]
Here
The number of examinees answering correctly in higher group
= The total number of examinees in higher group
The number of examinees answering correctly in lower group
Though any specific value can not be determined about the difficulty level for the selection of items in a test, yet educationists have suggested that the items with difficulty level from 40% to 60% should be treated as suitable. It is suggested that the item difficulty levels should be distributed according to the following table:
Number of Items | Difficulty level ( in % ) |
A/5 | 0 to 40 |
3A/5 | 50 to 60 |
A/5 | 60 to90 |
Here, A= the number of total items in the achievement test.
On the basis of N.P.C., it is advisable that a test must not be constructed for extreme cases, such as backward or gifted students. Therefore the items which have been solved by top gifted students and the items solved by the below most dullard students must be eliminated from the test as they must be treated as to difficult or to easy test items.
In the context of difficulty level, the following difficulty levels are suggested for the selection of questions as per Katz (1959) also recommendation-
S.N0. | Types of items | Difficulty Level % |
1.
2.
3.
4.
5.
|
Long answer type
Alternatives 5
Alternatives 4
Alternatives 3
Alternatives 2
|
50%
60%
62%
66%
75%
|
The item difficult indicates the percentage of students who got the item right in the two groups used for the analysis.
The discriminating power of the test items
Item discrimination shows whether the test items differentiate between people of varying degrees of knowledge and ability. It may be defined as the percentage of the “high” group passing the item minus the percentage of the “low” group passing the same item. The discriminating power of a test item refers to the degree to which success or failure of an item indicates possession of the ability being measured. In other words, the ability of the test items measures the better and poorer examinees of items . The index of discriminating power (D) indicates the degree to which an item discriminates between high and low achiever on a single administration of the test while the index of sensitivity to instructional effects (S) indicates the degree to which an item reflects the intended effects of instruction determined based on pre-test and post-test results. According to Marshall Hales (1972) the discriminating power of the item may be defined as the extent to which success or failure on that item indicates the possession of the achievement being measured.
In the same context, Blood and Budd (1972) defined the index of discrimination as the ability of an item on the basis of which the discrimination is made between superiors and inferiors. Item discrimination power is an index which indicates how well an item is able to distinguish between the high achievers and low achievers given what the test is measuring. That is, it refers to the degree to which it discriminates between students with high and low achievement. Similarly, the degree to which single items separates the superiors from the inferiors’ individuals in the trait or group of traits being measured .A discrimination index is meant to communicate the power of an item in separating the more capable items from less capable on some latent attributes . The discriminating power is defined in the numerical term, which may range from +1 to –1. On the basis of discriminating power, items are classified into three types .
Positive Discrimination: -
A positively discriminating item is one in which the percentage of correct answers is higher in the upper group than in the lower group. Positive value when a larger proportion of those in the high scoring group get the item right compared to those in the low scoring group. If an item is answered correctly by superiors (upper groups) and but not answered correctly by inferiors (lower group) such item possess positive discrimination.
Negative Discrimination: .
A negatively discriminating item is one in which the reverse occurs.-Negative value when more students in the lower group than in the upper group get the item right. An item answered correctly by inferiors (lower group) but not answered correctly by the superiors (upper groups) such item possess negative discrimination.
Zero Discrimination:
zero value when an equal number of students in both groups get the item right; and when all students in the upper group get the item right and all the students in the lower group get the item wrong. If an item is answered correctly by the same number of superiors as well as inferiors examinees of the same group. The item cannot discriminate between superior and inferior examinees. Thus, the discrimination power of the item is zero. A non-discriminating item is one in which the percentage of correct answers is about the same for the upper and lower groups.
Computing Discriminating power
Like difficulty level, discriminating power can be calculated by different formulae. Few of the important formulae are as follows:
D = PQ
Here
P = The number of examinees answering correct an item (in % )
Q = The number of examinees answering wrong an item (in % )
D = Discriminating power or item validity index.
Here it is worthwhile to mention that calculations by the above formula should be done on the data of the entire group. If we look at the format of the formula , it becomes clear that if an item is correctly answered by all examinees ( that is, P=100% and Q= O%) , then the discriminating power of this item will be 0. In the same way, if an item is wrongly answered by examinees ( that is, P=0% and Q=100% ) then the discriminating power too will be 0. The items having 2500 discriminating power are considered appropriate for test construction.
The data collected from a large group of examinees present difficulty in the use of item analysis. For it the students are listed in order of marks attained in the achievement test , and after that, the group of examinees is divided into high and low groups. For this many methods are suggested, such as;
_______________________________________________________________________
Groups | Number of Examinees
High | 25% | 27% | 33% | 50% |
Low | 25% | 27% | 33% | 50% |
According to Gerret the high and low groups as per 27% are helpful in calculating discriminating power more correctly. However, it has been observed during research work that while conducting the analysis with the help of small group of examinees ( 100 or less) the method of using 50% for making high and low groups proves more appropriate.
Discriminating power used in this formula have been described before. However, it may be mentioned here that;
n= =number of examinees in higher and lower groups
According to statististicsians, the unit of values are different for finding out ,the discriminating power by the above formula, therefore in comparative studies difficulty is faced. In that case the following formula is recommended
D=( RH – RL – 1) / (RT ( 1 – RT / NT – NRT))
The formula is called ‘chi –square test’. The symbols used in this formula hace been described earlier. The other important things of this formula are as follows;
Here,
RT = RH + RL
NT = NH + NL
NRT = NRH + NRL
This formula should be used only when RH is more than RL
If the value of RL is more than RH ,then the following formula is recommended
D=( RH – RL + 1) / (RT ( 1 – RT / (NT – NRT))
If the calculation by the above formula gives the value of D as 10 or more then the discriminating power of the item is considered satisfactory, and that item be considered appropriate for inclusion in the test. If the value of D comes in negative, then the item must be rejected.
In addition to the above methods, several other methods have been developed by statisticians, however the bi-serial correlation is considered as most authentic method. In addition to bi-serial correlation, the Symond’s method, Flanagan’s product moment coefficient method, Nomo-graphic method, Stanley’s 27% high and 27% low group method, Davis’s discriminating index method, etc, can be used to find out discriminating power.
Item analysis for teacher made test
The methods which so far have been mentioned for calculating item difficulty level and item discriminating power are certainly inconvenient, if not difficult for a common teacher to use. Keeping the convenience of the teachers in view, Didrich(1960) developed a project to complete this process of item analysis in a shorter period, an keeping this objective in mind Kartz too has developed a technique. Looking at the environment prevailing in Indian schools, it will be prudent and rational to use this method which proves useful for teachers
Under this method a test is administered. An effort is made to ensure that students are given as much time as they may be able to solve each item. After scoring of the test, a merit list of students is prepared. In this merit list a certain % of total students ( generally 27% ) is selected to make high and low groups. The number of students answering each item is identified in both groups, and the discriminating power and difficulty level of each item is identified with the help of the following table;
Total number of examinees = A
Number of examinees in higher group = Number of examinees in lower group- 27/100× A= N
Note- Here to make higher and lower group the method related with 27% is used.A teacher can use other methods as per his needs. In that situation instead of27% he can use 33% or 50%.
Item No | No of examinees in high group- | No of examinees in low group- | Difficulty level | Discriminating power | Suitability |
RH | RL | RH +RL/2N | RH-RL/N | YES/NO |
For the selection of items the values of discriminating power and difficulty level have to be determined. Mostly, the items with a difficulty level from .3 to .7 and if their discriminating power is about .5 are considered appropriate. A test constructor expresses his decision about suitability of any item by writing yes/ no in the last column.
Ebel and Frisbie (1986) give us the following rule of thumb for determining the
quality of the items, in terms of the discrimination index. Table I shows the values
D and their corresponding interpretation. The recommendations for each of these
values is shown in the table as well.
D= | Quality | Recommendations |
> 0.39 | Excellent | Retain |
0.30 – 0.39 | Good | Possibilities for improvement |
0.20 – 0.29 | Mediocre | Need to check/review |
0.00 – 0.20 | Poor | Discard or review in depth |
< -0.01 | Worst | Definitely discard |
Selection of Appropriate choices
It is essential for the qualitative improvement of tests made with multiple choice or alternatives used in each item be of homogenous nature that means they should belong to the same category. In order to make item suitable from the above point of view, it is suggested that the number of choices in each item in the first format of the test should be one extra than needed for the final format. With the help of the test solved by the examinees, the distraction power of different choices is ascertained using the following table. A alternative with least dictation power should be rejected. This way the items be included in the test with improved form.
Item No | Selected Alternative-A | Selected – Alternative B | Selected – Alternative C | Selected – Alternative D | Selected – Alternative E | Rejected Alternative |
19 | 31 | 20 | 54 | 49 | — | B |
23 | 40 | – | 33 | 42 | 68 | C |
The above table have columns for five alternatives. It is clear that the final paper format can have at the most four alternatives. It is also worth mentioning that while calculating distraction power of alternatives the correct alternative is left out and attention is focused on the remaining alternatives. The distraction power of an alternative is equal in proportion to the number of examinees opting for that.
In the same way which may be measuring the same content area twice as (e.g. who was the founder of Mughal empire in India?) and ( Which empire was formed by Baber in India?) both questions refer to one content area that Baber established the mughal empire in india. Out of these two questions one question be treated as bogus and must be excluded from the test.
The Process of Item Analysis
The correlation coefficient obtained from the point-bi-serial is a measure of item discrimination. The point-bi-serial correlation, between “pass/fail” on each item and the total test score, was used to explore item discrimination. The greater the correlation of the item the more discriminating it is. That is, it discriminates between higher and lower groups more effectively. For an item to be valid, its correlation with the total score should be fairly high.
The process of Item Analysis is carried out by using two contracting test groups composed from the upper and lower 25% or 27% of the testees on which the items are administered or trial tested. The upper and lower 25% is the optimum point at which balance is obtained between the sensitivity of the groups in making adequate differentiation and reliability of the results for a normal distribution. On the other hand, the upper and lower 27% when used are better estimate of the actual discrimination value. They are significantly different and the middle values do not discriminate sufficiently. In other to get the groups, the graded test papers are arranged from the highest score to the lowest score in a descending order. The best 25% or 27% are picked from the top and the poorest 25% or 27% from the bottom while the middle test papers are discarded.
To illustrate the method of item analysis using an example with a class of 40 learners taking a 10 item test that have been administered and scored, and using 25% test groups. The item analysis procedure might follow this basic step.
i. Arrange the 40 test papers by ranking them in order from the highest to the lowest score.
ii. Select the best 10 papers (upper 25% of 40 students) with the highest total scores and the least 10 papers (lower 25% of 40 students) with the lowest total scores.
iii. Drop the middle 20 papers (the remaining 50% of the 40 students) because they will no longer be needed in the analysis.
iv. Draw a table to show readiness for the tallying of responses for item analysis.
v. For each of the 10 test items, tabulate the number of students in the upper and lower groups who got the answer right or who selected each alternative (for multiple choice items).
vi. Compute the difficulty of each item (percentage of students who got the item right).
vii. Compute the discriminating power of each item (difference between the number of students in the upper and lower groups who got the item right).
viii. Evaluate the effectiveness of the distracters in each item (attractiveness of the incorrect alternatives) for multiple choice test items.
Computing Item Discrimination
The formula for computing item discrimination given below Gronlund, (1993: 103) and (Ebel and Frisbie, 1991:231)
D = Ru= RL
_______
NU or NL
Where D = Index of discrimination.
RU= Number of examinees giving correct answers in the upper group.
RL = Number of examinees giving correct answers in the lower group.
NU or NL= Number of examinees in the upper or lower group respectively
Computing Item Discriminating Power (D) It is obtained from this formula:
Number of high scorers who Number of low scorers who
Item Discrimination Power (D) = got items right (H) -got item right (L)
Total Number in each group (n)
That is,
D = H– L/ n
Item discrimination values range from – 1·00 to + 1·00. The higher the discriminating index, the better is an item in differentiating between high and low achievers.
The formula for computing item discrimination by Ebel and Frisbie, is given below
D = Ru= RL
_______
NU or NL
Where D = Index of discrimination.
RU= Number of examinees giving correct answers in the upper group.
RL = Number of examinees giving correct answers in the lower group.
NU or NL= Number of examinees in the upper or lower group respectively
Usually, if item discriminating power is a:
- positive value when a larger proportion of those in the high scoring group get the item right compared to those in the low scoring group.
- negative value when more students in the lower group than in the upper group get the item right.
- zero value when an equal number of students in both groups get the item right; and- 1.00 when all students in the upper group get the item right and allthe students in the lower group get the item wrong.
The item discrimination index compares students’ performance on an item to their performance on the entire examination. The point-biserial correlation index, the method used by evaluation and testing to measure item discrimination, compares the performance of all students on each item to their performance on the total test. Another method of deriving an item discrimination index compares examinees’ performance in the upper and lower groups on the examination (e.g. upper and lower 27%) for each item. Item discrimination indices vary from –1.00 to +1.00, with a negative index suggesting that poorly performing students on the exam answered the particular item 26 correctly, or conversely, high performing students answered the item incorrectly. Items should have a positive discrimination index, indicating that those who scored high on the test also tended to answer the item correctly, while those who scored low on the test tended to answer incorrectly.
A discrimination index should be evaluated with reference to the difficulty level of the item ,because a correlation method is used to assess the item’s success in discriminating between low and high achieving students. If the items are very easy or difficult, indicating homogenous performance, there is less variation in the scores, thus resulting in a reduced potential for discrimination. For example, item X has a slightly negative discrimination index, but is an extremely difficult item. Therefore, at least one poorly performing student answered this very difficult item correctly. This item could be evaluated as being too difficult and having ambiguous options and should be revised. It is also possible that the correct option was mis-keyed or the content was not taught as thoroughly as the instructor
The discrimination index can be used to evaluate an item from either a criterion-referenced test or a norm-referenced achievement test, however, less variation is expected on the criterion-referenced test. Students who were successful in mastering 27 the material overall should answer the items correctly, resulting in a positive but lower discrimination index.
Ebel and Frisbie believed that the more items classified as highly or moderately discriminating the better the test. Burroughs (1975) showed that an item which does not discriminate between these groups, upper and lower, contributes nothing to the establishment of an order of merit. It may be useful for warming-up purposes though. An item which is easier for weaker students than it is for good students would not only be a very curious item, but also one that detracts from the test’s rank ordering properties.