Kanupriya Katyal, Dr. Jagrook Dawra
Abstract:
This module deals with defining and determining quality of test instruments and test items. Tests as an instrument for evaluation need to be accurate, objective, practical and reliable. Further they should be able to discriminate between good and bad performers and have a uniform difficulty level. This module explains each of these terms and describes how they can be measured. Module specifically tou ched on six measures of test quality, objectivity, practicability, reliability, validity, difficulty level and discrimination index. It also talks about mathematical measures like mean, median, mode, standard deviation and correlation that help in measuring test quality.
Objective:
1.
To enable the reader to define the quality of a test and measure it.
a.
To understand the concepts of reliability and validity in a test.
b.
To understand the various measurements used in defining quality like mean, median, mode, standard deviation & correlation
Introduction:
A test needs to evaluate and measure the performance of the candidate, department or an institution.
Measurement is purely quantitative and when an individual‟s judgment is added it becomes evaluation.
A test should measure what it intended to measure, with considerable accuracy and at the same time it should be able to discriminate between students of varied abilities.
Subjective judgment leads to inaccuracy and errors. These errors are the standard errors of mea surements.
Hence, these need to be identified and eliminated.
There are several valid reasons for analyzing questions and tests that students have completed and that have already been graded. Some of the reasons include the following:
Identify content that has not been adequately covered and should be re-taught,
Provide feedback to students,
Determine if any items need to be revised in the event they are to be used again or become part of an item file or bank,
Identify items that may not have functioned as they were intended,
1
Direct the teacher 's attention to individual student weaknesses.
Validity and reliability are the overarching principles that govern test design. Validity is the extent to which a test measures what it intends to measure. Reliability is the extent to which the test scores are consistent.
Reliability is a property of the test as a measuring instrument. Other measures like objectivity, practicability, difficulty level and discrimination index are also some measures of test quality and have been discussed in the subsequent sections.
Understanding Test Item and Test Quality
There are various forms of assessment techniques available to examiner. They range from assessing students using a fixed-response multiple choice test or an open-response short answer, long answer or essay type of exam. These exams serve a variety of purposes. The results may be used to access a student‟s strengths and weaknesses or plan further instructional activity. They may be used for selection, placement or for certification. They may be used as tools for appraisals. Regardless of the objective of assessment, all assessments need to possess certain characteristics and need to have a certain degree of quality. A test is said to be of good quality if it satisfies the following criteriai:
1.
Objectivity (justice): Objectivity is said to be ensured when the paper setter is given a design/ method to follow. Objectivity of the „darts‟ exercise would depend upon how well is the task defined to the players. A test with good objectivity would define number of attempts, distance from where to aim, etc.
For example, teachers at several levels of education assess students‟ overall learning by giving them projects. Often, students are not told anything about the scope of the wo rk. They are also unaware of what distinguishes a good project from a bad project and how would they be graded.
It has often been observed that students‟ learning is enhanced from a project if the scope of the project is clearly defined and the student is also told clearly about certain specific performance characteristics arranged in levels, indicating the degree to which the standard has been met.
If a biology student is asked to maintain a journal on leaf collection, a test with good objectivity for this project would look like as follows:
Grade A
Grade B
Grade C
Grade D
Appearance/
Neatness
Extremely neat, with cover page, leaves dried and neatly pasted
Neat with cover page, leaves dried and pasted
Neat with no cover page, leaves not dried & pasted
Untidy, no cover page & leaves not dried Organization
Well organized and categorized/ catalogued organized and categorized/ catalogued with some errors
organized and categorized/ catalogued with a lot of errors
Disorganized and no cataloguing
Information and
Both common
Both common
Both common
Such information
2
understanding
name and scientific name given.
Information about
Species/ Genus/ family given and accurate. name and scientific name given.
Information about
Species/ Genus/ family given with some errors
name and scientific name given.
Information about
Species/ Genus/ family given with a lot of errors
is missing
Objectivity needs to be maintained not only for the test but also for test items.
2.
Practicability (usability): All test instruments should be easily usable and have simple and clear instructions for administration of the instrument. For example, an online test may not be practical in remote areas where internet connectivity is poor. A paper based test would probably be more appropriate. 3.
Reliability (dependability): A test instrument is said to be reliable if it produces the same result every time. It is the consistency of measurement. A measure is considered reliable if a person 's score on the same test given twice is similar. The ability of a player to consistently hit around the bulls eye is his measure of reliability.
There are several ways by which reliability is generally measured: Test-retest, alternate form, split half, internal consistency (inter-item) and inter-rater.
a.
Test/retest: This is the more conservative method to estimate reliability. In this method, the scores from repeated tests of same participants, with the same test are compared. The test instrument remains the same. A reliable test would produce very similar scores. Simply put, the idea behind test/retest is that you should get the same score on test 1 as you do on test 2.
For example, IQ tests typically show high test -retest reliability.
The reliability of weighing scales in a physics experiment can be tested by recording weight 3 to 4 times with an interval of few minutes.
Test-retest reliability is a measure of stability.
b.
Alternate form reliability: when participants are able to recall their previous responses, test re test procedures fail. In such cases, alternate form reliability is used. As the name suggests, two or more versions of the tests are constructed that are equivalent in content and difficulty.
For example, the marks in the pre board test should be consistent to the board exam if there is no change in the underlying conditions between the two.
Teachers also use this technique to create replacement exams for students who have for some reason missed the main exam.
Alternate form reliability is a measure of equivalence.
3
c.
Split half reliability: this method of reliability compares scores from different parts of the test such as comparing the scores form even vs. odd numbered questions.
d.
Internal consistency or inter item reliability estimates reliability by grouping questions in a questionnaire that measure the same concept. For example, you could write two sets of three questions that measure the same concept and after collecting the responses, run a correlation between those two groups of three questions to determine if your instrument is reliably measuring that concept.
For example,
Vocabulary could be tested using synonyms, antonyms, sentence completion or analogies.
Understanding of Newton‟s laws can be tested by asking the student to state the laws or also by giving him a numerical based on these laws.
Inter-item reliability is the degree to which different items measuring the same variable attain consistent results. Scores on different items designed to measure the same construct should be highly correlated.
e.
Inter rater reliability: scorer reliability needs to be measured when observers use their judgment for interpretation.
For example, when analyzing live or video taped behavior and written answers to open ended essay type questions, different observers take measurement of the same responses. A high degree of correlation between the scores given by different observers gives high inter -rater reliability. There are often more than two judges to judge the performance of gymnasts in a sporting event. There are also more than one teachers present during the viv a-voce examination of a student.
A high correlation between the scores given by different judges to the gymnasts and teachers to the students indicates a high inter -rater reliability.
4.
Validity (accuracy): A test instrument should accurately measure what it is designed to test. It is the strength of our conclusions. Most tests are designed to measure hypothetical constructs like intelligence or learning which the examiner needs to operationalize. A valid test will measure this construct (learning) without being influenced by other factors (student‟s motivation level). It answers the examiner‟s question “was I right in giving the student this test/ test item?” in the above example of playing darts, if the student is able to aim the bull‟s eye correctly, he is valid.
So, he is valid in A and B in the diagram above (though he is less reliable in B). For example, a
4
test intended to examine a student‟s understanding of Wordsworth‟s literary contribution, a question can be asked in the following ways:
Summarize Wordsworth‟s poem „Daffodils‟.
Critically evaluate Wordsworth‟s poem „Daffodils‟.
The first question tests the student‟s memory and not his/ her understanding of „Daffodils‟.
Validity is also of different types:
a.
Face Validity - the test looks to be a good one: what teachers and students think of the test. Is it a reasonable way of assessing students? Is it too simple? Or is it too difficult?
The consensus of experts (generally) that a measure represents a concept. It is the least stringent type of validity.
b.
Construct validity – A construct is an individual characteristic that we assume exists in order to explain some aspect of behavior. Whenever we wish to interpret the assessment results in terms of some individual characteristics (e.g. reading comprehensio n, mathematics problem solving ability), we are concerned with a construct.
Some other examples of constructs are: Reasoning abilities, understanding of principles of electricity, intelligence, creativity, personality characteristics like sociability, hon esty and anxiety. Constructs are often difficult to define. They are often generated from some theoretical position that the examiner assumes. E.g. one examiner‟s model of successful salesperson may propose that an aggressive person is likely to be a succe ssful salesperson. Whereas, another examiner might opine that aggressiveness is a negative trait and a salesperson should rather be assertive.
Construct validity measures whether the test is accurately measuring a particular construct. For example an examiner constructs a SALESPERSON scale with questions testing both aggressive and assertive behavior and administers it to certain sales people whose performance as salespeople is known. The items that have a high correlation with the performance of a sales person indicate high construct validity while those with low correlation indicate low construct validity.
When measuring a student‟s understanding of principles of thermodynamics, if the examiner examines the adequacy (or inadequacy) of the answer, he woul d measure the construct appropriately. But if the examiner also examines the student on grammar/ neatness/ etc., the construct is not being measured appropriately.
5
c.
Content Validity: Content validity is the property of a test such that the test items sampl e the universe of items for which the test is designed. Content validity helps us understand whether a sample of items truly represents the entire universe of items for a particular topic. For example a teacher gives her students a list of 200 words and would like to know whether they have learnt to spell them correctly. She may choose a sample of say 20 words for a small test. We would like to know how representative were these 20 words of the entire list so that we can generalize that a student who spells 80% of these 20 words correctly would be able to spell 80% of the entire list correctly.
d.
Criterion Validity: Criterion validity assesses whether a test reflects a set of abilities in a current or a future setting as measured by some other test. It is of two types – predictive
(future) and concurrent validity (present).
Predictive validity - the test accurately predicts performance in some subsequent situation. For example, candidates were selected to do a certain job by interviewing them. If these selected candidates also perform well in their jobs then the test method (interview) has a good predictive validity.
Concurrent validity - the test gives similar results to existing tests that have already been validated. For example, assume that interview as a method has already been validated as a good indicator for employee performance. A written technical exam shall have high concurrent validity if it also gives similar results.
For example reading readiness test scores might be used to predict students‟ futu re achievement in reading or a test of dictionary skills might be used to estimate students‟ current skills in the actual use of a dictionary.
Difference between Reliability & Validity: Assume that there are some individuals playing darts.
The success of their skill is based on the fact of how close to the bulls‟ eye can they hit consistently. Let there be four persons playing, Person A, B, C, and D and their results are given in figure 1.
Then it can be said from the figure below that Player A is both val id and reliable. Player A not just achieves the desired result (valid) but also does it consistently (reliable).
6
A: Reliable and Valid
C: Reliable but not valid
5.
B: Valid but not reliable
D: Neither reliable nor valid
Difficulty level: A question paper or any test instrument is generally administered to a group, which is of about the same age and in the same grade/ standard. Thus, the test instrument must be made to a difficulty level suitable to the group. Item difficulty is simply the percentage of students taking the test who answered the item correctly. The larger the percent age getting an item right, the easier the item. The higher the difficulty index, the easier the item is understood to be.
For example, in the questions below, which item is more difficult?
a.
Who was AO Hume?
b.
Who was Mahatma Gandhi?
It is relatively easier to recognize the individual in the second question than the first.
Also for example, an English test item that is very difficult for an elementary student will be very easy for a high school student.
Difficulty index tells us how difficult the item is or ho w many people got that item correct. It is calculated as follows:
D
U c Lc
T
7
Where, Uc is the number of people in the upper group who answered the item correctly, L c is the number of people in the lower group who answered the item correctly. T is the total number of responses to the item.
For example, in a class, if out of the top 10 students 9 gave a correct response to the question
“Who is the president of India?” and if out of the bottom 10 students, only 4 gave a correct response to the same, the difficulty level of the question would be:
D
94
0.65 65%
20
This means that only 65 % of the students could answer the question correctly.
6.
Discrimination Value: Even though it has been stated that a test instrument must be suited for a homogenous group yet it should be able to distinguish between the different ability levels of different individuals being tested. The darts test should be able to discriminate between a novice, an amateur and an expert.
A good item discriminates between those who do well on the test and those who do poorly. The item discrimination index, D can be computed to determine the discriminating power of an item. If a test is given to a large group of people, the discriminating power of an item can be measured by comparing the number of people with high test scores who answered that item correctly with the number of people with low scores who answered the same item correctly. If a particular item is doing a good job of discriminating between those who score high and those who score low, more people in the top-scoring group will have answered the item correctly. Discrimination index „D‟ is given by:
d
U c Lc
T /2
Where, Uc is the number of people in the upper group who answered the item correctly, L c is the number of people in the lower group who answered the item correctly; U and L are the number of people in the upper and lower groups respectively. T is the total number of responses to the item.
For example, if 15 out of 20 persons in the upper group answered a p articular question correctly and 5 out of 30 people in the lower group answered the same question correctly, then,
d
15 5
10
0.4
20 30 2 25
The higher the discrimination index, the better the item because such a value indicates that the item discriminates in favor of the upper group, which should get more items correct.
8
An item that everyone gets correct or that everyone gets incorrect will have a discrimination index equal to zero.
When more students in the lower group than in the upper group select the ri ght answer to an item, the item actually has negative discrimination index
While it is important to analyze the performance of individual test items (reliability, difficulty level, discrimination value, etc.) it is also important to analyze the overall per formance of the complete test or its subsections. These criteria are measured using certain statistical measures primarily based on measures of central tendency – mean, median, mode and standard deviation (measure of dispersion). The mean, median and mode show how the test scores cluster together and the standard deviation shows how widely the scores are spread out.
Mean (also called average): For a data set, the mean is the sum of the observations divided by the number of observations.
Mean
1n
xi n i 1
For example, the arithmetic mean of 34, 27, 45, 55, 22, 34 (six values) is (34+27+45+55+22+34)/6 = 217/6
≈ 36.167.
Median is described as the number separating the higher half of a data set from the lower half.
For example, consider the dataset {1, 2, 2, 2, 3, 9}. The median is 2 in this case.
Mode is the value that occurs the most frequently in a data set.
For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
Standard deviation of a data set is a measure of the spread of its values. It is a me asure of dispersion that takes every test score into account. Simply put, it the average amount that each students‟ score deviates
(differs) from the mean of the class. The standard deviation is usually denoted with the letter σ.
1n
xi x n i 1
2
For example, the standard deviation of 34, 27, 45, 55, 22, 34 (six values) is 12.06.
These measures of central tendency and dispersion show how appropriately a test has been designed for its intended purpose. They help the examiner determine the level of difficulty required hand how well different levels of students‟ can be differentiated. If the test results show skewness, either there is clustering of marks towards the top or clustering towards the bottom, the examiner may conc lude that the test designed is too easy or too difficult for the students.
9
Correlation: This concept lays foundations for most concepts of test analysis. It tells the examiner the extent to which two or more sets of results agree with each other.
For example,
Case 1: The results of two tests for the same set of students yielded the following results.
Student No
Test 1 Rank
Test 2 Rank
A
1
1
B
2
2
C
3
3
D
4
4
E
5
5
This shows that the students ranked identically on the two tests, that is, all ranks are same for both the tests.
This shows a perfect positive correlation or a correlation of +1.
Case 2: If the results of two tests for the same set of students yielded the following results:
Student No
Test 1 Rank
Test 3 Rank
A
1
5
B
2
4
C
3
3
D
4
2
E
5
1
Here the ranks are as different from each other as it is possible to be. The student who was ranked 1 in first test was ranked last in the second test and vice versa. This shows a perfect negative correlation or a correlation of -1.
Case 3: If the results of two tests for the same set of students yielded the following results:
Student No
Test 1 Rank
Test 4 Rank
A
1
3
B
2
2
C
3
4
D
4
5
E
5
1
10
6
5
4
3
2
1
0
0
1
2
T est 1 Rank
3
4
5
6
T est 4 Rank
This graph shows that there is no visible pattern between the Test 1 Ranks and Test 4 Ranks. Hence it can be said that there is no correlation.
However, in most situations there will be some amount of association. And to measure this association whether positive or negative the coefficient of correlation is used. The following table may be used as a basis for interpreting coefficient of correlationii
Correlation
Small
Medium
Large
Negative
−0.3 to −0.1
−0.5 to −0.3
−1.0 to −0.5
Positive
0.1 to 0.3
0.3 to 0.5
0.5 to 1.0
The formula for calculating this coefficient is
nXY XY
r
n( X 2 ) (X ) 2 n(Y 2 ) (Y ) 2
Points to remember:
A good test satisfies the criteria of objectivity, practicability, reliabi lity, validity, difficulty level and discriminatory power.
Objectivity is said to be ensured when the paper setter is given a design/ method to follow.
All test instruments should be easily usable and have simple and clear instructions for administration of the instrument.
A test instrument is said to be reliable if it produces the same result every time.
A test instrument should accurately measure what it is designed to test.
The test instrument must be made to a difficulty level suitable to the group.
11
A test item should be able to distinguish between the different ability levels of different individuals being tested.
Exercises
Q1. If a vocabulary test was conducted with persons from various age groups, determine for the testing authority if there was any relationship between the age and the marks obtained. x = age of person y = marks obtained x y2
xy
28.4
81
806.56
255.6
15
29.3
225
858.49
439.5
24
37.6
576
1413.76
902.4
30
36.2
900
1310.44
1086
38
36.5
1444
1332.25
1387
46
35.3
2116
1246.09
1623.8
53
36.2
2809
1310.44
1918.6
60
44.1
3600
1944.81
2646
64
44.8
4096
2007.04
2867.2
76
r
x2
9
47.2
5776
2227.84
3587.2
415
Total
y
375.6
21623
14457.72
16713.3
nXY XY n( X 2 ) (X ) 2 n(Y 2 ) (Y ) 2
r = 10 x 16713.3 - 415 x 375.6 / {(10 x 21623 - 4152) (10 x 14457.72 - 375.62)} r = 11259 / (44005 x 3501.84) r = 11259 / 124.14 r = 0.91
Thus it is found that the correlation coefficient is 0.91; this is very large. Hence, the district testing authority can assume that there is a strong positive correlation bet ween the age the person and the test scores obtained.
Q2. Using the test information given below, determine the range, mean, and median of the scores, the item difficulty and item discrimination indices of the questions.
There are 6 true-false questions (1-6), and 4 multiple choice questions
12
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Correct answers
T
F
F
T
F
T
A
C
B
B
Amit
T
T
F
T
F
T
A
C
B
B
Prakash
T
F
T
T
F
T
A
C
B
B
Rahul
T
F
F
T
F
T
A
C
B
B
Gina
F
F
F
T
F
T
B
A
C
B
Tom
T
F
F
T
T
F
C
C
B
B
Ritu
T
F
T
F
T
T
A
C
B
B
Kriti
T
F
F
T
F
F
B
A
B
B
Prerna
F
F
T
T
F
T
C
C
C
B
Bhim
F
F
F
F
T
F
B
A
C
B
Arjun
T
F
T
F
T
F
C
B
C
B
Score
Solution to the above exercise
Q1
Amit
Prakash
Rahul
Gina
Tom
Ritu
Kriti
Prerna
Bhim
Arjun
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
1
1
1
0
1
1
1
0
0
1
0
1
1
1
1
1
1
1
1
1
1
0
1
1
1
0
1
0
1
0
1
1
1
1
1
0
1
1
0
0
1
1
1
1
0
0
1
1
0
0
1
1
1
1
0
1
0
1
0
0
1
1
1
0
0
1
0
0
0
0
1
1
1
0
1
1
0
1
0
0
1
1
1
0
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
1
Total correct 9
9
10
6
7
7
7
6
3
3
1 in the above table indicates a correct response and 0 indicates an incorrect response.
Mean
9 9 10 ... 3
6.7
10
Median (The middle score when all scores are put in rank order) = 7
Mode (Score(s) occurring most often) = 7
Range (Low score to high score) = 3-10
Arranging the above table in descending order of total score,
Rahul
Amit
Q1
1
1
Q2
1
0
Q3
1
1
Q4
1
1
Q5
1
1
Q6
1
1
Q7
1
1
Q8
1
1
Q9
1
1
Q10
1
1
Total
10
9
13
Prakash
Tom
Ritu
Kriti
Gina
Prerna
Bhim
Arjun
1
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
0
1
0
1
1
0
1
0
1
1
0
1
1
1
0
0
1
0
0
1
1
1
0
0
1
0
1
0
1
1
0
0
1
0
1
0
0
0
0
0
1
1
1
0
0
1
0
0
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
9
7
7
7
6
6
3
3
Let us consider a students getting a score of 7 and above as the “upper group” and those getting below 7 as the “lower group”
D
Using the formula
Uc
Lc
D
Q1
6
1
50.00%
U c Lc to calculate item difficulty:
T
Q2
5
4
10.00%
Q3
4
2
20.00%
Q4
5
2
30.00%
Q5
4
2
20.00%
Q6
4
2
20.00%
d
Q7
4
0
40.00%
Q8
5
1
40.00%
Q9
6
0
60.00%
Q10
6
4
20.00%
U c Lc
T /2
Discrimination Index:
Calculating Discrimination index using the formula
Q1
Uc
Lc
D
6
1
1.00
Q2
Q3
5
4
0.20
Q4
4
2
0.40
5
2
0.60
Q5
4
2
0.40
Q6
4
2
0.40
Q7
4
0
0.80
Q8
5
1
0.80
Q9
6
0
1.20
Q10
6
4
0.40
Q 3: A BPO firm wants to re-examine its recruitment strategy for tele-callers. It had some past data on performance of existing employees in their jobs and the scores on 3 tests that they had scored at the time of their recruitment. Examine these scores and suggest a future recruitment s trategy for the firm. successful tele caller
1
1
1
0
1
0
1
0
0
english grammar test
9
10
9
4
9
5
9
8
2
Vocabulary test 3
3
4
5
0
9
2
9
6
performance in verbal ability test
8
7
8
4
9
4
7
3
5
14
1
0
0
0
1
1
1
0
1
0
1
0
0
1
0
1
7
7
6
4
8
6
8
5
10
5
8
6
5
8
3
7
3
2
0
0
10
10
0
4
7
0
0
10
5
6
4
10
10
5
2
6
8
8
7
4
9
3
10
5
4
9
5
9
Answer:
Correlation between the construct “Successful tele caller” and the test scores would measure the construct validity of the tests. A high correlation would indicate the appropriateness of the test. The correlation can be obtained using the formula:
r
nXY XY n( X 2 ) (X ) 2 n(Y 2 ) (Y ) 2
English grammar test
0.770359
Vocabulary test
-0.00542
Verbal ability test
0.897702
The results show that the Verbal ability test is the most valid test in measuring the performance of a tele caller, followed by the English Grammar test. The vocabulary test has no correlation with the job performance and therefore can be discontinued with.
Tips for further study:
There are statistical measures to measure and interpret reliability and validity like cronbach alpha, kappa coefficient, etc. These can be further studied from the book titled „Statistics for the Social Sciences‟ by
Victoria L. Mantzopoulos published by Prentice Hall, Englewood Cliffs, NJ (1995).
.
15
Colleges like IcfaiTech College of Engineering make use of the principles of Standard deviation, mean, range to access the reliability of test scores between different teachers teaching the same subject. Some colleges like IBS, Hyderabad also use such measures extensively.
Bibliographical References:
HS Srivastava, “Challenges in education evaluation”, UBS Publishers Distributors Ltd.
Noen Entwistle, “Handbook of educational ideas and practices”, Routledge publications.
Airasian, Peter W. (2000). Assessment in the classroom. A concise approach. Boston: McGrawHill.
Linn, Robert L. & Gronlund, Norman E. (2000). Measurement and assessment in teaching. Upper
Saddle River, NJ: Prentice-Hall, Inc.
Wiersma, William & Jurs, Stephen G. (1985). Educational measurement and testing. Boston:
Allyn and Bacon, Inc.
Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6 th Ed). New
York: MacMillan
Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests.
Columbus, OH: Charles E. Merrill Books, Inc
Nunnally, J.C. (1972). Educational measurement and evaluation (2 nd Ed). New York: McGrawHill
Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation.
Cambridge University Press
Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests &
Measurement. Sage Publications, Inc.
Linn R L. & Miller M D. (2005) Measurement & Assessment in Teaching (9 th Ed). Merrill
Prentice Hall
i
Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when
guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a student 's perception of the test items. Looking at an item 's difficulty and discrimination will assist the test developer in determining what is wrong with individual items. Item and test analysis provide empirical data about how individual items and whole tests are performing in real test situat ions.
16
Test designers need to accomplish some requirements concerning validity, objectivity and reliability for the items and for the test itself; they also have to follow some logical procedures. ii Even though guidelines for interpreting the coefficient o f correlation have been given however, all such
criteria are in some ways arbitrary and should not be observed too strictly. This is because the interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be ve ry low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.
17
References: Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests. Nunnally, J.C. (1972). Educational measurement and evaluation (2 nd Ed). New York: McGrawHill Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation. Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests & Measurement
You May Also Find These Documents Helpful
-
3.2 Explain the appropriate criteria to use for judging the quality of the assessment process…
- 587 Words
- 2 Pages
Good Essays -
• which standards/criteria are to be assessed (e.g. a module, unit, award) and whether the assessment is initial, formative or summative…
- 405 Words
- 2 Pages
Satisfactory Essays -
This section should discuss the types of validity for which there is evidence and the adequacy of this evidence to support potential uses of the test.…
- 2775 Words
- 12 Pages
Powerful Essays -
According to (Richard F. Gerson, Ph. D.) the above listed tools are the seven "basic" tools for measuring quality.…
- 601 Words
- 3 Pages
Good Essays -
Carry out assessments in accordance with the awarding organisation’s requirements in a fair and objective manner;…
- 890 Words
- 4 Pages
Good Essays -
The responsibility of choosing the correct test is a process which includes considering “the purpose for testing, the content and skills to be tested, and the intended test takers, as well as the ability to “select and use the most appropriate test based on a through review of available information” (pg. 5). Whitson (2009) defines a qualified test user by his or her ability to “understand legal and ethical principles regarding test security, demonstrate an acute understanding of the paramount importance of the well being of the client and confidentiality of test scores, as well as “seeking ongoing educational and training opportunities to maintain competence” (pg. 13).…
- 561 Words
- 3 Pages
Good Essays -
measurements. It is sometimes the case that researchers are unable to find an appropriate way of…
- 574 Words
- 3 Pages
Good Essays -
Cited: Anderson, Scarvia B., and John S. Helmick. On Educational Testing. San Francisco: Jossey-Bass, 1983. Print.…
- 2569 Words
- 7 Pages
Powerful Essays -
Evaluate the assessments you have carried out, stating whether you believe they were fair, valid and reliable.…
- 1543 Words
- 7 Pages
Powerful Essays -
Salkind, N. (2013). Why Measurement? An Introduction. In Tests & Measurements for People Who Think They Hate Tests & Measurements(2nd ed., p. 13). Thousand Oaks, California: SAGE Publications…
- 284 Words
- 1 Page
Satisfactory Essays -
Standardized testing has been embedded in children from the time they first enter kindergarten all the way through grade school and high school years and finally ending in college and graduate school. It has become so frequent that it is no longer questioned why these tests are necessary, and by the time a person is finally through with school, they have taken an average of twenty to twenty-two tests. Although countless generations of Americans have had to sit through these tests, never have they played such a prominent role in schooling. Usually these exams were used to administer a child’s performance in the classroom and what he or she has learned so far, along with where…
- 1728 Words
- 7 Pages
Better Essays -
Following administration, and scoring of the test the next step is to perform an interpretation of the test. While interpreting the test the interpreter may find that, there are a few test bias that may have caused the scores to be high or low within a certain group causing the test to be unfair. Test bias and test fairness are two topics that individuals will continue to debate for years to come. Salkind (2013) described Test fairness as a very sensitive of use of examinations, quizzes, tests, etc. and social values and judgements are two influences in clarifying the results of test scores (Salkind, 2013, p. 294). However, Salkind (2013) described Test bias as an inconsistency in test scores between various groups due to factors that are irrelevant…
- 573 Words
- 3 Pages
Good Essays -
Validity relates to whether an instrument measures what it intends to measure, and the degree of confidence that the user can have in the results obtained when using the instrument (Corr and Siddons 2005)…
- 4150 Words
- 17 Pages
Best Essays -
Reliability, validity, relevance and transferability. To explain, the assessment should give a reliable reflection of the skills / knowledge being assessed. It should be concise and to a specific methodology that the assessor and candidate understand. The assessment outcome should be valid, easy to score. It should be relevant. Asking a crane operator to produce a 3000 word report would not give a true reflection of their ability to operate a crane to a specific lifting plan. As people differ greatly, assessments should aim to encompass clearly identify preferred methods, specific to either task or knowledge assessment.…
- 5086 Words
- 21 Pages
Good Essays -
References: Camara, W. J., & Lane, S. (2006, September). A historical perspective and current views on the standards for educational and psychological testing. Educational Measurement: Issues & Practice, 25(3), 35-41.…
- 959 Words
- 4 Pages
Good Essays