Top-Rated Free Essay
Preview

Reliability and Validity

Powerful Essays
5464 Words
Grammar
Grammar
Plagiarism
Plagiarism
Writing
Writing
Score
Score
Reliability and Validity
C.4.4 Item and Test Analysis: Determination of Test Quality
Kanupriya Katyal, Dr. Jagrook Dawra

Abstract:
This module deals with defining and determining quality of test instruments and test items. Tests as an instrument for evaluation need to be accurate, objective, practical and reliable. Further they should be able to discriminate between good and bad performers and have a uniform difficulty level. This module explains each of these terms and describes how they can be measured. Module specifically tou ched on six measures of test quality, objectivity, practicability, reliability, validity, difficulty level and discrimination index. It also talks about mathematical measures like mean, median, mode, standard deviation and correlation that help in measuring test quality.
Objective:
1.

To enable the reader to define the quality of a test and measure it.
a.

To understand the concepts of reliability and validity in a test.

b.

To understand the various measurements used in defining quality like mean, median, mode, standard deviation & correlation

Introduction:
A test needs to evaluate and measure the performance of the candidate, department or an institution.
Measurement is purely quantitative and when an individual‟s judgment is added it becomes evaluation.
A test should measure what it intended to measure, with considerable accuracy and at the same time it should be able to discriminate between students of varied abilities.
Subjective judgment leads to inaccuracy and errors. These errors are the standard errors of mea surements.
Hence, these need to be identified and eliminated.
There are several valid reasons for analyzing questions and tests that students have completed and that have already been graded. Some of the reasons include the following:


Identify content that has not been adequately covered and should be re-taught,



Provide feedback to students,



Determine if any items need to be revised in the event they are to be used again or become part of an item file or bank,



Identify items that may not have functioned as they were intended,
1



Direct the teacher 's attention to individual student weaknesses.

Validity and reliability are the overarching principles that govern test design. Validity is the extent to which a test measures what it intends to measure. Reliability is the extent to which the test scores are consistent.
Reliability is a property of the test as a measuring instrument. Other measures like objectivity, practicability, difficulty level and discrimination index are also some measures of test quality and have been discussed in the subsequent sections.
Understanding Test Item and Test Quality
There are various forms of assessment techniques available to examiner. They range from assessing students using a fixed-response multiple choice test or an open-response short answer, long answer or essay type of exam. These exams serve a variety of purposes. The results may be used to access a student‟s strengths and weaknesses or plan further instructional activity. They may be used for selection, placement or for certification. They may be used as tools for appraisals. Regardless of the objective of assessment, all assessments need to possess certain characteristics and need to have a certain degree of quality. A test is said to be of good quality if it satisfies the following criteriai:
1.

Objectivity (justice): Objectivity is said to be ensured when the paper setter is given a design/ method to follow. Objectivity of the „darts‟ exercise would depend upon how well is the task defined to the players. A test with good objectivity would define number of attempts, distance from where to aim, etc.
For example, teachers at several levels of education assess students‟ overall learning by giving them projects. Often, students are not told anything about the scope of the wo rk. They are also unaware of what distinguishes a good project from a bad project and how would they be graded.
It has often been observed that students‟ learning is enhanced from a project if the scope of the project is clearly defined and the student is also told clearly about certain specific performance characteristics arranged in levels, indicating the degree to which the standard has been met.
If a biology student is asked to maintain a journal on leaf collection, a test with good objectivity for this project would look like as follows:
Grade A

Grade B

Grade C

Grade D

Appearance/
Neatness

Extremely neat, with cover page, leaves dried and neatly pasted

Neat with cover page, leaves dried and pasted

Neat with no cover page, leaves not dried & pasted

Untidy, no cover page & leaves not dried Organization

Well organized and categorized/ catalogued organized and categorized/ catalogued with some errors

organized and categorized/ catalogued with a lot of errors

Disorganized and no cataloguing

Information and

Both common

Both common

Both common

Such information
2

understanding

name and scientific name given.
Information about
Species/ Genus/ family given and accurate. name and scientific name given.
Information about
Species/ Genus/ family given with some errors

name and scientific name given.
Information about
Species/ Genus/ family given with a lot of errors

is missing

Objectivity needs to be maintained not only for the test but also for test items.
2.

Practicability (usability): All test instruments should be easily usable and have simple and clear instructions for administration of the instrument. For example, an online test may not be practical in remote areas where internet connectivity is poor. A paper based test would probably be more appropriate. 3.

Reliability (dependability): A test instrument is said to be reliable if it produces the same result every time. It is the consistency of measurement. A measure is considered reliable if a person 's score on the same test given twice is similar. The ability of a player to consistently hit around the bulls eye is his measure of reliability.
There are several ways by which reliability is generally measured: Test-retest, alternate form, split half, internal consistency (inter-item) and inter-rater.
a.

Test/retest: This is the more conservative method to estimate reliability. In this method, the scores from repeated tests of same participants, with the same test are compared. The test instrument remains the same. A reliable test would produce very similar scores. Simply put, the idea behind test/retest is that you should get the same score on test 1 as you do on test 2.
For example, IQ tests typically show high test -retest reliability.
The reliability of weighing scales in a physics experiment can be tested by recording weight 3 to 4 times with an interval of few minutes.
Test-retest reliability is a measure of stability.

b.

Alternate form reliability: when participants are able to recall their previous responses, test re test procedures fail. In such cases, alternate form reliability is used. As the name suggests, two or more versions of the tests are constructed that are equivalent in content and difficulty.
For example, the marks in the pre board test should be consistent to the board exam if there is no change in the underlying conditions between the two.
Teachers also use this technique to create replacement exams for students who have for some reason missed the main exam.
Alternate form reliability is a measure of equivalence.
3

c.

Split half reliability: this method of reliability compares scores from different parts of the test such as comparing the scores form even vs. odd numbered questions.

d.

Internal consistency or inter item reliability estimates reliability by grouping questions in a questionnaire that measure the same concept. For example, you could write two sets of three questions that measure the same concept and after collecting the responses, run a correlation between those two groups of three questions to determine if your instrument is reliably measuring that concept.
For example,
Vocabulary could be tested using synonyms, antonyms, sentence completion or analogies.
Understanding of Newton‟s laws can be tested by asking the student to state the laws or also by giving him a numerical based on these laws.
Inter-item reliability is the degree to which different items measuring the same variable attain consistent results. Scores on different items designed to measure the same construct should be highly correlated.

e.

Inter rater reliability: scorer reliability needs to be measured when observers use their judgment for interpretation.
For example, when analyzing live or video taped behavior and written answers to open ended essay type questions, different observers take measurement of the same responses. A high degree of correlation between the scores given by different observers gives high inter -rater reliability. There are often more than two judges to judge the performance of gymnasts in a sporting event. There are also more than one teachers present during the viv a-voce examination of a student.
A high correlation between the scores given by different judges to the gymnasts and teachers to the students indicates a high inter -rater reliability.

4.

Validity (accuracy): A test instrument should accurately measure what it is designed to test. It is the strength of our conclusions. Most tests are designed to measure hypothetical constructs like intelligence or learning which the examiner needs to operationalize. A valid test will measure this construct (learning) without being influenced by other factors (student‟s motivation level). It answers the examiner‟s question “was I right in giving the student this test/ test item?” in the above example of playing darts, if the student is able to aim the bull‟s eye correctly, he is valid.
So, he is valid in A and B in the diagram above (though he is less reliable in B). For example, a

4

test intended to examine a student‟s understanding of Wordsworth‟s literary contribution, a question can be asked in the following ways:


Summarize Wordsworth‟s poem „Daffodils‟.



Critically evaluate Wordsworth‟s poem „Daffodils‟.

The first question tests the student‟s memory and not his/ her understanding of „Daffodils‟.
Validity is also of different types:
a.

Face Validity - the test looks to be a good one: what teachers and students think of the test. Is it a reasonable way of assessing students? Is it too simple? Or is it too difficult?
The consensus of experts (generally) that a measure represents a concept. It is the least stringent type of validity.

b.

Construct validity – A construct is an individual characteristic that we assume exists in order to explain some aspect of behavior. Whenever we wish to interpret the assessment results in terms of some individual characteristics (e.g. reading comprehensio n, mathematics problem solving ability), we are concerned with a construct.
Some other examples of constructs are: Reasoning abilities, understanding of principles of electricity, intelligence, creativity, personality characteristics like sociability, hon esty and anxiety. Constructs are often difficult to define. They are often generated from some theoretical position that the examiner assumes. E.g. one examiner‟s model of successful salesperson may propose that an aggressive person is likely to be a succe ssful salesperson. Whereas, another examiner might opine that aggressiveness is a negative trait and a salesperson should rather be assertive.
Construct validity measures whether the test is accurately measuring a particular construct. For example an examiner constructs a SALESPERSON scale with questions testing both aggressive and assertive behavior and administers it to certain sales people whose performance as salespeople is known. The items that have a high correlation with the performance of a sales person indicate high construct validity while those with low correlation indicate low construct validity.
When measuring a student‟s understanding of principles of thermodynamics, if the examiner examines the adequacy (or inadequacy) of the answer, he woul d measure the construct appropriately. But if the examiner also examines the student on grammar/ neatness/ etc., the construct is not being measured appropriately.

5

c.

Content Validity: Content validity is the property of a test such that the test items sampl e the universe of items for which the test is designed. Content validity helps us understand whether a sample of items truly represents the entire universe of items for a particular topic. For example a teacher gives her students a list of 200 words and would like to know whether they have learnt to spell them correctly. She may choose a sample of say 20 words for a small test. We would like to know how representative were these 20 words of the entire list so that we can generalize that a student who spells 80% of these 20 words correctly would be able to spell 80% of the entire list correctly.

d.

Criterion Validity: Criterion validity assesses whether a test reflects a set of abilities in a current or a future setting as measured by some other test. It is of two types – predictive
(future) and concurrent validity (present).


Predictive validity - the test accurately predicts performance in some subsequent situation. For example, candidates were selected to do a certain job by interviewing them. If these selected candidates also perform well in their jobs then the test method (interview) has a good predictive validity.



Concurrent validity - the test gives similar results to existing tests that have already been validated. For example, assume that interview as a method has already been validated as a good indicator for employee performance. A written technical exam shall have high concurrent validity if it also gives similar results.

For example reading readiness test scores might be used to predict students‟ futu re achievement in reading or a test of dictionary skills might be used to estimate students‟ current skills in the actual use of a dictionary.
Difference between Reliability & Validity: Assume that there are some individuals playing darts.
The success of their skill is based on the fact of how close to the bulls‟ eye can they hit consistently. Let there be four persons playing, Person A, B, C, and D and their results are given in figure 1.
Then it can be said from the figure below that Player A is both val id and reliable. Player A not just achieves the desired result (valid) but also does it consistently (reliable).

6

A: Reliable and Valid

C: Reliable but not valid

5.

B: Valid but not reliable

D: Neither reliable nor valid

Difficulty level: A question paper or any test instrument is generally administered to a group, which is of about the same age and in the same grade/ standard. Thus, the test instrument must be made to a difficulty level suitable to the group. Item difficulty is simply the percentage of students taking the test who answered the item correctly. The larger the percent age getting an item right, the easier the item. The higher the difficulty index, the easier the item is understood to be.
For example, in the questions below, which item is more difficult?
a.

Who was AO Hume?

b.

Who was Mahatma Gandhi?

It is relatively easier to recognize the individual in the second question than the first.
Also for example, an English test item that is very difficult for an elementary student will be very easy for a high school student.
Difficulty index tells us how difficult the item is or ho w many people got that item correct. It is calculated as follows:

D

U c  Lc
T
7

Where, Uc is the number of people in the upper group who answered the item correctly, L c is the number of people in the lower group who answered the item correctly. T is the total number of responses to the item.
For example, in a class, if out of the top 10 students 9 gave a correct response to the question
“Who is the president of India?” and if out of the bottom 10 students, only 4 gave a correct response to the same, the difficulty level of the question would be:

D

94
 0.65  65%
20

This means that only 65 % of the students could answer the question correctly.
6.

Discrimination Value: Even though it has been stated that a test instrument must be suited for a homogenous group yet it should be able to distinguish between the different ability levels of different individuals being tested. The darts test should be able to discriminate between a novice, an amateur and an expert.
A good item discriminates between those who do well on the test and those who do poorly. The item discrimination index, D can be computed to determine the discriminating power of an item. If a test is given to a large group of people, the discriminating power of an item can be measured by comparing the number of people with high test scores who answered that item correctly with the number of people with low scores who answered the same item correctly. If a particular item is doing a good job of discriminating between those who score high and those who score low, more people in the top-scoring group will have answered the item correctly. Discrimination index „D‟ is given by:

d

U c  Lc
T /2

Where, Uc is the number of people in the upper group who answered the item correctly, L c is the number of people in the lower group who answered the item correctly; U and L are the number of people in the upper and lower groups respectively. T is the total number of responses to the item.
For example, if 15 out of 20 persons in the upper group answered a p articular question correctly and 5 out of 30 people in the lower group answered the same question correctly, then,

d

15  5
10

 0.4
20  30 2 25

The higher the discrimination index, the better the item because such a value indicates that the item discriminates in favor of the upper group, which should get more items correct.

8

An item that everyone gets correct or that everyone gets incorrect will have a discrimination index equal to zero.
When more students in the lower group than in the upper group select the ri ght answer to an item, the item actually has negative discrimination index
While it is important to analyze the performance of individual test items (reliability, difficulty level, discrimination value, etc.) it is also important to analyze the overall per formance of the complete test or its subsections. These criteria are measured using certain statistical measures primarily based on measures of central tendency – mean, median, mode and standard deviation (measure of dispersion). The mean, median and mode show how the test scores cluster together and the standard deviation shows how widely the scores are spread out.
Mean (also called average): For a data set, the mean is the sum of the observations divided by the number of observations.

Mean 

1n
 xi n i 1

For example, the arithmetic mean of 34, 27, 45, 55, 22, 34 (six values) is (34+27+45+55+22+34)/6 = 217/6
≈ 36.167.
Median is described as the number separating the higher half of a data set from the lower half.
For example, consider the dataset {1, 2, 2, 2, 3, 9}. The median is 2 in this case.
Mode is the value that occurs the most frequently in a data set.
For example, the mode of the sample [1, 3, 6, 6, 6, 6, 7, 7, 12, 12, 17] is 6.
Standard deviation of a data set is a measure of the spread of its values. It is a me asure of dispersion that takes every test score into account. Simply put, it the average amount that each students‟ score deviates
(differs) from the mean of the class. The standard deviation is usually denoted with the letter σ.





1n
 xi  x n i 1



2

For example, the standard deviation of 34, 27, 45, 55, 22, 34 (six values) is 12.06.
These measures of central tendency and dispersion show how appropriately a test has been designed for its intended purpose. They help the examiner determine the level of difficulty required hand how well different levels of students‟ can be differentiated. If the test results show skewness, either there is clustering of marks towards the top or clustering towards the bottom, the examiner may conc lude that the test designed is too easy or too difficult for the students.
9

Correlation: This concept lays foundations for most concepts of test analysis. It tells the examiner the extent to which two or more sets of results agree with each other.
For example,
Case 1: The results of two tests for the same set of students yielded the following results.
Student No

Test 1 Rank

Test 2 Rank

A

1

1

B

2

2

C

3

3

D

4

4

E

5

5

This shows that the students ranked identically on the two tests, that is, all ranks are same for both the tests.
This shows a perfect positive correlation or a correlation of +1.
Case 2: If the results of two tests for the same set of students yielded the following results:
Student No

Test 1 Rank

Test 3 Rank

A

1

5

B

2

4

C

3

3

D

4

2

E

5

1

Here the ranks are as different from each other as it is possible to be. The student who was ranked 1 in first test was ranked last in the second test and vice versa. This shows a perfect negative correlation or a correlation of -1.
Case 3: If the results of two tests for the same set of students yielded the following results:
Student No

Test 1 Rank

Test 4 Rank

A

1

3

B

2

2

C

3

4

D

4

5

E

5

1

10

6
5
4
3
2
1
0
0

1

2
T est 1 Rank

3

4

5

6

T est 4 Rank

This graph shows that there is no visible pattern between the Test 1 Ranks and Test 4 Ranks. Hence it can be said that there is no correlation.
However, in most situations there will be some amount of association. And to measure this association whether positive or negative the coefficient of correlation is used. The following table may be used as a basis for interpreting coefficient of correlationii
Correlation
Small
Medium
Large

Negative
−0.3 to −0.1
−0.5 to −0.3
−1.0 to −0.5

Positive
0.1 to 0.3
0.3 to 0.5
0.5 to 1.0

The formula for calculating this coefficient is

nXY  XY

r

n( X 2 )  (X ) 2 n(Y 2 )  (Y ) 2

Points to remember:


A good test satisfies the criteria of objectivity, practicability, reliabi lity, validity, difficulty level and discriminatory power.



Objectivity is said to be ensured when the paper setter is given a design/ method to follow.



All test instruments should be easily usable and have simple and clear instructions for administration of the instrument.



A test instrument is said to be reliable if it produces the same result every time.



A test instrument should accurately measure what it is designed to test.



The test instrument must be made to a difficulty level suitable to the group.
11



A test item should be able to distinguish between the different ability levels of different individuals being tested.

Exercises
Q1. If a vocabulary test was conducted with persons from various age groups, determine for the testing authority if there was any relationship between the age and the marks obtained. x = age of person y = marks obtained x y2

xy

28.4

81

806.56

255.6

15

29.3

225

858.49

439.5

24

37.6

576

1413.76

902.4

30

36.2

900

1310.44

1086

38

36.5

1444

1332.25

1387

46

35.3

2116

1246.09

1623.8

53

36.2

2809

1310.44

1918.6

60

44.1

3600

1944.81

2646

64

44.8

4096

2007.04

2867.2

76

r

x2

9

47.2

5776

2227.84

3587.2

415

Total

y

375.6

21623

14457.72

16713.3

nXY  XY n( X 2 )  (X ) 2 n(Y 2 )  (Y ) 2

r = 10 x 16713.3 - 415 x 375.6 / {(10 x 21623 - 4152) (10 x 14457.72 - 375.62)} r = 11259 / (44005 x 3501.84) r = 11259 / 124.14 r = 0.91
Thus it is found that the correlation coefficient is 0.91; this is very large. Hence, the district testing authority can assume that there is a strong positive correlation bet ween the age the person and the test scores obtained.
Q2. Using the test information given below, determine the range, mean, and median of the scores, the item difficulty and item discrimination indices of the questions.
There are 6 true-false questions (1-6), and 4 multiple choice questions

12

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

Correct answers

T

F

F

T

F

T

A

C

B

B

Amit

T

T

F

T

F

T

A

C

B

B

Prakash

T

F

T

T

F

T

A

C

B

B

Rahul

T

F

F

T

F

T

A

C

B

B

Gina

F

F

F

T

F

T

B

A

C

B

Tom

T

F

F

T

T

F

C

C

B

B

Ritu

T

F

T

F

T

T

A

C

B

B

Kriti

T

F

F

T

F

F

B

A

B

B

Prerna

F

F

T

T

F

T

C

C

C

B

Bhim

F

F

F

F

T

F

B

A

C

B

Arjun

T

F

T

F

T

F

C

B

C

B

Score

Solution to the above exercise
Q1
Amit
Prakash
Rahul
Gina
Tom
Ritu
Kriti
Prerna
Bhim
Arjun

Q2

Q3

Q4

Q5

Q6

Q7

Q8

Q9

Q10

1
1
1
0
1
1
1
0
0
1

0
1
1
1
1
1
1
1
1
1

1
0
1
1
1
0
1
0
1
0

1
1
1
1
1
0
1
1
0
0

1
1
1
1
0
0
1
1
0
0

1
1
1
1
0
1
0
1
0
0

1
1
1
0
0
1
0
0
0
0

1
1
1
0
1
1
0
1
0
0

1
1
1
0
1
1
1
0
0
0

1
1
1
1
1
1
1
1
1
1

Total correct 9
9
10
6
7
7
7
6
3
3

1 in the above table indicates a correct response and 0 indicates an incorrect response.

Mean 

9  9  10  ...  3
 6.7
10

Median (The middle score when all scores are put in rank order) = 7
Mode (Score(s) occurring most often) = 7
Range (Low score to high score) = 3-10
Arranging the above table in descending order of total score,

Rahul
Amit

Q1
1
1

Q2
1
0

Q3
1
1

Q4
1
1

Q5
1
1

Q6
1
1

Q7
1
1

Q8
1
1

Q9
1
1

Q10
1
1

Total
10
9
13

Prakash
Tom
Ritu
Kriti
Gina
Prerna
Bhim
Arjun

1
1
1
1
0
0
0
1

1
1
1
1
1
1
1
1

0
1
0
1
1
0
1
0

1
1
0
1
1
1
0
0

1
0
0
1
1
1
0
0

1
0
1
0
1
1
0
0

1
0
1
0
0
0
0
0

1
1
1
0
0
1
0
0

1
1
1
1
0
0
0
0

1
1
1
1
1
1
1
1

9
7
7
7
6
6
3
3

Let us consider a students getting a score of 7 and above as the “upper group” and those getting below 7 as the “lower group”

D

Using the formula

Uc
Lc
D

Q1
6
1
50.00%

U c  Lc to calculate item difficulty:
T

Q2
5
4
10.00%

Q3
4
2
20.00%

Q4
5
2
30.00%

Q5
4
2
20.00%

Q6
4
2
20.00%

d

Q7
4
0
40.00%

Q8
5
1
40.00%

Q9
6
0
60.00%

Q10
6
4
20.00%

U c  Lc
T /2

Discrimination Index:

Calculating Discrimination index using the formula

Q1
Uc
Lc
D

6
1
1.00

Q2

Q3

5
4
0.20

Q4
4
2
0.40

5
2
0.60

Q5
4
2
0.40

Q6
4
2
0.40

Q7
4
0
0.80

Q8
5
1
0.80

Q9
6
0
1.20

Q10
6
4
0.40

Q 3: A BPO firm wants to re-examine its recruitment strategy for tele-callers. It had some past data on performance of existing employees in their jobs and the scores on 3 tests that they had scored at the time of their recruitment. Examine these scores and suggest a future recruitment s trategy for the firm. successful tele caller
1
1
1
0
1
0
1
0
0

english grammar test
9
10
9
4
9
5
9
8
2

Vocabulary test 3
3
4
5
0
9
2
9
6

performance in verbal ability test
8
7
8
4
9
4
7
3
5
14

1
0
0
0
1
1
1
0
1
0
1
0
0
1
0
1

7
7
6
4
8
6
8
5
10
5
8
6
5
8
3
7

3
2
0
0
10
10
0
4
7
0
0
10
5
6
4
10

10
5
2
6
8
8
7
4
9
3
10
5
4
9
5
9

Answer:
Correlation between the construct “Successful tele caller” and the test scores would measure the construct validity of the tests. A high correlation would indicate the appropriateness of the test. The correlation can be obtained using the formula:

r

nXY  XY n( X 2 )  (X ) 2 n(Y 2 )  (Y ) 2

English grammar test

0.770359

Vocabulary test

-0.00542

Verbal ability test

0.897702

The results show that the Verbal ability test is the most valid test in measuring the performance of a tele caller, followed by the English Grammar test. The vocabulary test has no correlation with the job performance and therefore can be discontinued with.
Tips for further study:
There are statistical measures to measure and interpret reliability and validity like cronbach alpha, kappa coefficient, etc. These can be further studied from the book titled „Statistics for the Social Sciences‟ by
Victoria L. Mantzopoulos published by Prentice Hall, Englewood Cliffs, NJ (1995).
.

15

Colleges like IcfaiTech College of Engineering make use of the principles of Standard deviation, mean, range to access the reliability of test scores between different teachers teaching the same subject. Some colleges like IBS, Hyderabad also use such measures extensively.
Bibliographical References:


HS Srivastava, “Challenges in education evaluation”, UBS Publishers Distributors Ltd.



Noen Entwistle, “Handbook of educational ideas and practices”, Routledge publications.



Airasian, Peter W. (2000). Assessment in the classroom. A concise approach. Boston: McGrawHill.



Linn, Robert L. & Gronlund, Norman E. (2000). Measurement and assessment in teaching. Upper
Saddle River, NJ: Prentice-Hall, Inc.



Wiersma, William & Jurs, Stephen G. (1985). Educational measurement and testing. Boston:
Allyn and Bacon, Inc.



Gronlund, N.E., & Linn, R.L. (1990). Measurement and evaluation in teaching (6 th Ed). New
York: MacMillan



Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests.
Columbus, OH: Charles E. Merrill Books, Inc



Nunnally, J.C. (1972). Educational measurement and evaluation (2 nd Ed). New York: McGrawHill



Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation.
Cambridge University Press



Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests &
Measurement. Sage Publications, Inc.



Linn R L. & Miller M D. (2005) Measurement & Assessment in Teaching (9 th Ed). Merrill
Prentice Hall

i

Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when

guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a student 's perception of the test items. Looking at an item 's difficulty and discrimination will assist the test developer in determining what is wrong with individual items. Item and test analysis provide empirical data about how individual items and whole tests are performing in real test situat ions.
16

Test designers need to accomplish some requirements concerning validity, objectivity and reliability for the items and for the test itself; they also have to follow some logical procedures. ii Even though guidelines for interpreting the coefficient o f correlation have been given however, all such

criteria are in some ways arbitrary and should not be observed too strictly. This is because the interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be ve ry low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.

17

References: Wood, D.A. (1960). Test construction: Development and interpretation of achievement tests. Nunnally, J.C. (1972). Educational measurement and evaluation (2 nd Ed). New York: McGrawHill  Anderson J C, Clapham C & Wall D. (1995). Language Test Construction & Evaluation. Salkind N J. (2006). Tests & Measurement for People who think They Hate Tests & Measurement

You May Also Find These Documents Helpful