A good intelligence test must be valid, reliable and standard.
Validity refers to how well the test accurately capture what it attempts to measure. For intelligence tests, that is "intelligence". For example, a test measuring language proficiency in itself cannot be considered an intelligence test because not all people proficient in a certain language are "intelligent", in a sense. Similarly, a test measuring mathematical ability need not include instructions using cryptic English. Validity can be established in two ways. First, there should be a representative sample of items across the entire domain of intelligence (i.e., not just mathematical abilities, but verbal skills as well). This is where Weschler scales seem to fare better than the Stanford-Binet test. Second, the results should match an external criterion. Common external criteria are educational achievements, career success, and wealth; that is, intelligent people are often achievers, whether in school, work, or finances.
Reliability refers to the stability and consistency of scores the intelligence test produces. For example, Peter took a random 50% sample of an intelligence test on his first year, and answered 75% of the test items correctly. Thereafter, Peter took the test year after year. Surprisingly, the results were inconsistent. He correctly answered 90% of the items in his second year, 40% of the items in his third year, and 60% in his first year. Meanwhile, Annie took the intelligence test every month in her first year, and the results seemed nonsense. Because the results vary significantly every retake, then the test loses its ability to be predictive of what it attempts to measure.
Standardization refers to the uniformity of administering and scoring the test. An intelligence test does not consist only of the test items; it includes the process in which the test is given and interpreted. For example, if the test requires an interview, all the interviewers