Understanding Standardized Speech and Language Tests: Sensitivity, Specificity and more

Have you ever asked…

Why there are discrepancies between language test scores?
Why we do NOT get the same test scores/results across various language tests?
Why some tests produce scores higher than others?
Who should be included in the normative sample (i.e., should children with disabilities be included)?
How do we tell the purpose of a test or if it’s appropriate to use for what we need it for?

This blog will help SLPs:

Understand the accuracy of standardized tests;
Understand sensitivity and specificity;
Answer the most common test-related questions and misconceptions;

Let’s begin by addressing one of the most common questions that SLPs ask, why are there discrepancies in scores between language tests? Why do we NOT get the same scores across various language tests? Why do some tests produce higher scores than the others?

Consider this case study example: Student A was referred for a speech and language re-evaluation. Initial testing results revealed a testing standard score (SS) range of 84-92 (a morphology subtest SS of 92, a listening comprehension SS of 89, a syntax subtest score of 84, a receptive vocabulary test SS of 91, an expressive vocabulary subtest SS of 85, etc) indicating a final conclusion that the student did not meet the state/district eligibility criteria for special education in the area of speech language impairment. During the re-evaluation, the student was administered the Language Video Assessment Tool with the following testing profile: Following Directions SS: 78, Restating Information SS: 69, Morphology and Sentence Structure SS:84 and Listening Comprehension SS: 63.

Watch Dr. Plante's course " Understanding Psychometric Properties of Standardized Tests", featured at the Power Up SLP Conference ( ASHA CEUs, PDH certificates available via CEU Smarthub libary)

Why is there such a discrepancy in language scores?

The tests that were used may be based on normative data/group characteristics that do not match. The makeup of the normative group (i.e., the group that the test was standardized on), significantly affect how the test functions. For example, if Test 1 includes people with disabilities in their normative sample (with the idea that this strategy represents the full population), and Test 2 includes typically developing (neurotypical) individuals and excludes people with the target disorder from the normative sample, Test 2 will be more sensitive to the disorder, whereas Test 1 will be more likely to find the child with the disorder to be a member of the typically developing normative group.
The tests that were used may not serve the same purpose. For example, one test might be a diagnostic test that was designed with the purpose to identify/diagnose a disorder whereas the other test might have been designed with the purpose to identify strengths/weaknesses or rate the severity of the disorder.
The tests do not share a similar DESIGN or measure the same skill. If we go back to our previously discussed case study, the reason for the discrepancy in test scores could have been because Test 1 might have been designed to evaluate the severity of a disorder or identify strengths or weaknesses and Test 2 may have been designed to evaluate the presence of a specific language impairment (SLI). Test 1’s normative sample may have included people with disabilities meaning the test is less sensitive to the disorder and more likely to identify a child with a disorder as a member of the typically developing normative group. Test 2’s normative sample consisted of only typically developing (neurotypical) individuals and excluded people with the target disorder from the normative sample, therefore making the test more sensitive to the disorder.

With this in mind, if the purpose of an assessment is to rate the severity level, then the test needs to include individuals with disabilities in the normative sample. If the purpose of an assessment is to identify a disorder, the test should only include typically developing children in the normative sample.

In the past, SLPs may have questioned why paying attention to the psychometric properties of a test is so important – we hope that this is becoming more clear now. The psychometric properties of a test will impact the children we are trying to serve because the testing scores affect our clinical decisions and determine the outcome of the assessment and the eligibility determination. Inaccurate test scores may deprive a child of the services that they may require, or indicate a student needs services when they do not.

So, let’s go back to our initial question, why are there discrepancies in scores between language tests? Why do we NOT get the same scores across various language tests? Some might blame it on the child’s attention skills or the child feeling sick that day, etc. – however, what it really comes down to is reliability. Afterall, tests are developed to be reliable sources and reliability studies are conducted to ensure reliability. So, if we are measuring the same skill, there shouldn’t be significant discrepancies in scores across tests. When there are test discrepancies, please go back to the “the purpose of the test” and ask yourself whether the tests were developed for the same purpose, and then look at “the design of the test” and ask yourself whether the tests measure similar skills.

What does the law tell us regarding selection of standardized tests?

The way the law is worded and the critical part/ wording of IDEA says that …. “assessments and other evaluation materials used to assess a child…. are used for purposes for which the assessments or measures are valid and reliable.”

What does this mean? We often hear specialists state that test XYZ is “a reliable and valid test, therefore it was used in this assessment.” But, this is where the first misconception occurs that we would like to address. Just because a test is reliable and valid, does not mean it is necessarily the right test to use for your student. Ask yourself, what is the purpose of the test and why are you testing? Tests are reliable and valid only relative to the purpose. A test can be valid for one purpose and completely invalid for another purpose. What IDEA actually explains is the purpose of the assessment. That means that the test has to be validated for the purpose of the assessment!

When most of the assessments are done with the purpose of identification of a disorder (e.g., initial or eligibility review), the tests we use for the assessment MUST be validate for the purpose of identifying the disorder the child is suspected of. That means that the tests we use must be diagnostic in nature.

How do we know if a test is valid for the purpose of identification of a disorder? How is this measured/ displayed in test manuals? What potential interpretation do we want to draw from the test result? Since we are trying to figure out whether the student has a disability or not, this is a yes/no question as to whether they have a disorder or not. This is not a question that asks for strengths/weaknesses or asks about the severity rating of the disorder. The statistical tool that provides the evidence that the test score interpretation is valid for the purpose of identification of a disorder is called “Discriminant Analysis” and is the primary tool that provides the evidence that the test score interpretation is valid. This is the type of information that we need to look for in test manuals.

There will be a distribution of scores for typically developing children and a separate distribution of scores for the clinical group. The curves might overlap a lot or maybe just a little, but at some point along these distributions, there is a point that maximally discriminates between the two groups and that is what we call the cut score. So above that cut score, people are classified by the test as neurotypical and below that cut score, individuals are classified by the test as having an impairment.

So, this is what we need, we need a test that statistically differentiates between the two groups.

Cut scores – It is very important to look at the test specific cut-off score that tells us how the two groups are differentiated. Not all tests have the same cut-off score.

Specificity/sensitivity – Sensitivity is calculated by dividing the number of kids who are identified as impaired / divided by the true number of impaired kids. Acceptable sensitivity/specificity should be anywhere above .8.

Going back to our case study, why was there a discrepancy in the two test results? Were the two tests validated for the purpose of identification of Speech Language Impairment? How do we know if the test was validated for the purpose, we intend to use the test for? We need to review the manual!

When selecting an assessment, remember to ask yourself:

Why am I giving this test? What am I trying to learn from this test?

Instead of just grabbing whatever test is on your shelf, answer these questions first and review your test manuals to help you decide which test is right for you and your student!

Remember:

First, look for the evidence that supports your purpose of administering the test.
Then, look for the test’s evidence that supports this purpose (i.e., sensitivity and sensitivity -which should be between 0.8-1.0).
If the purpose of the assessment is rating the severity level, then the test needs to include individuals with disabilities in the normative sample.

Common misconceptions

Standardized tests over or under-identify, therefore we shouldn’t use them!
This test is reliable /valid therefore I should use it with ALL my assessments
Standardized tests are inaccurate, they overidentify or under-identify! We should only use dynamic assessment!
We should only use strength-based assessment and observations

Quick Questions and Answers

Why do some tests produce scores higher than others?

While some tests are built to detect severity, others are built to diagnose or differentiate between groups. If scores are higher on one of two tests, it could be because one test’s purpose is to evaluate the severity of the disorder whereas the other test’s purpose is diagnostic. Additionally, the normative sample the test was based around will impact the test’s sensitivity and specificity – so be sure to double check the sample makes sense for the type of test that it is.

Should disordered students be included in the normative sample? Why or why not?

This depends on the purpose of the test. If the test’s purpose is to identify a disorder, then only typically developing students should be used in the normative sample. If the test’s purpose is to rate the severity of a disorder, then both typically developing and disordered students should be used in the normative sample.

Is it true that standardized tests are inaccurate, and they either overidentify or under-identify? Should we only use dynamic assessment?

Standardized tests can be inaccurate when the test is not used for the purpose it was designed for. When a test is not designed for the purpose of identifying a disorder, we cannot expect the test to provide accurate results.

How do we know if the test is designed to be diagnostic?

Take a look at the manual for the discriminant analysis that is displayed as sensitivity/sensitivity (*which should always be over .8).

Why is it best practice/ evidence based practice to use cut scores in assessment reports?

Since identification of a disorder is a yes/no question, there has to be a line that is drawn to differentiate between typical performance and the performance that is impacted by disability. Cut scores represent the numerical boundary between what is considered to be neurotypical/typical and what is impacted by the disability. True diagnostic tests should use cut scores for identification purposes and report sensitivity and specificity as a measure of test’s accuracy. Standard scores and percentile ranks are a measure of severity or the rate of a person’s performance. Identification of a disorder is not a continuum that rates a person’s performance – the disorder is either present or it isn’t.

Should we only use strength-based assessment and observation and never use standardized tests?

Standardized tests are typically used for the purpose of identifying a disorder. These tools are meant to help us answer the yes/no question as to whether or not a child has a disability. When we identify a disorder, we need a good and accurate diagnostic standardized test. When the disorder has been identified, the next step should be a strength based assessment in order to help promote an enabling environment for the student, focus on self-advocacy, self-awareness, problem-solving, etc.

The purpose of a diagnostic evaluation is to:

compare student performance to a group of neurotypical students in the same age-group;
evaluate how the student functions in a neurotypical academic and social setting;
determine eligibility;
develop a profile of strengths and weaknesses; and
determine or rule out a diagnosis.

The purpose of a strength-based evaluation is to:

promote an enabling environment;
focus on changing the environment, NOT the student;
focus on self-esteem, autistic identity and autonomy;
move the burden of change away from the student and foster acceptance and accommodation so that the student can integrate/participate as much as they wish; and
focus on self-advocacy, self-awareness, problem-solving.