How to Understand and Pick Standardized Speech and Language Tests
by Charlotte Granitto, MS, CCC-SLP and Adriana Lavi, PhD, CCC_SLP
Normative Sample, Purpose of Assessment, Discriminant Analysis and Cut-Scores, Specificity and Sensitivity, Validity, Reliability, Response Bias, Types of Response Biases
As SLPs, we are often asked to assess students for a variety of reasons. For example, we may be asked to screen individuals to find those who are most likely to have a disorder, identify individuals who actually have a disorder, establish skill level relative to peers, document profiles across behavioral domains, and document change over time. Additionally, our assessments may be used to establish skill levels relative to a criterion, select methods for effective remediation, or to select specific targets of remediation. No matter what the reason for the assessment, it is crucial that we first ask ourselves a few important questions before we select our assessment tools.
Cut-scores are test-specific and a specific test score will only work with that specific test
Sensitivity and Specificity
Plante and Vance (1994) determined that we must strive for sensitivity and specificity scores to be above 80%
One of the ways we can tell if an assessment is a strong test, is if it includes adequate norms
First, it is important that we consider what potential interpretations we hope to draw from the test. We can ask ourselves what are we trying to get out of this test and what do we hope to learn from it? Next, we want to review what evidence is available to support our intended purpose and interpretations. Lastly, we want to think about how confident we can be in our interpretations. In order to find the answers to our questions, we can turn to the technical manual of each test and review the psychometric properties.
One of the ways we can tell if an assessment is a strong test, is if it includes adequate norms. Norm-referenced testing is a method of evaluation where an individual’s scores on a specific test are compared to scores of a group of test-takers (e.g., age norms) (AERA, APA, and NCME, 2014). Previous research has suggested that utilizing a normative sample can be beneficial in the identification of a disability. Additionally, research has suggested that the inclusion of children with disabilities in the normative sample may negatively impact the test’s ability to differentiate between children with disorders and children who are typically developing (Peña, Spaulding, & Plante, 2006). When reviewing a test’s normative sample, it is important to consider size, gender, race and ethnicity, age, geographic location, and whether individuals with disabilities were included in the normative sample.
Purpose of Assessment
When we consider evidence-based practice in relation to assessment tools, it is also important to remember that tests are only reliable and valid relative to a purpose. For example, a test may be valid for one purpose and invalid for another purpose. IDEA states, “(A)(iii) Assessments and other evaluation materials used to assess a child under this section are used for the purposes for which the assessments or measures are valid and reliable.”
Discriminant Analysis and Cut-Scores
In the past, it was believed that when the purpose of an assessment was to identify an individual with a disorder, we would expect the student to score on the lower end of the distribution. However, this is not always the case and often times, students do not score on the lower half of the normal distribution. When we consider the normal distribution, it is skill level that exists on a continuum, not whether or not a disorder is present. If the purpose of the assessment is to identify whether a disorder is present, we must look at discriminant analysis – which is a distribution of scores for “typically developing” and distribution for “impaired” individuals. At some point along these two distributions there should be a point that maximally discriminates across the two groups and this is known as the cut score. Above the cut score, individuals are classified as typically developing, and below the cut score, individuals are classified as impaired. Depending on the test, the two distributions may differ on how much or how little they overlap, but there will always be a point where maximal discrimination takes place. Cut-scores are test-specific and a specific test score will only work with that specific test.
Specificity and Sensitivity
As a result of cut scores, we are provided with information on sensitivity and specificity. Sensitivity refers to the ability of a test to identify impaired individuals as impaired and it is calculated by the number of individuals the test identifies as impaired divided by the number of truly impaired individuals. Specificity refers to the ability of a test to identify normal individuals as normal and is calculated by the number of people the test identifies as not-impaired divided by the number of truly not-impaired individuals. Plante and Vance (1994) determined that we must strive for sensitivity and specificity scores to be above 80%.
When considering the strength of a test, we must also evaluate content validity, which refers to whether the test provides the clinician with accurate information on the ability being tested. More specifically, content validity measures whether or not the test actually assesses what it says it’s suppose to. According to McCauley and Strand (2008), there should be a justification of the methods used to choose content, expert evaluation of the test’s content, and an item analysis.
Additionally, content-oriented evidence of validation addresses the relationship between a student’s learning standards and the test content. Specifically, content-sampling issues take a look at whether cognitive demands of a test are reflective of the student’s learning standard level. Content sampling may address whether the test avoids inclusion of features irrelevant to what the test item is intended to target.
Once we have established the validity of a test, we can begin to look at reliability.
First, we should evaluate internal consistency, which looks at how well the items come together to measure the skill the subtest or test is assessing.
Next, we can review test-retest reliability, which looks at the variation between scores or different evaluative measurements of the same subject/individual taking the same test during a given period of time. If the test is a strong instrument, this variation would be expected to be low. Remember to ask yourself if the condition is changing or stable over this time frame and consider the length of the time frame. If the time frame is too short, individuals can remember test items the next time they are tested, and if the test time frame is too long, you may see developmental or degenerative changes.
Next, we can look at inter-rater reliability, which evaluates the consistency between different raters with regard to their scoring of examinees on the same instrument. Inter-rater reliability should be around .80. When evaluating inter-rater reliability, it is important to consider training and judgment required to give and score the test. If there is more judgment required in the test, for example, an observational rating scale, you can be more flexible on the requirement of .80. It is also important to evaluate how clear the instructions are and if there is test-specific training available.
Lastly, we should review item-reliability, which evaluates certain characteristics of test items. Individual test items should reflect a reliable level of skill difficulty. For example, people with less ability than the skill difficulty level should fail the item and people with more ability than the skill difficulty should pass the item. Items should also do a good job of discriminating between people of these different skill levels.
A bias is defined as a tendency, inclination, or prejudice toward or against something or someone. For example, if you are interviewing for a new employer and asked to complete a personality questionnaire, you may answer the questions in a way that you think will impress the employer. These responses will of course impact the validity of the questionnaire.
Responses to questionnaires, tests, scales, and inventories may also be biased for a variety of reasons. Response bias may occur consciously or unconsciously, it may be malicious or cooperative, self- enhancing or self-effacing (Furr, 2011). When response bias does occur, the reliability and validity of our measures will be compromised. Diminished reliability and validity will in turn impact decisions we make regarding our students (Furr, 2011). Thus, psychometric damage may occur because of response bias.
Types of Response Biases
Acquiescence Bias (“Yea-Saying and Nay-Saying”) refers to when an individual consistently agrees or disagrees with a statement without taking into account what the statement means (Danner & Rammstedt, 2016).
Extremity Bias refers to when an individual consistently over or underuses “extreme” response options, regardless of how the individual feels towards the statement (Wetzel, Lüdtke, Zettler, & Bohnke, 2016).
Social desirability Bias refers to when an individual responds to a statement in a way that exaggerates his or her own positive qualities (Paulhus, 2002).
Malingering refers to when an individual attempts to exaggerate problems, or shortcomings (Rogers, 2008). Random/careless responding refers to when an individual responds to items with very little attention or care to the content of the items (Crede, 2010).
Guessing refers to when the individual is unaware of or unable to gage the correct answer regarding their own or someone else’s ability, knowledge, skill, etc. (Foley, 2016).
In order to protect against biases, balanced scales are utilized. A balanced scale is a test or questionnaire that includes some items that are positively keyed and some items that are negatively keys. For a balanced scale to be useful, it must be scored appropriately, meaning the key must accommodate the fact that there are both positively and negatively keyed items. To achieve this, the rating scale must keep track of the negatively keyed items and “reverse the score.” Scores are only reversed for negatively keyed items.
Now, let’s take a look at the psychometric properties of the IMPACT Rating Scales and analyze how these assessments compare to the psychometric standards we have discussed.
The IMPACT Rating Scales & Psychometric Properties
Since the purpose of the IMPACT Rating Scales is to help to identify students who present with speech and/or language deficits, it was critical to exclude students from the normative sample who have diagnoses that are known to influence the targeted disorders (Peña, Spaulding, & Plante, 2006). For example, students who had previously been diagnosed with a specific language impairment or learning disability were not included in the normative sample. Further, students were excluded from the normative sample if they were diagnosed with autism spectrum disorder, intellectual disability, hearing loss, neurological disorders, or genetic syndromes. Students used in the normative samples for the IMPACT Rating Scales had no other diagnosed disabilities and were not receiving speech and language support or any other services. Thus, the normative sample for the IMPACT Rating Scales provides an appropriate comparison group (i.e., a group without any known disorders that might affect the targeted disorder) against which to compare students with suspected disorders. For example, clinicians can compare clinician, teacher, and parent ratings on the IMPACT Social Communication Rating Scale to this normative sample to determine whether a student is scoring within normal limits or, if their scores are indicative of a social communication disorder.
Additionally, the normative data for the IMPACT Rating Scales is based on the performance of roughly 1000 examinees (exact number differs from scale to scale), across multiple age groups and from multiple states across the United States of America. The data was collected from state and ASHA licensed speech-language pathologists (SLPs). All standardization project procedures were reviewed and approved by IntegReview IRB, an accredited and certified independent institutional review board. To ensure representation of the national population, the IMPACT Rating Scales standardization sample were selected to match the US Census data reported in the ProQuest Statistical Abstract of the United States (ProQuest, 2017). The sample was stratified within each age group by the following criteria: gender, race or ethnic group, and geographic region.
It is often common practice to use single cut scores (e.g., -1.5 standard deviations) to identify disorders, however, this is not evidence-based and there is actually evidence that advises against using this practice (Spaulding, Plante, & Farinella, 2006). When using single cut scores (e.g., -1.5 SD, -2.5 SD, etc.), we may under identify students with impairments on tests for which the best-cut score is higher and over identify students’ impairments on tests for which the best-cut score is lower. Additionally, using single cut scores may go against IDEA’s (2004) mandate, which states assessments must be valid for the purpose for which they are used. Sensitivity and specificity are diagnostic validity statistics that explain how well a test performs. Vance and Plante (1994) set forth the standard that for a language assessment to be considered clinically beneficial, it should reach at least 80% sensitivity and specificity. Thus, strong sensitivity and specificity (i.e., 80% or stronger) is needed to support the use of a test in its identification of the presence of a disorder or impairment. All of the IMPACT Rating Scales provide multiple cut- scores for different ages with strong sensitivity and specificity. Please review the technical manuals for each of the IMPACT Rating Scales to review cut-scores, sensitivity and specificity.
The validity of a test determines how well the test measures what it purports to measure. Expert opinion was solicited for the IMPACT Rating Scales. For example, twenty-nine speech language pathologists (SLPs) reviewed the IMPACT Social Communication Rating Scale. All SLPs were licensed in the state of California, held the Clinical Certificate of Competence from the American Speech-Language-Hearing Association, and had at least 5 years of experience in assessment of children with autism and social communication deficits. Each of these experts was presented with a comprehensive overview of the rating scale descriptions, as well as rules for standardized administration and scoring. They all reviewed 6 full-length administrations. Following this, they were asked 30 questions related to the content of the rating scale and whether they believed the assessment tool to be an adequate measure of social communication skills. For instance, their opinion was solicited regarding whether the questions and the raters’ responses properly evaluated the impact of social communication skills on educational performance and social interaction. The reviewers rated each rating scale on a decimal scale. All reviewers agreed that the IMPACT Social Communication Rating Scale is a valid informal observational measure to evaluate social communication skills and to determine the impact on educational performance and social interaction, in students who are between the ages of 5 and 21 years old.
Standards of fairness are crucial to the validity and comparability of the interpretation of test scores (AERA, APA, and NCME, 2014). The identification and removal of construct-irrelevant barriers maximizes each test- taker’s performance, allowing for skills to be compared to the normative sample for a valid interpretation. Test constructs and individuals or subgroups of those who the test is intended for must be clearly defined. In doing so, the test will be free of construct-irrelevant barriers as much as possible for the individuals and/or subgroups the test is intended for. It is also important that simple and clear instructions are provided.
The IMPACT Rating Scales use balanced set of questions in order to protect against response biases. A balanced scale is a test or questionnaire that includes some items that are positively keyed and some items that are negatively keys. Here is an example taken from the IMPACT Social Communication Rating Scale. Items on this scale are rated on a 4-point scale (“never,” “sometimes,” “often,” and “typically”). Now, imagine if we asked a teacher to answer the following two items regarding one of their students:
- Appears confident and comfortable when socializing with peers.
- Does not appear overly anxious and fidgety around group of peers.
Both of these items are positively keyed because a positive response indicates a stronger level of social language skills. To minimize the potential effects of acquiescence bias (“yea-saying and nay-saying” when an individual consistently agrees or disagrees [Danner & Rammstedt, 2016]), the test creator may revise one of these items to be negatively keyed. For example:
- Appears confident and comfortable when socializing with peers.
- Appears overly anxious and fidgety around group of peers.
Now, the first item is keyed positively and the second item is keyed negatively. The revised scale, which represents a balanced scale, helps control acquiescence bias by including one item that is positively keyed and one that is negatively keyed.
The IMPACT Rating Scales are psychometrically strong informal assessment tools that will assist SLPs and IEP teams in determining the impact of a speech and/or language disorder on a child’s education and social interactions. To learn more about the IMPACT Rating Scales please visit www.videolearningsquad.com and www.videoassessmenttools.com