6c. Research Design Part 3 – Evaluating Measurement Quality: Reliability, Validity, and Trustworthiness
Dr. Rochelle Stevenson
🎯 Learning Objectives
- Define reliability and describe the types of reliability.
- Define validity and describe additional types of validity.
- Analyze the rigour of qualitative measurement using the criteria of trustworthiness and authenticity.
We began our discussion on research design in chapter 6a by examining the process of measurement and how we conceptualize and operationalize our measures. As you may recall, this process requires us to become increasingly specific about what we mean by the concepts that are central to our research question. We also talked about variables and how they relate to one another. In part 2 of the research design section (chapter 6b), we delved further into the notion of causality and how we can say whether two variables that are correlated are indeed causally related. We also outlined specific criteria for establishing causality. The time dimension we use in our study is relevant to this discussion on causality, as are specific types of validity, namely, internal validity and external validity.
Now, we are left with a very important question: once we’ve defined our terms and specified the operations for measuring them, how do we know that our measures are any good? Without some assurance of the quality of our measures, we cannot be certain that our findings will have any meaning or that our findings will mean what we think they mean. When social scientists measure concepts, they aim to achieve reliability and validity in their measures. These two aspects of measurement quality are the focus of this final chapter on research design.
Reliability
The term reliability in measurement is about consistency. When a measure is reliable, in theory, it should yield the exact same result every time it is applied. To illustrate this concept of reliability, let’s look at an example. Let’s say our interest is in measuring the concepts of alcoholism and alcohol intake. What are some potential problems that could arise when attempting to measure this concept, and how might we work to overcome those problems? Perhaps we’ve decided to measure alcoholism by asking people to respond to the following question: Have you ever had a problem with alcohol? If we measure alcoholism this way, then it is likely that anyone who identifies as an alcoholic would respond “yes.” This may seem like a good way to identify our group of interest, but think about how you and your peer group may respond to this question. Would participants respond differently after a wild night out compared to any other night? Could an infrequent drinker’s current headache from last night’s glass of wine influence how they answer the question this morning? How would that same person respond to the question before consuming the wine? In each case, the same person might respond differently to the same question at different points, so it is possible that our measure of alcoholism has a reliability problem.
One common problem of reliability with social scientific measures is memory. If we ask research participants to recall some aspect of their own past behaviour, we should try to make the recollection process as simple and straightforward for them as possible. Sticking with the topic of alcohol intake, if we ask respondents how much wine, beer, and liquor they’ve consumed each day over the course of the past three months, how likely are we to get accurate responses? Unless a person keeps a journal documenting their intake, there will very likely be some inaccuracies in their responses. On the other hand, we might get more accurate responses if we ask a participant how many drinks of any kind they have consumed in the past week.
Reliability can be an issue even when we’re not reliant on others to accurately report their behaviours. Perhaps a researcher is interested in observing how alcohol intake influences interactions in public locations. They may decide to conduct observations at a local pub by noting how many drinks patrons consume and how their behaviour changes as their intake changes. What if the researcher has to use the restroom, and the patron next to them takes three shots of tequila during the brief period the researcher is away from their seat? The reliability of this researcher’s measure of alcohol intake depends on their ability to physically observe every instance of patrons consuming drinks. If they are unlikely to be able to observe every such instance, then perhaps their mechanism for measuring this concept is not reliable.
Reliability also refers to the stability of the measure over time. If the measure yields consistent results when given multiple times, then the measure is considered reliable. For example, the Psychopathy Checklist-Revised (PCL-R) is a commonly used 20-item assessment tool in criminal justice contexts to assess risk for “serious criminality, violence, and poor treatment response” (Storey et al., 2016, p. 136.). The PCL-R assesses both lifestyle factors, such as the need for stimulation, poor behaviour controls, and the failure to accept responsibility, and interpersonal factors, such as a lack of empathy, pathological lying, and manipulation (Storey et al., 2016). Using a variety of different population samples, the PCL-R has been shown to have strong test-retest reliability, meaning that scores are relatively the same each time the PCL-R is administered. If test-retest reliability is strong, any changes in the results between the tests should be due to an intervening factor. Moosburner et al. (2024) used the PCL-R to evaluate German offenders at the time of their incarceration and then again 18 months later. They found that the scores on the PCL-R showed small but significant reductions on the retest, which Moosburner et al. (2024) attributed to the intense treatment the offenders had received.
Additionally, you may need to assess inter-rater reliability if your study involves observing people’s behaviours. Inter-rater reliability (also called inter-observer consistency) is the degree to which different observers agree on what happened. The PCL-R is used in a variety of criminal justice settings such as pre-sentence evaluations and correctional intakes; there is not just one person who administers the PCL-R in any institution. DeMatteo and Olver (2022) conducted a meta-analysis of published studies using the PCL-R, finding that across studies and populations, the PCL-R had strong inter-rater reliability. Scores of multiple observers should be consistent, though perhaps not perfectly identical. If two different observers come up with different ratings of the same concept, it is possible that one or both are not examining the concept correctly.
Finally, internal reliability (also called internal consistency) is an important concept, specifically when dealing with scales. The scores on each question of a scale should be correlated with each other, as they all measure parts of the same concept. Again, the PCL-R is a good example of this, with the interpersonal and lifestyle/deviance factors showing strong correlations, or relationships, among the factors (Storey et al., 2016 ).
Test-retest, inter-rater, and internal reliability are three important subtypes of reliability. Researchers use these types of reliability to make sure their measures are consistently measuring the concepts in their research questions.
| Reliability type | Definition | Example |
|---|---|---|
| Test-retest reliability | When a measure yields the same results when applied multiple times to the same person. | When PCL-R scores remain the same for the same offender when administered at one age and then later at another age. |
| Inter-rater reliability | When different people administering the same measure on the same person yields the same results. | When different psychologists separately yield the same PCL-R score for the same offender. |
| Internal reliability | When the score of each question in a scale is correlated since they are measuring the same concept. | Interpersonal and lifestyle factors in the PCL-R . |
🧠 Stop and Take a Break!
Test your knowledge by answering a few questions on what you have read so far.
Validity
While reliability is about consistency and yielding the same results time after time, validity is about accuracy and truth. What image comes to mind when you hear the word “alcoholic”? Are you certain your image is similar to the image others have in mind? If not, then we may be facing a problem of validity.
As discussed in chapter 6b, for a measure to have validity, it must accurately measure what we think it does, and there are several facets of validity for researchers to consider. Internal validity and external validity were discussed in chapter 6b. Now, let’s talk about other types of validity. Think back to when we initially considered measuring alcoholism by asking research participants if they have ever had a problem with alcohol. We realized that this might not be the most reliable way of measuring alcoholism because the same person’s response might vary dramatically depending on how they are feeling that day. Likewise, this measure of alcoholism is not particularly valid. What is “a problem” with alcohol? For some, it might be having had a single regrettable or embarrassing moment that resulted from consuming too much. For others, the threshold for what is considered a “problem” might be different. Participants could define an alcohol problem in countless ways. If we are trying to objectively understand how many of our research participants are alcoholics, then this measure may not yield any useful results.
Through critical engagement with our measure for alcoholism “Do you have a problem with alcohol?”, we have identified that it is flawed. We assessed its face validity, or whether it makes reasonable sense that the question measures what it intends to measure. Face validity is a subjective process. Sometimes, face validity is easy. For example, we would know that if a researcher tried to convey the extent of a participant’s alcoholism by reporting how tall they are, this would not be valid because a measure of a person’s height has nothing to do with alcoholism. Other times, face validity can be more difficult to assess. Perhaps we are interested in exploring the work stress of correctional officers (COs) in Canadian institutions. We could measure work stress by counting the number of times per week that a CO shows up late to work. This behaviour can be reasonably linked to increased stress at work; a person who is stressed at work may delay or avoid their workplace. But a mere count of lateness may not tell the whole story. Perhaps a CO drops their child off at daycare on certain mornings resulting in a late arrival, or they may have a work-related meeting in the community several days a week. Neither of these situations have anything to do with work stress; therefore, on the face of it, a simple count of lateness may not be the most valid way to measure our concept of interest.
In addition, our suggested measure of work stress is problematic because it is incomplete. Content validity assesses whether the measure includes all the possible meanings of the concept. Kocsis and Lavoie (2023) explored CO experiences of work stress during the COVID-19 pandemic. Recognizing that stress has many different dimensions and meanings, they used established scales on stress, workplace safety, and resiliency, and they also created scales on pandemic-related safety precautions and pandemic-related changes in responsibilities. By utilizing multiple measures for stress, Kocsis and Lavoie (2023) included all potential indicators of work stress established by the existing research literature as well as added those unique indicators related to COVID-19. Because of their attention to a fulsome measure of work stress, the study had strong content validity. As this study illustrates, a thorough review of the literature can assist us in recognizing the various dimensions and meanings that have been associated with the concepts of interest in our own research.
Another form of validity is criterion validity, which is an assessment of how well a measure is related to some other factor or concept. In other words, it involves showing how our measure predicts scores on some other measure that has been generally accepted as being valid. This type of validity includes two elements: predictive validity and concurrent validity (Carr et al., 2021). Let’s say you have created (or found) a good scale, index, or typology to measure work stress. For instance, a valid measure of work stress might be able to predict future behaviours outside of work, such as relationship breakdowns or substance use. This is called predictive validity, and it means that your measure predicts things it should be able to predict. In this case, our measure predicts that if someone has a high level of work stress, then a year later, they will be more likely than someone with low work stress to engage in substance use. If we were to measure substance use at the same time as the scales encompassing work stress, we would be assessing concurrent validity. Concurrent validity is similar to predictive validity in that it assesses how one measure predicts scores on another measure, the difference being that both measures are given at the same time. In both cases, work stress measures are linked to already established and valid measures of substance use.
Construct validity is focused on whether the constructs being measured and the various attributes of the construct (or variable) relate to each other in predictable ways. Construct validity is often rooted in the theory that informs a study. This type of validity is intricately linked to our measurement decisions: have the constructs we have chosen been properly defined, and have we selected the best operations to measure the observations of these constructs? Let’s return to the hypothetical example of researching alcoholism. We could create a scale with a few questions, such as whether the person has had multiple drinks on more than one occasion, if they have missed work or an important event because they were hungover, if they have engaged in risky or dangerous behaviour while drinking, and if others think they have a problem with alcohol. All these variables are indirect observables and are theoretically related to the construct of alcoholism. If our measure was high in construct validity, respondents with problems with alcohol would answer in similar ways to each other, while respondents without a problem with alcohol would answer in different ways. In other words, our measure would accurately capture what it means to have “a problem” with alcohol. If there is no pattern to our responses (i.e., no similarity from the responses of people who have a problem with alcohol), our measure is low in construct validity. We would have to revise our scale, going back to our theory and previous research to assess where we can and should improve our construct validity.
Perhaps we can look at another example to illustrate what construct validity is all about. Let’s say we are interested in measuring whether or not couples are in love. We gather data on the extent to which a sample of couples are holding hands in public. If we use that as our sole measure of whether or not a given couple is in love, would that be a valid measure of the complex construct of love? I think many of you would agree that it is not. Many other dimensions of this construct would need to be included to achieve a more valid measure of this construct. Therefore, a study that chooses the observation of handholding as the only measure of love would be very low on construct validity.
| Validity Type | Definition | Example |
|---|---|---|
| Face validity | Assesses whether the chosen measure makes reasonable sense. | Work stress being measured by tardiness is low on face validity. |
| Content validity | Assesses whether the measure includes all the possible meanings of the concept. | Using multiple measures for stress, such as workplace safety and resiliency, increases content validity. |
| Criterion validity | Assessment of how well a measure is related to some other concept or factor. When the two measures are administered one after the other, this is predictive. When they are administered at the same time, it is concurrent. | Using a measure of work stress to measure the related concept of substance use. This can happen one year apart (predictive) or simultaneously (concurrent). |
| Construct Validity | Assessment of how well the constructs (or attributes of a construct) relate to each other in predictable and/or theoretical ways.
Assessment of how well the constructs are operationalized to accurately measure the underlying concepts of interest. |
Using multiple measures of alcoholism, or love, which are theoretically connected, and then assessing how strongly they relate. |
The basic subtypes of validity presented above are enough to give you a strong sense of validity and why it is important in research, though there are certainly others you can read more about. Think of validity like a portrait: Some portraits look just like the person they are intended to represent, but other representations, such as caricatures and stick drawings, are not nearly as accurate. A portrait may not be an exact representation of how a person looks, but the extent to which it resembles the subject is important. The same goes for validity in measures: No measure is exact, but some measures are more accurate than others.
🧠 Stop and Take a Break!
When conducting research in Indigenous communities, it is important to consider existing research literature as well as consult with Indigenous Elders and communities to ensure that your research measures are valid in their view and you are not imposing outside Eurocentric or personally biased definitions. For instance, following the Supreme Court of Canada’s decision in R v. Gladue (1999), courts are to consider the unique circumstances of Indigenous offenders during sentencing. These reports incorporate Indigenous perspectives on justice and aim to inform more equitable sentencing decisions. One unique circumstance often considered in sentencing that might need to be considered when conducting research is that of historical trauma and how that may manifest as depression, suicide, alcohol abuse, or other social problems. Indigenous community members often understandably define these social problems as resulting from historical and current inequities at the hands of the dominant culture; this has been agreed upon as the more valid conceptualization rather than as an individual choice or an indication of weakness of individual character. The latter conceptualizations are not only invalid but reflect racist, colonial perceptions.
Let us look at one more example of validity concerns related to research with Indigenous peoples: the PCL-R. Though many studies have commented on the reliability of the measure when used on the general population, it is important to note that other studies and legal cases (see Ewert v. Canada, 2015) have raised concerns about the validity of the PLC-R when used with Indigenous offenders. The studies suggest that it has not been tested on large samples of Indigenous persons, and it may not accurately reflect culturally specific expressions of behaviour, emotion, or interpersonal norms. For example, Hart (1998) cautioned that the PCL-R’s Western-based constructs may pathologize culturally normative behaviours in Indigenous populations, leading to inflated scores and misclassification. Similarly, Shepherd, Adams, McEntyre, and Walker (2014) found that the PCL-R showed poor predictive validity for violent recidivism among Australian Aboriginal offenders. These findings suggest a need for culturally adapted tools that consider the unique historical, social, and cultural experiences of Indigenous peoples to ensure the scores are valid.
If you are still confused about validity and reliability, Figure 6c.1 provides a visual representation. On the top left target, the shooter’s aim is all over the place; it is neither reliable (consistent) nor valid (accurate). The top right target shows an unreliable and inconsistent shot, but it is centred around the target (accurate). The bottom left target demonstrates consistency, but the shooter’s aim is reliably off-target and therefore invalid. The final bottom-right target represents a reliable and valid result. The shooter hits the target accurately and consistently. This is what you should aim for in your research.

Authenticity and Trustworthiness
The standards for measurement are different in qualitative and quantitative research for an important reason. Measurement in quantitative research is done objectively or impartially; that is, the researcher has minimal influence over the measurement process. They choose a measure, apply it, and read the results. Therefore, the accuracy and consistency depend more on the measure than on the researcher.
On the other hand, qualitative researchers are deeply involved in the data analysis process. There are no external measurement tools, like quantitative scales; rather, the researcher is the measurement instrument. Researchers build connections between different ideas that participants discuss and draft an analysis that accurately reflects the depth and complexity of what participants have shared. This is challenging for researchers as it involves acknowledging their own various biases and allowing the meaning that participants shared to emerge as the data are read. This process is not concerned with objectivity, as there is always some subjectivity in qualitative analysis. Here, we are more concerned with researchers rigorously engaging in the data analysis process. Because of the subjective nature of measurement and analysis in qualitative research, we have to think about reliability and validity slightly differently.
We assess the rigour of qualitative research through authenticity and trustworthiness. Authenticity refers to the “extent to which researchers fairly and faithfully show a range of different realities” (Polit & Beck, 2020, p. 416). Authenticity is not limited to the breadth of perspectives but also recognition that each perspective or reality has meaning and worth.
Trustworthiness, which encompasses the foundation of accuracy (validity) and consistency (reliability), refers to “the degree of confidence in data, interpretation, and methods used to ensure the quality of a study” (Connelly, 2016, p. 435). Trustworthiness is made up of four criteria: credibility, dependability, confirmability, and transferability.
Credibility refers to the accuracy of the results and the confidence of participants in the interpretations of the data. Bryman and Bell (2019) frame credibility as whether the findings are believable and “ring true for the people observed” (p. 205). There are a few methods that can be used to establish credibility. Researchers may seek assistance from another qualitative researcher to review or audit their work. As you might expect, it’s difficult to view your own research without bias, so another set of eyes is often helpful. Another tool is member checking (also called respondent validation), where researchers solicit feedback from their participants. This process of member checking can happen at any stage of the process, for example, a review of interview transcripts to ensure the participants’ perspectives are captured accurately or a review of a draft of the analysis or final report. Cesaroni et al. (2019) conducted talking circles with Indigenous youth, exploring their opinions on the overrepresentation of Indigenous youth in the Canadian criminal justice system. In collaboration with the community, the researchers established an Advisory Committee made up of “an Elder, a chief, an Indigenous knowledge keeper, Indigenous young people and Indigenous practitioners who work with high-risk Indigenous young people ” (Cesaroni et al., 2019, p. 116). Part of the agreement with the community was that any publications and reports from the research be approved by the Advisory Committee before being shared publicly. Not only was there member engagement at various stages of the research process, but the Advisory Committee was able to assess whether the findings of Cesaroni et al. (2019) “rang true” before releasing any results to others.
Credibility is akin to validity, as it mainly speaks to the accuracy of the research product. On the other hand, the criterion of dependability is similar to reliability. As you recall, reliability is the consistency of a measure; if you give the same measure each time, you should get similar results. However, qualitative research questions and interview questions may change during the research process. How can reliability, or dependability, be achieved under such conditions?
Because qualitative research understands the importance of context, it would be impossible to control everything that makes a qualitative measure the same when given to each person. The location, the timing, or even the weather can influence participants to respond differently. To assess dependability, researchers maintain a comprehensive record of any changes and research decisions. This journal or log establishes transparency by showing that the study followed proper qualitative procedures and enables the researcher to clearly justify, describe, and account for any changes that emerge during the research process in their final report; it also enables another researcher to replicate the study should they wish. For example, when the author (Stevenson) was interviewing incarcerated men as part of her graduate work, one participant’s interview was interrupted several times due to various reasons, including a legal hearing that required use of the interview room, a fire alarm, and a lockdown on the cell range. Clearly explaining how consent was reobtained each time the interview was restarted, how the previous conversation was recapped for the participant, and how the conversation proceeded despite interruptions formed part of the methodology of the final report. This transparency enabled other researchers to assess that good practices had been followed, contributing to the dependability of the study. Some researchers may consult another qualitative researcher to examine their logs and results to ensure dependability. Others may make their data available to other researchers along with their results to “audit” the findings, though this is not common practice due to the time-consuming nature of auditing research materials.
Confirmability is connected to objectivity and refers to the degree to which the results reported are linked to the data obtained from participants and not the researcher’s bias. Although objectivity is not truly possible in social science research, the criterion of confirmability ensures that a researcher’s results are grounded in what participants said. In other words, the researcher “has acted in good faith” (Bryman & Bell, 2019, p. 206) in their analysis process, acknowledged their personal biases, and not imagined or incorrectly interpreted the data.
The final criterion of transferability is focused on whether the findings can be applied in other settings or contexts, much like generalizability in quantitative research. One way that qualitative researchers can do this is by providing thick, rich, and detailed descriptions of the perspectives, voices, and experiences captured in their data. In Cesaroni et al.’s (2019) study, they offer pages of substantial detail showcasing the voices of the Indigenous youth who participated in the research. The research included the voices of Indigenous youth from the Anishinaabe, Haudenosaunee, and Métis communities, and the rich descriptions allow an understanding of how the impacts of colonization and residential schools, the loss of history, tradition and culture, and racism and stereotypes resonate with Indigenous youth across Canada. In this way, Cesaroni et al. (2019) establish the transferability of their findings.
These four criteria for trustworthiness were created as a reaction to criticisms of qualitative research as unscientific (Lincoln & Guba, 1986). They demonstrate that qualitative research is equally as rigorous as quantitative research, as qualitative research and measurement are conducted with the same degree of care and attention as quantitative research. While the standards may be different, they speak to the goals of accurate and consistent results that reflect the views of the participants in the study.
| Criterion | Questions Being Asked |
|---|---|
| Dependability | Are the findings likely to be consistent over time or over subjects?
Can another researcher conduct the same study and get similar results? |
| Confirmability | Are the findings accurately linked to the data?
Have personal values or bias influenced the findings? |
| Credibility | Are you measuring what you think you are?
How believable are the findings? |
| Transferability | Do the findings apply to other contexts or populations? |
Conclusion
In this chapter, we focused on how we ensure that the measures we choose in our own research are indeed of good quality. We discuss the various types of reliability that can be used as a way to evaluate the quality of our research; we also expand on the types of validity. While reliability refers to the consistency of our measures, validity speaks to their accuracy.
In this chapter, we also discuss how to think about evaluating the quality of our measures when conducting qualitative research. We introduced the notions of authenticity and trustworthiness in this discussion, with emphasis on the fact that qualitative research is as rigorous as quantitative research, despite the fact that the standards of establishing reliability and validity differ. The research design decisions we make around measurement and measurement quality are crucial and are directly linked to our paradigmatic stance, our research purpose, and our research question.
✅ Summary
- Reliability is a matter of consistency. What we mean when we say our measures are reliable is that they will yield the same results when repeatedly administered. There are three types of reliability: test-retest reliability, inter-rater reliability, and internal reliability.
- Validity is a matter of accuracy. The question we are dealing with here is whether our measures are actually measuring what we think they are measuring. The types of validity reviewed in this chapter are face validity, content validity, criterion validity, and construct validity.
- When conducting research in Indigenous communities, it is important to consider existing research literature as well as consult with Indigenous Elders and communities to ensure that your research measures are valid.
- Qualitative researchers assess rigor by using the criteria of trustworthiness and authenticity. Authenticity can be broken down into four criteria: dependability, confirmability, credibility and transferability.
- Both quantitative and qualitative research are equally rigorous, but the standards for assessing rigor differ between the two.
🖊️ Key Terms
authenticity: one of the criteria used to assess the rigor of qualitative methods. It refers to the “extent to which researchers fairly and faithfully show a range of different realities” (Polit & Beck, 2020, p. 416). Authenticity is not limited to the breadth of perspectives, as it also includes recognition that each perspective or reality has meaning and worth.
concurrent validity: one aspect of criterion validity that refers to whether a measure is able to predict outcomes from another established measure given at the same time.
confirmability: a criterion of trustworthiness in qualitative research that refers to the degree to which the results reported are linked to the data obtained from participants.
construct validity: assessment of how well the constructs (or attributes of a construct) relate to each other in predicable and/or theoretical ways.
content validity: a type of validity that refers to whether the measure includes all the possible meanings of the concept.
credibility: a criterion of trustworthiness in qualitative research that refers to the degree to which the results are accurate and viewed as important and believable by participants.
criterion validity: a type of validity that refers to how well a measure is related to some other concept or factor. It can be either predictive or concurrent.
dependability: a criterion of trustworthiness in qualitative research that refers to the importance of ensuring that proper qualitative procedures were followed and that any changes that emerged during the research process are accounted for, justified, and described in the final report.
face validity: a type of validity that speaks to how plausible it is that the measure actually measures what it intends to.
internal reliability: a type of reliability, also sometimes called internal consistency, that refers to the degree to which scores on each question of a scale are correlated with each other.
inter-rater reliability: a type of reliability that refers to the degree to which different observers agree on what happened.
member checking: a method used to establish credibility in qualitative research. It typically involves collecting feedback from participants on transcripts, analysis at various stages of the research process, or drafts of final manuscripts or reports.
meta-analysis: a type of secondary data analysis in which findings from published studies are combined and analyzed together; typically, such research takes a quantitative approach.
predictive validity: one aspect of criterion validity; it addresses whether a measure predicts things it should be able to predict in the future.
reliability: the degree of consistency and stability of our measures.
scale: a composite measure of a variable or construct composed of multiple indicators with numerical scores assigned to response categories. For example, the PCL-R is a scale with 20 related items, scored 0 to 3, measuring different aspects of the construct of psychopathy.
test-retest reliability: a type of reliability that refers to how consistent results will be when a measure is given multiple times. Ideally, any changes in the results would be due to a change or intervening factor between the tests.
transferability: a criterion of trustworthiness in qualitative research that refers to the degree to which qualitative research findings can be applied in other settings or contexts.
trustworthiness: one criterion used to assess the rigor of qualitative research. Trustworthiness refers to the degree of confidence in data, interpretation, and methods used to ensure the quality of a study.
validity: this refers to the accuracy of our chosen measures. When we talk about validity, we are asking whether we are measuring what we actually intend to measure.
🧠 Chapter Review
Crossword
Fill in the term in the right-hand column and it will display in the crossword puzzle. Be sure to include spaces where appropriate.
Discussion Questions
- How do the types of reliability and validity align with the criteria of trustworthiness?
- Why is it important to consult and work with Indigenous Elders and communities to ensure the validity of measures used to capture their perspectives?
- How might a research measure appear to have face validity but still fail to meet content or criterion validity? Can you give an example where something “looks right” but doesn’t fully capture what it claims to measure?
- How can poor inter-rater reliability in areas such as risk assessments, sentencing recommendations, or use-of-force evaluations lead to unequal or unjust outcomes in the criminal justice system?
References
Bryman, A., & Bell, E. (2019). Social research methods (5th Canadian ed.). Oxford University Press.
Carr, D., Boyle, E. H., Cornwell, B., Correll, S., Crosnoe, R., Freese, J., & Waters, M. C. (2020). The art and science of social research (2nd ed). W.W. Norton & Co.
Cesaroni, C., Grol, C., & Fredericks, K. (2019). Overrepresentation of Indigenous youth in Canada’s criminal justice system: Perspectives of Indigenous young people. Australian & New Zealand Journal of Criminology, 52(1), 111-128. http://doi.org/10.1177/0004865818778746
Connelly, L. M. (2016). Trustworthiness in Qualitative Research. Medsurg Nursing, 25(6), 435-436.
DeMatteo, D., & Olver, M. E. (2022). Use of the Psychopathy Checklist-Revised in legal contexts: Validity, reliability, admissibility, and evidentiary issues. Journal of Personality Assessment, 104(2), 234-251.
Ewert v. Canada, 2018 SCC 30, [2018] 2 S.C.R. 165, https://decisions.scc-csc.ca/scc-csc/scc-csc/en/item/17133/index.do, retrieved on 2025-05-26.
Hart, S. D. (1998). The role of psychopathy in assessing risk for violence: Conceptual and methodological issues. Legal and Criminological Psychology, 3(Part 1), 121–137. https://doi.org/10.1111/j.2044-8333.1998.tb00354.x
Kocsis, K., & Lavoie, J. (2023). Canadian correctional officers’ experiences of workplace safety and stress during the COVID-19 pandemic. Canadian Journal of Criminology and Criminal Justice, 65(1), 9-36. https://doi.org/10.3138/cjccj.2022-0015
Lincoln, Y. S., & Guba, E. G. (1986). But is it rigorous? Trustworthiness and authenticity in naturalistic evaluation. New directions for program evaluation, 1986(30), 73-84.
Moosburner, M., Etzler, S., Brunner, F., Briken, P., & Rettenberger, M. (2024). Is psychopathy a dynamic risk factor? An empirical investigation of changes in psychopathic personality traits over the course of correctional treatment. Criminal Justice and Behavior, 51(2), 230-246. https://doi.org/10.1177/00938548231219804
Polit, D., & Beck, C. (2020). Essentials of nursing research: Appraising evidence for nursing practice (9th ed.). Lippincott Williams & Wilkins.
R. v. Gladue, 1999 CanLII 679 (SCC).
Shepherd, S. M., Adams, Y., McEntyre, E., & Walker, R. (2014). Violence risk assessment in Australian Aboriginal offender populations: A review of the literature. Psychology, Public Policy, and Law, 20(3), 281–293. https://doi.org/10.1037/law0000017
Storey, J. E., Hart, S. D., Cooke, D. J., & Michie, C. (2016). Psychometric properties of the Hare Psychopathy Checklist-Revised (PCL-R) in a representative sample of Canadian federal offenders. Law and Human Behavior, 40(2), 136–146. https://doi-org/10.1037/lhb0000174
Adaptation Statement
Chapter adapted from Scientific Inquiry in Social Work by Matthew DeCarlo, licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.
Media Attributions
- Reliability and validity © Nevit Dilmen is licensed under a CC BY-SA (Attribution ShareAlike) license
The consistency and stability of measurement results over time. Reliable measures yield similar outcomes under consistent conditions across different instances of testing.
A type of reliability that refers to how consistent results will be when a measure is given multiple times. Ideally, any changes in the results would be due to a change or intervening factor between the tests.
A type of reliability that refers to the degree to which different observers agree on what happened.
A type of secondary data analysis in which findings from published studies are combined and analyzed together; typically, such research takes a quantitative approach.
A type of reliability, also sometimes called internal consistency, that refers to the degree to which scores on each question of a scale are correlated with each other.
A composite measure of a variable or construct composed of multiple indicators with numerical scores assigned to response categories. For example, the PCL-R is a scale with 20 related items, scored 0 to 3, measuring different aspects of the construct of psychopathy.
A type of validity that speaks to how plausible it is that the measure actually measures what it intends to.
A type of validity that refers to whether the measure includes all the possible meanings of the concept.
A type of validity that refers to how well a measure is related to some other concept or factor. It can be either predictive or concurrent.
One aspect of criterion validity; it addresses whether a measure predicts things it should be able to predict in the future.
One aspect of criterion validity that refers to whether a measure is able to predict outcomes from another established measure given at the same time.
Assessment of how well the constructs (or attributes of a construct) relate to each other in predicable and/or theoretical ways.
One of the criteria used to assess the rigor of qualitative methods. It refers to the “extent to which researchers fairly and faithfully show a range of different realities” (Polit & Beck, 2020, p. 416). Authenticity is not limited to the breadth of perspectives, as it also includes recognition that each perspective or reality has meaning and worth.
Trustworthiness refers to the degree of confidence in data, interpretation, and methods used to ensure the quality of a study.
Whether the claim is of a kind that, given what we know about how the research was carried out, we can judge it to be very likely to be true.
A method used to establish credibility in qualitative research. It typically involves collecting feedback from participants on transcripts, analysis at various stages of the research process, or drafts of final manuscripts or reports.
A criterion of trustworthiness in qualitative research that refers to the importance of ensuring that proper qualitative procedures were followed and that any changes that emerged during the research process are accounted for, justified, and described in the final report.
A criterion of trustworthiness in qualitative research that refers to the degree to which the results reported are linked to the data obtained from participants.
A criterion of trustworthiness in qualitative research that refers to the degree to which qualitative research findings can be applied in other settings or contexts.