How to Tell if a Research Article Is Strong Enough to Use
A practical quality check for BCBAs deciding whether to base a treatment decision on a paper, from a BCBA-led CEU.
Key takeaway
Open a paper and check the implementer line first; if the first author ran every session, treat it as a red flag, then look at what the setting required (staffing ratios, nurses on site, laminated materials, every piece of PPE money can buy), and if it is a behavioral feeding article, assume you need hospital-level training before you copy any of it.

Research to practice - extending past the pages
On this page · 9 sections▾
How to Tell if a Research Article Is Strong Enough to Use
Open a paper and check the implementer line first; if the first author ran every session, treat it as a red flag, then look at what the setting required (staffing ratios, nurses on site, laminated materials, every piece of PPE money can buy), and if it is a behavioral feeding article, assume you need hospital-level training before you copy any of it. That is the quick filter. Most papers you read are not bad, but they were built for a setting you do not work in, and that gap is where treatment decisions go sideways.
You are not grading the paper for a journal club. You are deciding whether to put your name and your client on a treatment plan inspired by it. That is a different question, and it has a different rubric. This guide walks the rubric in the order I actually use it on a Tuesday afternoon when a fieldwork student hands me a PDF and asks if we should run it.
What a strong applied paper looks like at a glance#
A strong applied paper, for our purposes, is one where you can picture your team running it on Monday without lying to yourself. The participants look like your kid. The setting looks like your clinic or school. The implementers look like your RBTs, not the PhD candidate who wrote the dissertation. The intervention reads in plain enough steps that you could brief a 19-year-old technician on it in 10 minutes. And the discussion section talks about real limits, not just future directions.
If you have to squint to make any of those fit, that is not a deal breaker. It is a flag that the paper needs translation before it touches a learner. The rest of this guide is how to spot those flags fast.
Who actually ran the intervention (the implementer red flag)#
This is the first place I look, and it is the cheapest signal in the whole article. Skip to the methods section, find the line that says who ran sessions, and read it carefully.
This was done to fulfill a PhD program. First author did every implementation session.
When you see that, the paper is telling you something important. It is saying the intervention worked under near-perfect fidelity, run by the person who designed it, who knows every reason the procedure exists, and who can adjust on the fly without checking a manual.
That is not the conditions you work in. Your team is not the first author. They are not even on the author list.
If this intervention can be run with no issues and 100 percent fidelity with a PhD or even master's level students, we need to consider if our staff are competent to run that as well.
This does not mean throw the paper out. It means budget more time for staff training, build BST loops into the rollout, and expect treatment fidelity to drift unless you actively coach it. If the procedure is so complex that only the first author can execute it cleanly, the intervention probably is not portable as written. You will be adapting, whether you plan to or not.
What the setting required (staffing ratios, medical support, materials)#
The next question is what the paper assumed about resources. A lot of beautiful research comes out of inpatient hospital settings. In those settings, every learner had two staff and one clinical supervisor assigned to them. There was a BCBA-D over a couple of cases. Two nursing staff on site at all times. Full meal service. Every piece of protective equipment money could buy. Laminated materials. Time.
That is not your school district. That is not your group home. That is not even your clinic on a Tuesday when one tech called out and another is in training.
Read the setting paragraph and ask three questions. How many adults were on each learner. Was there medical support standing by. What materials and prep time were assumed.
If the answers do not match your situation, you have a choice. Translate the intervention to fit your resources, or pick a different paper. What you cannot do is pretend the gap is not there and run the procedure as written. That is how you get a beautiful research design and a 30 percent effect size in practice.
Behavioral feeding is the cleanest example of this trap.
Behavioral feeding articles, the last paragraph is basically: if you don't go to these three hospital settings and train, you probably shouldn't do this in real life.
If you skip that paragraph and lift the procedure, you are not doing evidence-based practice. You are guessing with citations.
Sample size and how it limits your conclusions#
Most applied behavior analysis is single-case design, so the question is not whether the sample size is statistically powered. It is whether the sample contains anyone who looks like your learner.
Look at the participant table. Age. Diagnosis. Language level, often reported as a VB-MAPP score. IQ if it is there. Setting. Then ask: is my learner inside or outside that range?
If your learner is inside the range, you have decent grounds to expect a similar response to the intervention. Not a guarantee. Grounds. If your learner is outside the range, especially on a variable the authors flag as prerequisite, like identity matching for a PECS rollout, you need to either teach the prerequisite first or pick a different intervention.
You do not need an exact match. You need a similar enough match that the behavioral principles in play are likely to generalize. Age and diagnosis are usually rule-out variables. Verbal repertoire and skill prerequisites are usually rule-in variables. Use them that way.
Replication: is this the first time anyone tried it#
Single-study findings get a softer weight than replicated ones. That is true in any science, and it is true here. If the paper you are reading is the first published demonstration of a procedure, treat it as promising, not proven.
The fastest way to check is to look at the reference section of a recent paper on the same topic. If the procedure shows up in three or four different author groups across a decade, you have replication. If it shows up only in the same lab citing itself, you do not.
A pro tip: if you are starting cold on a topic, find one solid recent review or summary paper and copy its reference list. That gives you a working bibliography in five minutes instead of two hours of search.
Replication is not the only thing that matters, but it should change how much you are willing to stretch the intervention. A well-replicated procedure tolerates more adaptation. A single-study finding deserves a tighter copy and a closer eye on data.
When a discussion section is doing too much work#
The discussion section is where authors stretch. That is fine. That is what discussions are for. Your job is to notice when the stretch goes too far.
Watch for three patterns. The first is broad recommendations that the data does not support. A study with three participants in a hospital does not justify a sentence like "clinicians should use this procedure in school settings." It might justify "future work should examine generalization to school settings." Those are different claims.
The second is downplaying the limits. If the methods describe two nurses on site and the discussion never mentions staffing requirements, the authors are letting you off the hook in a way the data does not.
The third is overreach on the mechanism. If the procedure worked, the discussion should be honest about why it worked, not just what it produced. When the mechanism explanation is vague or the authors hand-wave the active ingredient, that is a sign you may not be able to reproduce the effect because nobody, including the authors, is sure what the effect depended on.
If a discussion section is doing too much work, weight the methods and results more, and weight the discussion less. The data is still useful. The recommendations are softer than they look.
When a small or messy paper is still enough to act on#
Not every paper you cite has to be airtight. Sometimes a small, imperfect study is exactly what you need, because the principle it demonstrates maps cleanly onto your clinical question.
A paper with three participants and no replication can still teach you something real about a behavioral principle. That principle generalizes. The procedure may not, but the principle does. If you are pulling shaping logic from a study, you do not need that study to look exactly like your client. You need shaping to work, which it does, broadly, across populations and settings.
So a small paper is enough to act on when three things line up. First, the principle is well established outside the paper. Second, the paper's contribution is showing the principle in a context similar enough to yours that you can borrow the procedural sketch. Third, you are treating the paper as one input among several, not as the sole justification.
Consider parsimony in our conversations.
That is the test. Are you reaching for the simplest explanation and the simplest intervention that fits the data, or are you forcing a fancy paper to do a job a basic shaping plan would do better. If the basic plan fits, use the basic plan and cite the basic paper. You do not need to impress anyone with your reference list. You need to help the client.
FAQ#
Does sample size really matter in single-case design? Less than in group designs, but still some. Look for whether the sample contains anyone like your learner on the variables that matter for the intervention. Verbal repertoire, prerequisite skills, and learning history matter more than raw participant count.
Is one published article enough to justify a treatment choice? Sometimes, when the principle is well replicated outside that single article and the procedure is a clean fit. Treat one paper as one input. Layer in the broader literature on the principle, the setting fit, and parsimony before you commit.
How do I know if the authors over-reached in their discussion? Compare the participant section to the recommendations. If the data was three preschoolers in a hospital and the discussion says "clinicians" without qualifier, the authors are stretching. Trust the methods and results more than the discussion.
Should I trust an article if only the first author ran sessions? You can still learn from it, but expect a gap when your team runs the procedure. Plan for more training, slower fidelity ramp, and ongoing coaching. Do not assume your RBT will hit the same numbers as a doctoral student.
What is a red flag I can spot in the first two minutes? Open methods, find the implementer line. If the first author ran every session and the setting had hospital-level support, you are not reading a paper you can copy. You are reading a paper you have to translate.
Try it on your next paper#
Pick the paper sitting on your desk right now. Run the five checks in order: implementer, setting, sample, replication, discussion stretch. If three or more come back clean, you have something you can build on. If three or more come back flagged, you have a paper that needs translation before it touches a learner.
Watch the full talk for three case studies on how this translation actually plays out at the bedside, in a coffee shop, and at a church.