See participant list

See Backgrounder


Welcome (Bob Carter, 0:001)

Dr. Carter provided an overview of why NIAMS was holding this meeting (see background), acknowledged key staff who were involved in its development, and urged participants to keep the discussions focused on the scientific needs and opportunities related to subset analysis in clinical studies. He asked everyone to think broadly about how the various statistical approaches could be applied to NIAMS mission areas and to share their contributions during the afternoon discussion.

     1 Times are noted in approximate hours and minutes on the videocast. The link does not take the viewer directly to the timepoint in the videocast.

Goals (Jeff Katz, Tor Tosteson, and Bob Carter, 0:151)

Having aggregate means from trials is helpful when clinicians interact with patients, but such knowledge is not entirely satisfactory because patients differ. The bulk of the roundtable will focus on the methods for making inferences that pertain to specific subgroups. Toward the end of the meeting, Dr. Tosteson will lead a session on ways that the community can analyze subgroups the context of sustainable clinical trial ecosystems. Items to be considered include integrating new data sources, including electronic health records, with clinical studies and other means for lowering the barriers to high quality clinical research.

What matters in the clinic? Part 1

Physician perspective and guidelines (Arthur Kavanaugh, 0:221)

Using rheumatoid arthritis drugs as an example, Dr. Kavanaugh described the considerations that affect therapeutic choices: clinical efficacy, cost, convenience, patient comorbidities, adverse effects, and other therapies that a patient is taking. He emphasized that the decision to start a treatment is made in conjunction with the patient, and patients might want to stay on drug regimens that do not lead to even a 20 percent improvement in the number of tender and swollen joints and other factors (ACR20 criteria) because they are satisfied with the amount of benefit they receive. Patients often are reluctant to change therapies when there is a chance that they might get worse. Which patients can successfully taper to a lower dose, or completely off a therapy, remains another unanswered question.

Innovative designs for subset analysis in randomized clinical trials

Stratification and exploratory subset analysis (George Howard, 0:371)

Clinical trials usually attempt to recruit relatively homogenous populations. However, in a study with positive results, there is wide acceptance that subgroups may not share equally in the benefits: some may fail to show efficacy while others may be particularly responsive. Likewise, in a negative study, there is commonly an urge for exploratory analysis to find subgroups in which the treatment was efficacious. The long-standing approach to confirm the homogeneity of the effect or identify and document heterogeneous subgroups is to define a priori stratification factors and examine effects across levels of these factors. Examination of a posteriori factors may also be useful for hypothesis generation of potential factors that modify the observed effects. Dr. Howard cautioned that stratification cuts the available sample size, leading to less stable estimates. Coupled with increased risk of spurious findings through multiple testing, this can lead to misleading findings. Common sense is key, but attempts to find biologically plausible explanations have drawbacks. During the discussion, he noted that humans are quick to make causal associations that may not be true; any time you make a justification for an observed result, ask yourself "what would I have said if the result was the opposite?"

Adaptive master protocols for subsets of patients with one disease (Lisa LaVange, 0:581)

Master protocols are useful for studying subgroups—either by finding a subgroup or refining a previously identified subgroup. They are useful in the environment of precision medicine. They also are attractive to some patient participants because they provide many treatment options without having to be screened separately for many studies. One, over-arching protocol can be used to study multiple diseases, multiple biomarker-defined patient subgroups, and multiple therapies. There are different types of master protocols: umbrella or platform trial is the term used when studying multiple drugs for one disease, and basket trial is the term when one drug is being studied in multiple disease cohorts. Exploratory master protocols are early phase trials set up to identify the best treatment for a biomarker-defined patient subgroup, and are typically not randomized. Confirmatory master protocols evaluate different therapies, each targeted to a biomarker-defined subgroup, relative to a control, for individual diseases in parallel, and typically are randomized. Master protocols capitalize on similarities among trials and shared infrastructure, which makes them efficient. Hurdles include the need for considerable up-front investment, regulatory buy-in, and sponsors that are willing to test drugs in collaboration with others. When designing master protocols, investigators should consider the strength of their hypotheses, decide which hypotheses to test, consider how adaptive they want the study to be, and define how they will analyze the results. They also should involve patient advocacy groups. Challenges with reviewing master protocols for safety and IND (investigational new drug) approval before a study begins came up during the discussion.

Methods and open-source software for optimizing adaptive enrichment designs (Michael Rosenblum, 1:241)

Adaptive enrichment designs specify a preplanned rule for modifying enrollment based on accrued data. This approach may be useful if the investigator suspects that treatment effects differ by subpopulation (e.g., baseline biomarkers, disease severity). Dr. Rosenblum cautioned that adaptive designs are not always better than standard designs due to the tradeoffs involved. His research group has developed adaptive enrichment designs for time-to-event and other delayed outcome studies as well as free, open-source software to tailor adaptive enrichment design to clinical research questions. Investigators input what they wish to learn and the resource constraints into the software, and the computer scans more than 1,000 potential designs and compares the top contenders to standard designs. His adaptive enrichment designs feature preplanned interim analysis with prespecified adaptation rules (e.g., recruit two subpopulations at first, then continue recruiting both if it appears both are benefiting, recruiting one or the other if only one population appears to be benefiting, or stopping the trial if neither are benefiting). The designs can handle many outcome types (e.g., binary, continuous, time-to-event). The goal is to minimize the expected sample size under prespecified power and Type 1 error constraints. While the predicted sample size may be smaller than it would be using a standard design, the tradeoff is that the sample size could be considerably larger and you will not know until the study is well underway. This budgeting issue came up during the discussion; Dr. Rosenblum recommended budgeting in case the study requires the largest number of anticipated participants.

Sequential, Multiple Assignment, Randomized Trials (SMART) (Kelley Kidwell, 1:491)

Dr. Kidwell introduced the concept of a dynamic treatment regimen, which is a sequence of individually tailored decision rules that specify whether, when, and how, to alter the intensity, type, dose, or delivery of treatment at critical decision points during the course of care. Dynamic treatment regimens are most useful when the condition waxes and wanes with time, when the condition lacks a widely effective treatment or the treatment may be costly, burdensome, or have adherence problems, or when there is within and between person heterogeneity. Before implementing dynamic treatment regimens, clinicians must consider what the best first-line intervention is, what they will use as a measure of response to assess an intervention's success, when they will measure the response to the initial intervention, and what the best subsequent treatments are for non-responders and for responders. A SMART design provides a way to rigorously collect the evidence needed to estimate the key components of a dynamic treatment regimen. It is a type of multi-stage randomized design, where trial participants are randomized to a set of treatment options at critical decision points over the course of treatment. Unlike some sequential adaptive designs, all individuals participate in all stages of the trial. Subsequent randomization is based on information acquired up to that point. Researchers do not need to randomize everyone, but there must be two randomizations in sequence (e.g., first randomize participants to treatments A and B, then only randomize the non-responders to treatments E and F). Benefits of SMART designs include the ability to see treatment synergies or antagonisms and prescriptive effects (i.e., the initial treatment may elicit responses that can be used to better match individuals to subsequent treatments). Results may be more generalizable to the broader population than other designs. Dr. Kidwell urged investigators who are considering SMART designs to keep their studies simple and straightforward; to power their studies for simple, important hypotheses; to restrict treatment options only by ethical, feasible, or strong scientific considerations; to define response and non-response in an easily identifiable way so that the results can be broadly translated into clinical practice; and to collect additional measures (e.g., quality of life, toxicities, adherence) necessary to develop dynamic treatment regimens in the clinic. One major drawback to SMART designs is the potential higher operation cost of administering multiple stages of treatments. Because it can be difficult to get pharmaceutical companies to agree to participate in SMART designs based on drug interventions, many SMART designs involve behavioral interventions or multiple doses of a single drug.

Desirability of outcome ranking (DOOR) (Scott Evans, 2:141)

The DOOR probability is the probability of a more desirable overall outcome while factoring in efficacy, toxicity, quality of life, etc., when assigned to one therapy instead of another. This can be useful in clinical decision making for both providers and patients. Dr. Evans used an example of ceftazidime-avibactam (CAZ-AVI) or colistin for the initial treatment of carbapenem-resistant Enterobacteriaceae. The DOOR in this example had four ordinal levels, ranked in descending order of desirability: 1) the patient could be alive and discharged to home; 2) alive, not discharged to home, but lacking renal failure; 3) alive, not discharged to home, with renal failure; or 4) dead. Based on trial results, CAZ-AVI had a 64 percent higher chance of a more desirable outcome, defined as the proportion of patients in category 1. He then added the concept of "partial credit ", scoring the levels of the DOOR similarly to an academic test (100 for most desirable; 0 for least desirable; partial credit for the two intermediate levels. This partial credit could be obtained by surveying expert clinicians or patients. If both intermediate stages were counted as 100 (same as being discharged home) and death was counted as zero, CAZ-AVI had a 16 percent advantage. If the intermediate stages were counted as zero (same as death), CAZ-AVI had a 13 percent advantage. Other combinations gave different relative results, all with the advantage to CAZ-AVI. He then showed data looking at CAZ-AVI versus colistin as a function of disease severity; the patients who were the sickest benefited from CAZ-AVI the most. As a second example, he referred to the PROVIDE study, which is a prospective multi-center observational study of adult hospitalized patients with MRSA infections. The investigators wanted to know the vancomycin pharmacodynamic dose that is associated with optimal treatment outcomes (the idea being that if the patient does not receive enough medicine, the patient does not benefit but if the patient receives too much medicine, the patient experiences toxicity). This example had five levels of outcomes and quintiles of doses; the lower two doses appeared to be more beneficial. He then plotted the DOOR outcome as a function of dose enabling visualization of the patients that displayed the most net benefit which could then guide dose selection. In response to a question, Dr. Evans said it is possible to identify subgroups of patients who might respond by using patient factors as baseline covariates informative of DOOR outcomes and that the most challenging part of DOOR studies is determining how many levels to use and defining them.

Discussion (2:401)

Dr. Tosteson provided an overview of the morning session summarizing how each of the speaker addressed the "Whom to treat, when and with what" theme of the workshop. All speakers addressed methodologies to address "whom", which is the primary motivation of the traditional subset concepts and techniques discussed by Dr. Howard. Dr. Rosenblum's discussion of adaptive enrichment design addressed "whom" as well, using sequential methods for subgroup randomization. Dr. Kidwell's summary of SMART designs seeks to identify both the timing of sequential treatments and subject characteristics associated with greater treatment efficacy. Dr. Evan's work on DOOR outcomes points to the growing need for composite outcomes better reflecting the impact of efficacy, safety and patient centeredness, particularly important with the increasing emphasis on comparative effectiveness. Dr. LaVange's advocacy of innovations such as master protocols can be leveraged to answer each dimension of our theme and are very important as we attempt to improve and streamline our clinical trial programs.  

Dr. Carter further emphasized the promise of using biomarkers and an understanding of the basic disease biology to answer the question "Whom to treat, when, with what? " It is important to get as much knowledge as possible from the clinical studies that NIAMS supports through incorporating these rapidly developing methodologies.

What matters in the clinic? part 2

Patient perspective (Suz Schrandt, 2:481)

Ms. Schrandt emphasized that she was speaking not only from her perspective as a person who has polyarticular juvenile rheumatoid arthritis but also from her experience with PCORI (Patient-Centered Outcomes Research Institute), the Arthritis Foundation, and patient-focused drug development at the FDA. One factor that influences how patients and their families decide on a course of treatment is how recent the diagnosis is; the relationship with the disease and the patient's learning and acceptance curves influence perceptions of treatment risks and benefits. Patients at all stages may be reluctant to start new therapies, although their motivations and concerns may differ. This reluctance also influences patients' decisions to join clinical studies. Often, the people in clinical trials are those for whom no therapy has proven effective, which means the sample is not representative of the community at large. The rheumatologic diseases community's perception of risks and benefits of drugs also varies with the era in which a person was diagnosed with the disease (i.e., pre-biologics or post-biologics). These contextual pieces directly influence the sorts of information patients are seeking from their healthcare providers and their peers. She is concerned that we are missing opportunities to collect and use longitudinal patient data—how long did a treatment work? Did it stop working, or did a patient stop taking it because of a side effect or insurance issues? A patients' health at baseline is also important. QALYs (quality-adjusted life years) assume that a person is healthy at baseline; this is not usually the case for rheumatic disease patients. Ms. Schrandt shared her own experience with a wrist fusion and a total wrist replacement to illustrate the importance of getting the perspective from other patients like her in addition to that of her surgeon. Preference plays a role in treatment decisions, especially in the characterization of serious adverse events and nuisance side effects. She used methotrexate plus disease-modifying anti-rheumatic drugs (DMARDs) and biologics as an example; some patients/families willingly trade the very small risk of severe adverse events from biologics (infection, malignancy), to avoid the GI symptoms very common with methotrexate—even though the latter is not considered by clinicians and researchers to be a serious adverse event. In closing, she urged researchers to capture information that patients really care about (e.g., fatigue); patient reported outcomes are valuable as long as they provide information that is important to patient decision making.

Subset analyses in cohort studies

Clinical inference from subgroups in cohorts: 2 examples (Dennis Black, 3:041)

Dr. Black began his presentation by noting that observational studies are "less expensive than randomized clinical trials per p-value " and are more flexible so they can address multiple research questions. They are becoming increasingly feasible and valuable with the advent of electronic health records and HMO and administrative databases. He noted that three NIAMS-funded observational studies at his institution (the Osteoporotic Fractures in Men (MrOS) Study, and the Osteoarthritis Initiative) have generated more than 1,500 publications. His first example of clinical inference from subgroups in an observational study addressed bisphosphonates and atypical femoral fractures in Asians and Caucasians. The study population was 1.1 million women over 50 years of age who received care through Kaiser Permanente Southern California. There were 272 atypical femoral fractures, the incidence of which increased with the length of time a woman was on bisphosphonates. Almost 50 percent of the fractures were in Asians, who made up about 12 percent of the population (hazard ratio = 4.9 after adjustment). However, when looking at who benefited most from bisphosphonates in terms of reductions in hip fractures, the data revealed that Asians had an approximately 4-fold lower risk of experiencing a hip fracture than Caucasians did in the absence of treatment. For Caucasians, the benefits strongly outweigh the risks at all timepoints, but those findings did not hold true for Asians (paper in review). Dr. Black then talked about challenges in using observational data to obtain efficacy results consistent with randomized clinical trials. This was the predominant topic of discussion following Dr. Black's presentation.

Causal inference in follow-up of randomized controlled trial cohorts: key concepts and methods (Ellie Murray, 3:291)

Dr. Murray said she views clinical research on a risk-of-bias continuum where at one end is the ideal randomized controlled trial that can give a perfectly unbiased causal effect estimate while at the other end is an intractably biased reliance on "gut feeling " in the absence of evidence. Explanatory randomized controlled trials, pragmatic randomized trials, and observational studies are between those two impractical extremes. Even an exceptionally good explanatory randomized controlled trial may be subject to lack of generalizability, loss to follow-up, and non-adherence. Additionally, observational studies are subject to baseline confounding, but tools exist to address all of these sources of bias. She talked about intention-to-treat (including everyone who had been assigned to treatment A or B) and per-protocol (including everyone who had received treatment A or B) designs. Although one common criticism of the per-protocol designs is that non-adherers are different than adherers, researchers have developed strategies to overcome this confounding factor. Both intention-to-treat and per-protocol approaches might have bias due to loss to follow-up. Analysis of the per-protocol effect also may have challenges because adherence can have multiple definitions, which must be specified in advance (e.g., take treatment as appropriate, take every dose of treatment, take most doses of treatments). Dr. Murray introduced the concept of g-methods to adjust for loss to follow-up in intention-to-treat designs. G-methods (e.g., inverse probability weighting, g-formula, g-estimation) are tools that generalize estimation of causal effects to settings with treatment-confounder feedback. Regarding subsets, intention-to-treat and per-protocol effects can be estimated within subgroups with little extra effort; subgroup effects should be reported using an additive scale; and patients and advocates should be included in a priori specification of subgroups because they bring a unique perspective. Randomized clinical trials share many issues in common with cohort studies, but can still take advantage of randomization-based methodology (e.g., instrumental variables) to improve causal inference.

Real world data, real world assumptions (Patrick Heagerty, 3:541)

Real-world data (RWD), which relate to patient health status or the delivery of health care, are routinely collected from electronic health records, billing and claims data, patient generated reports, etc. Real-world evidence (RWE) is the clinical evidence about the usage and potential benefits or risks of a medical product; it is derived from analysis of RWD. Dr. Heagerty gave three examples of different study designs. The first illustrated variabilities in missing data among study sites. The second showed variability in CPT (current procedural terminology) codes between two sites and raised the question of whether the differences were due to variations in the types of procedures done or the codes used. The third example was from a pragmatic trial which showed that accuracy of the tool used to extract the data from medical records varied with the prevalence of the procedure. He mentioned two initiatives based on real world evidence: the FDA's national medical product monitoring system called the Sentinel Initiative and PCORnet, which combines electronic health record data with a broader purpose of studying multiple questions. Dr. Heagerty also presented an example cautioning how analyses of claims-based data that do not adequately account for unrecognized confounding can be misleading. RWD have biases associated with them which, despite large sample sizes, make it difficult to reveal small and moderate sized signals. Minimal or high-fidelity assumptions are essential for reliable results.

Machine learning in causal inference (David Benkeser, 4:151)

Dr. Benkeser began by framing machine learning in the context of how good science is done and where machine learning can help with the process. His first slide showed a pyramid of good science, the base of which was data (are the right data captured on the right people?), followed by the estimand (does the estimand answer questions of interest?) and pre-specification (is the question answered in a reliable way?), with methods at the peak (are the methods robust?). He next described the virtues of randomized clinical trials and set up the remainder of his presentation by asking how machine learning can allow observational/secondary data analysis to have similar rigor through pre-specification of analyses and the use of more robust methodologies. Machine learning provides flexible alternatives to logistic regression when subgroups are already known. Regression stacking (also known as super learning) balances the needs for pre-specification and robustness. Investigators pre-specify a long, robust list of ways they would like to model the outcomes process and/or the propensity score. Then they use cross validation with an a priori specified criteria to decide which approach is best or to build an ensemble of estimators. However, not all methods for causal effect estimation are compatible with machine learning. Inverse probability of treatment weighted (IPTW) estimators and g-formulas do not display standard behaviors, and the nonparametric bootstrap does not always have a strong theoretical base; doubly robust estimators are the preferred approach. Machine learning also can be applied when the subgroups are not known a priori. When subgroups are not known a priori, machine learning requires both training and validation data. Dr. Benkeser then wrapped up his presentation with two examples of how machine learning can be applied to treatment decisions (either what would happen if we treat subsets of people according to a protocol or cases where the treatment rule of interest was not defined a priori).

Drug Trial Snapshots: providing the most reliable information on the treatment effect a patient can expect (Mark Rothmann, 4:331)

The U.S. FDA Center for Drug Evaluation and Research has been doing Drug Trial Snapshots for approximately 4.5 years. Snapshots provide information to the public on who participated in clinical trials for new molecular entities and original biologics. They also include information on study design, efficacy and safety results, and whether significant differences among sex, race, and age subgroups (and other appropriate subgroups) exist. Dr. Rothmann emphasized the importance of providing consistent, simplified advice; information should not be contradictory. Recently, some Snapshots consider the relevancy of data outside the subgroup; this is where a process called shrinkage estimation comes in. Shrinkage estimated treatment effects using one-way models (within a given factor) have been used for some Snapshots. There is a goal to eventually providing shrinkage estimated treatment effects using multi-way models (multiple factors or combinations of levels). Dr. Rothmann gave as motivation an example of a hypothetical new drug for which sex might be the only effect modifier and a physician has experience using the drug in two males and two females. When the physician prescribes the new drug to the next patient, a female, the doctor is likely to consider information from all four previous patients although the outcomes from females may be a little more relevant than outcomes for males. As more patients are prescribed the drug, the relevancy of the results from males to those of females depends on how similar the results between the sexes and the amount of data available on females. Dr. Rothmann described the two components of variability in sample treatment effects across subgroups: the within subgroup variability of the sample estimator and the across subgroups variability in the underlying or true parameter values. The sample estimates vary more than the true effects do, and the shrinkage estimates address within subgroup variability in the estimations. They provide narrower confidence intervals and address random highs and lows observed in the data. Dr. Rothmann also introduced the concepts of linking treatment effects (e.g., across studies, subgroups, products) and exchangeability (i.e., whether possible orderings of treatment effects are considered equally likely a priori).

Creating a sustainable ecosystem for clinical studies (Tor Tosteson, discussant, 4:561)

Dr. Tosteson began the discussion with the premise that

Clinical trials have become increasingly expensive, time consuming, and seemingly ‘less successful' as measured by numbers of approved new drugs. How can the clinical research ecosystem best support new designs and methods for identifying efficacious and effective treatments in subsets?

He listed some issues related to clinical trials and observational studies. For clinical trials, these include possible barriers to trial initiation, conduct, reporting, and validation (e.g., increased enrollment requirements and costs, regulatory and human subjects protections, recruitment and prioritization); increased focus on patient centered outcomes and comparative effectiveness studies; and the promise and challenges of electronic health records imbedded in trials (e.g., missing or conflicting data, adequacy of electronic health record systems for research, patient identification, disruption of clinical workflows, use of HIPPA-protected data). Issues pertaining to observational research include whether prospective cohort designs can adequately anticipate subsets of interest, the promise and challenges of electronic health records for identifying subsets in which a treatment is efficacious, the impact of evolving ontologies within institutions and research networks, and promotion and improved access to public use databases and research databases.

Following a comment from Dr. Kavanaugh, Dr. Tosteson talked about the possibilities of and opportunities to turn electronic health record systems into resources for doing clinical research. In response to a second point by Dr. Kavanaugh, Dr. Carter noted that the Institute rarely sees grant applications to test interventions requiring INDs from the FDA. Many are comparative studies of surgery versus rehabilitation or cognitive behavioral therapy versus rehabilitation.

Dr. Goodman noted that the outcome variables incorporated into clinical trials can be burdensome to participants. He encouraged people to think about ways to streamline data capture methods with the goal of reducing dropout rates and retaining study populations that better resemble those that providers see in their clinics. Ms. Schrandt emphasized the importance of including input from patients in study designs. Patients can help to identify endpoints to be measured, which can be more informative than the patient reported outcomes that researchers identify. She stressed the need to use all data that are generated at each clinical visit and by patients in their daily lives to answer real-world health care questions.

Dr. Merrill pointed out that modeling is the first step in working with subsets, particularly in a disease such as lupus.

Dr. Heagerty mentioned an article on predictive approaches to treatment effect heterogeneity (PATH) that describes ways of constructing derived variables or engineered features from a machine learning perspective that can be used to look for variations in treatment effects (Kent et al., Ann Intern Med. 2019 Nov 12). The authors put forth the concept of using a predictive risk score as an important, a priori, engineered variable that one should use when looking for treatment effect variation. In response to a second point regarding delivering results to patients, caregivers, and providers, Ms. Schrandt urged researchers to think about result dissemination at the beginning of a study; engaging patients during a study's design will help with this aspect.

Dr. Saag stated that he sees learning health care systems with embedded pragmatic trials as a way of the future. Such approaches address many of the issues raised at this roundtable while emphasizing the high-quality data that randomized controlled trials provide. He is encouraged by activities outside of NIAMS [i.e., PCORnet, Clinical and Translational Science Awards (CTSAs), a PCORI funding opportunity announcement], and asked Dr. Carter about NIAMS' plans for supporting large studies. In response, Dr. Carter agreed that the research landscape is changing, but noted that the funding landscape has not kept up. One possibility is to involve other NIH components and PCORI in funding clinical research of interest to NIAMS. Dr. Carter emphasized the importance of looking for ways to make studies more efficient. Dr. Katz reiterated that while investigators have heard several approaches to addressing questions about subgroups, powering for subgroup analyses can make studies more costly. Dr. Tosteson agreed, noting the need to balance study cost with the desire to practice data-driven precision medicine. He hopes electronic health records can be converted into systems that can be economically mined to provide good real-world data.

Dr. McAlindon felt that adherence to the randomized controlled trial heuristic is obligating researchers to construct large studies with prohibitive sample sizes and expenses, yet the logical direction of precision medicine is moving toward a subset of one patient. The ability to define factors influencing response in one patient is a technological hurdle that should be able to be overcome. He sees technologies such as smart phones as crucial for collecting granular data of direct relevance to patients. Virtual trials using data repositories also might hold promise for generating hypotheses when combined with machine learning.

Dr. Gelfand sees the following four steps as crucial for creating a sustainable ecosystem for clinical studies. 1) Train more people to do this kind of work. 2) Develop funding opportunities to engage patients and other stakeholders. 3) Look for additional ways NIAMS could partner with PCORI. 4) Support infrastructure for pragmatic trials that can address the NIAMS mission is an unfulfilled need in the United States.

Dr. Howard asked about budget flexibilities, noting Dr. Rosenblum's urging to budget for the largest possible number of participants. In response, Dr. Carter mentioned supplements may be available for unanticipated costs, but he would rather know upfront what a study's maximum cost might be.

Dr. Werth noted that small cohort studies can provide valuable information about responders and non-responders that in turn informs larger clinical trials.

Dr. Kohrt's take home message was that a fair amount of knowledge is needed to predict differential effects among subgroups. Developing this mechanistic understanding can be hampered by NIH's policies regarding how clinical research is defined and overseen.

Dr. Callahan emphasized the importance of routine standardized measures and outcomes to make studies based on electronic health records more useful.

Application of methods for subset analysis in clinical research in NIAMS mission areas (Jeff Katz, discussant, 5:501)

Dr. Katz began by mentioning cross-over challenges when conducting surgical clinical trials and how surgical clinical trials could utilize SMART designs. Dr. Tosteson, who also has experience with surgical clinical trials, mentioned two-way cross over where people who were randomized to surgery decided not to have it and those who were assigned to the non-surgical arm wanted surgery. This produces special issues related to causal inference.

Dr. Rosenblum pointed out that the ICH (International Council for Harmonization) E9(R1) addendum partially addresses protocol deviations, which they refer to as intercurrent events. They have five strategies focused on decisions related to what is being estimated. He also noted that the estimand for any strategy other than intention-to-treat analysis requires very strong assumptions. Dr. Murray indicated that the ICH E9(R1) could have made stronger recommendations regarding loss to follow-up.

Dr. Katz then raised the issue of chronic episodic diseases, which comprise most of the NIAMS portfolio. In these cases, clinical trialists need to consider both the placebo effect and the natural history of the disease. Dr. Murray responded by citing some advantages of the per protocol effect approach, emphasizing the importance of having well defined guidance on when to start and stop treatment and rigorous, high-quality data collection to understand why participants started and stopped treatments.

Dr. Carter cited two examples of planning grants that NIAMS is supporting that could be modified to provide information about subgroups who will respond to different treatments. Dr. Tosteson envisions SMART designs being useful in many examples in rheumatology. Dr. LaVange agreed that SMART is ideal for defining sequential treatments and could have a place in many NIAMS studies; Dr. Rosenblum agreed. Dr. Kidwell noted the importance of quick and agreed upon intermediate outcomes so that participants can be re-randomized; these outcomes must be parameters that can be measured across multiple sites. Dr. Merrill cautioned about potential pitfalls of SMART. She noted that administration of one drug may alter a person's phenotype, and such pharmacodynamic changes may cause the second drug to be effective when it otherwise would not have been. Modeling of potential pharmacodynamic changes can help address this issue.

Dr. Rosenblum mentioned the potential to improve precision, regardless of trial design, by accounting for baseline variables that could be correlated with clinical outcomes. In other words, if variables thought to be correlated with outcomes exist, study protocols can be developed that include these variables in the analyses (e.g., in the case of a genetic modifier, look at differences in placebo and intervention in the presence and absence of the modifier as well as analyzing all results together). The most common approach is ANCOVA, or analysis of covariance. Dr. Benkeser echoed this point, noting that investigators collect huge amounts of data that they do not use in primary and subgroup analysis.

Dr. Carter gave a hypothetical example of a trial testing physical therapy and cognitive behavioral therapy for osteoarthritis in which investigators believe that there will be a subset of patients who respond better to cognitive behavioral therapy. Dr. Carter asked whether the most reasonable design was simple randomized controlled trial with stratification, or whether other options should be considered? Dr. Rosenblum noted that adaptive enrichment may be useful in cases where the researchers have a sense of what the modifying factors are, but this approach is not appropriate when the predictive variables are not known. Dr. Katz emphasized the importance of clarifying the research question (e.g., do you think the only people who will respond to cognitive behavioral therapy are people who have depression?). He also noted that the trial will be designed differently depending on whether the question about subgroups is the primary outcome or whether it will be addressed in post-hoc analysis for hypothesis generation. 

Closing thoughts (Tor Tosteson and Jeff Katz, 6:291)

Dr. Tosteson noted that the day's topics extended beyond subgroup analysis and covered many current issues that researchers face when designing and conducting clinical studies. Studies can have both randomized controlled trial and observational study components. Good cohort studies share characteristics of good trials. As the field looks to a sustainable economic model, subgroups are part of the future with its push toward targeted therapies.

Dr. Katz emphasized the need to find design features and analytic approaches that allow investigators to make precise inferences more efficiently. Even without such innovations, the more efficient the overall trial design is, the more likely it is that researchers can draw conclusions about subsets. He thanked participants for their knowledgeable and astute perspectives.

Confirmed participants

  • David Benkeser, PhD, MPH, Emory University
  • Dennis Black, PhD, University of California, San Francisco
  • Leigh Callahan, PhD, University of North Carolina
  • Scott Evans, PhD, George Washington University
  • Joel Gelfand, MD, University of Pennsylvania
  • Stuart Goodman, MD, Stanford University
  • Patrick Heagerty, PhD, University of Washington
  • George Howard, PhD, University of Alabama
  • Jeffrey Katz, MD, Brigham and Women's Hospital (co-chair)
  • Arthur Kavanaugh, MD, University of California, San Diego
  • Kelley Kidwell, PhD, University of Michigan
  • Wendy Kohrt, PhD, University of Colorado
  • Lisa LaVange, PhD, University of North Carolina
  • Timothy McAlindon, MD, MPH, Tufts Medical Center
  • Joan Merrill, MD, Oklahoma Medical Research Foundation
  • Eleanor (Ellie) Murray, PhD, Boston University
  • Nikolay Nikolov, MD, U.S. Food and Drug Administration
  • Michael Rosenblum, PhD, The Johns Hopkins University
  • Mark Rothmann, PhD, U.S. Food and Drug Administration
  • Ken Saag, MD, MSc, University of Alabama
  • Suz Schrandt, JD, ExPPect
  • Tor Tosteson, ScD, Dartmouth (co-chair)
  • Victoria Werth, MD, University of Pennsylvania


  • Current State: Most clinical trials funded by NIAMS use traditional randomized clinical trial (RCT) designs, which measure the difference in average outcome between participants assigned to one intervention or another. However, it would be more clinically useful to understand who, within the relevant population, would likely benefit more from (or be harmed by) either intervention. This can be addressed to some extent in an RCT by stratification, but at a significant increase in costs and requiring pre-specification. Other strategies may also provide practical means to uncover “whom to treat, when, with what?”
  • Objectives: The roundtable will gather input as to how clinical studies can be designed to identify groups of individuals that are more likely to benefit from a given intervention in order to better inform clinical care. At the end of the roundtable, attendees should be able to answer the following questions:
    • In general terms, how do results from clinical studies impact treatment decisions in the clinic?
    • When should RCTs intended to test the efficacy of an intervention be designed to estimate the benefits and harms of interventions in subsets of the participants?
    • When should other study approaches to subset analysis, including the incorporation of such methods into longitudinal observational studies, be used to understand the benefits and harms of interventions in subsets of participants?
    • How can these principles be applied to clinical research relevant to the NIAMS mission
  • Doctor and patient handsDesired Outputs
    • Ideas about clinical study designs that could inform NIAMS Funding Opportunity Announcements
    • A white paper summarizing ideas from the roundtable to emphasize NIAMS interest in this area
  • Intended Audience
    • NIAMS staff
      • To inform the review of clinical study applications
      • To inform discussions with investigators about their research studies
    • Research community
      • To help investigators explore options in the design of clinical studies
      • To communicate NIAMS support for these approaches
  • Questions to be Addressed
    • Background: Update on research on clinical decision-making
      • What is the literature on how information from clinical studies can impact decisions by patients and providers in the clinic?
      • Can alternative study designs be sufficiently rigorous to be included in practice guidelines?
    • Study Design Perspective: What are the advantages and disadvantages of different clinical study designs in terms of informing the clinical care and outcomes for individual patients?
      • Study designs could include
        • RCTs with strata or Bayesian analysis for assessing likelihood of response
        • Master protocols
        • Regression discontinuity methods
        • N of 1 / sequential studies
        • Observational studies designed and analyzed using advanced causal inference approaches
        • Others?
      • Considerations could include
        • Statistical requirements (e.g., how to power?)
        • Cost/benefit of creating platforms or infrastructure
        • Building studies embedded into electronic health records
  • Clinical Research Perspective: What clinical questions in NIAMS mission areas could be best addressed by different study designs? Are any types of intervention (e.g., surgical, biopsychosocial) more likely to be appropriate for certain study designs?