This site is a compendium of R code meant to highlight the various uses of simulation to aid in the understanding of probability, statistics, and study design. I frequently draw on examples using my R package simstudy. Occasionally, I opine on other topics related to causal inference, evidence, and research more generally.

simstudy: another way to generate data from a non-standard density

One of my goals for the simstudy package is to make it as easy as possible to generate data from a wide range of data distributions. The recent update created the possibility of generating data from a customized distribution specified in a user-defined function. Last week, I added two functions, genDataDist and addDataDist, that allow data generation from an empirical distribution defined by a vector of integers. (See here for how to download latest development version.) This post provides a simple illustration of the new functionality.

[Read More]
R  simstudy 

simstudy 0.8.0: customized distributions

Over the past few years, a number of folks have asked if simstudy accommodates customized distributions. There’s been interest in truncated, zero-inflated, or even more standard distributions that haven’t been implemented in simstudy. While I’ve come up with approaches for some of the specific cases, I was never able to develop a general solution that could provide broader flexibility.

This shortcoming changes with the latest version of simstudy, now available on CRAN. Custom distributions can now be specified in defData and defDataAdd by setting the argument dist to “custom”. To introduce the new option, I am providing a couple of examples.

[Read More]
R  simstudy 

simstudy enhancement: specifying idiosyncratic follow-up times for longitudinal data

A researcher reached out to me a few weeks ago. They were trying to generate longitudinal data that included irregularly spaced follow-up periods. The default periods generated by the function addPeriods in the simstudy package are {0,1,2,...,n1}\{0, 1, 2, ..., n - 1\}, where there are nn total periods. However, when follow-up periods required more specificity, such as {0,90,180,365}\{0, 90, 180, 365\} days from baseline, users had to manually add them. Originally, I had intended to incorporate this feature into the function, but unfortunately it slipped through the cracks. Thanks to the clear motivation provided by the researcher, I’ve implemented this enhancement. Users can now replace the default vector with their desired set of follow-up periods using the new argument periodVec. This addition is available in the development version of simstudy on GitHub.

[Read More]

Perfectly balanced treatment arm distribution in a multifactorial CRT using stratified randomization

Over two years ago, I wrote a series of posts (starting here) that described possible analytic approaches for a proposed cluster-randomized trial with a factorial design. That proposal was recently funded by NIA/NIH, and now the Emergency departments leading the transformation of Alzheimer’s and dementia care (ED-LEAD) trial is just getting underway. Since the trial is in its early planning phase, I am starting to think about how we will do the randomization, and I’m sharing some of those thoughts (and code) here.

[Read More]

A three-arm trial using two-step randomization

Clinical Decision Support (CDS) tools are systems created to support clinical decision-making. Health care professionals using these tools can get guidance about diagnostic and treatment options when providing care to a patient. I’m currently involved with designing a trial focused on comparing a standard CDS tool with an enhanced version (CDS+). The main goal is to directly compare patient-level outcomes for those who have been exposed to the different versions of the CDS. However, we might also be interested in comparing the basic CDS with a control arm, which would suggest some type of three-arm trial.

[Read More]

Creating a nice looking Table 1 with standardized mean differences

I’m in the middle of a perfect storm, winding down three randomized clinical trials (RCTs), with patient recruitment long finished and data collection all wrapped up. This means a lot of data analysis, presentation prep, and paper writing (and not so much blogging). One common (and not so glamorous) thread cutting across all of these RCTs is the need to generate a Table 1, the comparison of baseline characteristics that convinces readers that randomization worked its magic (i.e., that study groups are indeed “comparable”). My primary goal here is to provide some R code to automate the generation of this table, but not before highlighting some issues related to checking for balance and pointing you to a couple of really interesting papers.

[Read More]
R 

Finding logistic models to generate data with desired risk ratio, risk difference and AUC profiles

About two years ago, someone inquired whether simstudy had the functionality to generate data from a logistic model with a specific AUC. It did not, but now it does, thanks to a paper by Peter Austin that describes a nice algorithm to accomplish this. The paper actually describes a series of related algorithms for generating coefficients that target specific prevalence rates, risk ratios, and risk differences, in addition to the AUC. simstudy has a new function logisticCoefs that implements all of these. (The Austin paper also describes an additional algorithm focused on survival outcome data and hazard ratios, but that has not been implemented in simstudy). This post describes the the new function and provides some simple examples.

[Read More]

A demo of power estimation by simulation for a cluster randomized trial with a time-to-event outcome

A colleague reached out for help designing a cluster randomized trial to evaluate a clinical decision support tool for primary care physicians (PCPs), which aims to improve care for high-risk patients. The outcome will be a time-to-event measure, collected at the patient level. The unit of randomization will be the PCP, and one of the key design issues is settling on the number to randomize. Surprisingly, I’ve never been involved with a study that required a clustered survival analysis. So, this particular sample size calculation is new for me, which led to the development of simulations that I can share with you. (There are some analytic solutions to this problem, but there doesn’t seem to a consensus about the best approach to use.)

[Read More]

Generating variable cluster sizes to assess power in cluster randomized trials

In recent discussions with a number of collaborators at the NIA IMPACT Collaboratory about setting the sample size for a proposed cluster randomized trial, the question of variable cluster sizes has come up a number of times. Given a fixed overall sample size, it is generally better (in terms of statistical power) if the sample is equally distributed across the different clusters; highly variable cluster sizes increase the standard errors of effect size estimates and reduce the ability to determine if an intervention or treatment is effective.

[Read More]

Implementing a one-step GEE algorithm for very large cluster sizes in R

Very large data sets can present estimation problems for some statistical models, particularly ones that cannot avoid matrix inversion. For example, generalized estimating equations (GEE) models that are used when individual observations are correlated within groups can have severe computation challenges when the cluster sizes get too large. GEE are often used when repeated measures for an individual are collected over time; the individual is considered the cluster in this analysis. Estimation in this case is not really an issue because the cluster sizes are typically relatively small. However, if there are groups of individuals, we also need to account for correlation. Unfortunately, if these group/cluster sizes are too large - perhaps bigger than 1000 - traditional GEE estimation techniques just may not be feasible.

[Read More]
R