ouR data generation

Horses for courses, or to each model its own (causal effect)

Posted on November 28, 2018

In my previous post, I described a (relatively) simple way to simulate observational data in order to compare different methods to estimate the causal effect of some exposure or treatment on an outcome. The underlying data generating process (DGP) included a possibly unmeasured confounder and an instrumental variable. (If you haven’t already, you should probably take a quick look.)

A key point in considering causal effect estimation is that the average causal effect depends on the individuals included in the average. If we are talking about the causal effect for the population - that is, comparing the average outcome if everyone in the population received treatment against the average outcome if no one in the population received treatment - then we are interested in the average causal effect (ACE).

[Read More]

R

Generating data to explore the myriad causal effects that can be estimated in observational data analysis

Posted on November 20, 2018

I’ve been inspired by two recent talks describing the challenges of using instrumental variable (IV) methods. IV methods are used to estimate the causal effects of an exposure or intervention when there is unmeasured confounding. This estimated causal effect is very specific: the complier average causal effect (CACE). But, the CACE is just one of several possible causal estimands that we might be interested in. For example, there’s the average causal effect (ACE) that represents a population average (not just based the subset of compliers). Or there’s the average causal effect for the exposed or treated (ACT) that allows for the fact that the exposed could be different from the unexposed.

[Read More]

R

Causal mediation estimation measures the unobservable

Posted on November 6, 2018

I put together a series of demos for a group of epidemiology students who are studying causal mediation analysis. Since mediation analysis is not always so clear or intuitive, I thought, of course, that going through some examples of simulating data for this process could clarify things a bit.

Quite often we are interested in understanding the relationship between an exposure or intervention on an outcome. Does exposure $A$ (could be randomized or not) have an effect on outcome $Y$ ?

[Read More]

R

Cross-over study design with a major constraint

Posted on October 23, 2018

Every new study presents its own challenges. (I would have to say that one of the great things about being a biostatistician is the immense variety of research questions that I get to wrestle with.) Recently, I was approached by a group of researchers who wanted to evaluate an intervention. Actually, they had two, but the second one was a minor tweak added to the first. They were trying to figure out how to design the study to answer two questions: (1) is intervention $A$ better than doing nothing and (2) is $A^+$ , the slightly augmented version of $A$ , better than just $A$ ?

[Read More]

R

In regression, we assume noise is independent of all measured predictors. What happens if it isn't?

Posted on October 9, 2018

A number of key assumptions underlie the linear regression model - among them linearity and normally distributed noise (error) terms with constant variance In this post, I consider an additional assumption: the unobserved noise is uncorrelated with any covariates or predictors in the model.

In this simple model:

$Y_i = \beta_0 + \beta_1X_i + e_i,$

$Y_i$ has both a structural and stochastic (random) component. The structural component is the linear relationship of $Y$ with $X$ . The random element is often called the $error$ term, but I prefer to think of it as $noise$ . $e_i$ is not measuring something that has gone awry, but rather it is variation emanating from some unknown, unmeasurable source or sources for each individual $i$ . It represents everything we haven’t been able to measure.

[Read More]

R

simstudy update: improved correlated binary outcomes

Posted on September 25, 2018

An updated version of the simstudy package (0.1.10) is now available on CRAN. The impetus for this release was a series of requests about generating correlated binary outcomes. In the last post, I described a beta-binomial data generating process that uses the recently added beta distribution. In addition to that update, I’ve added functionality to genCorGen and addCorGen, functions which generate correlated data from non-Gaussian or normally distributed data such as Poisson, Gamma, and binary data. Most significantly, there is a newly implemented algorithm based on the work of Emrich & Piedmonte, which I mentioned the last time around.

[Read More]

R

Binary, beta, beta-binomial

Posted on September 11, 2018

I’ve been working on updates for the simstudy package. In the past few weeks, a couple of folks independently reached out to me about generating correlated binary data. One user was not impressed by the copula algorithm that is already implemented. I’ve added an option to use an algorithm developed by Emrich and Piedmonte in 1991, and will be incorporating that option soon in the functions genCorGen and addCorGen. I’ll write about that change some point soon.

[Read More]

R

The power of stepped-wedge designs

Posted on August 28, 2018

Just before heading out on vacation last month, I put up a post that purported to compare stepped-wedge study designs with more traditional cluster randomized trials. Either because I rushed or was just lazy, I didn’t exactly do what I set out to do. I did confirm that a multi-site randomized clinical trial can be more efficient than a cluster randomized trial when there is variability across clusters. (I compared randomizing within a cluster with randomization by cluster.) But, this really had nothing to with stepped-wedge designs.

[Read More]

R

Multivariate ordinal categorical data generation

Posted on August 15, 2018

An economist contacted me about the ability of simstudy to generate correlated ordinal categorical outcomes. He is trying to generate data as an aide to teaching cost-effectiveness analysis, and is hoping to simulate responses to a quality-of-life survey instrument, the EQ-5D. The particular instrument has five questions related to mobility, self-care, activities, pain, and anxiety. Each item has three possible responses: (1) no problems, (2) some problems, and (3) a lot of problems. Although the instrument has been designed so that each item is orthogonal (independent) from the others, it is impossible to avoid correlation. So, in generating (and analyzing) these kinds of data, it is important to take this into consideration.

[Read More]

R

Randomize by, or within, cluster?

Posted on July 19, 2018

I am involved with a stepped-wedge designed study that is exploring whether we can improve care for patients with end-stage disease who show up in the emergency room. The plan is to train nurses and physicians in palliative care. (A while ago, I described what the stepped wedge design is.)

Under this design, 33 sites around the country will receive the training at some point, which is no small task (and fortunately as the statistician, this is a part of the study I have little involvement). After hearing about this ambitious plan, a colleague asked why we didn’t just randomize half the sites to the intervention and conduct a more standard cluster randomized trial, where a site would either get the training or not. I quickly simulated some data to see what we would give up (or gain) if we had decided to go that route. (It is actually a moot point, since there would be no way to simultaneously train 16 or so sites, which is why we opted for the stepped-wedge design in the first place.)

[Read More]

R