I learned statistics and probability by simulating data. Sure, I did the occasional proof, but I never believed the results until I saw it in a simulation. I guess I have it backwards, but I that’s just the way I am. And now that I am a so-called professional, I continue to use simulation to understand models, to do sample size estimates and power calculations, and of course to teach. Sure - I’ll use the occasional formula when one exists, but I always feel the need to check it with simulation. It’s just the way I am.
Since I found myself constantly setting up simulations, over time I developed ways to make the process a bit easier. Those processes turned into a package, which I called simstudy, which obviously means simulating study data. The purpose here in this blog entyr is to introduce the basic idea behind simstudy, and provide a relatively brief example that actually comes from a question a user posed about generating correlated longitudinal data.
The basic idea
Simulation using simstudy has two primary steps. First, the user defines the data elements of a data set either in an external csv file or internally through a set of repeated definition statements. Second, the user generates the data, using these definitions. Data generation can be as simple as a cross-sectional design or prospective cohort design, or it can be more involved, extending to allow simulators to generate observed or randomized treatment assignment/exposures, survival data, longitudinal/panel data, multi-level/hierarchical data, datasets with correlated variables based on a specified covariance structure, and to data sets with missing data based on a variety of missingness patterns.
The key to simulating data in simstudy is the creation of series of data defintion tables that look like this:
varname | formula | variance | dist | link |
---|---|---|---|---|
nr | 7 | 0 | nonrandom | identity |
x1 | 10;20 | 0 | uniform | identity |
y1 | nr + x1 * 2 | 8 | normal | identity |
y2 | nr - 0.2 * x1 | 0 | poisson | log |
xCat | 0.3;0.2;0.5 | 0 | categorical | identity |
g1 | 5+xCat | 1 | gamma | log |
a1 | -3 + xCat | 0 | binary | logit |
Here’s the code that is used to generate this definition, which is stored as a data.table :
def <- defData(varname = "nr", dist = "nonrandom", formula=7, id = "idnum")
def <- defData(def,varname="x1",dist="uniform",formula="10;20")
def <- defData(def,varname="y1",formula="nr + x1 * 2",variance=8)
def <- defData(def,varname="y2",dist="poisson",
formula="nr - 0.2 * x1",link="log")
def <- defData(def,varname="xCat",formula = "0.3;0.2;0.5",dist="categorical")
def <- defData(def,varname="g1", dist="gamma",
formula = "5+xCat", variance = 1, link = "log")
def <- defData(def, varname = "a1", dist = "binary" ,
formula="-3 + xCat", link="logit")
To create a simple data set based on these definitions, all one needs to do is execute a single genData
command. In this example, we generate 500 records that are based on the definition in the def
table:
dt <- genData(500, def)
dt
## idnum nr x1 y1 y2 xCat g1 a1
## 1: 1 7 19.41283 47.02270 17 2 2916.191027 0
## 2: 2 7 17.93997 44.71866 29 1 2015.266701 0
## 3: 3 7 17.53885 43.96838 31 3 775.414175 0
## 4: 4 7 14.39268 37.89804 67 1 6.478367 0
## 5: 5 7 11.08339 30.20787 120 3 1790.472665 1
## ---
## 496: 496 7 15.57788 34.36490 45 3 13016.890610 1
## 497: 497 7 11.06041 26.95209 132 1 592.113025 0
## 498: 498 7 18.49925 46.86620 25 1 84.521499 0
## 499: 499 7 11.89009 33.60175 87 3 4947.826967 1
## 500: 500 7 18.70241 44.96824 33 2 695.362237 0
There’s a lot more functionality in the package, and I’ll be writing about that in the future. But here, I just want give a little more introduction by way of an example that came in from around the world a couple of days ago. (I’d say the best thing about building a package is hearing from folks literally all over the world and getting to talk to them about statistics and R. It is really incredible to be able to do that.)
Going a bit further: simulating a prosepctive cohort study with repeated measures
The question was, can we simulate a study with two arms, say a control and treatment, with repeated measures at three time points: baseline, after 1 month, and after 2 months? Of course.
This was what I sent back to my correspondent:
# Define the outcome
ydef <- defDataAdd(varname = "Y", dist = "normal",
formula = "5 + 2.5*period + 1.5*T + 3.5*period*T",
variance = 3)
# Generate a "blank" data.table with 24 observations and
# assign them to groups
set.seed(1234)
indData <- genData(24)
indData <- trtAssign(indData, nTrt = 2, balanced = TRUE, grpName = "T")
# Create a longitudinal data set of 3 records for each id
longData <- addPeriods(indData, nPeriods = 3, idvars = "id")
longData <- addColumns(dtDefs = ydef, longData)
longData[, T := factor(T, labels = c("No", "Yes"))]
# Let's look at the data
ggplot(data = longData, aes(x = factor(period), y = Y)) +
geom_line(aes(color = T, group = id)) +
scale_color_manual(values = c("#e38e17", "#8e17e3")) +
xlab("Time")
If we generate a data set based on 1,000 indviduals and estimate a linear regression model we see that the parameter estimates are quite good. However, my correspondent wrote back saying she wanted correlated data, which makes sense. We can see from the alpha estimate of approximately 0.02 (at the bottom of the output), we don’t have much correlation:
# Fit a GEE model to the data
fit <- geeglm(Y~factor(T)+period+factor(T)*period,
family= gaussian(link= "identity"),
data=longData, id=id, corstr = "exchangeable")
summary(fit)
##
## Call:
## geeglm(formula = Y ~ factor(T) + period + factor(T) * period,
## family = gaussian(link = "identity"), data = longData, id = id,
## corstr = "exchangeable")
##
## Coefficients:
## Estimate Std.err Wald Pr(>|W|)
## (Intercept) 4.98268 0.07227 4753.4 <2e-16 ***
## factor(T)Yes 1.48555 0.10059 218.1 <2e-16 ***
## period 2.53946 0.05257 2333.7 <2e-16 ***
## factor(T)Yes:period 3.51294 0.07673 2096.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Estimated Scale Parameters:
## Estimate Std.err
## (Intercept) 2.952 0.07325
##
## Correlation: Structure = exchangeable Link = identity
##
## Estimated Correlation Parameters:
## Estimate Std.err
## alpha 0.01737 0.01862
## Number of clusters: 1000 Maximum cluster size: 3