Date: 2020-05-15 09:31:12

This tutorial introduces a number of basic concepts in data simulation using the statistical package, SPSS 26.0. I hope the concepts introduced in this tutorial serve as the building blocks you need to simulate the data you need.

Simulating data gives you power over every aspect of the data that result. That means you have to make a lot of decisions. Therefore, both SPSS and this tutorial separates the process of data simulation into planning (Example 1) and data-generation steps (Example 2) at first, and then these steps are merged together in both Examples 3 and 4. The latter two examples extend the basic simulation tutorial to data that could be analyzed with a multilevel model (Example 3) or a structural equation model (Example 4).

The first two examples assume that you are familiar with the general linear model (GLM). The last two examples assume you have familiarity with multilevel or structural equation modelling, but they are optional (i.e., you could stop reading after Example 2).

Examples 2 through 4 each have accompanying syntax files linked below.

You should be able to open each file, select all, and run the syntax. If you have trouble, then please let me know and hopefully I can debug it.

Why Simulate Data?

Simulated data is useful for many reasons, including to give you the ability to practice an analysis, test a draft of your analysis syntax before data collection (e.g., for preregistration), and do a power analysis. The examples today assume that you are simulating data to test a particular analysis syntax.

Making a Plan for the Simulation

Steps for Planning a Data Simulation

  1. Write out your final model in analysis syntax

  2. For each variable identified in Step 1, identify the following properties:

    2.1. Variable name

    2.2. Distribution, including relevant parameters like location and spread

    2.3. Relationship to the other variables, considering effect size

  3. Decide on the size of your sample

Example 1

1. This example simulates data that could be analyzed with a GLM, specifically a moderated regression model. In this example, physical stress (\(physical\_stress_i\)) will be predicted from a variable representing random assignment to a control condition or intervention (\(condition_i\)), neuroticism (\(centered\_neuroticism_i\)), and their interaction. That GLM syntax should look like this:

* Planned GLM Syntax.
  GLM physical_stress WITH condition centered_neuroticism
  /DESIGN=condition centered_neuroticism condition*centered_neuroticism
  /PRINT PARAMETER.

2.1. From the syntax above, we see that there are 3 variables needed for the analysis:

  • physical_stress

  • condition

  • centered_neuroticism

2.2. Define each variable’s properties

When you simulate data, all the variable properties are made up by you. You truly just make them up, but you want them to be plausible. So, start by identifying the distribution that the variable should follow and then make up parameters for the distribution. If you already know those parameters (e.g., from existing data or published papers), then that would be a great source for good parameters. Otherwise, when creating the table below, I simply made up parameters that seemed plausible and reflected my expectations for each variable. These choices are explained in further detail below Table 1.

Table 1

Construct Variable Name Range Distribution
Physical Stress physical_stress {1, 2, …, 7} ~ N(\(\mu\) = 3.15)
Intervention Condition condition {-1, 1} ~ B(n = 1, p = .5)
Neuroticism centered_neuroticism {-3, -2.875, …, 3} N(\(\mu\) = -1, \(\sigma\) = .8)

Physical Stress. With physical stress, I was imagining a single item, 1 - 7 Likert scale that people would use to rate, “Have you felt worn down physically from stress?” (1 = Not at All, 7 = Extremely). The values will be integers ranging between 1 - 7. I chose a mean value of 3.15 arbitrarily, but I generally guess that most people are below the midpoint (i.e., \(\mu < 4\)) but not at the floor (i.e., \(\mu > 1\)). Normally, we would define a standard deviation for a variable we are simulating, but physical stress is the dependent variable and the dependent variable’s variance will be set by the simulation process.[^This is because the dependent variable’s variance is conditional on the effect sizes that we will define in planning step 2.3. In other words, the dependent variable’s variance is partially explained by the independent variables.]

Intervention Condition. Intervention condition is going to be a 2-level categorical variable that represents random assignment to a control condition or a mindfulness-based cognitive therapy intervention. Participants have an equal chance of being assigned to the control or intervention conditions. Therefore, intervention condition will be a Binomially distributed variable, this time with one trial (i.e., assignment to intervention or not) that has a 50% chance of being assigned to the intervention or control conditions.

Neuroticism. I decided that the neuroticism variable would mimic an average of 8 variables (i.e., a composite score), where each variable is a 1-7 Likert scale. Note that I can choose to (a) just simulate the final composite score already centered (e.g., a variable with values ranging between -3 to +3 in increments of 0.125) or (b) simulate the 8 variables in their raw scale (i.e., integers ranging from 1 - 7), take the mean, and then centre them. I choose the former option (i.e., simulate one value that mimics the final composite score), because I am simulating these data to specifically test some draft analysis syntax (i.e., so I only need variables in their final value). I might also make this choice if I were simulating data for a power analysis. However, I would make the other choice if I were simulating data for purposes of testing a full draft analysis syntax (i.e., including data preparation syntax) or to reproduce a dataset. In those cases, I almost always want to simulate the raw data. I would then do all the data preparation on the simulated data that I would do on real data. Ultimately, as a construct, neuroticism is normally distributed. So, we will use a normal distribution to generate these values. I decided neuroticism should have a mean of -1 to reflect self-enhancement and other response-bias errors that likely deflate neuroticism ratings. I chose the SD of .8 arbitrarily.

2.3. Declare relationships between variables

Finally, you must determine how all the variables will relate to each other in your simulated data set. These relationships usually represent your a priori hypotheses and you need them to simulate the data. The hypothesized relationships between the variables are listed in Table 2.

In the specific case of this example, there is an interaction term between variables in the GLM. Therefore, I must also consider how the product of \(condition_i\) and \(centered\_neuroticism_i\) relates to all other variables, too.

Table 2

physical_stress condition centered_neuroticism condition * centered_neuroticism
physical_stress 1
condition d = .63 1
centered_neuroticism r = .50 d = .00 1
condition * centered_neuroticism r = -.30 d = .00 r = .00 1

In this table, I defined expected effect sizes using whatever effect size (e.g., Cohen’s d, Pearson r) made the most sense for the comparison. For example, experimental condition is a 2-level category. I wanted an effect size that captured mean differences between the 2-levels of condition. Therefore, I used Cohen’s d, because Cohen’s d standardizes differences between two means. However, all effect sizes will be converted to correlation coefficients, r, for the simulation; The effect sizes will be incorporated by using them like standardized slopes in a GLM (Example 2). However, I will convert the d values to r values after generating the \(condition_i\) variable, because I need to know sample sizes for each level of \(condition_i\) to apply the equation that converts d to r with the most accuracy.

3. Determine Sample Size

The expected effect sizes are all medium or large, but I chose a sample size of N = 750 as a rough guess. I chose 750, because there are three effect sizes of interest declared above (one for each slope in our model), and we know that correlation estimates stabilize around sample sizes of N = 250 (Schönbrodt & Perugini, 2013). This is not a legitimate power analysis, because it does not really matter for this specific example.

Conduct the Simulation

I have three general tips for conducting simulations:

  1. One general tip for simulating data is that you should go in the order of your process. For example, you usually want to simulate the independent variable(s) before the dependent variable, because you will use the independent variable(s) to simulate the dependent variable.

  2. A second general tip for simulating data is to always “set a random seed.” We use Monte Carlo simulation that requires random numbers. The computer generates quasi-random numbers for the Monte Carlo simulation using a really complex algorithm. This algorithm gets started with a starting number that “seeds” the random process, like you would plant a seed in a garden and watch it grow. 🌱 So, we call this starting number the “random seed” for the randomizer that drives our simulation. If you set the random seed to a specific number, then your simulation will yield identical results every time you run it. Sometimes you want identical results each time the simulation is run, but sometimes you specifically do not want that; Just think about it for a moment. If you want other people to reproduce your results (e.g., simulations conducted for research or teaching), then you definitely need a random seed. You can use literally any number for the random seed. It does not matter what number you give it. When setting the random seed in the example below, I just banged my fingers against the number keys on my keyboard to get the number you see. This is how you generally set the seed for your randomizer in SPSS:

* Set a random seed.
SET SEED=91285103.
  1. My final tip is to always return to the analysis that you intend to run with the simulated data if you are unsure of what you need to do next. The variables involved in that analysis are the ones that you must simulate. Let yourself use your model as a guide. This is our hypothesized linear model, integrating the information from Tables 1 and 2 with our initial GLM syntax:

\[physical\_stresss_i = 3.15 + .3 * condition_i + .5 * centered\_neuroticism_i - .3 * condition_i * centered\_neuroticism_i + e_i\]

In this equation, everything that has a subscript i needs to be simulated. That means the dependent variable, two independent variables, and the residuals (\(e_i\)), too.

Steps of Data Simulation

  1. Simulate data

  2. Test the simulated data

  3. Save dataset

Example 2

1. Simulate the data

I will demonstrate how to use Monte Carlo simulation and the General Linear Model (GLM) to create a simulated dataset. We will simulate the independent variables independently using Monte Carlo simulation, and we will combine these simulated independent variables together using our hypothesized model to simulate the dependent variable. This simple approach (using the GLM) allows you to simulate complex multivariate relationships.

Simulation in SPSS.

General Notes on Simulation in SPSS.

SPSS makes some pretty advanced concepts in simulation easy. As you get started simulating data with SPSS, there are some general things to know:

  1. You can access the Simulation dialogue by following Analyze → Simulation. Simulation is conducted through the syntax with the SIMPLAN function.

  2. In order to access the Simulation features of SPSS, you must have an existing data set open1. If you see the error message off to the right, then you simply need to open some data. It does not matter what type of data you open or how it is formatted; It does not actually have to be used in the simulation. Another option is to type something into a few cells in an empty SPSS data set, and then the Simulation features will become accessible.

  1. I have only explored SPSS simulation features a little bit, so I am sure that I am doing some things in a clunky way. I would love to hear your tips as you dive further into the world of simulating data!

Define Variables, Parameters, and Relationships.

We are going to start by defining \(condition_i\) and \(neuroticism\) together. I will briefly review our plan for those two variables.

Condition. We defined condition as a Binomial variable with 1 trial and .5 probability of success in Table 1. You can conduct a Monte Carlo simulation in SPSS using a 1-trial Binomial distribution by simply selecting the Binomial distribution as the variable’s distribution. The simulation will automatically output values of 1 and 2. Especially because this variable will be used in an interaction term, I will effect-code the variable into values of -1 and 1 after generating it (see West, Aiken, & Krull, 1996).

Neuroticism. Above, we decided that we would directly simulate one neuroticism variable that mimicked a composite variable made from 8 variables. If you take the average of 8 integer variables, then you get values that differ, at the lowest resolution, by 0.125. To imitate that property, I do a little fancy conversion after simulating the data. Otherwise, this is a classic Monte Carlo simulation of a normally distributed variable.

Let’s simulate them!

1.1. Go to Analyze → Simulation.

1.2. Selecting the Simulation menu item brings up the “Model Source” dialogue box. Click “Create Simulated Data.”

1.3. You will be brought to the Model tab of the Simulation Builder. Toward the bottom of the dialogue box, you can define “Fields to be Simulated.” First you must declare the names of all the variables that you want to simulate in this Model Tab. In our example, we will declare \(condition_i\) and \(centered\_neuroticism_i\) here. In the next step (1.4), we will define the properties of the variables declared in the Model tab.

We want to create one new variable for every variable we want to simulate. Toward the bottom of the Simulation Builder, click “New…” under the “Fields to be Simulated:” box. A new dialogue box will appear titled “Defined Inputs.” Enter the name for each variable (i.e., either \(condition_i\) or \(centered\_neuroticism_i\)), indicate that it is “Input to be simulated,” and then define its type and maybe its values. I intend to effect-code the categorical variable, condition, so I would prefer for SPSS to treat it as numeric and \(centered\_neuroticism_i\) is certainly a numeric variable. I do not define any parameters beyond that. To complete the definition for each new variable, click “Continue” to return to the Simulation Builder Model Tab.

The only purpose of this “Model” tab is for you to declare the names of the variables that you want to simulate. You will define the variable properties in the Simulation tab. Before I clicked on the Simulation Tab, my Model Tab view looked like this:

1.4 Click the “Simulation” tab at the top of the Simulation Builder dialogue box. You will see the three variables you declared in the Model tab here, but with new fields for you to enter more information. For each variable, you can define its distribution, that distribution’s properties, and its minimum and maximum values. Personally, I think this interface makes it really easy to do complex simulations in SPSS. The image below shows the completed dialogue. I set most of the parameters according to the values I developed in my simulation plan (Table 1), and I additionally declared minimum and maximum values of -3 and 3 for \(centered\_neuroticism_i\) to ensure the data were plausible. We will still need to do a little post-processing to mimic the 8-item scale for neuroticism, but otherwise, the SPSS Simulation Plan allowed us to do a lot of steps at once.

Along the left side of the Simulation Tab is a menu of other settings. Select “Advanced Options” to (a) set the random seed and (b) minimize the maximum number of cases to 1000. Notice in the image below that the random seed is set to 91285103 in the bottom right corner of the Advanced Options Simulation Item. As stated above, I selected this number by typing numbers without looking (i.e., arbitrarily; Pretty much any number works). We already decided that we wanted the simulated dataset to have a sample size of 750, but I believe that SPSS will not simulate less than 1000 cases at a time. This means I will have to go through and remove the remaining 250 cases after the simulation is run. All the same, manually trimming down 1000 cases is easier than manually trimming down 10000, so that is why I changed the “Maximum Number of Cases” parameter to be 1000 – all that being said, the automatic way to trim cases is the method demonstrated in this example.

At this point, click “Run” (to go ahead and run the simulation plan) or “Paste” (to save the simulation plan syntax to a syntax file). Once the simulation plan is run, you should have a new data set with 3 variables (i.e., \(physical\_stress_i\), \(centered\_neuroticism_i\), \(residuals_i\)) sitting on your computer! 💻

At this point, we are done with the Simulation Builder, so it is a good time to delete the extra cases SPSS simulated. Remember that SPSS will not simulate less than 1000 observations. We limited the number of simulated observations to 1000 in the Advanced Options Item of the Simulation tab in Simulation Builder to make it easier to manually delete them. If you wanted to do that, you would highlight the unwanted rows, right-click on the selected area, and select “clear” from the menu (see image below).

However, you can also delete the extra cases through syntax. I find that much easier.

**** Delete Extra Cases ****.
FILTER OFF.
USE 1 thru 750 /permanent.
EXECUTE.

Now, it is time to post-process the simulated variables. First, for \(centered\_neuroticism_i\), I need to manipulate the raw simulated variable whose values are continuous to put them in the proper values that would result from averaging 8 variables rated on 1 - 7 Likert scales (i.e., all decimal values being factors of .125 = 1/8). After I do that, then I mean-centre neuroticism. Next, I will effect-code \(condition_i\). Finally, I created an “ID” number for every observation, because (a) I wanted to show you how to do that and (b) that will make things easier when we get to the multilevel example (Example 3).

*--- Post-Process Simulated Data ----.
**** Neuroticism ****.
* Give neuroticism the resolution of an 8-item scale.
DATASET ACTIVATE DataSet2.
COMPUTE centered_neuroticism = RND(centered_neuroticism,.125).
EXECUTE.

* Mean-centre centered_neuroticism.
** 1. Calculate the mean and store it in a new variable called "neuroticism_mean".
AGGREGATE
  /neuroticism_mean = MEAN(centered_neuroticism).
EXECUTE.

** 2. Use that mean value to centre neuroticism.
COMPUTE centered_neuroticism = centered_neuroticism - neuroticism_mean.
EXECUTE.

**** Condition ****.
* Effect-code condition.
RECODE condition (0 = -1) (1 = 1).

**** Create a Real Variable for ID (Useful for Multilevel Example Later) ****.
 COMPUTE id=$CASENUM.
  FORMAT id (F8.0).
  EXECUTE.

After doing this recoding, the data look like this:

Simulate the Dependent Variable

Now that we have our independent variables generated, we need to create a dependent variable that is related to the independent variables in the manner that we specified. The general linear model (GLM) is meant to predict outcomes, so it can be used as a simulation tool. We will use it in this way! We will combine the independent variables together in a linear equation to predict the dependent variable.

We begin by combining the information across Tables 1 and 2 with the primary model. At this point, I convert the predicted d = 0.63 to a Pearson r = .30 using the equation below. Note that the relative sample sizes of each group matter in this conversion, so first I will find out how many observations were assigned to each level of condition in the first 750 simulated observations.

*--- Predicting Dependent Variable (DV) From Linear Model ----.
COMPUTE physical_stress = 3.15 + .30 * condition + .5 * centered_neuroticism + -.3 * condition * centered_neuroticism + residuals.
EXECUTE.

In my output, the frequency table output by SPSS said there were 371 people in the control condition (-1) and 379 people in the intervention condition (1). Your output may show a different condition count from 371/379. As of the most current version of this tutorial, I cannot figure out how to get the randomizer to simulate exactly the same values for condition each time. It is baffling to me, but stochastic processes are wild beasts … anyway, I will try to work to figure it out and update this tutorial, but for now let’s use the numbers of \(n_1\) = 371 and \(n_2\) = 379.

\[r = {d\over \sqrt{d^2+{(N^2-2N)\over{n_1 n_2}}}} = {0.63\over \sqrt{0.63^2 + {(750^2 - 2*750)\over{371*379}}}} = {0.63\over \sqrt{0.3969+3.989787}} = 0.3008\]

Note that if your category levels have equal sample sizes (i.e., \(n_1 = n_2\)), then you can use the constant “4” with the following equation, because that’s the result you would get from the preceding equation if the samples sizes are equal across category levels.

\[r = {d\over \sqrt{d^2+4}}\]

I will also note that there are a couple of effect sizes defined in Cohen’s d in Table 2 that equal zero. Cohen’s d and Pearson r are equivalent at zero. So, no calculations are required to convert effect sizes of d = 0 to r.

Now we are ready to simulate the dependent variable. We originally stated that physical stress is a function of neuroticism, condition, and their interaction, plus some error (Equation 1).

\[\begin{equation} \label{eq:1} physical\_stress_i = 3.15 + .30*condition_i + .50*centered\_neuroticism_i - .30 * condition_i * centered\_neuroticism_i + e_i \end{equation}\]

This is the equation we will use to generate values of physical stress. We already simulated all these variables, so the only task that remains is to simulate our dependent variable!

*--- Predicting Dependent Variable (DV) From Linear Model ----.
COMPUTE physical_stress = 3.15 + .30 * condition + .5 * centered_neuroticism + -.3 * condition * centered_neuroticism + residuals.
EXECUTE.

*--- Inspect New Variable ----.
GRAPH
  /HISTOGRAM(NORMAL) = physical_stress.

The resulting values we get for physical stress are normally distributed, ranging between \(-\infty\) and \(\infty\). Ultimately, we want to simulate \(physical\_stress_i\) as a positive integer that ranges between 1 and 7. We cannot use the minimum and maximum values as we did in the simulation builder, because we specifically wanted the predicted values of the dependent variable. So, we will simply exchange any values that exceed the range of 1 to 7 to the extreme values (i.e., 1 or 7) after rounding the predicted values to create integers. Please expect that both truncating and reducing the resolution of the variable in this way will deflate the effect sizes a bit.

*--- Round Predicted Values to Create Integers ----.
COMPUTE physical_stress = RND1(physical_stress).

*--- Replace any values that exceed a plausible range with the most extreme values ----.
RECODE physical_stress (Lowest thru 1=1) (7 thru Highest=7) (ELSE=Copy) INTO physical_stress.
EXECUTE.

*--- Inspect New Variable ----.
GRAPH
  /HISTOGRAM(NORMAL) = physical_stress.

One Last Thing: Calculate Proper SD for the Residuals

Okay, so on the one hand, this is all awesome. We simulated all our variables, and their values are plausible. There is the remaining issue of effect size, however. If you want your variables to be related to each other to a specific degree (e.g., you are simulating data for a power analysis), then we have an additional step.

We used the correlation effect sizes from Table 2 as the slopes for the GLM, but this method of simulation treats those estimates like unstandardized slopes. This means that the effect sizes we defined in Table 2 are not being applied properly right now. This becomes obvious when you focus on the standardized slopes for the primary model.

*--- Look at Effect Sizes ----.
COMPUTE conditionXneuroticism = condition*centered_neuroticism.
EXECUTE.

REGRESSION
  /DEPENDENT physical_stress
  /METHOD=ENTER condition centered_neuroticism conditionXneuroticism.

You see that the unstandardized slopes are around the effect sizes we specified, whereas the standardized slopes are not. In fact, the standardized slopes do not even retain the relative magnitude of partial effect sizes for each independent variable (e.g., condition has a stronger effect size than neuroticism, but we defined it the other way around). This occurs because the slopes that we specified for the GLM are unstandardized slopes unless we make them standardized. The way to do this is to (a) calculate how much variance should be explained by the predictors in the model, given that the slopes represent standardized slopes, and then (b) set the residual variance to explain the remaining unexplained variance.

a) Calculate how much variance should be explained by the predictors in the model

There are three slopes in our model, the main effect of \(condition_i\) (\(b_1 = .30\)), the main effect of \(centered\_neuroticism_i\) (\(b_2 = .50\)), and the interaction between \(condition_i\) and \(centered\_neuroticism_i\) (\(b_3 = -.30\)). The squared correlation coefficient equals the variance explained by that effect. Since we expect all predictors are uncorrelated, we can simply sum the squared values of our three slopes to find out how much variance our model is supposed to explain, \(R^2_\beta\)2.

\[R^2_\beta = .30^2 + .50^2 +(-.30^2) = .09 + .25 + .09 = 0.43\]

If our predictors explain 43% of the variance according to the effect sizes defined in Table 2, then that means that 57% of the variance is left unexplained. We will set the variance of the residuals, \(\sigma_{e_i}\), to account for this remaining amount of variance.

\[\sigma^2_{e_i} = 1 - R^2_\beta = 1 - .43 = 0.57\]

Now, we can set the residual variance to explain the remaining variance. We need to provide the standard deviation to the \(rnorm()\) function, not the variance. The standard deviation of the residuals, \(\sigma_{e_i}\), will be the square root of the residual variance.

\[\sigma_{e_i} = \sqrt{\sigma^2_{e_i}} = \sqrt{0.57} = 0.7549834\]

b) Set the residual variance to the remaining variance in the model

Now that we know the correct value for the SD of the residuals, we simulate the residuals again, this time saving them in a variable called \(new\_residuals_i\). This new residuals variable is normally distributed with a mean of zero and a standard deviation of 0.7549834. This is what the Simulation Fields item of the Simulation tab in Simulation Builder looks like for these new residuals.

Once these new residuals are simulated, they need to be merged back with the original data set. First, the extra 250 cases simulated by SPSS must be deleted. Then, the new residuals can be added to the previously simulated dataset with the following syntax.

**** Delete Extra Cases ****.
* Delete extra simulated observations.
DATASET ACTIVATE DataSet3.
FILTER OFF.
USE 1 thru 750 /permanent.
EXECUTE.

* Merge new residuals with old residuals.
DATASET ACTIVATE DataSet3.
MATCH FILES /FILE=*
  /FILE='DataSet2'.
EXECUTE.

Using these new residuals, all our variables should relate to each other to the degree that we specified in Table 2 when we simulate the Dependent Variable. So, let’s do it!

*--- Predicting Dependent Variable (DV) From Linear Model ----.
COMPUTE physical_stress = 3.15 + .30 * condition + .5 * centered_neuroticism + -.3 * condition * centered_neuroticism + new_residuals.
EXECUTE.

*--- Round Predicted Values to Create Integers ----.
COMPUTE physical_stress = RND1(physical_stress).

*--- Replace any values that exceed a plausible range with the most extreme values ----.
RECODE physical_stress (Lowest thru 1=1) (7 thru Highest=7) (ELSE=Copy) INTO physical_stress.
EXECUTE.

*--- Look at Effect Sizes Again ----.
COMPUTE conditionXneuroticism = condition*centered_neuroticism.
EXECUTE.

REGRESSION
  /DEPENDENT physical_stress
  /METHOD=ENTER condition centered_neuroticism conditionXneuroticism.

Note that the only reason I simulated the dependent variable twice was to emphasize the issue with residuals and effect size. In a real simulation, you would only simulate the dependent variable once.

3. Test the simulated data

In various ways, you want to make sure to test the data. We have already been inspecting values generated along the way, which is important. Additionally, I want to make sure that syntax would run using the data we simulated. If it runs, then we can move onto Step 4. If the analysis syntax returns an error, then there is either a problem with the simulated data or a problem with the draft analysis syntax – at least one of those two things will need to be fixed. So, I would figure out the problem, fix it, and then try to run this syntax again. This process is illustrated in Example 4.

*--- Test the Simulation ----.
**** Planned GLM Syntax ****.

DATASET ACTIVATE DataSet3.
  GLM physical_stress WITH condition centered_neuroticism
  /DESIGN=condition centered_neuroticism condition*centered_neuroticism
  /PRINT PARAMETER.

Indeed, the syntax ran without errors. That was the goal of the test. ✅

3. Finalize Data Organization and Export Data

Once you are satisfied with the data you have simulated, you are ready to save or export it. You are finished!

*--- Export Simulated Data ----.
SAVE OUTFILE='Simulated Data.sav'
  /COMPRESSED.

The syntax above saved a file called “Simulated Data.sav” in the top level of my User home directory.

Extensions to Datasets for Advanced Analyses

Lastly, I want to give you the tools to extend your simulations skills to generate data for “advanced” analyses. Specifically, I’ll demonstrate how to extend the example to hierarchical data for multilevel modelling and how to generate a big correlation matrix to simulate data for a path analysis and structural equation modelling (SEM).

Example 3: Simulate Data for Multilevel Modelling

This example assumes that you are familiar with multilevel modelling (a.k.a., “mixed models,” “hierarchical linear modelling,” “random effects models”).

It is easy to take the regular GLM example and make it into a multilevel one. The GLM that drove Examples 1 and 2 was specified like this:

\[y_i = b_0 + b_1 x_1 + b_2 x_2 + b_3 x_1 x_2 + e_i\]

To simulate a multilevel dataset, we will pretend that we had a study design that was pretty similar to Examples 1 and 2, except that we measured physical symptoms every day for 7 days. We will keep the same model specification for the fixed effects, but we will estimate a random intercept for each participant. The equation for this multilevel model would be:

\[ y_{ij} = b_{0_j} + b_1 x_{1_j} + b_2 x_{2_j} + b_3 x_{1_j} x_{2_j} + u_{0_j} + e_{ij} \]

Note that the only real difference between the equations is the extra residual for the intercept, \(u_{0_{j}}\). Therefore, to take our between-subjects simulation and turn it into a multilevel one, we really only need to simulate these extra residuals. Like our other residuals, these residuals should vary around zero. Their variance should actually be proportional to whatever you want to intercept variance to be, relative to the residual variance. In the following simulation, I will arbitrarily say that the intercept will have a variance of .40 (SD = 0.6324555).

We are ready to simulate our multilevel data. Remember that the only thing that differed between the equations for our regular GLM and the MLM was the addition of the intercept residuals, \(u_{0_j}\). Random intercepts are estimated for each participant and are invariant within participants. So, we will simulate the intercept residuals now and merge them with our new participant-level dataset.

First, we will create the participant-level (Level 2) data set that keeps only the \(condition_i\) and \(centered\_neuroticism_i\) independent variables from before, because those variables will still only vary between participants (and are invariant within participants) in our new study design. Condition (\(condition_i\)) is a level-2 variable, and thus the way we simulated it before is still good. Neuroticism (\(centered\_neuroticism_i\)) is here for the same reason condition is: It is a participant-level variable, so the way it was simulated before works equally well in the multilevel context. The reason you can simulate the Level 2 variables for a hierarchical dataset the same way you simulate variables for a non-hierarchical dataset is that we assume all Level 2 units were randomly sampled from the population and thus are independent. Values in the participant-level dataset will later be repeated across multiple rows. The syntax below reads in the data file from Example 2 and keeps only the variables we need for this example.

**** Read Data from Example 2 ****.
* Note that you will probably need to change the path to the file below to match where Simulated Data.sav (from Example) is saved your computer.
GET FILE='Simulated Data.sav'
  /KEEP id condition centered_neuroticism. 
DATASET NAME ParticipantData WINDOW=FRONT.

Next, following the more detailed instructions from Example 2, we will generate the intercept residuals as a Normally distributed variable that has a mean of 0 and a standard deviation of 0.6324555. We declare a variable called “per_participant_intercept” and then set its properties accordingly.

Now, we will delete the extra cases from the newly simulated data and then merge it back with the participant-level data set.

**** Merge two files together ****.
* Delete extra simulated observations.
DATASET ACTIVATE DataSet4.
FILTER OFF.
USE 1 thru 750 /permanent.
EXECUTE.

* Merge new residuals with old residuals.
DATASET ACTIVATE ParticipantData.
MATCH FILES /FILE=*
  /FILE='DataSet4'.
EXECUTE.

The second step is to simulate the level 1 residuals and merge the Level 2 dataset together with the Level 1 residuals. The multilevel model assumes that, after the random effects are taken into account, the Level 1 residuals are conditionally independent from each other. So, you can use a straight-forward Monte Carlo simulation to generate the Level 1 residuals. Since our example design is a 7-day observation period, then I will need to simulate 750 * 7 = 5250 residuals to reflect each of these observations. I will save these residuals into a Level 1 dataset.

Note: As with the previous example using an ordinary GLM, if we want our slopes to reflect effect sizes, we need to change the variance of the Level 1 residuals to be reduced by the variance of the Level 2 residuals. We simulated the random intercept to have a variance of 0.40 (SD = 0.6324555). The unexplained variance we estimated for the residuals in the Example 2 was 0.57 (SD = .7549834). That is now the total variance that needs to be accounted for by the intercept and Level 1 residuals. So, to find the new residual variance, I will subtract the intercept variance from the previous residual variance (\(\sigma^2_{e_{ij}} = \sigma^2_{e_i} - \sigma^2_{0_j} = 0.57 - 0.4 = 0.17\)). From this recalculation, I decide to make the SD of the Level 1 residuals equal to 0.4123106.

In the Advanced Options item, I set the maximum number of cases to 5250 (i.e., my total number of Level 1 observations = 750 participants X 7 observations/participant). I again used the random seed of 91285103, just because I am merging these data with other data simulated with the same random seed.

Finally, I want to name the simulated data set something in particular to make it easier to merge. You can name the output data set by navigating to the Save item, then “A new dataset,” and then typing the name you want for the data set in the “Dataset name:” field. I am going to call this data set, “ObservationData.”

The last step is to merge the two data sets. We want each participant to have 7 observations. Thus, every 7 residual estimates should belong to a new id number. The easiest way to match the two data sets will be with a common identifier variable. ParticipantData already has a variable, \(id_j\), that contains a number from 1 - 750 that uniquely identifies each row in ParticipantData. To match up these data with ObservationData, we will create the same id variable in ObservationData where each value of \(id_j\) is repeated across 7 rows of ObservationData.

*--- Create an ID variable to match participant id ----.
DATASET ACTIVATE ObservationData.
COMPUTE id = TRUNC(($CASENUM - 1) / 7) + 1.
EXECUTE.

Finally after that, the data sets can be properly merged to create the final dataset shown below.

DATASET ACTIVATE ObservationData.
SORT CASES BY id.
DATASET ACTIVATE ParticipantData.
SORT CASES BY id.
DATASET ACTIVATE ObservationData.
MATCH FILES /FILE=*
  /TABLE='ParticipantData'
  /BY id.
EXECUTE.

Finally, we can use our multilevel equation to predict values of the dependent variable of \(physical\_stress_{ij}\), taking into account a clustering of physical stress within participants.

*--- Predicting Dependent Variable (DV) From Multilevel Model ----.
COMPUTE physical_stress = (3.15 + per_participant_intercept) + .30 * condition + .5 * centered_neuroticism + -.3 * condition * centered_neuroticism + residuals.
EXECUTE.

After computing the dependent variable, we will do the same operations as we did in Example 2 to clean it up (i.e., make it an integer that ranges between 1 - 7).

*--- Round Predicted Values to Create Integers ----.
COMPUTE physical_stress = RND1(physical_stress).

*--- Replace any values that exceed a plausible range with the most extreme values ----.
RECODE physical_stress (Lowest thru 1=1) (7 thru Highest=7) (ELSE=Copy) INTO physical_stress.
EXECUTE.

Finally, the multilevel data simulation is complete! Here is what the data look like.

As a final step to ensure our simulated data is sufficient, let us test it by running it in an MLM!

*--- Test the Model ----.
MIXED physical_stress WITH condition centered_neuroticism
 /FIXED = condition centered_neuroticism condition*centered_neuroticism
 /RANDOM = INTERCEPT | SUB(id)
 /PRINT = SOLUTION.

Sweet. It worked. Export the data, and you are DONE! ✅

*--- Export Simulated MLM Data ----.
SAVE OUTFILE='Simulated MLM Data.sav'
  /COMPRESSED.

Example 4: Simulate Correlation Matrix for SEM

This example assumes that you are familiar with structural equation modelling (SEM).

Ultimately, SEM is an analysis of a covariance matrix. Some of the first statistical packages used for SEM required a covariance or correlation matrix as input instead of your raw data. Therefore, to simulate data that would be suitable for SEM, you just need to create a set of correlated variables.

Plan the Simulation

1. Similar to the other examples, you want to begin the simulation by specifying your final model. The best way to specify an SEM is to draw it! Since I am currently “isolating” with two small kids, multitasking takes on a whole different meaning. Aaaaand, I have no time. So, I drew the example SEM while playing with my daughter. 😊 🖍 🎨 👧

2.1. The structural model above represents the idea that both physical health and mental health contribute to well-being. The variables represented in rectangles are the measured variables, and that means they are the variables we need to simulate. This is the list of variables to simulate:

  • WB1 - WB4
  • sleep
  • nutrition
  • exercise
  • mood
  • relationships
  • enrichment

2.2. Next, we need to elaborate the properties of all variables that need to be simulated (Table 3). As with the previous examples, I am choosing these properties somewhat arbitrarily. The strategy I use is to think of what these data would be like if I collected them for real, and I make the simulated data as plausible and similar as possible. I elaborate my reasoning for all variables beneath Table 3.

Table 3

Latent Construct Measured Variable Range Distribution
Well-being WB1 {1, 2, …, 7} \(\sim N(\mu = 4.85, \sigma = 0.76)\)
Well-being WB2 {1, 2, …, 7} \(\sim N(\mu = 4.85, \sigma = 0.76)\)
Well-being WB3 {1, 2, …, 7} \(\sim N(\mu = 4.85, \sigma = 0.76)\)
Well-being WB4 {1, 2, …, 7} \(\sim N(\mu = 4.85, \sigma = 0.76)\)
Physical Health sleep {0, 1, …, \(\infty\)} \(\sim Pois(\lambda = 438)\)
Physical Health nutrition {0, …, 100} \(\sim N(\mu = 68.3, \sigma = 8.3)\)
Physical Health exercise {0, 1, …, \(\infty\)} \(\sim Pois(\lambda = 20)\)
Mental Health mood {1.0, 1.1, …, 7} \(\sim N(\mu = 4.6, \sigma = 1.1)\)
Mental Health relationships {1.0, 1.1, …, 7} \(\sim N(\mu = 5.3, \sigma = .75)\)
Mental Health enrichment {1.0, 1.1, …, 7} \(\sim N(\mu = 3.8, \sigma = .85)\)

Well-Being (WB1 - WB4). In this example, the measurement model for the latent variable, Well-Being, is estimated from responses to an imaginary scale of well-being. In this imaginary scale, each item is measured on a 7-point Likert scale from 1 - 7, where low values represents extremely poor well-being and high values represents extremely good well-being. We will assume the scale – and thus each item in the scale – has a mean of M = 4.85 and a standard deviation of SD = 0.76. I will say that this set of variables will be correlated with each other to the extent the scale reliability would be excellent, Cronbach’s \(\alpha\) = .90.

Physical Health (sleep, nutrition, exercise). Sleep and exercise are expected to be single-item, Poisson-distributed measures that ask how much sleep people got the previous night, in minutes. We expect that people will sleep an average of 7.3 hours per night (438 minutes) and exercise about 20 minutes per day. However, we will say that the \(nutrition\) variable is a calculated score that combines things like servings per food group, total calories, and dietary fibre. This imaginary nutrition score will be an approximately Normally distributed variable that ranges from 0 to 100. Altogether, we will generate these three measured variables of the physical health latent variable assuming a good internal consistency of Cronbach’s \(\alpha\) = .83.

Mental Health (mood, relationships, enrichment). All of the “mental health” variables are assumed to themselves be composite scores built from many different variables. To make things easy on myself, I decided to think of each measured variable for mental health as being an average of 10 items, which means that I can easily simulate a normal distribution for each item and then just round them to tenths place to mimic the numbers that would result from averaging a 10-item Likert scale. I will simulate the three measured variables for the mental health construct using the same internal consistency as I used for the measured variables of physical health (Cronbach’s \(\alpha\) = .83.

2.3 Define relationships between variables

We need to define the relationships between all variables.

Structural Model Predictions. The relationships between measured variables from different latent variables will be represented with the same correlations as the expected standardized path estimate for the structural model. I will simulate a structural model that has a path of \(\beta\) = .55 from physical health to well-being and \(\beta\) = .40 from mental health to well-being. I anticipate the correlation (i.e., standardized estimate of the covariance path) between physical health and mental health to be about \(\beta\) = .30. This creates the values you see in Table 4a.

Table 4a: Correlations between latent variables (Structural Model)

Well-Being Physical Health Mental Health
Well-Being 1 .55 .40
Physical Health 1 .30
Mental Health 1

Measurement Model Predictions. Each latent variable defined above has its own measurement model comprised of either three (mental health and physical health) or four (well-being) measured variables. For each measurement model, I will assume that all manifest variables are correlated with each other to some average degree. So, I will simulate each set of variables to have an average correlation that matches the Cronbach’s \(\alpha\) values I defined in step 2.2.

I used Cronbach’s \(\alpha\) to estimate a good value for the average correlations between measured variables in each measurement model, because the equation for the standardized Cronbach’s \(\alpha\) includes the average inter-item correlation, \(\bar{r}\), and the number of items, \(K\). To calculate the average correlation, we must solve the equation for Cronbach’s standardized \(\alpha\) for the average inter-item correlation, \(\bar{r}\).

\[\alpha = {K\bar{r}\over{1 + (K-1)\bar{r}}}\] \[\alpha = {K\over{{1\over{\bar{r}}} + (K-1)}}\] \[\alpha({1\over{\bar{r}}} + (K-1)) = {K}\] \[{1\over{\bar{r}}} + (K-1) = {K\over\alpha}\] \[{1\over{\bar{r}}} = {K\over\alpha}-(K-1)\] \[1 = ({K\over\alpha}-(K-1))\bar{r}\] \[{1\over{{K\over\alpha}-(K-1)}} = \bar{r}\]

Now, we can use this equation to solve for the average correlation for variables within each measurement model.

Latent Variable Predicted Cronbach’s \(\alpha\) Number of Items, \(K\) \(\bar{r} = {1\over{{K\over{\alpha}}-(K-1)}}\)
Well-Being .90 4 0.6923
Physical Health .83 3 0.6194
Mental Health .83 3 0.6194

Bringing these decisions together, we are ready to generate a big correlation matrix that combines the information from Tables 4a and 4b (Table 5). The correlations between variables that belong to the same measurement model matches \(\bar{r}\) from Table 4b and the correlations between variables that belong to different measurement models correspond to the standardized paths between latent variables in the structural model, defined as correlations in Table 4a.

Table 5

WB1 WB2 WB3 WB4 sleep nutrition exercise mood relationships enrichment
WB1 1 .6923 .6923 .6923 .55 .55 .55 .40 .40 .40
WB2 1 .6923 .6923 .55 .55 .55 .40 .40 .40
WB3 1 .6923 .55 .55 .55 .40 .40 .40
WB4 1 .55 .55 .55 .40 .40 .40
sleep 1 0.6194 0.6194 .30 .30 .30
nutrition 1 0.6194 .30 .30 .30
excercise 1 .30 .30 .30
mood 1 0.6194 0.6194
relationships 1 0.6194
enrichment 1

3. Decide on a sample size for the simulation

I will base my sample size for the simulated data to follow the general guideline to have 10 observations per estimated parameter in the model (Kline, 1998). I count 10 paths in the model: 2 Structural Paths + 1 Covariance Path + (10 Measurement Paths - 3 paths that were fixed to 1, once per measurement model) + 3 latent variable variances/disturbances + 10 disturbances for all the measured variables = 23 parameters to estimate. Therefore, I will simulate a sample ten times that size (N = 230).

Run the Simulation!

Taking together the information in Tables 3 and 5, I am ready to simulate a set of correlated variables for an SEM analysis.

It is super easy to simulate correlated variables in SPSS. It is even easy to simulate variables from different distributions, like the mixture of Poisson- and Normally distributed variables we have here. The first step is to define all the variables, using the values in Table 3, and then define the correlations between them, using the values from Table 5. As with our previous examples, this will be done through the Simulation Builder.

Navigate to the Simulation Builder by navigating to Analyze → Simulation. Select that you want to “Create Simulated Data” and then click “Continue.” You will be brought to the Model tab of the Simulation Builder. As with Example 2, enter the names of all the variables that you want to simulate by repeatedly clicking the “New…” Button under the “Fields to Simulate:” section. At the end, it should look very close to this:

Next, click on the Simulation tab to enter the properties of all variables that we defined in Table 3, following the detailed instructions in Example 2.

Up until now, this example is essentially the same as Example 2. Now is where it diverges, though. In the Simulation Tab, select the “Correlations” item from the left-hand menu. The Correlations screen allows you to define a big correlation matrix between all your variables. Thus, creating correlated data will be as easy as transcribing Table 5 into SPSS. This is what my completed Correlations screen looked like. However, note that SPSS will sort your variables in ASCII-based alphabetical order (i.e., A - Z then a - z), so the correlation matrix in the screenshot below looks a little different from Table 5. If you squint at them both long enough, you will see how they map onto each other.

Finally, click on the “Advanced Options” items in the left-hand menu in order to change the maximum number of cases to 1000 and set the random seed. For Example 4, we will use “3240692” as the random seed. These completed options are shown below.

Click “Run” or “Paste” to run simulate the data! This is what the simulated data looks like, fresh out of the simulation.

Now that we have a good set of correlated data, it is time to do some post-processing. First, we seek to make the data plausible. Second, we finalize the data set by deleting cases 231 - 1000[^We are deleting these rows, because SPSS cannot simulate less than 1000 cases but we only want 230 cases.].

First, we make the data adhere to the properties that we believe they would have if they were collected in the real world. In this example, we only need to round the well-being and nutrition variables to be integers and round the mental health variables to the tenths place to mimic averages of 10-item scales.

*--- Post-Process Simulated Data ----.
COMPUTE WB1 = RND1(WB1).
COMPUTE WB2 = RND1(WB2).
COMPUTE WB3 = RND1(WB3).
COMPUTE WB4 = RND1(WB4).
COMPUTE nutrition = RND1(nutrition).
COMPUTE relationships = RND(relationships,.1).
COMPUTE mood = RND(mood,.1).
COMPUTE enrichment = RND(enrichment,.1).
EXECUTE.

Finally, inspect the final dataset.

That looks about right! Finalize and then save the data, so you can test it. To achieve our desired sample size of N = 230, remove the extra rows (i.e., rows 231 - 1000).

Save the dataset. I saved it as “Simulated SEM Data.sav”.

Test the simulation

Finally, you would test the simulated data. If you are simulated SEM data in SPSS, then you are probably using AMOS software to do your SEM. Unfortunately, I do not have access to AMOS. However, I read the simulated We will use the \(lavaan\) package to run an SEM using our simulated data. This is the R syntax that I used, but I expect that you would get very similar numbers in AMOS, if not exactly the same numbers – I forget how AMOS’ defaults compare to lavaan’s defaults.

#--- Load the Lavaan Package ----
# install.packages("lavaan", "semPlot")
library(lavaan)
library(semPlot)
library(haven)

#--- Read Data ----
simulated_correlated_data = read_sav("Simulated SEM Data.sav")

#--- Define the Model ----
sem_model = '
WELL_BEING =~ WB1 + WB2 + WB3 + WB4
PHYSICAL_HEALTH =~ sleep + exercise + nutrition
MENTAL_HEALTH =~ mood + relationships + enrichment
WELL_BEING ~ PHYSICAL_HEALTH + MENTAL_HEALTH
PHYSICAL_HEALTH  ~~ MENTAL_HEALTH
'

#--- Estimate the Model ----
sem_analysis = sem(sem_model, data= simulated_correlated_data)
## Warning in lav_model_estimate(lavmodel = lavmodel, lavpartable = lavpartable, :
## lavaan WARNING: the optimizer warns that a solution has NOT been found!

Notice that there is a convergence problem! We receive a message from R saying that convergence was not achieved. Indeed, looking at the output of the model, many parameters are not estimated.

summary(sem_analysis) # Some parameters are not estimated
## lavaan 0.6-5 did NOT end normally after 3565 iterations
## ** WARNING ** Estimates below are most likely unreliable
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         23
##                                                       
##   Number of observations                          1000
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                    NA
##   Degrees of freedom                                NA
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard errors                             Standard
## 
## Latent Variables:
##                      Estimate  Std.Err  z-value  P(>|z|)
##   WELL_BEING =~                                         
##     WB1                 1.000                           
##     WB2                 1.094       NA                  
##     WB3                 1.004       NA                  
##     WB4                 1.027       NA                  
##   PHYSICAL_HEALTH =~                                    
##     sleep               1.000                           
##     exercise            0.240       NA                  
##     nutrition           0.457       NA                  
##   MENTAL_HEALTH =~                                      
##     mood                1.000                           
##     relationships       0.609       NA                  
##     enrichment          0.720       NA                  
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   WELL_BEING ~                                        
##     PHYSICAL_HEALT  -22.253       NA                  
##     MENTAL_HEALTH   211.897       NA                  
## 
## Covariances:
##                      Estimate  Std.Err  z-value  P(>|z|)
##   PHYSICAL_HEALTH ~~                                    
##     MENTAL_HEALTH       6.873       NA                  
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .WB1               0.290       NA                  
##    .WB2               0.240       NA                  
##    .WB3               0.273       NA                  
##    .WB4               0.256       NA                  
##    .sleep           345.809       NA                  
##    .exercise         15.035       NA                  
##    .nutrition        50.648       NA                  
##    .mood              0.438       NA                  
##    .relationships     0.245       NA                  
##    .enrichment        0.299       NA                  
##    .WELL_BEING       43.003       NA                  
##     PHYSICAL_HEALT   65.229       NA                  
##     MENTAL_HEALTH     0.723       NA

One thing I notice in the summary output is that the variances of two of our variables, sleep and nutrition, are relatively quite large. That can mess up SEM substantially and even lead to impossible estimates. All the other variables are on scales that are all in the ones range (e.g., 1 - 7). I want all variables to be in that range. So, back in my SPSS syntax, I am going to divide sleep (on the scale of hundreds of minutes) by 100 and nutrition (on the scale of tens) by 10.

*--- Get Variables Onto the Same Scale ----.
COMPUTE sleep = sleep/100.
COMPUTE nutrition = nutrition/10.
EXECUTE.

After doing those transformations, we try to estimate the SEM again.

## lavaan 0.6-5 ended normally after 48 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         23
##                                                       
##   Number of observations                          1000
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                68.207
##   Degrees of freedom                                32
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              4741.833
##   Degrees of freedom                                45
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.992
##   Tucker-Lewis Index (TLI)                       0.989
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -10144.282
##   Loglikelihood unrestricted model (H1)     -10110.178
##                                                       
##   Akaike (AIC)                               20334.564
##   Bayesian (BIC)                             20447.442
##   Sample-size adjusted Bayesian (BIC)        20374.393
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.034
##   90 Percent confidence interval - lower         0.023
##   90 Percent confidence interval - upper         0.045
##   P-value RMSEA <= 0.05                          0.994
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.019
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard errors                             Standard
## 
## Latent Variables:
##                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   WELL_BEING =~                                                           
##     WB1                 1.000                               0.584    0.735
##     WB2                 1.094    0.046   23.842    0.000    0.639    0.793
##     WB3                 1.005    0.045   22.505    0.000    0.588    0.748
##     WB4                 1.030    0.045   23.069    0.000    0.602    0.766
##   PHYSICAL_HEALTH =~                                                      
##     sleep               1.000                               0.173    0.815
##     exercise           18.723    0.769   24.338    0.000    3.244    0.748
##     nutrition           3.674    0.142   25.933    0.000    0.636    0.794
##   MENTAL_HEALTH =~                                                        
##     mood                1.000                               0.819    0.777
##     relationships       0.660    0.030   21.670    0.000    0.541    0.754
##     enrichment          0.786    0.035   22.148    0.000    0.644    0.784
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   WELL_BEING ~                                                          
##     PHYSICAL_HEALT    2.469    0.136   18.106    0.000    0.732    0.732
##     MENTAL_HEALTH     0.169    0.023    7.422    0.000    0.236    0.236
## 
## Covariances:
##                      Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   PHYSICAL_HEALTH ~~                                                      
##     MENTAL_HEALTH       0.063    0.006   10.224    0.000    0.443    0.443
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .WB1               0.291    0.016   18.708    0.000    0.291    0.460
##    .WB2               0.241    0.014   17.050    0.000    0.241    0.371
##    .WB3               0.273    0.015   18.415    0.000    0.273    0.441
##    .WB4               0.255    0.014   17.912    0.000    0.255    0.413
##    .sleep             0.015    0.001   15.093    0.000    0.015    0.335
##    .exercise          8.262    0.465   17.754    0.000    8.262    0.440
##    .nutrition         0.238    0.015   16.125    0.000    0.238    0.370
##    .mood              0.440    0.030   14.711    0.000    0.440    0.396
##    .relationships     0.222    0.014   15.785    0.000    0.222    0.431
##    .enrichment        0.259    0.018   14.357    0.000    0.259    0.385
##    .WELL_BEING        0.087    0.010    8.708    0.000    0.255    0.255
##     PHYSICAL_HEALT    0.030    0.002   14.710    0.000    1.000    1.000
##     MENTAL_HEALTH     0.671    0.051   13.194    0.000    1.000    1.000

Now the model fits well! This all looks good. A bonus is that you have seen an example of how simulated data and testing it with your intended analysis script allows you to debug your analysis script before you even start collecting data! 💡

One way or the other, you have now simulated data for a structural equation model that actually fits. You can call this job done! ✅

Conclusion

Being able to simulate data is a liberating skill for any psychological scientist. It is practical, at the very least, and can become an integral component of your research program, at the most. Simulated data is useful for important things like testing draft analysis scripts and conducting power analysis.

I did my best to simulate data that would be most similar to the data I would get from observing the natural world. Sometimes, this is more or less difficult, because research tasks are highly specialized, almost definitionally. So, do your best, and it is OK if some of your variables are not exactly as they would be if you collected natural data – really it depends on how much these deviations impact your goals.

Hopefully, the things you learned in this tutorial provide the building blocks you need to begin simulating data of your own! 💖


  1. I am sure there is a reason for this, but I haven’t figured out what it is yet.↩︎

  2. If the predictors were correlated, then the square of their correlation should be included in this sum.↩︎