Open In App

Regression Tree with Simulated Data - rpart package

Last Updated : 19 Apr, 2025
Comments
Improve
Suggest changes
1 Likes
Like
Report

Regression trees are a type of decision tree used to predict continuous values. They work by splitting the data into smaller groups based on the input features, and then predicting the average value (or mean) for each group. The goal is to divide the data in a way that reduces the difference between the predicted values and the actual values.

The tree is built by making decisions at each step to split the data in the best way possible, based on which split reduces the prediction error the most. Once the tree is built, it can predict outcomes by following the splits from the root to the leaves, where the final prediction is the average of the values in that leaf.

Building a Regression Tree with rpart

In this example, we will demonstrate how to build a regression tree using the rpart package in R. We will begin by simulating some data and fitting a regression tree to it. Follow these steps to simulate the data:

1. Simulating Data

First, we will generate three predictor variables (x1, x2, x3) and one outcome variable (y). The relationship between these variables is intentionally complex.

R
# Set the seed for reproducibility 
set.seed(123) 
n <- 1000 

x1 <- runif(n) 
x2 <- runif(n) 
x3 <- runif(n) 

y <- 2*x1 + 3*x2^2 + 0.5*sin(2*pi*x3) + rnorm(n, mean = 0, sd = 1) 

sim_data <- data.frame(x1 = x1, x2 = x2, x3 = x3, y = y)

2. Visualizing Data

Next, we can visualize the relationship between each predictor and the outcome variable using ggplot2. These plots illustrate the complex relationships between the predictors and the outcome.

R
library(ggplot2)

ggplot(sim_data, aes(x = x1, y = y)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

ggplot(sim_data, aes(x = x2, y = y)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

ggplot(sim_data, aes(x = x3, y = y)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

Output:

Plot For x1:

x1
Plot for X1

Plot For x2:

x2
Plot for X2

Plot For X3:

x3
Plot for X3

3. Building the Regression Tree

Now, we can use the rpart function from the rpart package to build a regression tree. The function requires the formula specifying the outcome and predictors, the data, and control options to tune the tree's complexity. We use minsplit = 10 to ensure that nodes have at least 10 observations, and cp = 0.01 to control the complexity of the tree.

R
library(rpart)

tree <- rpart(y ~ x1 + x2 + x3, 
              data = sim_data, 
              control = rpart.control(minsplit = 10, cp = 0.01))

4. Visualizing the Tree

Once the tree is built, we can visualize it using the plot() and text() functions. This plot displays the tree structure, with each node showing the predictor and the split value. The mean value of y for each node is also displayed.

R
plot(tree)
text(tree)

Output:

DTR
Structure of Decision Tree

5. Making Predictions

Finally, we can use the predict() function to make predictions for new observations. For example:

R
new_data <- data.frame(x1 = 0.5, x2 = 0.5, x3 = 0.5)

y_pred=predict(tree, new_data)
cat(y_pred)

Output:

 1.92548848961463

In this article, we showed how to build a regression tree in R using the rpart package. We simulated nonlinear data, visualized relationships with ggplot2, built a tree model, and used it for prediction.


Explore