the plan
Implementing a linear regression algorithm from scratch in R is an excellent way to gain a deeper understanding of both the statistical concepts behind regression and the computational methods used to perform it. The goal is to walk through each step required to build a simple linear regression model without relying on any external packages or built-in functions like lm
😅. We’ll look at the mathematical foundations, step-by-step implementation in R, and ways to evaluate the model.
how to think about glm
a matrix
To more easily understand and interpreting regression methods, think of a set of data points with \(n\) number of variables as matrices in the following form:
where matrix \(X\) contains \(n\) columns for \(m\) number of variables, and for each point \((x,y)\) we have one corresponding \(\beta\) coefficient and error term \(\epsilon\).
a vector in 2d space
Additionally, you can think of each column in the data as geometric vector, where each variable is represented by a spatial vector. Its length describes its variability1 and the angle2 between two vectors–or variables– represents the association between them.
1 squared length of the vector is the sum of squares associated with the variable, \(|y|^2 = SS_{y}\)
2 \(r_{xy} = corr(x_{i},y_{j}) = cos\angle(x_{i},y_{j})\)
A linear combination of two variables is then shown by their vector sum. For example, \(x_{1} + 2x_{2}\) is represented by:
The resulting vector will represent the properties of the new variable, its variability and correlation to others.
Code
data(mtcars)
# Standardize magnitude for horsepower (hp) and weight (wt)
mtcars$hp_norm <- mtcars$hp/max(mtcars$hp)
mtcars$wt_norm <- mtcars$wt/max(mtcars$wt)
# Create a data farme with start and end point vectors
vectors <- data.frame(
x1 = mtcars$hp_norm,
y1 = mtcars$wt_norm,
car = rownames(mtcars)
)
vector_plot <- ggplot(vectors) +
geom_segment(aes(x = 0,y = 0,
xend = x1, yend = y1),
arrow = arrow(length = unit(0.15,"cm")),
linewidth = .5, color = happy_clrs[6]) +
geom_text(aes(x = jitter(x1, amount = 0.1), y = jitter(y1, amount = 0.1),
label = car),
hjust = 0.1, vjust = .5,
size = 6, color = happy_clrs[4],
check_overlap = TRUE) +
xlim(0,1.5) +
ylim(0,1.5) +
coord_fixed() +
theme_happy() +
labs(title = "Car Horsepower(hp) vs. Weight(wt)",
x = NULL,
y = NULL)
# ggsave("img/vectors.png", vector_plot, width = 6, height = 4.5)
knitr::include_graphics("img/vectors.png")
the general linear model
The GLM is a method by which one outcome variable is represented by a combination of variables \(X_{n}\). In simple linear regression, the linear relationship between the dependent variable \(y\) and the independent variable \(x\) is modeled as:
\[ y = \beta_{0} + \beta_{1}x + \epsilon \]
where:
\(\beta_{0}\) represents the \(y\)-intercept,
\(\beta_{1}\) represents the slope of the regression line, and
\(\epsilon\) is the error term, representing the difference between the observed and predicted values
multiple linear regression
In multiple linear regression, the model extends to accommodate more than two variables.
\[ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ...\beta_{n}x_{n} + \epsilon, \]
where \(n\) is the number of variables.