Draft

bare–bones linear regression

custom lm implementation
Author
Published

Tuesday, October 1, 2024

the plan

Implementing a linear regression algorithm from scratch in R is an excellent way to gain a deeper understanding of both the statistical concepts behind regression and the computational methods used to perform it. The goal is to walk through each step required to build a simple linear regression model without relying on any external packages or built-in functions like lm😅. We’ll look at the mathematical foundations, step-by-step implementation in R, and ways to evaluate the model.

how to think about glm

a matrix

To more easily understand and interpreting regression methods, think of a set of data points with \(n\) number of variables as matrices in the following form:

where matrix \(X\) contains \(n\) columns for \(m\) number of variables, and for each point \((x,y)\) we have one corresponding \(\beta\) coefficient and error term \(\epsilon\).

a vector in 2d space

Additionally, you can think of each column in the data as geometric vector, where each variable is represented by a spatial vector. Its length describes its variability1 and the angle2 between two vectors–or variables– represents the association between them.

1 squared length of the vector is the sum of squares associated with the variable, \(|y|^2 = SS_{y}\)

2 \(r_{xy} = corr(x_{i},y_{j}) = cos\angle(x_{i},y_{j})\)

A linear combination of two variables is then shown by their vector sum. For example, \(x_{1} + 2x_{2}\) is represented by:

The resulting vector will represent the properties of the new variable, its variability and correlation to others.

Code
data(mtcars)

# Standardize magnitude for horsepower (hp) and weight (wt)
mtcars$hp_norm <- mtcars$hp/max(mtcars$hp)
mtcars$wt_norm <- mtcars$wt/max(mtcars$wt)

# Create a data farme with start and end point vectors
vectors <- data.frame(
  x1 = mtcars$hp_norm,
  y1 = mtcars$wt_norm,
  car = rownames(mtcars)
)

vector_plot <- ggplot(vectors) +
  geom_segment(aes(x = 0,y = 0,
                   xend = x1, yend = y1),
  arrow = arrow(length = unit(0.15,"cm")),
  linewidth = .5, color = happy_clrs[6]) +
  geom_text(aes(x = jitter(x1, amount = 0.1), y = jitter(y1, amount = 0.1), 
                label = car),
            hjust = 0.1, vjust = .5,
            size = 6, color = happy_clrs[4],
            check_overlap = TRUE) +
  xlim(0,1.5) +
  ylim(0,1.5) +
  coord_fixed() +
  theme_happy() +
    labs(title = "Car Horsepower(hp) vs. Weight(wt)",
       x = NULL,
       y = NULL)

# ggsave("img/vectors.png", vector_plot, width = 6, height = 4.5)

knitr::include_graphics("img/vectors.png")

the general linear model

The GLM is a method by which one outcome variable is represented by a combination of variables \(X_{n}\). In simple linear regression, the linear relationship between the dependent variable \(y\) and the independent variable \(x\) is modeled as:

\[ y = \beta_{0} + \beta_{1}x + \epsilon \]

where:
\(\beta_{0}\) represents the \(y\)-intercept,
\(\beta_{1}\) represents the slope of the regression line, and
\(\epsilon\) is the error term, representing the difference between the observed and predicted values

multiple linear regression

In multiple linear regression, the model extends to accommodate more than two variables.

\[ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ...\beta_{n}x_{n} + \epsilon, \]

where \(n\) is the number of variables.

Citation

For attribution, please cite this work as:
Monteagudo, JP. 2024. “Bare–Bones Linear Regression.” October 1, 2024. https://www.jpmonteagudo.com/blog/2024/09.