## the plan

Implementing a linear regression algorithm from scratch in R is an excellent way to gain a deeper understanding of both the statistical concepts behind regression and the computational methods used to perform it. The goal is to walk through each step required to build a simple linear regression model without relying on any external packages or built-in functions like `lm`

😅. We’ll look at the mathematical foundations, step-by-step implementation in R, and ways to evaluate the model.

## how to think about glm

#### a matrix

To more easily understand and interpreting regression methods, think of a set of data points with \(n\) number of variables as matrices in the following form:

where matrix \(X\) contains \(n\) columns for \(m\) number of variables, and for each point \((x,y)\) we have one corresponding \(\beta\) coefficient and error term \(\epsilon\).

#### a vector in 2d space

Additionally, you can think of each column in the data as geometric vector, where each variable is represented by a spatial vector. Its length describes its variability^{1} and the angle^{2} between two vectors–or variables– represents the association between them.

^{1} squared length of the vector is the sum of squares associated with the variable, \(|y|^2 = SS_{y}\)

^{2} \(r_{xy} = corr(x_{i},y_{j}) = cos\angle(x_{i},y_{j})\)

A linear combination of two variables is then shown by their vector sum. For example, \(x_{1} + 2x_{2}\) is represented by:

The resulting vector will represent the properties of the new variable, its variability and correlation to others.

## Code

```
data(mtcars)
# Standardize magnitude for horsepower (hp) and weight (wt)
mtcars$hp_norm <- mtcars$hp/max(mtcars$hp)
mtcars$wt_norm <- mtcars$wt/max(mtcars$wt)
# Create a data farme with start and end point vectors
vectors <- data.frame(
x1 = mtcars$hp_norm,
y1 = mtcars$wt_norm,
car = rownames(mtcars)
)
vector_plot <- ggplot(vectors) +
geom_segment(aes(x = 0,y = 0,
xend = x1, yend = y1),
arrow = arrow(length = unit(0.15,"cm")),
linewidth = .5, color = happy_clrs[6]) +
geom_text(aes(x = jitter(x1, amount = 0.1), y = jitter(y1, amount = 0.1),
label = car),
hjust = 0.1, vjust = .5,
size = 6, color = happy_clrs[4],
check_overlap = TRUE) +
xlim(0,1.5) +
ylim(0,1.5) +
coord_fixed() +
theme_happy() +
labs(title = "Car Horsepower(hp) vs. Weight(wt)",
x = NULL,
y = NULL)
# ggsave("img/vectors.png", vector_plot, width = 6, height = 4.5)
knitr::include_graphics("img/vectors.png")
```

## the general linear model

The GLM is a method by which one outcome variable is represented by a combination of variables \(X_{n}\). In simple linear regression, the **linear** relationship between the dependent variable \(y\) and the independent variable \(x\) is modeled as:

\[ y = \beta_{0} + \beta_{1}x + \epsilon \]

where:

\(\beta_{0}\) represents the \(y\)-intercept,

\(\beta_{1}\) represents the slope of the regression line, and

\(\epsilon\) is the error term, representing the difference between the observed and predicted values

### multiple linear regression

In multiple linear regression, the model extends to accommodate more than two variables.

\[ y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ...\beta_{n}x_{n} + \epsilon, \]

where \(n\) is the number of variables.