This is absolutely fine! R kind of converts categorical variables into factors when doing regression on its own, but explicitly doing it using factor()
just means you have a bit more control over it. And while the representation of a factor in a data frame is a number, in a regression it becomes a binary 0/1 for each level, so the coefficient for each level is only added to the model when that particular level is present for that input.
Now it will show one less level than are present in the factor because it displays the levels as referenced against a "base" level, which is factored into the intercept.
I went a bit deeper into some model info in another thread. I'll grab that link and edit my comment here with it. (It's this thread.)
Also a bit of exploration of your question further:
Just using ownership
:
Call:
lm(formula = default_rate ~ poly(SAT_avg, 2) + ownership, data = train)
Residuals:
Min 1Q Median 3Q Max
-6.7986 -1.3758 -0.2358 1.0202 14.9166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.307 1.302 4.844 1.52e-06 ***
poly(SAT_avg, 2)1 -72.817 2.626 -27.726 < 2e-16 ***
poly(SAT_avg, 2)2 36.799 2.610 14.101 < 2e-16 ***
ownershipPrivate nonprofit -1.215 1.307 -0.929 0.353
ownershipPublic -1.215 1.310 -0.928 0.354
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.604 on 837 degrees of freedom
Multiple R-squared: 0.54, Adjusted R-squared: 0.5378
F-statistic: 245.6 on 4 and 837 DF, p-value: < 2.2e-16
Now using factor(ownership)
:
Call:
lm(formula = default_rate ~ poly(SAT_avg, 2) + factor(ownership),
data = train)
Residuals:
Min 1Q Median 3Q Max
-6.7986 -1.3758 -0.2358 1.0202 14.9166
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.307 1.302 4.844 1.52e-06 ***
poly(SAT_avg, 2)1 -72.817 2.626 -27.726 < 2e-16 ***
poly(SAT_avg, 2)2 36.799 2.610 14.101 < 2e-16 ***
factor(ownership)Private nonprofit -1.215 1.307 -0.929 0.353
factor(ownership)Public -1.215 1.310 -0.928 0.354
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.604 on 837 degrees of freedom
Multiple R-squared: 0.54, Adjusted R-squared: 0.5378
F-statistic: 245.6 on 4 and 837 DF, p-value: < 2.2e-16
The coefficients stay the same. So Private for-profit
is the default, and when ownership
is Public
, -1.215 is added to get the response.
The equation would essentially be default_rate ~ 6.307 + -72.817*SAT_avg + 36.799*SAT_avg^2 + -1.215*(Private nonprofit) + -1.215*(Public)
where Private nonprofit and Public are binary variables.