class: center, middle, inverse, title-slide # Week 07 - Regression ## Model Fit, LM for Randomised Experiments, Multiple Predictors, Merging Data Sets in R
### Danilo Freire ### 8th March 2019 --- <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 6px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: #EB811B; } .orange { color: #EB811B; } </style> # Today's Agenda .font150[ * Review of least squares * How to assess model fit with `\(R^2\)` * Analyse randomised experiments with `lm()` * Merge data sets with `merge(x, y, by = )` ] --- # Review .font150[ * Linear regression minimises the squared distance of the points * A few things to remember: * .orange[Regression equation]: `\(Y = \alpha + \beta X + \epsilon\)` - `\(\alpha\)` is the intercept (value of `\(Y\)` when `\(X = 0\)`) - `\(\beta\)` is the slope (increase/decrease in `\(Y\)` when `\(X\)` varies by one unit) - `\(\epsilon\)` is the error term (difference from points to fitted line; mean zero) ] --- # Linear Models .font120[ ```r bivariate <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/prediction/bivariate_data.csv") bivariate <- subset(bivariate, Year == 2010) model.cm <- lm(Child.Mortality ~ log(GDP), data = bivariate) model.cm ``` ``` ## ## Call: ## lm(formula = Child.Mortality ~ log(GDP), data = bivariate) ## ## Coefficients: ## (Intercept) log(GDP) ## 276.58 -26.13 ``` * `\(Y = 276.58 - 26.13X + \epsilon\)` ] --- # R-Squared .font140[ * How well does our regression line fit the data? * `\(R^2\)`, _coefficient of determination_ `$$R^2 = 1 - \frac{\textsf{SSR}}{\textsf{Total sum of squares (TSS)}} \ = \ 1 - \frac{\sum_{i=1}^n \hat\epsilon_i^2}{\sum_{i=1}^n (Y_i - \overline{Y})^2}$$` * `\(R^2\)` is also defined as the _explained variance_ in Y * Goes from 0 to 1 * We can use `summary(model)` to see it ] --- # Example .font110[ ```r summary(model1) ``` ``` ## ## Call: ## lm(formula = Child.Mortality ~ log(GDP), data = bivariate) ## ## Residuals: ## Min 1Q Median 3Q Max ## -46.733 -15.699 -4.320 6.974 133.772 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 276.582 15.534 17.80 <2e-16 *** ## log(GDP) -26.127 1.691 -15.45 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 26.78 on 168 degrees of freedom ## (21 observations deleted due to missingness) *## Multiple R-squared: 0.587, Adjusted R-squared: 0.5845 ## F-statistic: 238.7 on 1 and 168 DF, p-value: < 2.2e-16 ``` ] --- # Caveats .font150[ * `\(R^2\)` cannot be used to compare models estimated with different samples * `\(R^2\)` .orange[does not] indicate whether: - _the independent variable causes the dependent variable_ - the model is correctly specified - the model suffers from ommitted variable bias - there are enough data points to make a valid conclusion ] --- # Linear Models and RCTs .font150[ * .orange[When the data come from a randomised experiment,] model parameters have a causal interpretation * Treatment status as the independent variable (0 or 1) - 0 indicates control group - 1 indicates treatment group ] -- .font150[ * `\(Y = \alpha + \beta \times treatment + \epsilon\)` ] -- .font150[ * What is the interpretation of `\(\alpha\)` here? * What is the interpretation of `\(\beta\)` here? ] --- # Women as Policy Makers .font120[ * Do women promote different policies than men? * Observational studies: compare policies adopted by female politicians with those adopted by male politicians * Randomised natural experiment: - one third of village council heads reserved for women - assigned at the level of Gram Panchayat (village council) since mid-1990s - each GP has multiple villages * Hypothesis: female politicians represent the interests of female voters * Female voters complain about drinking water while male voters complain about irrigation ] --- # Data | Name | Description | | :----------- | :------------------------------------------------------------------------------------------------------------------------- | | `GP` | An identifier for the Gram Panchayat (village council) | | `village` | identifier for each village | | `reserved` | binary variable indicating whether the GP was reserved for women leaders or not | | `female` | binary variable indicating whether the GP had a female leader or not | | `irrigation` | variable measuring the number of new or repaired irrigation facilities in the village since the reserve policy started | | `water` | variable measuring the number of new or repaired drinking-water facilities in the village since the reserve policy started | ```r women <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/prediction/women.csv") names(women) ``` ``` ## [1] "GP" "village" "reserved" "female" "irrigation" ## [6] "water" ``` --- # Models .font110[ * Does the reservation policy increase female politicians? ```r tapply(women$female, women$reserved, mean) ``` ``` ## 0 1 ## 0.07476636 1.00000000 ``` * Does it change the policy outcomes? ```r # drinking-water facilities tapply(women$water, women$reserved, mean) ``` ``` ## 0 1 ## 14.73832 23.99074 ``` ```r # irrigation facilities tapply(women$irrigation, women$reserved, mean) ``` ``` ## 0 1 ## 3.387850 3.018519 ``` ] --- # Slope Coefficient = Difference-in-Means .font120[ ```r tapply(women$water, women$reserved, mean) ``` ``` ## 0 1 *## 14.73832 23.99074 ``` ```r mean(women$water[women$reserved == 1]) - mean(women$water[women$reserved == 0]) ``` ``` *## [1] 9.252423 ``` ```r lm(water ~ reserved, data = women) ``` ``` ## ## Call: ## lm(formula = water ~ reserved, data = women) ## ## Coefficients: ## (Intercept) reserved *## 14.738 9.252 ``` ] --- # Resume Experiment Revisited .font100[ ```r resume <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/causality/resume.csv") model1 <- lm(call ~ race, data = resume) model1 ``` ``` ## ## Call: ## lm(formula = call ~ race, data = resume) ## ## Coefficients: ## (Intercept) racewhite ## 0.06448 0.03203 ``` ```r tapply(resume$call, resume$race, mean) ``` ``` ## black white ## 0.06447639 0.09650924 ``` ```r # baseline (black) + the effect of having a white-sounding name 0.06448 + 0.03203 ``` ``` ## [1] 0.09651 ``` ] --- # Call Rates and Gender .font110[ * Let's add a gender indicator ```r model2 <- lm(call ~ sex, data = resume) model2 ``` ``` ## ## Call: ## lm(formula = call ~ sex, data = resume) ## ## Coefficients: ## (Intercept) sexmale ## 0.082488 -0.008645 ``` ```r tapply(resume$call, resume$sex, mean) ``` ``` ## female male ## 0.08248799 0.07384342 ``` ] --- # Race + Gender .font120[ ```r model3 <- lm(call ~ race + sex, data = resume) model3 ``` ``` ## ## Call: ## lm(formula = call ~ race + sex, data = resume) ## ## Coefficients: ## (Intercept) racewhite sexmale ## 0.066534 0.032130 -0.009128 ``` ```r # Call rates for a white male 0.066534 + 0.032130 - 0.009128 ``` ``` ## [1] 0.089536 ``` * Regression Equation: `\(Y = 0.066 + 0.032*white - 0.009*male + \epsilon\)` ] --- # Interpreting Multiple Predictors .font150[ * .orange[Ceteris Paribus]: _holding everything else constant_ * Let's interpret the coefficient for _white_ in `\(Y = 0.066 + 0.032 \times white - 0.009 \times male + \epsilon\)` * We say: "_all else equal, having a white-sounding name increases the change of getting a job call in about 3%. Since candidates with black-sounding names a job call rate of about 6%, candidates with white-sounding names are 50% more likely to get a call_" ] --- # Adjusted R-Squared .font140[ * When we have more than one independent variable, we need to modify the `\(R^2\)` formula to account for those additional variables * `\(R^2\)` measures the overall impact of _all_ variables at once, but some might just add noise to the model * Every predictor added to a model increases `\(R^2\)` and never decreases it * Adjusted `\(R^2\)` compensates for the addition of variables and only increases _if the new term enhances the model_ * It is usually lower than regular `\(R^2\)` but not much ] --- # Adjusted R-Squared .font110[ ```r summary(model.cm) ``` ``` ## ## Call: ## lm(formula = Child.Mortality ~ log(GDP), data = bivariate) ## ## Residuals: ## Min 1Q Median 3Q Max ## -46.733 -15.699 -4.320 6.974 133.772 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 276.582 15.534 17.80 <2e-16 *** ## log(GDP) -26.127 1.691 -15.45 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 26.78 on 168 degrees of freedom ## (21 observations deleted due to missingness) *## Multiple R-squared: 0.587, Adjusted R-squared: 0.5845 ## F-statistic: 238.7 on 1 and 168 DF, p-value: < 2.2e-16 ``` ] --- # Merging Data Sets in R .font110[ * We often have to merge data from different sources * If the data refer to the same units/indiviuals, .orange[it is essential that both data sets have the same identification variable] * Use `merge(x, y, by =)` function ] ```r pres2008 <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/prediction/pres08.csv") names(pres2008) ``` ``` ## [1] "state.name" "state" "Obama" "McCain" "EV" ``` ```r intrade08 <- read.csv("https://raw.githubusercontent.com/pols1600/pols1600.github.io/master/datasets/prediction/intrade08.csv") names(intrade08) ``` ``` ## [1] "X" "day" "statename" "PriceD" "VolumeD" "PriceR" ## [7] "VolumeR" "state" ``` ```r *pres.merged <- merge(pres2008, intrade08, by = "state") ``` --- # Merging Data Sets in R .font110[ ```r names(pres.merged) ``` ``` ## [1] "state" "state.name" "Obama" "McCain" "EV" ## [6] "X" "day" "statename" "PriceD" "VolumeD" ## [11] "PriceR" "VolumeR" ``` ```r head(pres.merged) ``` ``` ## state state.name Obama McCain EV X day statename PriceD ## 1 AK Alaska 38 59 3 24073 2008-02-28 Alaska 7.5 ## 2 AK Alaska 38 59 3 25195 2008-03-21 Alaska 7.5 ## 3 AK Alaska 38 59 3 8773 2007-05-04 Alaska 10.0 ## 4 AK Alaska 38 59 3 25603 2008-03-29 Alaska 8.0 ## 5 AK Alaska 38 59 3 25246 2008-03-22 Alaska 7.5 ## 6 AK Alaska 38 59 3 23665 2008-02-20 Alaska 7.5 ## VolumeD PriceR VolumeR ## 1 0 92.5 0 ## 2 0 92.5 0 ## 3 0 90.0 0 ## 4 0 92.0 0 ## 5 0 92.5 0 ## 6 0 92.5 0 ``` ] --- class: inverse, center, middle # Questions? <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Homework .font150[ * `swirl()` PREDICTION2 ] --- class: inverse, center, middle # See you! <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html>