# Analysis of Housing Choices and Determinants of Income with Consumer Expenditure Survey data in R

Instructions:
-Data analysis should be done in R. Any package and function is allowed. Show all your work.
-regression models that can be used include KNN, Linear discriminant analysis (LDA),OLS, logistic regression, trees, random forest, boosting, local linear regression, generalized additive model (GAM), generalized linear model (GLM), etc.
-Other concepts include Bayes classifier, cross validation, bootstrap, shrinkage methods(Ridge and LASSO), splines, rtc.
-Submit one document with no more than 6 pages of writing (1.15 spacing, font size 11). All codes and output should be in appendices that do not count toward the 6-page limit. There should be 3 appendices:
o Output in table and figure formats. Tables and figures should be numbered and labeled.
o Code for your main analyses
o Data cleaning/preparation code
Data:
Consumer Expenditure Survey: http://www.bls.gov/cex/pumdhome.htm
Data files in csv format: Data documentation is available under Most Recent PUMD Release at http://www.bls.gov/cex/pumdhome.htm.
Questions：
1. Housing type choice
a. Build a model that best predicts housing type choice. What is the best model? Why is it the best?
b. What are the steps you take to get the model in a.? Explain in detail.
c. What are the top 3 determinants of housing type choice? Do you use the model in a. to answer this question? If yes, why? If no, why not? Explain how you decide which determinants are most important.
2. Determinants of income
a. What are possible techniques that allow you to study determinants of income?
b. How do you determine features in your models?
c. Which model do you prefer? Why? What are the top 3 determinants of income?
d. Let X be an independent variable/feature. Can you use your model in c. to answer how much income will change given a one unit increase in X? Explain your reasons. If your answer is no, build a model that would let you answer the question.
e. Interpret the results from the model in d. (or the model in c. if you did not build a new model in d.).
f. Distinguish between endogenous and exogenous variables in your model. Why might this distinction be important?

