library(tidycensus)
library(tidyverse)
library(dplyr)
There are several ways to load data into the R
environment for analysis. Using the tidycensus
package, we can directly querry the Census Bureau for data after obtainin a Census API key (http://api.census.gov/data/key_signup.html).
From the tidycensus
package, we use the census_api_key()
function and specify install=TRUE
so that the API key is stored in the local R
environment. To access the variables we want from the 2000 Census Summary File 3 (SF3) (https://www.census.gov/census2000/sumfile3.html). It should be noted that using tidycensus
, we can also load ACS (American Community Survey) data as well as other census years.
census_api_key("YOUR API KEY HERE", install=TRUE)
vars <- load_variables(2000, "sf3", cache=TRUE)
To explore the variables, you can use the View()
function to filter and explore data available to you.
View(vars)
First six observations from loaded Census data
To download selected variables from the decennial Census, we use the get_decennial()
function. We specify the geography to be “counties”, and the output to be “wide”, so each column will be a variable and each row will be a county in the United States. To load ACS data, we could use the get_acs()
function. Below we only select some variables for illustration.
povdata <- get_decennial(geography="county",
variables=c(totpopn="P001001",
povty="P087002",
ag_male="P049003",
ag_female="P051045",
manu_male="P049007",
manu_female="P049034",
retail_male= "P049009",
retail_female="P049036"),
year=2000, output="wide")
str(povdata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3219 obs. of 10 variables:
## $ GEOID : chr "01001" "01003" "01005" "01007" ...
## $ NAME : chr "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ totpopn : num 43671 140415 29038 20826 51024 ...
## $ povty : num 4738 14018 7032 4091 5930 ...
## $ ag_male : num 355 929 376 382 583 261 362 335 211 282 ...
## $ ag_female : num 96 237 49 65 114 99 56 81 30 88 ...
## $ manu_male : num 2378 5507 2067 1370 3041 ...
## $ manu_female : num 851 2388 1131 522 1391 ...
## $ retail_male : num 1297 4026 530 343 1319 ...
## $ retail_female: num 1220 4913 541 448 1316 ...
head(povdata)
## # A tibble: 6 x 10
## GEOID NAME totpopn povty ag_male ag_female manu_male manu_female
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Auta~ 43671 4738 355 96 2378 851
## 2 01003 Bald~ 140415 14018 929 237 5507 2388
## 3 01005 Barb~ 29038 7032 376 49 2067 1131
## 4 01007 Bibb~ 20826 4091 382 65 1370 522
## 5 01009 Blou~ 51024 5930 583 114 3041 1391
## 6 01011 Bull~ 11714 3405 261 99 460 339
## # ... with 2 more variables: retail_male <dbl>, retail_female <dbl>
Clean the data using dplyr
by converting explanatory variables and response to percentages of total :
pov <- povdata %>%
dplyr::mutate(poverty = (povty/totpopn)*1000,
ag = ((ag_male + ag_female)/totpopn)*1000,
manu = ((manu_male + manu_female)/totpopn)*1000,
retail = ((retail_male + retail_female)/totpopn)*1000,
STATE = substr(GEOID,1,2)) %>%
dplyr::select(GEOID, STATE, NAME, totpopn, poverty, ag, manu, retail)
write.csv(pov, "../data/supplement_poverty.csv")