library(tidycensus)
library(tidyverse)
library(dplyr)
There are several ways to load data into the R
environment for analysis. Using the tidycensus
package, we can directly querry the Census Bureau for data after obtainin a Census API key (http://api.census.gov/data/key_signup.html).
From the tidycensus
package, we use the census_api_key()
function and specify install=TRUE
so that the API key is stored in the local R
environment. To access the variables we want from the 2000 Census Summary File 3 (SF3) (https://www.census.gov/census2000/sumfile3.html). It should be noted that using tidycensus
, we can also load ACS (American Community Survey) data as well as other census years.
census_api_key("YOUR API KEY HERE", install=TRUE)
vars <- load_variables(2000, "sf3", cache=TRUE)
To explore the variables, you can use the View()
function to filter and explore data available to you.
View(vars)
To download selected variables from the decennial Census, we use the get_decennial()
function. We specify the geography to be “counties”, and the output to be “wide”, so each column will be a variable and each row will be a county in the United States. To load ACS data, we could use the get_acs()
function. Below we only select some variables for illustration.
povdata <- get_decennial(geography="county",
variables=c(totpopn="P001001",
povty="P087002",
ag_male="P049003",
ag_female="P051045",
manu_male="P049007",
manu_female="P049034",
retail_male= "P049009",
retail_female="P049036"),
year=2000, output="wide")
str(povdata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3219 obs. of 10 variables:
## $ GEOID : chr "01001" "01003" "01005" "01007" ...
## $ NAME : chr "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
## $ totpopn : num 43671 140415 29038 20826 51024 ...
## $ povty : num 4738 14018 7032 4091 5930 ...
## $ ag_male : num 355 929 376 382 583 261 362 335 211 282 ...
## $ ag_female : num 96 237 49 65 114 99 56 81 30 88 ...
## $ manu_male : num 2378 5507 2067 1370 3041 ...
## $ manu_female : num 851 2388 1131 522 1391 ...
## $ retail_male : num 1297 4026 530 343 1319 ...
## $ retail_female: num 1220 4913 541 448 1316 ...
head(povdata)
## # A tibble: 6 x 10
## GEOID NAME totpopn povty ag_male ag_female manu_male manu_female
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01001 Auta~ 43671 4738 355 96 2378 851
## 2 01003 Bald~ 140415 14018 929 237 5507 2388
## 3 01005 Barb~ 29038 7032 376 49 2067 1131
## 4 01007 Bibb~ 20826 4091 382 65 1370 522
## 5 01009 Blou~ 51024 5930 583 114 3041 1391
## 6 01011 Bull~ 11714 3405 261 99 460 339
## # ... with 2 more variables: retail_male <dbl>, retail_female <dbl>
Clean the data using dplyr
by converting explanatory variables and response to percentages of total :
pov <- povdata %>%
dplyr::mutate(poverty = (povty/totpopn)*1000,
ag = ((ag_male + ag_female)/totpopn)*1000,
manu = ((manu_male + manu_female)/totpopn)*1000,
retail = ((retail_male + retail_female)/totpopn)*1000,
STATE = substr(GEOID,1,2)) %>%
dplyr::select(GEOID, STATE, NAME, totpopn, poverty, ag, manu, retail)
write.csv(pov, "../data/supplement_poverty.csv")