Using tidycensus

library(tidycensus)
library(tidyverse)
library(dplyr)

There are several ways to load data into the R environment for analysis. Using the tidycensus package, we can directly querry the Census Bureau for data after obtainin a Census API key (http://api.census.gov/data/key_signup.html).

From the tidycensus package, we use the census_api_key() function and specify install=TRUE so that the API key is stored in the local R environment. To access the variables we want from the 2000 Census Summary File 3 (SF3) (https://www.census.gov/census2000/sumfile3.html). It should be noted that using tidycensus, we can also load ACS (American Community Survey) data as well as other census years.

census_api_key("YOUR API KEY HERE", install=TRUE)

vars <- load_variables(2000, "sf3", cache=TRUE)

To explore the variables, you can use the View() function to filter and explore data available to you.

View(vars)

First six observations from loaded Census data

To download selected variables from the decennial Census, we use the get_decennial() function. We specify the geography to be “counties”, and the output to be “wide”, so each column will be a variable and each row will be a county in the United States. To load ACS data, we could use the get_acs() function. Below we only select some variables for illustration.

povdata <- get_decennial(geography="county",
                     variables=c(totpopn="P001001",
                         povty="P087002",
                         ag_male="P049003",
                         ag_female="P051045",
                         manu_male="P049007",
                         manu_female="P049034",
                         retail_male= "P049009",
                         retail_female="P049036"),
                     year=2000, output="wide")
str(povdata)

## Classes 'tbl_df', 'tbl' and 'data.frame':    3219 obs. of  10 variables:
##  $ GEOID        : chr  "01001" "01003" "01005" "01007" ...
##  $ NAME         : chr  "Autauga County" "Baldwin County" "Barbour County" "Bibb County" ...
##  $ totpopn      : num  43671 140415 29038 20826 51024 ...
##  $ povty        : num  4738 14018 7032 4091 5930 ...
##  $ ag_male      : num  355 929 376 382 583 261 362 335 211 282 ...
##  $ ag_female    : num  96 237 49 65 114 99 56 81 30 88 ...
##  $ manu_male    : num  2378 5507 2067 1370 3041 ...
##  $ manu_female  : num  851 2388 1131 522 1391 ...
##  $ retail_male  : num  1297 4026 530 343 1319 ...
##  $ retail_female: num  1220 4913 541 448 1316 ...

head(povdata)

## # A tibble: 6 x 10
##   GEOID NAME  totpopn povty ag_male ag_female manu_male manu_female
##   <chr> <chr>   <dbl> <dbl>   <dbl>     <dbl>     <dbl>       <dbl>
## 1 01001 Auta~   43671  4738     355        96      2378         851
## 2 01003 Bald~  140415 14018     929       237      5507        2388
## 3 01005 Barb~   29038  7032     376        49      2067        1131
## 4 01007 Bibb~   20826  4091     382        65      1370         522
## 5 01009 Blou~   51024  5930     583       114      3041        1391
## 6 01011 Bull~   11714  3405     261        99       460         339
## # ... with 2 more variables: retail_male <dbl>, retail_female <dbl>

Clean the data using dplyr by converting explanatory variables and response to percentages of total :

pov <- povdata %>%
    dplyr::mutate(poverty = (povty/totpopn)*1000,
           ag = ((ag_male + ag_female)/totpopn)*1000,
           manu = ((manu_male + manu_female)/totpopn)*1000,
           retail = ((retail_male +  retail_female)/totpopn)*1000,
           STATE = substr(GEOID,1,2)) %>%
    dplyr::select(GEOID, STATE, NAME, totpopn, poverty, ag, manu, retail)

write.csv(pov, "../data/supplement_poverty.csv")

Using tidycensus

Maria Kamenetsky

January 2019