This web report is accessible at https://tin6150.github.io/phw251_group_z/milestone6_groupZ.html
Code used for analsys and visual generation is available at our github repo.
~~ This is a “verbose” version, where all code blocks are shown ~~
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
#library(kableExtra) ## no longer needed in this version
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(formattable) # for currency()
##
## Attaching package: 'formattable'
##
## The following object is masked from 'package:plotly':
##
## style
library(DT) # and NOT data.table
# omit warning messages from library load
# https://stackoverflow.com/questions/45399587/how-to-remove-warning-messages-in-r-markdown-document
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
## Loading & Cleaning data : Mortality
#**Suggestions from Lauren:
# -confirm we're not double counting deaths in mortality data
# Yes we are filtering by Total Population
# -summarize mortality rates by county and use rates instead of counts before joining
# Okay now calculated Rate as count / pop12
# Loading Mortality Data
mortality_path <- 'data/ca_county_mortality.csv'
mortality_data_raw <- read_csv(mortality_path,
na= c("", "NA", "-"),
show_col_types=F)
mortality_data <- mortality_data_raw %>% mutate_all(~replace( ., is.na(.), 0))
# View(mortality_data)
# There are NA values in mortality_data, so we need to replace NA w/ 0
mortality_data %>% select( Year ) %>% unique()
## # A tibble: 7 × 1
## Year
## <dbl>
## 1 2014
## 2 2015
## 3 2016
## 4 2017
## 5 2018
## 6 2019
## 7 2020
# Mortality data is from 2014 - 2020.
mortality_data %>% select( "Cause_Desc" ) %>% unique() %>% tally()
## # A tibble: 1 × 1
## n
## <int>
## 1 15
# contain 15 diseases, but one is "All causes (total)"
# subset mortality data for chronic conditions only
chronic_desc = c(
"Chronic lower respiratory diseases",
"Diabetes mellitus",
"Diseases of heart",
"Essential hypertension and hypertensive renal disease",
"Chronic liver disease and cirrhosis",
"Alzheimer's disease",
"Malignant neoplasms",
"Nephritis, nephrotic syndrome and nephrosis",
"Cerebrovascular diseases",
"Parkinson's disease"
)
str( chronic_desc )
## chr [1:10] "Chronic lower respiratory diseases" "Diabetes mellitus" ...
chronic_mortality_data = mortality_data %>%
filter( Cause_Desc %in% chronic_desc )
num_rows_befor_filter = mortality_data %>% tally
num_rows_after_filter = chronic_mortality_data %>% tally
#num_rows_after_filter
num_rows_filtered = num_rows_befor_filter - num_rows_after_filter
num_rows_filtered
## n
## 1 58464
#View(chronic_mortality_data)
# chronic case count by county in the last 5 years
# Question prompt:
## Second, OHE’s director would also like to include a total count of mortality
## from chronic health conditions
## over the past few years into the county level analysis.
chronic_by_county = chronic_mortality_data %>%
filter( Strata == "Total Population") %>%
filter( Geography_Type == "Occurrence" ) %>%
select( -Geography_Type ) %>% # Geo type restricted to "Occurrence"
filter( Year >= 2016 ) %>% # 2016-2020 = 5 years
group_by( County ) %>%
summarize( sum_count5 = sum(Count), # this is a 5 year count!
avg_count = sum(Count) / 5 )
# Loading HCAI funding Data
funding_path = 'data/hcai_healthcare_construction.csv'
funding_data <- read_csv( funding_path,
na= c("", "NA", "-"),
show_col_types=F )
CtyColl = select(funding_data, c("Collection of Counties")) %>%
unique()
CtyColl
## # A tibble: 4 × 1
## `Collection of Counties`
## <chr>
## 1 Bay Area Counties
## 2 <NA>
## 3 Greater Sacramento Counties
## 4 Greater Los Angeles Counties
## there are 4 collection of counties,
## there aren't likely needed for data eval,
## but add lots of NA, so dropping them
#1 Bay Area Counties
#2 NA
#3 Greater Sacramento Counties
#4 Greater Los Angeles Counties
# finding where in the data frame there is an 'na'
# https://www.geeksforgeeks.org/find-columns-and-rows-with-na-in-r-dataframe/
## funding_data_no_CtyColl = select(funding_data, -c("Collection of Counties"))
funding_data = select(funding_data, -c("Collection of Counties"))
## which(is.na(funding_data_no_CtyColl), arr.ind=T)
# above returns 0 rows, ie
# we find that only the column "Collection of Counties" has 'na'
# we will leave this for now since it may just be a colloquial reference
# unimportant for our data analysis.
# no replacement for na with 0 will be done on this data frame.
# the Costs column has human data, eg $50,890,315.00
# and we need to strip the dollar sign, the commas,
# and convert them to numbers.
# ref: https://stackoverflow.com/questions/31944103/convert-currency-with-commas-into-numeric
# we create a new column for this called "Numeric_Cost",
# but could have potentially done an in-place replacement
funding_data = funding_data %>%
mutate(Numeric_Cost = as.numeric(
gsub( '[$,]', '', funding_data[["Total Costs of OSHPD Projects"]] )
))
funding_data = funding_data %>%
mutate( county_name = str_sub( County, 6 )) %>%
subset(select = -c(County)) %>%
rename(County = county_name)
# subset HCAI funding data for those "In Closure" state
funding_data_closure <- funding_data %>%
filter(`OSHPD Project Status` == "In Closure") %>%
filter(`Data Generation Date` == as_date("2022-08-11") )
## 2022-08-11 is the latest avail data
funding_data_closure %>% arrange( Numeric_Cost ) %>% head( 10 )
## # A tibble: 10 × 6
## `Data Generation Date` `OSHPD Project Status` Total …¹ Numbe…² Numer…³ County
## <dttm> <chr> <chr> <dbl> <dbl> <chr>
## 1 2022-08-11 00:00:00 In Closure 0 0 0 Alpine
## 2 2022-08-11 00:00:00 In Closure 0 0 0 Amador
## 3 2022-08-11 00:00:00 In Closure 0 0 0 Butte
## 4 2022-08-11 00:00:00 In Closure 0 0 0 Calav…
## 5 2022-08-11 00:00:00 In Closure 0 0 0 Colusa
## 6 2022-08-11 00:00:00 In Closure 0 0 0 Del N…
## 7 2022-08-11 00:00:00 In Closure 0 0 0 Glenn
## 8 2022-08-11 00:00:00 In Closure 0 0 0 Humbo…
## 9 2022-08-11 00:00:00 In Closure 0 0 0 Imper…
## 10 2022-08-11 00:00:00 In Closure 0 0 0 Inyo
## # … with abbreviated variable names ¹`Total Costs of OSHPD Projects`,
## # ²`Number of OSHPD Projects`, ³Numeric_Cost
#View(funding_data)
#View(funding_data_closure)
funding_data %>% select( `Data Generation Date` ) %>% unique() %>% tally() ## 231 rows
## # A tibble: 1 × 1
## n
## <int>
## 1 231
selected_counties = c(
"Inyo",
"Siskiyou",
"Mariposa",
"Modoc",
"Plumas"
)
funding_data_selected_counties = funding_data %>%
#filter( County == "Alameda" ) %>%
#filter( County == "Inyo" ) %>%
#filter( County %in% selected_counties ) %>%
filter( `OSHPD Project Status` == "In Closure" |
`OSHPD Project Status` == "In Construction" ) %>%
#filter( `OSHPD Project Status` == "Pending Construction" ) %>%
#filter( `OSHPD Project Status` == "In Review" ) %>%
filter( `Data Generation Date` ==
as_date( "2022-08-11" )
#as_date( "2020-01-01" )
#as_date( "2016-01-01" )
)
chk1 =
ggplot( data = funding_data_selected_counties,
aes( x = `Data Generation Date`,
y = Numeric_Cost ) ) +
geom_bar( stat="identity", position="dodge" ) +
labs( title = 'Prj cost over time' ) +
facet_wrap( ~ County )
chk1
## some number may linger in closure for longer than other?
## so taking average of closure amount over say 5 months may lead to skewed data
## really just make sense to just use the last data point.
funding_data_allSources = funding_data %>%
#filter( County == "Alameda" ) %>%
#filter( County == "Mariposa" ) %>%
#filter( County == "Inyo" ) %>%
filter( County %in% selected_counties ) %>%
filter( `Data Generation Date` >
#as_date( "2022-01-01" )
#as_date( "2020-01-01" )
as_date( "2016-01-01" )
)
chk_allSrc =
ggplot( data = funding_data_allSources,
aes( x = `Data Generation Date`,
y = Numeric_Cost ) ) +
geom_bar( stat="identity", position="dodge" ) +
labs( title = '1 county, prj stage' ) +
facet_wrap( ~ `OSHPD Project Status` )
chk_allSrc
## there is likely more to the story about these costs
## expect to see approved project to have a peak in Review
## then the peak move to Pending Construction, In Construction
## then In Closure
## but maybe their cost is finance book remaining budget account?
## data doesn't fully make sense
## but it is just homework, take a snapshot of "In Closure", and let it be.
# Load demographics data
demographics_path = 'data/ca_county_demographics.csv'
demographics_data = read_csv( demographics_path,
na = c("", "NA", "-"),
show_col_types=F )
# the first column contains id number, but it is unamed, so renaming it
# rest of the columns have reasonable names, so left them as is.
demographics_data = rename( demographics_data, id="...1")
# View(demographics_data)
# massaging demographics data
demographics_data <- demographics_data %>%
select(id, name, pop2012, pop12_sqmi, med_age, owner_occ, renter_occ) %>%
rename( `County` = name )
## Determining Renters vs Homeowners Ratio
demographics_data = demographics_data %>%
mutate( rent_own_ratio = renter_occ / owner_occ )
demographics_data <-
mutate(demographics_data,
rural_class=if_else( pop12_sqmi < 20, "rural", "not rural", missing=NULL))
# column to show whether county was rural or not
rural_not_rural <- demographics_data %>%
select( County, rural_class, pop12_sqmi, rent_own_ratio, med_age )
rural_counties = rural_not_rural %>%
filter( rural_class == "rural" )
#rename( `Class` = rural_class )
#### consider a "low pop density" rather than a strict "rural"
#### --> not really good, 49 counties in this grouping
avg_pop_density = mean( demographics_data$pop12_sqmi )
demographics_data = demographics_data %>%
mutate( low_pop = if_else( pop12_sqmi < avg_pop_density ,
TRUE,
FALSE,
NULL ))
high_rental_ratio =
rural_counties %>% arrange( desc( rent_own_ratio ) )
high_med_age =
rural_counties %>% arrange( desc( med_age ) )
## Determine any counties satisfy triple whammy rule
avg_rent_ratio = mean( demographics_data$rent_own_ratio )
avg_age = mean( demographics_data$med_age )
demographics_data = demographics_data %>%
mutate( high_med_age = if_else( med_age > avg_age ,
TRUE,
FALSE,
NULL )) %>%
mutate( high_rental = if_else( rent_own_ratio > avg_rent_ratio ,
TRUE,
FALSE,
NULL ))
triple_whammy = demographics_data %>%
filter( rural_class == "rural" &
high_rental == TRUE &
high_med_age == TRUE )
### so 0 county satisfy the triple whammy rule
### need to modify selection criteria
## Use only RURAL counties to calculate their averages, and what's above average
## Determine any RURAL counties satisfy double whammy rule
#View(rural_counties)
avg_rent_ratio_rural = mean( rural_counties$rent_own_ratio )
avg_age_rural = mean( rural_counties$med_age )
rural_demographics_data = demographics_data %>%
mutate( high_med_age_rural = if_else( med_age > avg_age_rural ,
TRUE,
FALSE,
NULL )) %>%
mutate( high_rental_rural = if_else( rent_own_ratio > avg_rent_ratio_rural ,
TRUE,
FALSE,
NULL ))
#-- c( avg_rent_ratio, avg_rent_ratio_rural, avg_age, avg_age_rural )
#-- sn50 = rural_demographics_data %>% filter( rural_class == "rural" )
rural_double_whammy = rural_demographics_data %>%
filter(
high_rental_rural == TRUE &
high_med_age_rural == TRUE )
rural_double_whammy # only 3 counties in this list: Lake, Siskiyou, Inyo
## # A tibble: 3 × 14
## id County pop2012 pop12…¹ med_age owner…² rente…³ rent_…⁴ rural…⁵ low_pop
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
## 1 3 Lake 65253 49.1 45 17472 9076 0.519 not ru… TRUE
## 2 33 Siskiyou 45200 7.12 46.8 12629 6876 0.544 rural TRUE
## 3 58 Inyo 18611 1.82 45.5 5121 2928 0.572 rural TRUE
## # … with 4 more variables: high_med_age <lgl>, high_rental <lgl>,
## # high_med_age_rural <lgl>, high_rental_rural <lgl>, and abbreviated variable
## # names ¹pop12_sqmi, ²owner_occ, ³renter_occ, ⁴rent_own_ratio, ⁵rural_class
rural_demographics_data %>% arrange( rent_own_ratio, med_age ) %>% head( 6 )
## # A tibble: 6 × 14
## id County pop2012 pop12…¹ med_age owner…² rente…³ rent_…⁴ rural…⁵ low_pop
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
## 1 42 Calaver… 46212 44.6 49.1 14520 4366 0.301 not ru… TRUE
## 2 38 Amador 38354 63.3 48.2 10883 3686 0.339 not ru… TRUE
## 3 51 El Dora… 182494 102. 43.5 51391 18832 0.366 not ru… TRUE
## 4 15 Nevada 99951 103. 47.5 29890 11637 0.389 not ru… TRUE
## 5 32 Sierra 3226 3.35 51 1065 417 0.392 rural TRUE
## 6 36 Alpine 1148 1.54 46.4 357 140 0.392 rural TRUE
## # … with 4 more variables: high_med_age <lgl>, high_rental <lgl>,
## # high_med_age_rural <lgl>, high_rental_rural <lgl>, and abbreviated variable
## # names ¹pop12_sqmi, ²owner_occ, ³renter_occ, ⁴rent_own_ratio, ⁵rural_class
## determine old age as > avg age for full data set
## then pick rural countries,
## high rental in such rural counties.
#View(rural_counties)
focus_demographics_data = demographics_data %>%
mutate( high_rental_rural = if_else( rent_own_ratio > avg_rent_ratio_rural ,
TRUE,
FALSE,
NULL )) %>%
mutate( high_med_age = if_else( med_age > avg_age ,
TRUE,
FALSE,
NULL ))
#-- c( avg_rent_ratio, avg_rent_ratio_rural, avg_age, avg_age_rural )
#-- sn50 = focus_demographics_data %>% filter( rural_class == "rural" )
focus_dem = rural_demographics_data %>%
filter(
high_rental_rural == TRUE &
high_med_age == TRUE ) %>%
select( County, rural_class, high_med_age, high_rental_rural )
rural_demographics_data %>%
select( County, rural_class, med_age, high_med_age, high_rental_rural ) %>%
arrange( high_rental_rural, desc(med_age) )
## # A tibble: 58 × 5
## County rural_class med_age high_med_age high_rental_rural
## <chr> <chr> <dbl> <lgl> <lgl>
## 1 Sierra rural 51 TRUE FALSE
## 2 Plumas rural 49.5 TRUE FALSE
## 3 Mariposa rural 49.2 TRUE FALSE
## 4 Trinity rural 49.2 TRUE FALSE
## 5 Calaveras not rural 49.1 TRUE FALSE
## 6 Amador not rural 48.2 TRUE FALSE
## 7 Nevada not rural 47.5 TRUE FALSE
## 8 Tuolumne not rural 47.1 TRUE FALSE
## 9 Alpine rural 46.4 TRUE FALSE
## 10 Modoc rural 46 TRUE FALSE
## # … with 48 more rows
#View(focus_dem) # now have 13 counties to choose from, but not all rural
## decide if need to output any of these tables
## likely need to refine further
high_rental_ratio %>% head( 6 )
## # A tibble: 6 × 5
## County rural_class pop12_sqmi rent_own_ratio med_age
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Mono rural 4.60 0.787 37.2
## 2 Colusa rural 18.8 0.634 33.5
## 3 Inyo rural 1.82 0.572 45.5
## 4 Siskiyou rural 7.12 0.544 46.8
## 5 Lassen rural 7.42 0.526 37
## 6 Mariposa rural 12.6 0.472 49.2
high_med_age %>% head( 6 )
## # A tibble: 6 × 5
## County rural_class pop12_sqmi rent_own_ratio med_age
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Sierra rural 3.35 0.392 51
## 2 Plumas rural 7.65 0.440 49.5
## 3 Mariposa rural 12.6 0.472 49.2
## 4 Trinity rural 4.38 0.420 49.2
## 5 Siskiyou rural 7.12 0.544 46.8
## 6 Alpine rural 1.54 0.392 46.4
rural_double_whammy %>% head( 6 )
## # A tibble: 3 × 14
## id County pop2012 pop12…¹ med_age owner…² rente…³ rent_…⁴ rural…⁵ low_pop
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <lgl>
## 1 3 Lake 65253 49.1 45 17472 9076 0.519 not ru… TRUE
## 2 33 Siskiyou 45200 7.12 46.8 12629 6876 0.544 rural TRUE
## 3 58 Inyo 18611 1.82 45.5 5121 2928 0.572 rural TRUE
## # … with 4 more variables: high_med_age <lgl>, high_rental <lgl>,
## # high_med_age_rural <lgl>, high_rental_rural <lgl>, and abbreviated variable
## # names ¹pop12_sqmi, ²owner_occ, ³renter_occ, ⁴rent_own_ratio, ⁵rural_class
focus_dem %>% head( 13 )
## # A tibble: 13 × 4
## County rural_class high_med_age high_rental_rural
## <chr> <chr> <lgl> <lgl>
## 1 Lake not rural TRUE TRUE
## 2 Marin not rural TRUE TRUE
## 3 Mendocino not rural TRUE TRUE
## 4 Napa not rural TRUE TRUE
## 5 San Francisco not rural TRUE TRUE
## 6 San Luis Obispo not rural TRUE TRUE
## 7 San Mateo not rural TRUE TRUE
## 8 Shasta not rural TRUE TRUE
## 9 Siskiyou rural TRUE TRUE
## 10 Sonoma not rural TRUE TRUE
## 11 Tehama not rural TRUE TRUE
## 12 Del Norte not rural TRUE TRUE
## 13 Inyo rural TRUE TRUE
#It doesn't seems like we have high rental (ie > avg_rent) and old age (>avg_age)
#so we will have to judge subjectively
#-Tin
# calc chronic mortality rate using census pop12 as denominator
# First need to Join Chronics mortality data (by county, last 5 years)
# and demographics data
# the avg_count was already div by 5 earlier in the code.
demographics_chronic = left_join( demographics_data,
chronic_by_county,
by = "County" ) %>%
mutate( pct = ( avg_count / pop2012 ) * 100 ) %>%
mutate( per100k = ( avg_count / pop2012 ) * 100000 )
# pct = "prevalence" percentage
avg_chronic_mortality = mean( demographics_chronic$pct )
avg_chronic_mortality_per100k = mean( demographics_chronic$per100k )
fund_dem_chron = inner_join( demographics_chronic,
funding_data_closure,
by = "County" ) %>%
mutate( fund_per_cap = Numeric_Cost / pop2012 )
viz_fund_dem_chron = fund_dem_chron %>%
select( County,
pop2012,
pop12_sqmi,
rural_class,
#low_pop,
med_age,
high_med_age,
rent_own_ratio,
high_rental,
avg_count,
pct, # prevalence,
per100k, # prevalence, per 100k pop
Numeric_Cost,
fund_per_cap,
`Number of OSHPD Projects`,
)
# grabbing data from raw source
mortality_data_raw %>%
filter( County == "Siskiyou" ) %>%
filter( Year == 2020 ) %>%
filter( Strata == "Total Population" ) %>%
filter( Geography_Type == "Occurrence" ) %>%
filter( Cause_Desc %in% chronic_desc)
## # A tibble: 10 × 10
## Year County Geography…¹ Strata Strat…² Cause Cause…³ Count Annot…⁴ Annot…⁵
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 2020 Siskiyou Occurrence Total… Total … ALZ Alzhei… 26 NA <NA>
## 2 2020 Siskiyou Occurrence Total… Total … CAN Malign… 121 NA <NA>
## 3 2020 Siskiyou Occurrence Total… Total … CLD Chroni… 39 NA <NA>
## 4 2020 Siskiyou Occurrence Total… Total … DIA Diabet… 15 NA <NA>
## 5 2020 Siskiyou Occurrence Total… Total … HTD Diseas… 107 NA <NA>
## 6 2020 Siskiyou Occurrence Total… Total … HYP Essent… NA 1 Cell s…
## 7 2020 Siskiyou Occurrence Total… Total … LIV Chroni… NA 1 Cell s…
## 8 2020 Siskiyou Occurrence Total… Total … NEP Nephri… NA 1 Cell s…
## 9 2020 Siskiyou Occurrence Total… Total … PAR Parkin… NA 1 Cell s…
## 10 2020 Siskiyou Occurrence Total… Total … STK Cerebr… 26 NA <NA>
## # … with abbreviated variable names ¹Geography_Type, ²Strata_Name, ³Cause_Desc,
## # ⁴Annotation_Code, ⁵Annotation_Desc
# 26+121+39+15+107+26
# 334
# viz_fund_dem_chron is just fund_dem_chron with fewer columns
scm1 = viz_fund_dem_chron %>%
filter( County == "Siskiyou" )
chronic_by_county %>%
filter( County == "Siskiyou" )
## # A tibble: 1 × 3
## County sum_count5 avg_count
## <chr> <dbl> <dbl>
## 1 Siskiyou 1614 323.
scm2 = demographics_chronic %>%
filter( County == "Siskiyou" )
# Spot check for Mortality rate of Siskiyou
#pop12 for siskiyou is 45200
#sum_count for siskiyou is 1614
# as fraction is .03517 (ie 3.6%)
# as per 100,000 then 1614*100000/45200 = 3570.796
## this was a 5 years count ... fixed with avg_count
#### for viz, ponder about using these, or derivative thereof
# viz_fund_dem_chron is just fund_dem_chron with fewer columns
feel_for_data = viz_fund_dem_chron %>%
filter( rural_class == "rural" )
#filter( low_pop == TRUE ) # no good, 49 counties in this grouping
feel_for_data %>%
arrange( desc( fund_per_cap ) ) %>% head( 10 )
## # A tibble: 10 × 14
## County pop2012 pop12…¹ rural…² med_age high_…³ rent_…⁴ high_…⁵ avg_c…⁶ pct
## <chr> <dbl> <dbl> <chr> <dbl> <lgl> <dbl> <lgl> <dbl> <dbl>
## 1 Lassen 35039 7.42 rural 37 FALSE 0.526 FALSE 70.4 0.201
## 2 Marip… 18455 12.6 rural 49.2 TRUE 0.472 FALSE 80.6 0.437
## 3 Modoc 9791 2.33 rural 46 TRUE 0.459 FALSE 39.4 0.402
## 4 Mono 14418 4.60 rural 37.2 FALSE 0.787 TRUE 12.2 0.0846
## 5 Plumas 20000 7.65 rural 49.5 TRUE 0.440 FALSE 82.6 0.413
## 6 Sierra 3226 3.35 rural 51 TRUE 0.392 FALSE 0 0
## 7 Siski… 45200 7.12 rural 46.8 TRUE 0.544 FALSE 323. 0.714
## 8 Alpine 1148 1.54 rural 46.4 TRUE 0.392 FALSE 0 0
## 9 Colusa 21780 18.8 rural 33.5 FALSE 0.634 FALSE 56.6 0.260
## 10 Trini… 14063 4.38 rural 49.2 TRUE 0.420 FALSE 43.4 0.309
## # … with 4 more variables: per100k <dbl>, Numeric_Cost <dbl>,
## # fund_per_cap <dbl>, `Number of OSHPD Projects` <dbl>, and abbreviated
## # variable names ¹pop12_sqmi, ²rural_class, ³high_med_age, ⁴rent_own_ratio,
## # ⁵high_rental, ⁶avg_count
### above are all 0... which implies no fundings for them... should give them some love!?
The California Department of Public Health Office of Health Equity (OHE) recently issued a new policy to create a public-private partnership to improve healthcare facilities in five rural counties across the state. Our team will evaluate and recommend which counties should receive development funding proposals based on equitable selection criteria created by OHE. Specifically, we will explore data to identify which rural counties have more non-homeowners, aging individuals, higher chronic mortality rates, and have received minimal funding from the Department of Health Care Access and Information.
We used 3 datasets for this project:
The first dataset is from the 2012 Census and contains demographics information for each of the 58 counties in California. It includes information such as population per square mile, median age, number of households who are renters vs owners, ethnicity, genders, etc. We calculated the renter to owner ratio for each county. We then calculated the average age and population density for the whole state and visually inspected the data to see how each county stack up. We ended up using the National Rural Development Partnership’s definition to determine if a given county’s population density is to be classified as rural, for which there were 11.
The second dataset is the mortality surveillance obtained from the CA
Open Data Portal. It contains a breakdown of total mortality for each
county by 15 disease areas. We used the CDC definition
of chronic diseases as filter, for which 10 fit the criteria. The
data range from 2014 to 2020, but we were tasked to focus on the last 5
years, thus we applied a filter with Year >= 2016
. As
tasked, we also performed filters with
Geography_Type == "Occurrence"
and used
Strata == "Total Population"
to avoid over counting. Any
missing data were replaced with 0. Once the data was cleaned, we summed
all the disease occurrences within each county. We joined this with the
demographics data to obtain a mortality rate of chronic conditions over
5 years for each county.
The third dataset is the HCAI funding, also obtained from the CA Open Data Portal. It contains healthcare spending for each county in 4 stages of project progression, updated about every 2 weeks. We focused on the latest available data, which was Aug 11, 2022, and those with state of “In Closure”. Many rural counties showed up with $0 amount. As we went back to double check our selection code, we confirmed that funding was higher in large populous counties such as those around the greater Los Angeles and San Francisco. However, we also noted that while many rural counties had no funding for “In Closure”, they did report funding for projects “In Construction”. While our main variable focuses on “In Closure” to help our improvement plan drive new spending for rural counties with high and variable mortality rates, we also compared “In Construction” funding to help narrow selection.
After cleaning and filtering the 3 datasets above, we joined them by county name, whereby we can see which counties had high renters, high chronic mortality rates, and the funding they received.
Table 1 shows demographics and disease profile for the California’s
rural counties, which according to the National
Rural Development Partnership, are counties with population density
below 20 person per square miles.
There are 11 counties in California that qualify per this definition,
and they are highlighted in green.
Highlighted in blue are places where age, renter to owner ratio, or Chronic Mortality is higher than the state-wide average. Note that mortality rate is calculated based on the latest available population data: 2012. Number of Chronic cases for each county is actually the average number of yearly cases between 2016-2020, and disease selection is per CDC guideline.
As background reference, across all 58 CA counties, we found these statistics:
We observe that while only one of these counties have rent:owner
ratio higher than the state-wide average, many are still fairly
high.
They all have $0 in the latest HCAI funding that are in the Closure
state.
Figure 1 visualizes demographic and funding characteristics to further rank and narrow the selection of rural counties. For each county, the first two subplots depict median age and rent:own ratio, respectively. The third subplot depicts the HCAI funding amount each county received on projects with a status of “in construction” as of August 2022. Since all rural counties had no funding for projects “in closure”, our team felt it was important to explore funding for projects “in construction”. Together these scatter plots provide a visual comparison of where each county falls on the measurement scales of each criteria.
Figure 2 is a bar graph of Mortality Rate for Chronic diseases across the 11 rural counties in CA (as defined by National Rural Development Partnership, 20 person/sq.mile)
The list for Chronic disease is selected according to CDC
definition.
We note that we don’t have disease data for Alpine or Sierra county.
The Mortality Rate numbers are running yearly averages over the 5 year period of 2016-2020.
For the new public-private partnership to improve healthcare, the OHE director wanted to focus on rural areas that have high rental rates and high median age. However, no county perfectly fit all three attributes. Therefore, we offer visualizations and analysis with a holistic view of which counties best fit the selection criteria. Our first step narrowed down which counties are “rural” as defined by the National Rural Development Partnership. Table 1 list these the 11 rural counties, as well as highlight those that have a median age higher than the state-wide average (there were 8, and the ONE director wants to narrow down to a list of 5).
Using Figures 1 and 2 we compared demographics, funding and chronic mortality rates between counties. Notably, Figure 1, subplot 3, displays funding received for projects “In Construction” as of August 2022. Lassen, Inyo and Siskiyou reported funding of “In Construction” projects over 4 million dollars. Counties that reported zero dollar amounts in both categories include Alpine, Colusa, Mariposa, Mono, Modoc, and Sierra. While Plumas and Trinity received funding, it was under 1 million dollars. Together with Figure 2, we can see that Trinity has a lower chronic mortality rate than Plumas. We also see that Lassen, Colusa and Mono have lower chronic mortality rates, as well as median ages below the state-wide average. Since Lassen also has high funding of projects in construction, we do not recommend funding the county at this time. Additionally, without adequate data on chronic mortality rates in Sierra and Alpine counties, it is hard to justify allocating funding at this time, as we don’t know what would be most beneficial. We recommend further study in those areas to capture mortality and assess need for improvement projects.
The counties we selected have the highest chronic mortality rates of rural counties, have median ages above the state-wide average, and fit at least one other selection criteria better than those not selected. We propose funding development projects for healthcare facility improvement in Inyo, Mariposa, Modoc, Plumas and Siskiyou counties. As shown in Figure 2, we recommend these future improvement projects focus on heart disease and cancer, as they the most common disease reported.