This web report is accessible at https://tin6150.github.io/phw251_group_z/milestone6_groupZ.html

Code used for analsys and visual generation is available at our github repo.

~~ This is a “verbose” version, where all code blocks are shown ~~

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(dplyr)
#library(kableExtra) ## no longer needed in this version
library(lubridate)

## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggplot2)
library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(formattable)   # for currency()

## 
## Attaching package: 'formattable'
## 
## The following object is masked from 'package:plotly':
## 
##     style

library(DT)            # and NOT data.table


# omit warning messages from library load
# https://stackoverflow.com/questions/45399587/how-to-remove-warning-messages-in-r-markdown-document
knitr::opts_chunk$set(warning = FALSE, message = FALSE)

## Loading & Cleaning data : Mortality

#**Suggestions from Lauren: 
#  -confirm we're not double counting deaths in mortality data 
#     Yes we are filtering by Total Population
#  -summarize mortality rates by county and use rates instead of counts before joining 
#     Okay now calculated Rate as count / pop12 

# Loading Mortality Data 

mortality_path <- 'data/ca_county_mortality.csv'
mortality_data_raw <- read_csv(mortality_path, 
                           na= c("", "NA", "-"),
                           show_col_types=F)

mortality_data <- mortality_data_raw %>% mutate_all(~replace( ., is.na(.), 0))

# View(mortality_data)
# There are NA values in mortality_data, so we need to replace NA w/ 0

mortality_data %>% select( Year ) %>% unique()

## # A tibble: 7 × 1
##    Year
##   <dbl>
## 1  2014
## 2  2015
## 3  2016
## 4  2017
## 5  2018
## 6  2019
## 7  2020

# Mortality data is from 2014 - 2020.

mortality_data %>% select( "Cause_Desc" ) %>% unique() %>% tally()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    15

# contain 15 diseases, but one is "All causes (total)"


# subset mortality data for chronic conditions only

chronic_desc = c( 
  "Chronic lower respiratory diseases",
  "Diabetes mellitus",
  "Diseases of heart",
  "Essential hypertension and hypertensive renal disease",
  "Chronic liver disease and cirrhosis",
  "Alzheimer's disease",
  "Malignant neoplasms",
  "Nephritis, nephrotic syndrome and nephrosis",
  "Cerebrovascular diseases",
  "Parkinson's disease"
  )

str( chronic_desc )

##  chr [1:10] "Chronic lower respiratory diseases" "Diabetes mellitus" ...

chronic_mortality_data = mortality_data %>% 
  filter( Cause_Desc %in% chronic_desc )

num_rows_befor_filter = mortality_data %>% tally
num_rows_after_filter = chronic_mortality_data %>% tally
#num_rows_after_filter

num_rows_filtered = num_rows_befor_filter - num_rows_after_filter
num_rows_filtered

##       n
## 1 58464

#View(chronic_mortality_data)

# chronic case count by county in the last 5 years
# Question prompt:
## Second, OHE’s director would also like to include a total count of mortality 
## from chronic health conditions 
## over the past few years into the county level analysis. 


chronic_by_county = chronic_mortality_data %>%
  filter(  Strata == "Total Population") %>%
  filter(  Geography_Type == "Occurrence" ) %>%
  select( -Geography_Type ) %>% # Geo type restricted to "Occurrence"
  filter( Year >= 2016 )     %>%   # 2016-2020 = 5 years
  group_by( County ) %>%
  summarize( sum_count5 = sum(Count),       # this is a 5 year count!
             avg_count  = sum(Count) / 5 )

# Loading HCAI funding Data 

funding_path = 'data/hcai_healthcare_construction.csv'
funding_data <- read_csv(  funding_path, 
                           na= c("", "NA", "-"),
                           show_col_types=F       )

CtyColl = select(funding_data, c("Collection of Counties")) %>%
  unique()

CtyColl

## # A tibble: 4 × 1
##   `Collection of Counties`    
##   <chr>                       
## 1 Bay Area Counties           
## 2 <NA>                        
## 3 Greater Sacramento Counties 
## 4 Greater Los Angeles Counties

## there are 4 collection of counties, 
## there aren't likely needed for data eval,
## but add lots of NA, so dropping them
#1 Bay Area Counties           
#2 NA                          
#3 Greater Sacramento Counties 
#4 Greater Los Angeles Counties

# finding where in the data frame there is an 'na'
# https://www.geeksforgeeks.org/find-columns-and-rows-with-na-in-r-dataframe/
## funding_data_no_CtyColl = select(funding_data, -c("Collection of Counties"))
funding_data = select(funding_data, -c("Collection of Counties"))

## which(is.na(funding_data_no_CtyColl), arr.ind=T)
# above returns 0 rows, ie
# we find that only the column "Collection of Counties" has 'na'
# we will leave this for now since it may just be a colloquial reference
# unimportant for our data analysis.
# no replacement for na with 0 will be done on this data frame.



# the Costs column has human data, eg $50,890,315.00
# and we need to strip the dollar sign, the commas, 
# and convert them to numbers.  
# ref: https://stackoverflow.com/questions/31944103/convert-currency-with-commas-into-numeric
# we create a new column for this called "Numeric_Cost", 
# but could have potentially done an in-place replacement
funding_data = funding_data %>% 
  mutate(Numeric_Cost = as.numeric(
    gsub( '[$,]', '', funding_data[["Total Costs of OSHPD Projects"]] ) 
  ))

funding_data = funding_data %>%
  mutate( county_name = str_sub( County, 6 )) %>%
  subset(select = -c(County)) %>%
  rename(County = county_name)

# subset HCAI funding data for those "In Closure" state

funding_data_closure <- funding_data %>%
  filter(`OSHPD Project Status` == "In Closure") %>%
  filter(`Data Generation Date` == as_date("2022-08-11")  )

  ## 2022-08-11 is the latest avail data

funding_data_closure %>% arrange( Numeric_Cost ) %>% head( 10 )

## # A tibble: 10 × 6
##    `Data Generation Date` `OSHPD Project Status` Total …¹ Numbe…² Numer…³ County
##    <dttm>                 <chr>                  <chr>      <dbl>   <dbl> <chr> 
##  1 2022-08-11 00:00:00    In Closure             0              0       0 Alpine
##  2 2022-08-11 00:00:00    In Closure             0              0       0 Amador
##  3 2022-08-11 00:00:00    In Closure             0              0       0 Butte 
##  4 2022-08-11 00:00:00    In Closure             0              0       0 Calav…
##  5 2022-08-11 00:00:00    In Closure             0              0       0 Colusa
##  6 2022-08-11 00:00:00    In Closure             0              0       0 Del N…
##  7 2022-08-11 00:00:00    In Closure             0              0       0 Glenn 
##  8 2022-08-11 00:00:00    In Closure             0              0       0 Humbo…
##  9 2022-08-11 00:00:00    In Closure             0              0       0 Imper…
## 10 2022-08-11 00:00:00    In Closure             0              0       0 Inyo  
## # … with abbreviated variable names ¹`Total Costs of OSHPD Projects`,
## #   ²`Number of OSHPD Projects`, ³Numeric_Cost

#View(funding_data)
#View(funding_data_closure)

funding_data %>% select( `Data Generation Date` ) %>% unique() %>% tally() ## 231 rows

## # A tibble: 1 × 1
##       n
##   <int>
## 1   231

selected_counties = c(
  "Inyo",
  "Siskiyou",
  "Mariposa",
  "Modoc",
  "Plumas"
)

funding_data_selected_counties = funding_data %>% 
  #filter( County == "Alameda" ) %>%
  #filter( County == "Inyo" ) %>%
  #filter( County %in% selected_counties  ) %>%
  filter( `OSHPD Project Status` == "In Closure" |
            `OSHPD Project Status` == "In Construction"    ) %>%
  #filter( `OSHPD Project Status` == "Pending Construction"  ) %>%
  #filter( `OSHPD Project Status` == "In Review"  ) %>%
  filter( `Data Generation Date` == 
            as_date( "2022-08-11" ) 
            #as_date( "2020-01-01" ) 
            #as_date( "2016-01-01" ) 
          )

chk1 = 
  ggplot( data = funding_data_selected_counties,
          aes( x = `Data Generation Date`, 
               y = Numeric_Cost             ) ) +
  geom_bar( stat="identity", position="dodge" ) +
  labs( title = 'Prj cost over time' ) + 
  facet_wrap( ~ County )


chk1

## some number may linger in closure for longer than other?
## so taking average of closure amount over say 5 months may lead to skewed data
## really just make sense to just use the last data point.


funding_data_allSources = funding_data %>% 
  #filter( County == "Alameda" ) %>%
  #filter( County == "Mariposa" ) %>%
  #filter( County == "Inyo" ) %>%
  filter( County %in% selected_counties  ) %>%
  filter( `Data Generation Date` > 
            #as_date( "2022-01-01" ) 
            #as_date( "2020-01-01" ) 
            as_date( "2016-01-01" ) 
          )


chk_allSrc = 
  ggplot( data = funding_data_allSources,
          aes( x = `Data Generation Date`, 
               y = Numeric_Cost             ) ) +
  geom_bar( stat="identity", position="dodge" ) +
  labs( title = '1 county, prj stage' ) + 
  facet_wrap( ~ `OSHPD Project Status` )


chk_allSrc

## there is likely more to the story about these costs
## expect to see approved project to have a peak in Review
## then the peak move to Pending Construction, In Construction
## then In Closure
## but maybe their cost is finance book remaining budget account?
## data doesn't fully make sense
## but it is just homework, take a snapshot of "In Closure", and let it be.

# Load demographics data

demographics_path = 'data/ca_county_demographics.csv'
demographics_data = read_csv( demographics_path, 
                              na = c("", "NA", "-"), 
                              show_col_types=F )

# the first column contains id number, but it is unamed, so renaming it
# rest of the columns have reasonable names, so left them as is.
demographics_data = rename( demographics_data, id="...1")

# View(demographics_data)


# massaging demographics data

demographics_data <- demographics_data %>%
  select(id, name, pop2012, pop12_sqmi, med_age, owner_occ, renter_occ) %>%
  rename( `County` = name )


## Determining Renters vs Homeowners Ratio
demographics_data = demographics_data %>%
  mutate( rent_own_ratio = renter_occ / owner_occ )


demographics_data <- 
  mutate(demographics_data, 
         rural_class=if_else( pop12_sqmi < 20, "rural", "not rural", missing=NULL))

# column to show whether county was rural or not
rural_not_rural <- demographics_data %>% 
  select( County, rural_class, pop12_sqmi, rent_own_ratio, med_age )  

rural_counties = rural_not_rural   %>%
  filter( rural_class == "rural" ) 
  #rename( `Class`  = rural_class )


#### consider a "low pop density" rather than a strict "rural"
#### --> not really good, 49 counties in this grouping

avg_pop_density = mean( demographics_data$pop12_sqmi )
demographics_data = demographics_data %>% 
  mutate( low_pop = if_else( pop12_sqmi < avg_pop_density , 
                                    TRUE,
                                    FALSE,
                                    NULL   ))

high_rental_ratio = 
  rural_counties %>% arrange( desc( rent_own_ratio  ) ) 

high_med_age = 
  rural_counties %>% arrange( desc( med_age  ) ) 

## Determine any counties satisfy triple whammy rule

avg_rent_ratio = mean( demographics_data$rent_own_ratio )
avg_age        = mean( demographics_data$med_age  )

demographics_data = demographics_data %>% 
  mutate( high_med_age = if_else( med_age > avg_age , 
                                    TRUE,
                                    FALSE,
                                    NULL   )) %>%
  mutate( high_rental = if_else( rent_own_ratio > avg_rent_ratio , 
                                    TRUE,
                                    FALSE,
                                    NULL   )) 
  
triple_whammy = demographics_data %>%
  filter( rural_class  == "rural" &
          high_rental  == TRUE    &  
          high_med_age == TRUE       )

### so 0 county satisfy the triple whammy rule
### need to modify  selection criteria

## Use only RURAL counties to calculate their averages, and what's above average

## Determine any RURAL counties satisfy double whammy rule
#View(rural_counties)

avg_rent_ratio_rural = mean( rural_counties$rent_own_ratio )
avg_age_rural        = mean( rural_counties$med_age  )



rural_demographics_data = demographics_data %>% 
  mutate( high_med_age_rural = if_else( med_age > avg_age_rural , 
                                    TRUE,
                                    FALSE,
                                    NULL   )) %>%
  mutate( high_rental_rural = if_else( rent_own_ratio > avg_rent_ratio_rural , 
                                    TRUE,
                                    FALSE,
                                    NULL   )) 

#-- c( avg_rent_ratio, avg_rent_ratio_rural, avg_age, avg_age_rural )
#-- sn50 =  rural_demographics_data %>% filter( rural_class == "rural" )

rural_double_whammy = rural_demographics_data %>%
  filter( 
          high_rental_rural  == TRUE    &  
          high_med_age_rural == TRUE       )



rural_double_whammy  # only 3 counties in this list: Lake, Siskiyou, Inyo

## # A tibble: 3 × 14
##      id County   pop2012 pop12…¹ med_age owner…² rente…³ rent_…⁴ rural…⁵ low_pop
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <lgl>  
## 1     3 Lake       65253   49.1     45     17472    9076   0.519 not ru… TRUE   
## 2    33 Siskiyou   45200    7.12    46.8   12629    6876   0.544 rural   TRUE   
## 3    58 Inyo       18611    1.82    45.5    5121    2928   0.572 rural   TRUE   
## # … with 4 more variables: high_med_age <lgl>, high_rental <lgl>,
## #   high_med_age_rural <lgl>, high_rental_rural <lgl>, and abbreviated variable
## #   names ¹pop12_sqmi, ²owner_occ, ³renter_occ, ⁴rent_own_ratio, ⁵rural_class

rural_demographics_data %>% arrange( rent_own_ratio, med_age ) %>% head( 6 )

## # A tibble: 6 × 14
##      id County   pop2012 pop12…¹ med_age owner…² rente…³ rent_…⁴ rural…⁵ low_pop
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <lgl>  
## 1    42 Calaver…   46212   44.6     49.1   14520    4366   0.301 not ru… TRUE   
## 2    38 Amador     38354   63.3     48.2   10883    3686   0.339 not ru… TRUE   
## 3    51 El Dora…  182494  102.      43.5   51391   18832   0.366 not ru… TRUE   
## 4    15 Nevada     99951  103.      47.5   29890   11637   0.389 not ru… TRUE   
## 5    32 Sierra      3226    3.35    51      1065     417   0.392 rural   TRUE   
## 6    36 Alpine      1148    1.54    46.4     357     140   0.392 rural   TRUE   
## # … with 4 more variables: high_med_age <lgl>, high_rental <lgl>,
## #   high_med_age_rural <lgl>, high_rental_rural <lgl>, and abbreviated variable
## #   names ¹pop12_sqmi, ²owner_occ, ³renter_occ, ⁴rent_own_ratio, ⁵rural_class

## determine old age as > avg age for full data set
## then pick rural countries, 
## high rental in such rural counties.


#View(rural_counties)


focus_demographics_data = demographics_data %>% 
  mutate( high_rental_rural = if_else( rent_own_ratio > avg_rent_ratio_rural , 
                                    TRUE,
                                    FALSE,
                                    NULL   )) %>%
  mutate( high_med_age = if_else( med_age > avg_age , 
                                    TRUE,
                                    FALSE,
                                    NULL   ))

#-- c( avg_rent_ratio, avg_rent_ratio_rural, avg_age, avg_age_rural )
#-- sn50 =  focus_demographics_data %>% filter( rural_class == "rural" )

focus_dem = rural_demographics_data %>%
  filter( 
          high_rental_rural  == TRUE    &  
          high_med_age == TRUE       )    %>%
  select( County, rural_class, high_med_age, high_rental_rural )


rural_demographics_data %>%
  select( County, rural_class, med_age, high_med_age, high_rental_rural ) %>%
  arrange( high_rental_rural, desc(med_age) )

## # A tibble: 58 × 5
##    County    rural_class med_age high_med_age high_rental_rural
##    <chr>     <chr>         <dbl> <lgl>        <lgl>            
##  1 Sierra    rural          51   TRUE         FALSE            
##  2 Plumas    rural          49.5 TRUE         FALSE            
##  3 Mariposa  rural          49.2 TRUE         FALSE            
##  4 Trinity   rural          49.2 TRUE         FALSE            
##  5 Calaveras not rural      49.1 TRUE         FALSE            
##  6 Amador    not rural      48.2 TRUE         FALSE            
##  7 Nevada    not rural      47.5 TRUE         FALSE            
##  8 Tuolumne  not rural      47.1 TRUE         FALSE            
##  9 Alpine    rural          46.4 TRUE         FALSE            
## 10 Modoc     rural          46   TRUE         FALSE            
## # … with 48 more rows

#View(focus_dem)  # now have 13 counties to choose from, but not all rural

## decide if need to output any of these tables
## likely need to refine further 

high_rental_ratio %>% head( 6 )

## # A tibble: 6 × 5
##   County   rural_class pop12_sqmi rent_own_ratio med_age
##   <chr>    <chr>            <dbl>          <dbl>   <dbl>
## 1 Mono     rural             4.60          0.787    37.2
## 2 Colusa   rural            18.8           0.634    33.5
## 3 Inyo     rural             1.82          0.572    45.5
## 4 Siskiyou rural             7.12          0.544    46.8
## 5 Lassen   rural             7.42          0.526    37  
## 6 Mariposa rural            12.6           0.472    49.2

high_med_age %>% head( 6 )

## # A tibble: 6 × 5
##   County   rural_class pop12_sqmi rent_own_ratio med_age
##   <chr>    <chr>            <dbl>          <dbl>   <dbl>
## 1 Sierra   rural             3.35          0.392    51  
## 2 Plumas   rural             7.65          0.440    49.5
## 3 Mariposa rural            12.6           0.472    49.2
## 4 Trinity  rural             4.38          0.420    49.2
## 5 Siskiyou rural             7.12          0.544    46.8
## 6 Alpine   rural             1.54          0.392    46.4

rural_double_whammy %>% head( 6 )

## # A tibble: 3 × 14
##      id County   pop2012 pop12…¹ med_age owner…² rente…³ rent_…⁴ rural…⁵ low_pop
##   <dbl> <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <lgl>  
## 1     3 Lake       65253   49.1     45     17472    9076   0.519 not ru… TRUE   
## 2    33 Siskiyou   45200    7.12    46.8   12629    6876   0.544 rural   TRUE   
## 3    58 Inyo       18611    1.82    45.5    5121    2928   0.572 rural   TRUE   
## # … with 4 more variables: high_med_age <lgl>, high_rental <lgl>,
## #   high_med_age_rural <lgl>, high_rental_rural <lgl>, and abbreviated variable
## #   names ¹pop12_sqmi, ²owner_occ, ³renter_occ, ⁴rent_own_ratio, ⁵rural_class

focus_dem %>% head( 13 )

## # A tibble: 13 × 4
##    County          rural_class high_med_age high_rental_rural
##    <chr>           <chr>       <lgl>        <lgl>            
##  1 Lake            not rural   TRUE         TRUE             
##  2 Marin           not rural   TRUE         TRUE             
##  3 Mendocino       not rural   TRUE         TRUE             
##  4 Napa            not rural   TRUE         TRUE             
##  5 San Francisco   not rural   TRUE         TRUE             
##  6 San Luis Obispo not rural   TRUE         TRUE             
##  7 San Mateo       not rural   TRUE         TRUE             
##  8 Shasta          not rural   TRUE         TRUE             
##  9 Siskiyou        rural       TRUE         TRUE             
## 10 Sonoma          not rural   TRUE         TRUE             
## 11 Tehama          not rural   TRUE         TRUE             
## 12 Del Norte       not rural   TRUE         TRUE             
## 13 Inyo            rural       TRUE         TRUE

#It doesn't seems like we have high rental (ie > avg_rent) and old age (>avg_age)
#so we will have to judge subjectively
#-Tin

# calc chronic mortality rate using census pop12 as denominator

# First need to Join Chronics mortality data (by county, last 5 years)
# and demographics data
# the avg_count was already div by 5 earlier in the code.

demographics_chronic = left_join( demographics_data, 
                                  chronic_by_county, 
                                  by = "County"      ) %>% 
  mutate( pct     = ( avg_count / pop2012 ) * 100 ) %>%
  mutate( per100k = ( avg_count / pop2012 ) * 100000 ) 

# pct = "prevalence" percentage

avg_chronic_mortality         = mean( demographics_chronic$pct )
avg_chronic_mortality_per100k = mean( demographics_chronic$per100k )

fund_dem_chron = inner_join( demographics_chronic,
                             funding_data_closure, 
                             by = "County" ) %>%
  mutate( fund_per_cap = Numeric_Cost / pop2012 )

viz_fund_dem_chron = fund_dem_chron %>%
  select( County, 
          pop2012,
          pop12_sqmi,
          rural_class,  
          #low_pop,
          med_age,
          high_med_age,
          rent_own_ratio,
          high_rental,
          avg_count,
          pct,                    # prevalence,         
          per100k,                # prevalence,  per 100k pop
          Numeric_Cost, 
          fund_per_cap,
          `Number of OSHPD Projects`,
          )

# grabbing data from raw source 

mortality_data_raw %>%
  filter( County == "Siskiyou" ) %>%
  filter( Year == 2020 ) %>%
  filter( Strata == "Total Population" ) %>% 
  filter( Geography_Type == "Occurrence" ) %>% 
  filter( Cause_Desc %in% chronic_desc)

## # A tibble: 10 × 10
##     Year County   Geography…¹ Strata Strat…² Cause Cause…³ Count Annot…⁴ Annot…⁵
##    <dbl> <chr>    <chr>       <chr>  <chr>   <chr> <chr>   <dbl>   <dbl> <chr>  
##  1  2020 Siskiyou Occurrence  Total… Total … ALZ   Alzhei…    26      NA <NA>   
##  2  2020 Siskiyou Occurrence  Total… Total … CAN   Malign…   121      NA <NA>   
##  3  2020 Siskiyou Occurrence  Total… Total … CLD   Chroni…    39      NA <NA>   
##  4  2020 Siskiyou Occurrence  Total… Total … DIA   Diabet…    15      NA <NA>   
##  5  2020 Siskiyou Occurrence  Total… Total … HTD   Diseas…   107      NA <NA>   
##  6  2020 Siskiyou Occurrence  Total… Total … HYP   Essent…    NA       1 Cell s…
##  7  2020 Siskiyou Occurrence  Total… Total … LIV   Chroni…    NA       1 Cell s…
##  8  2020 Siskiyou Occurrence  Total… Total … NEP   Nephri…    NA       1 Cell s…
##  9  2020 Siskiyou Occurrence  Total… Total … PAR   Parkin…    NA       1 Cell s…
## 10  2020 Siskiyou Occurrence  Total… Total … STK   Cerebr…    26      NA <NA>   
## # … with abbreviated variable names ¹Geography_Type, ²Strata_Name, ³Cause_Desc,
## #   ⁴Annotation_Code, ⁵Annotation_Desc

# 26+121+39+15+107+26
# 334

# viz_fund_dem_chron is just fund_dem_chron with fewer columns

scm1 = viz_fund_dem_chron %>% 
  filter( County == "Siskiyou" )

chronic_by_county %>% 
  filter( County == "Siskiyou" )

## # A tibble: 1 × 3
##   County   sum_count5 avg_count
##   <chr>         <dbl>     <dbl>
## 1 Siskiyou       1614      323.

scm2 = demographics_chronic %>%
  filter( County == "Siskiyou" )

# Spot check for Mortality rate of Siskiyou 
#pop12      for siskiyou is 45200
#sum_count  for siskiyou is  1614
# as fraction is .03517  (ie 3.6%)
# as per 100,000 then 1614*100000/45200 = 3570.796
## this was a 5 years count ... fixed with avg_count

#### for viz, ponder about using these, or derivative thereof 

# viz_fund_dem_chron is just fund_dem_chron with fewer columns

feel_for_data = viz_fund_dem_chron %>% 
  filter( rural_class == "rural" ) 
  #filter( low_pop == TRUE )         # no good, 49 counties in this grouping

feel_for_data %>% 
  arrange( desc( fund_per_cap ) ) %>% head( 10 )

## # A tibble: 10 × 14
##    County pop2012 pop12…¹ rural…² med_age high_…³ rent_…⁴ high_…⁵ avg_c…⁶    pct
##    <chr>    <dbl>   <dbl> <chr>     <dbl> <lgl>     <dbl> <lgl>     <dbl>  <dbl>
##  1 Lassen   35039    7.42 rural      37   FALSE     0.526 FALSE      70.4 0.201 
##  2 Marip…   18455   12.6  rural      49.2 TRUE      0.472 FALSE      80.6 0.437 
##  3 Modoc     9791    2.33 rural      46   TRUE      0.459 FALSE      39.4 0.402 
##  4 Mono     14418    4.60 rural      37.2 FALSE     0.787 TRUE       12.2 0.0846
##  5 Plumas   20000    7.65 rural      49.5 TRUE      0.440 FALSE      82.6 0.413 
##  6 Sierra    3226    3.35 rural      51   TRUE      0.392 FALSE       0   0     
##  7 Siski…   45200    7.12 rural      46.8 TRUE      0.544 FALSE     323.  0.714 
##  8 Alpine    1148    1.54 rural      46.4 TRUE      0.392 FALSE       0   0     
##  9 Colusa   21780   18.8  rural      33.5 FALSE     0.634 FALSE      56.6 0.260 
## 10 Trini…   14063    4.38 rural      49.2 TRUE      0.420 FALSE      43.4 0.309 
## # … with 4 more variables: per100k <dbl>, Numeric_Cost <dbl>,
## #   fund_per_cap <dbl>, `Number of OSHPD Projects` <dbl>, and abbreviated
## #   variable names ¹pop12_sqmi, ²rural_class, ³high_med_age, ⁴rent_own_ratio,
## #   ⁵high_rental, ⁶avg_count

### above are all 0... which implies no fundings for them... should give them some love!?

Final Report

Problem Statement

The California Department of Public Health Office of Health Equity (OHE) recently issued a new policy to create a public-private partnership to improve healthcare facilities in five rural counties across the state. Our team will evaluate and recommend which counties should receive development funding proposals based on equitable selection criteria created by OHE. Specifically, we will explore data to identify which rural counties have more non-homeowners, aging individuals, higher chronic mortality rates, and have received minimal funding from the Department of Health Care Access and Information.

Methods

We used 3 datasets for this project:

CA demographics
CA mortality surveillance
HCAI healthcare construction funding

The first dataset is from the 2012 Census and contains demographics information for each of the 58 counties in California. It includes information such as population per square mile, median age, number of households who are renters vs owners, ethnicity, genders, etc. We calculated the renter to owner ratio for each county. We then calculated the average age and population density for the whole state and visually inspected the data to see how each county stack up. We ended up using the National Rural Development Partnership’s definition to determine if a given county’s population density is to be classified as rural, for which there were 11.

The second dataset is the mortality surveillance obtained from the CA Open Data Portal. It contains a breakdown of total mortality for each county by 15 disease areas. We used the CDC definition of chronic diseases as filter, for which 10 fit the criteria. The data range from 2014 to 2020, but we were tasked to focus on the last 5 years, thus we applied a filter with Year >= 2016. As tasked, we also performed filters with Geography_Type == "Occurrence" and used Strata == "Total Population" to avoid over counting. Any missing data were replaced with 0. Once the data was cleaned, we summed all the disease occurrences within each county. We joined this with the demographics data to obtain a mortality rate of chronic conditions over 5 years for each county.

The third dataset is the HCAI funding, also obtained from the CA Open Data Portal. It contains healthcare spending for each county in 4 stages of project progression, updated about every 2 weeks. We focused on the latest available data, which was Aug 11, 2022, and those with state of “In Closure”. Many rural counties showed up with $0 amount. As we went back to double check our selection code, we confirmed that funding was higher in large populous counties such as those around the greater Los Angeles and San Francisco. However, we also noted that while many rural counties had no funding for “In Closure”, they did report funding for projects “In Construction”. While our main variable focuses on “In Closure” to help our improvement plan drive new spending for rural counties with high and variable mortality rates, we also compared “In Construction” funding to help narrow selection.

After cleaning and filtering the 3 datasets above, we joined them by county name, whereby we can see which counties had high renters, high chronic mortality rates, and the funding they received.

Results

Rural Counties Profiles

Table 1 shows demographics and disease profile for the California’s rural counties, which according to the National Rural Development Partnership, are counties with population density below 20 person per square miles.
There are 11 counties in California that qualify per this definition, and they are highlighted in green.

Highlighted in blue are places where age, renter to owner ratio, or Chronic Mortality is higher than the state-wide average. Note that mortality rate is calculated based on the latest available population data: 2012. Number of Chronic cases for each county is actually the average number of yearly cases between 2016-2020, and disease selection is per CDC guideline.

As background reference, across all 58 CA counties, we found these statistics:

average population density: 665 person per square mile.
average rent:own ratio is: 64.7%
average age is: 38.5
average Chronic Mortality per 100k: 495

We observe that while only one of these counties have rent:owner ratio higher than the state-wide average, many are still fairly high.
They all have $0 in the latest HCAI funding that are in the Closure state.

Demographics and Funding of Rural Counties

Figure 1 visualizes demographic and funding characteristics to further rank and narrow the selection of rural counties. For each county, the first two subplots depict median age and rent:own ratio, respectively. The third subplot depicts the HCAI funding amount each county received on projects with a status of “in construction” as of August 2022. Since all rural counties had no funding for projects “in closure”, our team felt it was important to explore funding for projects “in construction”. Together these scatter plots provide a visual comparison of where each county falls on the measurement scales of each criteria.

Mortality Rate of Rural County

Figure 2 is a bar graph of Mortality Rate for Chronic diseases across the 11 rural counties in CA (as defined by National Rural Development Partnership, 20 person/sq.mile)

The list for Chronic disease is selected according to CDC definition.
We note that we don’t have disease data for Alpine or Sierra county.

The Mortality Rate numbers are running yearly averages over the 5 year period of 2016-2020.

Discussion

For the new public-private partnership to improve healthcare, the OHE director wanted to focus on rural areas that have high rental rates and high median age. However, no county perfectly fit all three attributes. Therefore, we offer visualizations and analysis with a holistic view of which counties best fit the selection criteria. Our first step narrowed down which counties are “rural” as defined by the National Rural Development Partnership. Table 1 list these the 11 rural counties, as well as highlight those that have a median age higher than the state-wide average (there were 8, and the ONE director wants to narrow down to a list of 5).

Using Figures 1 and 2 we compared demographics, funding and chronic mortality rates between counties. Notably, Figure 1, subplot 3, displays funding received for projects “In Construction” as of August 2022. Lassen, Inyo and Siskiyou reported funding of “In Construction” projects over 4 million dollars. Counties that reported zero dollar amounts in both categories include Alpine, Colusa, Mariposa, Mono, Modoc, and Sierra. While Plumas and Trinity received funding, it was under 1 million dollars. Together with Figure 2, we can see that Trinity has a lower chronic mortality rate than Plumas. We also see that Lassen, Colusa and Mono have lower chronic mortality rates, as well as median ages below the state-wide average. Since Lassen also has high funding of projects in construction, we do not recommend funding the county at this time. Additionally, without adequate data on chronic mortality rates in Sierra and Alpine counties, it is hard to justify allocating funding at this time, as we don’t know what would be most beneficial. We recommend further study in those areas to capture mortality and assess need for improvement projects.

The counties we selected have the highest chronic mortality rates of rural counties, have median ages above the state-wide average, and fit at least one other selection criteria better than those not selected. We propose funding development projects for healthcare facility improvement in Inyo, Mariposa, Modoc, Plumas and Siskiyou counties. As shown in Figure 2, we recommend these future improvement projects focus on heart disease and cancer, as they the most common disease reported.

Milestone 6 Group Z

Aubrey Robinson, Elizabeth Guzman, Tin Ho

Due: 2022-12-11