PHW251 Bonus Problem Set

The knitted HTML is accessible at: https://tin6150.github.io/psg/R_phw251_bonus_problem_set.html

Part 1

Question 1

Below is a function that ought to return “I’m an even number!” if x is an even number. However, we’re having trouble receiving a value despite x == 4, which we know is an even number. Fix the code chunk and explain why this error is occurring. You will have to change the eval=FALSE option in the code chunk header to get the chunk to knit in your PDF.

NOTE: %% is the “modulo” operator, which returns the remainder when you divide the left number by the right number. For example, try 2 %% 2 (should equal 0 as 2/2 = 1 with no remainder) and 5 %% 2 (should equal 1 as 5/2 = 2 with a remainder of 1).

return_even = function(x){
  if (x %% 2 == 0) {
    return("I'm an even number!")
  }
}

X = 4
return_even( X )

## [1] "I'm an even number!"

EXPLAIN THE ISSUE HERE

The function call “return_even” need an argument, provided x as argument.

Question 2

R functions are not able to access global variables unless we provide them as inputs.

Below is a function that determines if a number is odd and adds 1 to that number. The function ought to return that value, but we can’t seem to access the value. Debug the code and explain why this error is occurring. Does it make sense to try and call odd_add_1 after running the function?

return_odd = function(y){
  if (y %% 2 != 0) {
    odd_add_1 = y + 1
  }
}


odd_add_1 = return_odd(3)
odd_add_1

EXPLAIN THE ISSUE HERE

Need to save the return value of the function into the variable odd_add_1 before being able to read its value

Question 3

BMI calculations and conversions: - metric: \(BMI = weight (kg) / [height (m)]^2\) - imperial: \(BMI = 703 * weight (lbs) / [height (in)]^2\) - 1 foot = 12 inches - 1 cm = 0.01 meter

Below is a function bmi_imperial() that calculates BMI and assumes the weight and height inputs are from the imperial system (height in feet and weight in pounds).

df_colorado <- read_csv("data/colorado_data.csv")

## Rows: 24 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): location, gender
## dbl (3): height, weight, date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

bmi_imperial <- function(height, weight){
  return ( (703 * weight)/(height * 12)^2 )
}

# calculate bmi for the first observation
bmi_imperial(df_colorado$height[1], df_colorado$weight[1])

## [1] 42.62802

#bmi_imperial(df_colorado$height,    df_colorado$weight   )

Write a function called bmi_metric() that calculates BMI based on the metric system. You can test your function with the Taiwan data set excel file in the data folder, which has height in cm and weight in kg.

# your code here

df_taiwan = read_xlsx("data/taiwan_data.xlsx")

# bloody height given in cm, need to convert to m
bmi_metric = function( height, weight ) {
  #str( height )
  return( weight / (height/100)^2 )
  #bmi = ( weight / (height/100)^2 )
  #return( bmi )
}

# uncomment the line below to test the bmi calculation on the first row
bmi_metric(df_taiwan$height[1], df_taiwan$weight[1])

## [1] 21.45357

#bmi_metric(df_taiwan$height,    df_taiwan$weight   )

Question 4

Can you write a function called calculate_bmi() that combines both of the BMI functions? You will need to figure out a way to determine which calculation to perform based on the values in the data.

# your code here

# going to assume that any height less than 9 is in ft.inches
# otherwise would be in cm

# when used with mutate, R give this a whole vector
# not element-wise with value from each cell
# thus if-statment cannot be used
# need a vector ready if_else
# and dont use a RETURN clause with if_else!
# Avoid R-base ifelse(), it is less strict and may compute incorrect/unexpected result. 

# this version works, vectorization is ok.  
calculate_bmi = function( height, weight ) {
  if_else( height >= as.vector(9),  
    ( bmi_metric(   height,weight ) ),
    ( bmi_imperial( height,weight ) )
  )
}

Question 5

Use your function calculate_bmi() to answer the following questions:

What is the average BMI of the individuals in the Colorado data set?

45.61

# your code here

avg_colo = df_colorado %>% 
  #mutate( bmi = bmi_imperial(height, weight) ) %>%
  mutate( bmi = calculate_bmi(height, weight) ) %>%   # this works if vectorized if_else() can somehow individually apply T vs F for each list item.
  summarize( avg = mean(bmi))

avg_colo

What is the average BMI of the individuals in the Taiwan data set?

22.99

# your code here


avg_taiwan = df_taiwan %>% 
  #mutate( bmi = bmi_metric(height, weight) ) %>%
  mutate( bmi = calculate_bmi(height, weight) ) %>%
  summarize( avg = mean(bmi))

avg_taiwan  # expect 22.99 # are these data real, coloradian are that much fatter than taiwanese?!

Question 6

Combine the Colorado and Taiwan data sets into one data frame and calculate the BMI for every row using your calculate_bmi() function. Print the first six rows and the last six rows of that new data set.

# your code here

combo_df = rbind( df_colorado, df_taiwan )

combo_df = combo_df %>% 
  mutate( bmi = calculate_bmi( height, weight   ) ) 

combo_df %>% head(6)

combo_df %>% tail(6)

avg_all = combo_df %>%
  summarize( avg = mean(bmi))

avg_all

Question 7

Make a boxplot that shows the BMI distribution of the combined data, separated by location on the x-axis. Use a theme of your choice, put a title on your graph, and hide the y-axis title.

NOTE: These data are for practice only and are not representative populations, which is why we aren’t comparing them with statistical tests. It would not be responsible to draw any conclusions from this graph!

# your code here

gplot7 = ggplot(
  data = combo_df, 
  aes( x = location, 
       y = bmi ) 
  ) + 
  geom_boxplot(
    color = "black",
    fill = "sienna2"
  ) +
  theme_minimal() +
  labs(
    title = "BMI of Colorado vs Taiwan",
    x = "Location",
    y = ""
  )


gplot7

Part 2

Question 8

Recall the patient data from a healthcare facility that we used in Part 2 of Problem Set 7.

We had four tables that were relational to each other and the following keys linking the tables together:

patient_id: patients, schedule
visit_id: schedule, visits
doctor_id: visits, doctors

Use a join to find out which patients have no visits on the schedule.

# your code here

patients_schedule = left_join( 
  x = patients , 
  y = schedule, 
  by = "patient_id" )

#str(patients)
#str(patients_schedule)

# look for entries with visit_id as NA
noSchedule = patients_schedule %>%
  filter( is.na(visit_id) )

noSchedule %>% head(10)

Question 9

With this data, can you tell if those patients with no visits on the schedule have been assigned to a doctor? Why or why not?

# your code here? (optional)


#patients_schedule_visits = left_join(
noSchedule_visits = left_join(
  x = noSchedule,
  y = visits,
  by = "visit_id"
)

str( noSchedule_visits )

## spec_tbl_df [7 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ patient_id     : num [1:7] 1013 1015 1017 1023 1033 ...
##  $ age            : num [1:7] 32 71 72 48 64 38 66
##  $ race_ethnicity : chr [1:7] NA "White" "Asian" "Asian" ...
##  $ gender_identity: chr [1:7] "man" "man" "woman" "man" ...
##  $ height         : num [1:7] 180 151 191 4.82 6.43 4.86 4.95
##  $ weight         : num [1:7] 58 62 62 203 256 ...
##  $ visit_id       : num [1:7] NA NA NA NA NA NA NA
##  $ date           : chr [1:7] NA NA NA NA ...
##  $ follow_up      : chr [1:7] NA NA NA NA ...
##  $ doctor_id      : num [1:7] NA NA NA NA NA NA NA
##  - attr(*, "spec")=
##   .. cols(
##   ..   patient_id = col_double(),
##   ..   age = col_double(),
##   ..   race_ethnicity = col_character(),
##   ..   gender_identity = col_character(),
##   ..   height = col_double(),
##   ..   weight = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

# now look for rows where follow_up is NA or doctor_id is NA

noSchedule_visits %>% head(10)

YOUR ANSWER HERE

_NO, because doctor_id is only avail once a visit has been recorded.

Question 10

Assume those patients need primary care and haven’t been assigned a doctor yet. Which primary care doctors have the least amount of visits? Rank them from least to most visits.

# your code here

doctors_visits = left_join(
  x = doctors,
  y = visits,
  by = "doctor_id"
)

# str( doctors_visits ) 


doc_work_by_rank = doctors_visits %>%
  group_by( doctor_id ) %>% 
  summarize( num_visits = length( visit_id )) # %>% 

doc_work_by_rank_with_name = 
  full_join(
    x = doctors,
    y = doc_work_by_rank,
    by = "doctor_id"
  ) 

doc_work_by_rank_with_name %>% arrange( num_visits ) %>%  head( 8 )

Part 3

Recall in Problem Set 5, Part 2, we were working with data from New York City that tested children under 6 years old for elevated blood lead levels (BLL). [You can read more about the data on their website]).

About the data:

All NYC children are required to be tested for lead poisoning at around age 1 and age 2, and to be screened for risk of lead poisoning, and tested if at risk, up until age 6. These data are an indicator of children younger that 6 years of age tested in NYC in a given year with blood lead levels (BLL) of 5 mcg/dL or greater. In 2012, CDC established that a blood lead level of 5 mcg/dL is the reference level for exposure to lead in children. This level is used to identify children who have blood lead levels higher than most children’s levels. The reference level is determined by measuring the NHANES blood lead distribution in US children ages 1 to 5 years, and is reviewed every 4 years.

Question 11

Load in a cleaned-up version of the blood lead levels data:

bll_nyc_per_1000 <- read_csv("data/bll_nyc_per_1000.csv")

## Rows: 20 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): borough_id
## dbl (2): time_period, bll_5plus_1k
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Create a formattable table (example below) that shows the elevated blood lead levels per 1000 tested across 2013-2016. If the BLL increases from the previous year, turn the text red. If the BLL decreases from the previous year, turn the text green. To accomplish this color changing, you may want to create three indicator variables that check the value between years (e.g. use if_else). If you’ve have used conditional formatting on excel/google sheets, the concept is the same, but with R.

Note: If you are using if_else (hint hint) and checking by the year, you will likely need to use the left quote, actute, backtip, to reference the variable.

We have also provided you a function that you can use within your formattable table to reference this indicator variable to help reduce the code. However, you do not have to use this, and feel free to change the hex colors.

The formattable renders correctly in HTML, just not PDF The knitted HTML is accessible at: https://tin6150.github.io/psg/R_phw251_bonus_problem_set.html

# in the event that plotly was run below, detach plotly
# the option 'style' conflicts when both libraries are loaded
#/ detach("package:plotly", unload=TRUE)

# your code here

#str( bll_nyc_per_1000 )

bll_wide = bll_nyc_per_1000 %>% pivot_wider(
  names_from = "time_period",
  values_from = "bll_5plus_1k"
)



# function that returns red if indicator == 1, green otherwise
up_down = function(indicator) {
  return(ifelse( indicator == 1, "#fd626e", "#03d584"))
}

# function overloading works in R too :)
# x and y could be vectors, it will do element wise comparison, and return a vector
up_down = function(x, y) {
  return(ifelse( x <= y, "#fd626e", "#03d584"))
}

# careful using this fn, need to explicitly pass a vector of data
styler = function(x,y) {
  return( formatter( "span", style = ~style( color = up_down(x,y))))
}

# your `formattable code` here
# the code works in Rstudio, but don't knit correctly to pdf output 

table_4_html_not_pdf = formattable( 
  bll_wide,
  list( 
    `2014` = styler(bll_wide$`2013`, bll_wide$`2014`),   
    `2015` = styler(bll_wide$`2014`, bll_wide$`2015`),   
    `2016` = styler(bll_wide$`2015`, bll_wide$`2016`)
    #`2016` = styler("2015","2016")  # this dont work correctly, bad coloring
    #`2016` = styler(`2015`,`2016`)  # `2015` is a string?, not a vector the styler fn can grab data from
    )
)
table_4_html_not_pdf

borough_id	2013	2014	2015	2016
Bronx	20.1	18.7	15.7	15.0
Brooklyn	30.2	26.8	22.6	22.3
Manhattan	15.2	14.1	10.6	8.1
Queens	18.2	18.5	15.4	14.3
Staten Island	17.6	17.1	12.0	14.8

# this one works, orig code, just repetitive and not fully utilizaing fn
old_table_4_html_not_pdf = formattable( 
  bll_wide,
  list( 
    `2014` = formatter( "span", style = ~style( color = up_down(`2013`,`2014`) ) ),   
    `2015` = formatter( "span", style = ~style( color = up_down(`2014`,`2015`) ) ),   
    `2016` = formatter( "span", style = ~style( color = up_down(`2015`,`2016`) ) )
    )
)

Question 12

Starting with the data frame bll_nyc_per_1000 create a table with the DT library showing elevated blood lead levels per 1000 tested in 2013-2016 by borough. Below is an example of the table to replicate.

The datatable renders correctly in HTML, just not PDF The knitted HTML is accessible at: https://tin6150.github.io/psg/R_phw251_bonus_problem_set.html

# your code here

bll_renamed = bll_nyc_per_1000 %>%
  rename(
    `Borough` = borough_id,
    `Year`    = time_period,
    `BLL > 5` = bll_5plus_1k
  )

datatable( 
  bll_renamed,
  caption = "New York City: Elevated Blood Lead Levels 2013-2016 by Borough",
  rownames = F
)

Part 4

Question 13

For this question, we will use suicide rates data that comes from the CDC.

Replicate the graph below using plotly.

The plot_ly graph renders correctly in HTML, just not PDF The knitted HTML is accessible at: https://tin6150.github.io/psg/R_phw251_bonus_problem_set.html

# issues with formattable and plotly together since the "style" option overlap
# https://stackoverflow.com/questions/39319427/using-formattable-and-plotly-simultaneously
#detach("package:formattable", unload=TRUE)

library(plotly)

## Warning: package 'plotly' was built under R version 4.2.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:formattable':
## 
##     style

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

df_suicide <- read_csv("data/Suicide Mortality by State.csv")

## Rows: 300 Columns: 6

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# your code here


top_suicide = df_suicide %>%
  group_by(STATE) %>%
  summarize( total = sum( DEATHS ),
             RATE ) %>%
  arrange( total )

## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.

# not sure of state selection criteria used, so pick by hand
state_list = c( 'AZ', 'CA', 'FL', 'HI', 'MI', 'NY', 'WY' )

df_suicide_select = df_suicide %>%
  filter( STATE %in% state_list )

fig_q13 = plot_ly( df_suicide_select ) %>%
  add_trace( 
    x = ~YEAR,
    y = ~RATE,
    #type = "bar"
    type = "scatter"
    , color = ~STATE
    #,     mode = "lines+markers" 
    ,     mode = "lines"   
    , line = list( dash = "dash")  # https://plotly.com/r/line-charts/
    )



fig_q13

Question 14

Create an interactive (choropleth) map with plotly similar to the one presented on the CDC website. On the CDC map you can hover over each state and see the state name in bold, the death rate, and deaths. We can do much of this with plotly. As a challenge, use only the 2018 data to create an interactive US map colored by the suicide rates of that state. When you hover over the state, you should see the state name in bold, the death rate, and the number of deaths.

Some key search terms that may help:

choropleth
hover text plotly r
hover text bold plotly r
plotly and html
html bold
html subtitle

Below is an image of an example final map when you hover over California.

Here is the shell of the map to get you started. Copy the plotly code into the chunk below this one and customize it.

The plot_ly map renders correctly in HTML, just not PDF The knitted HTML is accessible at: https://tin6150.github.io/psg/R_phw251_bonus_problem_set.html

# your code here


# data pulled from CDC website described above 
df_suicide <- read_csv("data/Suicide Mortality by State.csv") %>%
  filter(YEAR == 2018)

## Rows: 300 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): STATE, NAME, URL
## dbl (2): YEAR, RATE
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## add hover pop up text as extra col of data to DF
df_suicide$hover = with(
  df_suicide,
  paste( NAME, '<br>',
         'Death Rate: ', RATE,    '<br>',
         'Death: '     , DEATHS,  '<br>'
         ) 
)

map_q14 = plot_ly(df_suicide,
        type="choropleth",
        locationmode = "USA-states") %>%
  layout(geo=list(scope="usa"))   %>% add_trace(
    z            = ~RATE
    , text       = ~hover
    , locations  = ~STATE
    , color      = ~RATE
    , colors     = 'Purples'
  )  %>%   
  layout(      
    title = paste(
      "Suicide Mortality by State in 2018"
      , "<br> The number of deaths per 100,000 total population"
) )


map_q14 # map renders well in Rstudio, but not to pdf.