title: “Sn50 random notes on R”
subtitle: “many while taking stat or epi classes”
date: ‘2022’
output: html_document

This is my notebook/scratchpad for R notes. The basic stuff was in psg/R.html but as class uses more advance constructs, and the graphics, it is just easier to have Rmd to play with the code in R studio as well. Knit to html and the result is still visible in the web :)

(gis class may need a Rgis.Rmd, dont put those here)

xref: - PHW251 wk11 has lot of visuals, plot_ly, DT data table, - PHW251 bonus problem set practiced on some of them, as well as a gis choropleth.

This example from wk 13 of PHW251 on epicurves by Manasa Susarla: PHW251 R for Public Health–Fall 2022 git repo

ref for Rmd, knit options, latex: ds4ps

~~~~

easystats produces nice plots and way less coding that ggplot!

./weekly material/week_13/epi_R_handbook/epi_R_handbook.Rmd

also discussed in The Epidemiologist R Handbook is by Neale Batra et al. and is also available on the web for free.

(covered glm - generalized linear model, for multivariate?)

# author: "Manasa Susarla"
# epidemic curvee (wk 13 of PHW251 EpidemicCurves.Rmd)

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.0
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
library(pacman)
pacman::p_load(rio)
pacman::p_load(incidence2)

Ebola_data <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/case_linelists/linelist_cleaned.xlsx")
head(Ebola_data)
##   case_id generation date_infection date_onset date_hospitalisation
## 1  5fe599          4     2014-05-08 2014-05-13           2014-05-15
## 2  8689b7          4           <NA> 2014-05-13           2014-05-14
## 3  11f8ea          2           <NA> 2014-05-16           2014-05-18
## 4  b8812a          3     2014-05-04 2014-05-18           2014-05-20
## 5  893f25          3     2014-05-18 2014-05-21           2014-05-22
## 6  be99c8          3     2014-05-03 2014-05-22           2014-05-23
##   date_outcome outcome gender age age_unit age_years age_cat age_cat5
## 1         <NA>    <NA>      m   2    years         2     0-4      0-4
## 2   2014-05-18 Recover      f   3    years         3     0-4      0-4
## 3   2014-05-30 Recover      m  56    years        56   50-69    55-59
## 4         <NA>    <NA>      f  18    years        18   15-19    15-19
## 5   2014-05-29 Recover      m   3    years         3     0-4      0-4
## 6   2014-05-24 Recover      f  16    years        16   15-19    15-19
##                               hospital       lon      lat infector source wt_kg
## 1                                Other -13.21574 8.468973   f547d6  other    27
## 2                              Missing -13.21523 8.451719     <NA>   <NA>    25
## 3 St. Mark's Maternity Hospital (SMMH) -13.21291 8.464817     <NA>   <NA>    91
## 4                        Port Hospital -13.23637 8.475476   f90f5f  other    41
## 5                    Military Hospital -13.22286 8.460824   11f8ea  other    36
## 6                        Port Hospital -13.22263 8.461831   aec8ec  other    56
##   ht_cm ct_blood fever chills cough aches vomit temp time_admission       bmi
## 1    48       22    no     no   yes    no   yes 36.8           <NA> 117.18750
## 2    59       22  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          09:36  71.81844
## 3   238       21  <NA>   <NA>  <NA>  <NA>  <NA> 36.9          16:48  16.06525
## 4   135       23    no     no    no    no    no 36.8          11:22  22.49657
## 5    71       23    no     no   yes    no   yes 36.9          12:60  71.41440
## 6   116       21    no     no   yes    no   yes 37.6          14:13  41.61712
##   days_onset_hosp
## 1               2
## 2               1
## 3               2
## 4               2
## 5               1
## 6               1
##First make sure that your date column is in Date Format
Ebola_data$date_onset <- as.Date(Ebola_data$date_onset)

#Step 1: Create an incidence object
epi_day <- incidence(x = Ebola_data, date_index = date_onset, interval = "day")

#Step 2: Plot
plot(epi_day)

# so much simpler than ggplot !

epiR and epitools (wk13 PHW251 Yifan)

dummy=1

TBD

string, regex

string cheatsheet

for use | /| / // . \. # literal period ! \! ? \? \ \\ ( \() ) \) t # tab \n # newline d # :digits:

[[:punct:]] [[:alpha:]] [[:ASCII:]] # ascii chars (7 bit?, ie the 127 chars set) [^[:digit:]] # everything but numerals ^ means complement in reqgex,

\0 NULL octal code char, 1-3 digits hex code char, 1-2 digits unicode char, 1-4 digits unicode char, 1-8 digits

p_load(stringr)

# hoping to find equiv of $1 $2 perl equiv of (\w+).fasta 

str1="chi_A11.fasta"
str2="A1.fasta"


str_extract( str1, "([\\w,\\d]+)\\.fasta$" )
## [1] "chi_A11.fasta"
# nah


# by using () around the pattern, str_match return additional array elements
base = str_match( str1, "([\\w\\d_]+)\\.fasta$" )
base
##      [,1]            [,2]     
## [1,] "chi_A11.fasta" "chi_A11"
base[2]  # this does get the basename without the extension
## [1] "chi_A11"
base = str_match( str1, "([\\w\\d_]+)\\.(fasta)$" )
base
##      [,1]            [,2]      [,3]   
## [1,] "chi_A11.fasta" "chi_A11" "fasta"
base[3]
## [1] "fasta"
baseB = str_match( str1, "[\\w\\d_]+\\.fasta$" )
baseB
##      [,1]           
## [1,] "chi_A11.fasta"
result =  str1 %>% str_match( "([\\w\\d_]+)\\.fasta$" )
result[2]
## [1] "chi_A11"
# the following don't work 
#result =  str1 %>% str_match( "([\\w\\d_]+)\\.fasta$" )[2]
#result[2]

# so how to grab pattern 2?! into a col of a df??

df_strEg = data_frame( str1, str2 )
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
df_strEg2 = data_frame(id = 1:7,
       name = c('this and it pretty long is a',
                'name is a',
                'so and so and so and so  and so',
                "chi_B99.fasta",
                "C9.fasta",
                'this is a',
                'this is a variabel name')) 


df_strEg2 %>% 
  mutate_at( "name", str_match, "([\\w\\d_]+)\\.fasta$"  )
## # A tibble: 7 × 2
##      id name[,1]      [,2]   
##   <int> <chr>         <chr>  
## 1     1 <NA>          <NA>   
## 2     2 <NA>          <NA>   
## 3     3 <NA>          <NA>   
## 4     4 chi_B99.fasta chi_B99
## 5     5 C9.fasta      C9     
## 6     6 <NA>          <NA>   
## 7     7 <NA>          <NA>
  # this can't get to [2] :-\
#df_strEg2

df_strEg2$name %>% 
  str_match( "([\\w\\d_]+)\\.(fasta)$" )
##      [,1]            [,2]      [,3]   
## [1,] NA              NA        NA     
## [2,] NA              NA        NA     
## [3,] NA              NA        NA     
## [4,] "chi_B99.fasta" "chi_B99" "fasta"
## [5,] "C9.fasta"      "C9"      "fasta"
## [6,] NA              NA        NA     
## [7,] NA              NA        NA
# ok! this works as desired!
df_result3 = df_strEg2 %>% 
  mutate( base = str_match( name, "([\\w\\d_]+)\\.fasta$"  )[,2] )
  
df_result3
## # A tibble: 7 × 3
##      id name                            base   
##   <int> <chr>                           <chr>  
## 1     1 this and it pretty long is a    <NA>   
## 2     2 name is a                       <NA>   
## 3     3 so and so and so and so  and so <NA>   
## 4     4 chi_B99.fasta                   chi_B99
## 5     5 C9.fasta                        C9     
## 6     6 this is a                       <NA>   
## 7     7 this is a variabel name         <NA>

named colors in R

table in R.html, maybe move here… except that r studio like to mess with tab

color name compatibility approx hex code
red r,p?,h? 0xFF0000 ?
green r,p?,h? 0x00FF00 ?
blue r,p?,h? 0x0000ff ?

darkorange r darkcyan r darklateblue r

lightgreen r

compat: r = R, p = python, h = HTML

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:rio':
## 
##     export
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
x = c( 1, 2, 3 )
y = x

tried_color_list=c(
"grey",
"lightgrey"
)

plot1 = plot_ly(
  x = ~x,
  y = ~y,
  type = "bar",
  marker = list( color = tried_color_list[1] ),    # add color as item 1 and try running code :)
  data = as.data.frame( tried_color_list )
)


plot1

Vector vs scalar

(PHW251 bonus problem Q5)

# understand these comparison first:
myMix = c( 5.1, 5.2, 11.1, 11.2 )
myMix >= 9
## [1] FALSE FALSE  TRUE  TRUE
myMix >= as.vector(9)
## [1] FALSE FALSE  TRUE  TRUE
as.numeric( myMix >= 9 )
## [1] 0 0 1 1
# ie, result is a list.  IF-statement cant handle list in the THEN and ELSE parts
# manually create some dataframe from scratch as example data

h1 = c( 123,  124)    # cm
h2 = c(  5.6,   6.6)  # ft
w1 = c( 71,    82)
w2 = c(202,   176)

df_Taiwan = data_frame( h1, w1 ) %>% rename( height=h1, weight=w1 )
df_Colo  =  data_frame( h2, w2 ) %>% rename( height=h2, weight=w2 )  


df_both = rbind( df_Taiwan, df_Colo ) # if col name mismatch then cant row bind

df_both
## # A tibble: 4 × 2
##   height weight
##    <dbl>  <dbl>
## 1  123       71
## 2  124       82
## 3    5.6    202
## 4    6.6    176
str(df_Taiwan)                 
## tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
##  $ height: num [1:2] 123 124
##  $ weight: num [1:2] 71 82
bmi_metric = function( height, weight ) {
    return( weight / (height/100)^2 )           # single line return here is fine, even for vectorization
}

bmi_imperial <- function(height, weight){
  bmi = (703 * weight)/(height * 12)^2          # this form is not required
  return(bmi)
}


scalar_calculate_bmi = function( height, weight ) {
  #str( height )
  #if( as.numeric(height >= 9 )) {    # this return a list if height is a list...   # need to return via apply()?
  if( height >= 9 ) {    # if height is a list, it return a list, which return complain is not length 1, which is fine, want this to be scalar comparison only
    #return( bmi_metric(   height,weight ) )   ## for IF-statement, even NOT using the return still cant handle vector
    bmi = bmi_metric(   height,weight )
  } else {
    #return( bmi_imperial( height,weight ) )
    bmi = bmi_imperial( height,weight ) 
  }
  return( bmi )
}


scalar_calculate_bmi( df_Taiwan$height[1], df_Taiwan$weight[1] )   # cuz of the if-statement, only handle scalar input
## [1] 46.92974
#scalar_calculate_bmi( df_Taiwan$height,    df_Taiwan$weight    )   # cuz of the if-statement, only handle scalar input



vector_calculate_bmi = function( height, weight ) {
  if_else( height >= as.vector(9),  
    ( bmi_metric(   height,weight ) ),   # do NOT use RETURN in front of these statement, which force return of lenght 1
    ( bmi_imperial( height,weight ) )
  )
}

## see PHW251_Fall2022/problem sets/problem set bonus/problem_set_bonus_beforeCleanUp_lotsOfDetailsOfSubtleVectorization.Rmd
## for all the crazy things i tried 

df_both %>% mutate(
  bmi = vector_calculate_bmi( height, weight )
)
## # A tibble: 4 × 3
##   height weight   bmi
##    <dbl>  <dbl> <dbl>
## 1  123       71  46.9
## 2  124       82  53.3
## 3    5.6    202  31.4
## 4    6.6    176  19.7
# function overloading works in R too :)
# x and y could be vectors, it will do element wise comparison, and return a vector
up_green_down_red = function(x, y) {
  return(ifelse( x <= y, "#fd626e", "#03d584"))
}

stat table calculator

(from R.html)

F(x) = P(X ≤ x) Answer What is P(X < 27.4) when X has Normal distribution with mean 50 and standard deviation of 20 aka N(50,20). N often styled as cursive cap N. if variance (σ) is given, use the relation s.d. = sqrt(sigma) pnorm( q=27.4, mean=50, sd=20, lower.tail=TRUE ) pnorm( 27.4, 50, 20 )

https://www.bing.com/images/search?view=detailV2&ccid=0m5W4I7i&id=D280CB81A472BCC65596A2A6AAAF0FFFE9900FFE&thid=OIP.0m5W4I7iNeVuNWAygEAEvQHaFj&mediaurl=https%3A%2F%2Fimage.slidesharecdn.com%2Fbusinessstatisticschapter9-160519114719%2F95%2Fbusiness-statistics-chapter-9-28-638.jpg%3Fcb%3D1463658507&cdnurl=https%3A%2F%2Fth.bing.com%2Fth%2Fid%2FR.d26e56e08ee235e56e356032804004bd%3Frik%3D%252fg%252bQ6f8Pr6qmog%26pid%3DImgRaw%26r%3D0&exph=479&expw=638&q=when+to+reject+null+hypothesis+p-value&simid=608004671995271148&form=IRPRST&ck=7010A98F3A1E4976BC1BA694406D033F&selectedindex=1&ajaxhist=0&ajaxserp=0&pivotparams=insightsToken%3Dccid_BYpTD9%252Ba*cp_1F329B75E25E5E43155CF677D62073D5*mid_EF64F8CFAC485FCB18A3749907B64FD3C796FF50*simid_608042167058510911*thid_OIP.BYpTD9-aYUl4li4z6AsFTAHaFj&vt=0&sim=11&iss=VSI

# R use log for natural log, but can redefine/alias ln as:
`ln` = log


qnorm( 0.05, lower.tail=F )  # 1.65
## [1] 1.644854
# eg: got p-value of 0.0668 >= alpha of 0.05, do not reject (NULL)     ie not stat significant
# test stat of 1.50 is below the 1.645 cut off, do not reject (NULL)


pnorm( q=.95, mean=0, sd=1, lower.tail=F )
## [1] 0.1710561
pnorm( q=.05, mean=0, sd=1, lower.tail=T )
## [1] 0.5199388
paup1 = 3.841
paup1 = -2*( ln(558162.2) - ln(605229.7) ) 

pchisq( q=paup1, df=2 )
## [1] 0.07776799
dchisq( x=.995, df=1 )
## [1] 0.2431851
dchisq( .975, df=1 )
## [1] 0.2481357
# most freq used chisq 95% df=1 is 3.841 # JH memorize this
qchisq( 0.95, df=1 )
## [1] 3.841459
qchisq( 0.05, df=1, lower.tail=F )
## [1] 3.841459
pchisq( 3.841,    df=1 )
## [1] 0.9499863
pchisq( 3.841459, df=1 )
## [1] 0.95
# pchisq always upper.tail from STAT142... but situation/context??
# pchisq( 3.841459, df=1, lower.tail=F )

qchisq( 0.95, df=2   )  # 5.991465
## [1] 5.991465
qchisq( 0.95, df=1.5 )  # 4.980195
## [1] 4.980195
qchisq( 0.95, df=1.1 )  # 4.083997
## [1] 4.083997
qchisq( 0.95, df=1   )  # 3.841
## [1] 3.841459
qchisq( 0.95, df=0.9 )  # 3.58
## [1] 3.588808
qchisq( 0.95, df=0.5 )  # 2.420232 
## [1] 2.420232
qchisq( 0.95, df=0.1 )  # 0.5318646
## [1] 0.5318646
qchisq( 0.95, df=0   )  # 0
## [1] 0
paup1 = -2*( ln(558162.2) - ln(605229.7) ) 
paup2 = -2*( log(558162.2) - log(605229.7) ) 


#-ln L  2909255.8  # nst=2
#-ln L  2891898.8  # nst=6

paup_hky = -2*( ln(2909255.8) - ln(2891898.8) )

paup_hky # -0.011968
## [1] -0.011968

Rmd/Rstudio quirkyness

the triple-backticks r code block. either don’t have name in them, but if there is, make sure they are unique, or knit will fail. I suppose if not knitting then don’t matter.

Good resources

  • R for Health Data Science is an online book by Ewen Harrison and Riinu Pius. It’s available online in its entirety, for free
  • The Epidemiologist R Handbook is by Neale Batra et al. and is also available on the web for free.

  • xref