title: “Sn50 random notes on R” |
subtitle: “many while taking stat or epi classes” |
date: ‘2022’ |
output: html_document |
This is my notebook/scratchpad for R notes. The basic stuff was in psg/R.html but as class uses more advance constructs, and the graphics, it is just easier to have Rmd to play with the code in R studio as well. Knit to html and the result is still visible in the web :)
(gis class may need a Rgis.Rmd, dont put those here)
xref: - PHW251 wk11 has lot of visuals, plot_ly, DT data table, - PHW251 bonus problem set practiced on some of them, as well as a gis choropleth.
This example from wk 13 of PHW251 on epicurves by Manasa Susarla: PHW251 R for Public Health–Fall 2022 git repo
ref for Rmd, knit options, latex: ds4ps
~~~~
easystats produces nice plots and way less coding that ggplot!
./weekly material/week_13/epi_R_handbook/epi_R_handbook.Rmd
also discussed in The Epidemiologist R Handbook is by Neale Batra et al. and is also available on the web for free.
(covered glm - generalized linear model, for multivariate?)
# author: "Manasa Susarla"
# epidemic curvee (wk 13 of PHW251 EpidemicCurves.Rmd)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ readr 2.1.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(pacman)
pacman::p_load(rio)
pacman::p_load(incidence2)
Ebola_data <- import("https://github.com/appliedepi/epirhandbook_eng/raw/master/data/case_linelists/linelist_cleaned.xlsx")
head(Ebola_data)
## case_id generation date_infection date_onset date_hospitalisation
## 1 5fe599 4 2014-05-08 2014-05-13 2014-05-15
## 2 8689b7 4 <NA> 2014-05-13 2014-05-14
## 3 11f8ea 2 <NA> 2014-05-16 2014-05-18
## 4 b8812a 3 2014-05-04 2014-05-18 2014-05-20
## 5 893f25 3 2014-05-18 2014-05-21 2014-05-22
## 6 be99c8 3 2014-05-03 2014-05-22 2014-05-23
## date_outcome outcome gender age age_unit age_years age_cat age_cat5
## 1 <NA> <NA> m 2 years 2 0-4 0-4
## 2 2014-05-18 Recover f 3 years 3 0-4 0-4
## 3 2014-05-30 Recover m 56 years 56 50-69 55-59
## 4 <NA> <NA> f 18 years 18 15-19 15-19
## 5 2014-05-29 Recover m 3 years 3 0-4 0-4
## 6 2014-05-24 Recover f 16 years 16 15-19 15-19
## hospital lon lat infector source wt_kg
## 1 Other -13.21574 8.468973 f547d6 other 27
## 2 Missing -13.21523 8.451719 <NA> <NA> 25
## 3 St. Mark's Maternity Hospital (SMMH) -13.21291 8.464817 <NA> <NA> 91
## 4 Port Hospital -13.23637 8.475476 f90f5f other 41
## 5 Military Hospital -13.22286 8.460824 11f8ea other 36
## 6 Port Hospital -13.22263 8.461831 aec8ec other 56
## ht_cm ct_blood fever chills cough aches vomit temp time_admission bmi
## 1 48 22 no no yes no yes 36.8 <NA> 117.18750
## 2 59 22 <NA> <NA> <NA> <NA> <NA> 36.9 09:36 71.81844
## 3 238 21 <NA> <NA> <NA> <NA> <NA> 36.9 16:48 16.06525
## 4 135 23 no no no no no 36.8 11:22 22.49657
## 5 71 23 no no yes no yes 36.9 12:60 71.41440
## 6 116 21 no no yes no yes 37.6 14:13 41.61712
## days_onset_hosp
## 1 2
## 2 1
## 3 2
## 4 2
## 5 1
## 6 1
##First make sure that your date column is in Date Format
Ebola_data$date_onset <- as.Date(Ebola_data$date_onset)
#Step 1: Create an incidence object
epi_day <- incidence(x = Ebola_data, date_index = date_onset, interval = "day")
#Step 2: Plot
plot(epi_day)
# so much simpler than ggplot !
epiR and epitools (wk13 PHW251 Yifan)
dummy=1
for use | /| / // . \. # literal period ! \! ? \? \ \\ ( \() ) \) t # tab \n # newline d # :digits:
[[:punct:]] [[:alpha:]] [[:ASCII:]] # ascii chars (7 bit?, ie the 127 chars set) [^[:digit:]] # everything but numerals ^ means complement in reqgex,
\0 NULL octal code char, 1-3 digits hex code char, 1-2 digits unicode char, 1-4 digits unicode char, 1-8 digits
p_load(stringr)
# hoping to find equiv of $1 $2 perl equiv of (\w+).fasta
str1="chi_A11.fasta"
str2="A1.fasta"
str_extract( str1, "([\\w,\\d]+)\\.fasta$" )
## [1] "chi_A11.fasta"
# nah
# by using () around the pattern, str_match return additional array elements
base = str_match( str1, "([\\w\\d_]+)\\.fasta$" )
base
## [,1] [,2]
## [1,] "chi_A11.fasta" "chi_A11"
base[2] # this does get the basename without the extension
## [1] "chi_A11"
base = str_match( str1, "([\\w\\d_]+)\\.(fasta)$" )
base
## [,1] [,2] [,3]
## [1,] "chi_A11.fasta" "chi_A11" "fasta"
base[3]
## [1] "fasta"
baseB = str_match( str1, "[\\w\\d_]+\\.fasta$" )
baseB
## [,1]
## [1,] "chi_A11.fasta"
result = str1 %>% str_match( "([\\w\\d_]+)\\.fasta$" )
result[2]
## [1] "chi_A11"
# the following don't work
#result = str1 %>% str_match( "([\\w\\d_]+)\\.fasta$" )[2]
#result[2]
# so how to grab pattern 2?! into a col of a df??
df_strEg = data_frame( str1, str2 )
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
df_strEg2 = data_frame(id = 1:7,
name = c('this and it pretty long is a',
'name is a',
'so and so and so and so and so',
"chi_B99.fasta",
"C9.fasta",
'this is a',
'this is a variabel name'))
df_strEg2 %>%
mutate_at( "name", str_match, "([\\w\\d_]+)\\.fasta$" )
## # A tibble: 7 × 2
## id name[,1] [,2]
## <int> <chr> <chr>
## 1 1 <NA> <NA>
## 2 2 <NA> <NA>
## 3 3 <NA> <NA>
## 4 4 chi_B99.fasta chi_B99
## 5 5 C9.fasta C9
## 6 6 <NA> <NA>
## 7 7 <NA> <NA>
# this can't get to [2] :-\
#df_strEg2
df_strEg2$name %>%
str_match( "([\\w\\d_]+)\\.(fasta)$" )
## [,1] [,2] [,3]
## [1,] NA NA NA
## [2,] NA NA NA
## [3,] NA NA NA
## [4,] "chi_B99.fasta" "chi_B99" "fasta"
## [5,] "C9.fasta" "C9" "fasta"
## [6,] NA NA NA
## [7,] NA NA NA
# ok! this works as desired!
df_result3 = df_strEg2 %>%
mutate( base = str_match( name, "([\\w\\d_]+)\\.fasta$" )[,2] )
df_result3
## # A tibble: 7 × 3
## id name base
## <int> <chr> <chr>
## 1 1 this and it pretty long is a <NA>
## 2 2 name is a <NA>
## 3 3 so and so and so and so and so <NA>
## 4 4 chi_B99.fasta chi_B99
## 5 5 C9.fasta C9
## 6 6 this is a <NA>
## 7 7 this is a variabel name <NA>
table in R.html, maybe move here… except that r studio like to mess with tab
color name | compatibility | approx hex code |
---|---|---|
red | r,p?,h? | 0xFF0000 ? |
green | r,p?,h? | 0x00FF00 ? |
blue | r,p?,h? | 0x0000ff ? |
darkorange r darkcyan r darklateblue r
lightgreen r
compat: r = R, p = python, h = HTML
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:rio':
##
## export
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
x = c( 1, 2, 3 )
y = x
tried_color_list=c(
"grey",
"lightgrey"
)
plot1 = plot_ly(
x = ~x,
y = ~y,
type = "bar",
marker = list( color = tried_color_list[1] ), # add color as item 1 and try running code :)
data = as.data.frame( tried_color_list )
)
plot1
(PHW251 bonus problem Q5)
# understand these comparison first:
myMix = c( 5.1, 5.2, 11.1, 11.2 )
myMix >= 9
## [1] FALSE FALSE TRUE TRUE
myMix >= as.vector(9)
## [1] FALSE FALSE TRUE TRUE
as.numeric( myMix >= 9 )
## [1] 0 0 1 1
# ie, result is a list. IF-statement cant handle list in the THEN and ELSE parts
# manually create some dataframe from scratch as example data
h1 = c( 123, 124) # cm
h2 = c( 5.6, 6.6) # ft
w1 = c( 71, 82)
w2 = c(202, 176)
df_Taiwan = data_frame( h1, w1 ) %>% rename( height=h1, weight=w1 )
df_Colo = data_frame( h2, w2 ) %>% rename( height=h2, weight=w2 )
df_both = rbind( df_Taiwan, df_Colo ) # if col name mismatch then cant row bind
df_both
## # A tibble: 4 × 2
## height weight
## <dbl> <dbl>
## 1 123 71
## 2 124 82
## 3 5.6 202
## 4 6.6 176
str(df_Taiwan)
## tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
## $ height: num [1:2] 123 124
## $ weight: num [1:2] 71 82
bmi_metric = function( height, weight ) {
return( weight / (height/100)^2 ) # single line return here is fine, even for vectorization
}
bmi_imperial <- function(height, weight){
bmi = (703 * weight)/(height * 12)^2 # this form is not required
return(bmi)
}
scalar_calculate_bmi = function( height, weight ) {
#str( height )
#if( as.numeric(height >= 9 )) { # this return a list if height is a list... # need to return via apply()?
if( height >= 9 ) { # if height is a list, it return a list, which return complain is not length 1, which is fine, want this to be scalar comparison only
#return( bmi_metric( height,weight ) ) ## for IF-statement, even NOT using the return still cant handle vector
bmi = bmi_metric( height,weight )
} else {
#return( bmi_imperial( height,weight ) )
bmi = bmi_imperial( height,weight )
}
return( bmi )
}
scalar_calculate_bmi( df_Taiwan$height[1], df_Taiwan$weight[1] ) # cuz of the if-statement, only handle scalar input
## [1] 46.92974
#scalar_calculate_bmi( df_Taiwan$height, df_Taiwan$weight ) # cuz of the if-statement, only handle scalar input
vector_calculate_bmi = function( height, weight ) {
if_else( height >= as.vector(9),
( bmi_metric( height,weight ) ), # do NOT use RETURN in front of these statement, which force return of lenght 1
( bmi_imperial( height,weight ) )
)
}
## see PHW251_Fall2022/problem sets/problem set bonus/problem_set_bonus_beforeCleanUp_lotsOfDetailsOfSubtleVectorization.Rmd
## for all the crazy things i tried
df_both %>% mutate(
bmi = vector_calculate_bmi( height, weight )
)
## # A tibble: 4 × 3
## height weight bmi
## <dbl> <dbl> <dbl>
## 1 123 71 46.9
## 2 124 82 53.3
## 3 5.6 202 31.4
## 4 6.6 176 19.7
# function overloading works in R too :)
# x and y could be vectors, it will do element wise comparison, and return a vector
up_green_down_red = function(x, y) {
return(ifelse( x <= y, "#fd626e", "#03d584"))
}
(from R.html)
F(x) = P(X ≤ x) Answer What is P(X < 27.4) when X has Normal distribution with mean 50 and standard deviation of 20 aka N(50,20). N often styled as cursive cap N. if variance (σ) is given, use the relation s.d. = sqrt(sigma) pnorm( q=27.4, mean=50, sd=20, lower.tail=TRUE ) pnorm( 27.4, 50, 20 )
# R use log for natural log, but can redefine/alias ln as:
`ln` = log
qnorm( 0.05, lower.tail=F ) # 1.65
## [1] 1.644854
# eg: got p-value of 0.0668 >= alpha of 0.05, do not reject (NULL) ie not stat significant
# test stat of 1.50 is below the 1.645 cut off, do not reject (NULL)
pnorm( q=.95, mean=0, sd=1, lower.tail=F )
## [1] 0.1710561
pnorm( q=.05, mean=0, sd=1, lower.tail=T )
## [1] 0.5199388
paup1 = 3.841
paup1 = -2*( ln(558162.2) - ln(605229.7) )
pchisq( q=paup1, df=2 )
## [1] 0.07776799
dchisq( x=.995, df=1 )
## [1] 0.2431851
dchisq( .975, df=1 )
## [1] 0.2481357
# most freq used chisq 95% df=1 is 3.841 # JH memorize this
qchisq( 0.95, df=1 )
## [1] 3.841459
qchisq( 0.05, df=1, lower.tail=F )
## [1] 3.841459
pchisq( 3.841, df=1 )
## [1] 0.9499863
pchisq( 3.841459, df=1 )
## [1] 0.95
# pchisq always upper.tail from STAT142... but situation/context??
# pchisq( 3.841459, df=1, lower.tail=F )
qchisq( 0.95, df=2 ) # 5.991465
## [1] 5.991465
qchisq( 0.95, df=1.5 ) # 4.980195
## [1] 4.980195
qchisq( 0.95, df=1.1 ) # 4.083997
## [1] 4.083997
qchisq( 0.95, df=1 ) # 3.841
## [1] 3.841459
qchisq( 0.95, df=0.9 ) # 3.58
## [1] 3.588808
qchisq( 0.95, df=0.5 ) # 2.420232
## [1] 2.420232
qchisq( 0.95, df=0.1 ) # 0.5318646
## [1] 0.5318646
qchisq( 0.95, df=0 ) # 0
## [1] 0
paup1 = -2*( ln(558162.2) - ln(605229.7) )
paup2 = -2*( log(558162.2) - log(605229.7) )
#-ln L 2909255.8 # nst=2
#-ln L 2891898.8 # nst=6
paup_hky = -2*( ln(2909255.8) - ln(2891898.8) )
paup_hky # -0.011968
## [1] -0.011968
the triple-backticks r code block. either don’t have name in them, but if there is, make sure they are unique, or knit will fail. I suppose if not knitting then don’t matter.
The Epidemiologist R Handbook is by Neale Batra et al. and is also available on the web for free.