dabestr/vignettes/tutorial_basics.Rmd at 26ff23f9f537fa47c170672c55f532784687a77b · ACCLAB/dabestr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
title: "Tutorial: Basics"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Tutorial: Basics}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette documents the basic functionalities of dabestr. It illustrates the
order in which the functions are meant to be used procedurally.

The dataset is first processed into the dabestr format using the `load()` function. Next, the effect sizes are calculated using the effect_size() function. Finally, the estimation plots are generated using `dabest_plot()`.

```{r setup, warning = FALSE, message = FALSE}
library(dabestr)
library(dplyr)
```

## Create dataset for demo
Here, we create a dataset to illustrate how dabest functions. In this dataset,
each column corresponds to a group of observations.
```{r}
set.seed(12345) # Fix the seed so the results are replicable.
# pop_size = 10000 # Size of each population.
N <- 20

# Create samples
c1 <- rnorm(N, mean = 3, sd = 0.4)
c2 <- rnorm(N, mean = 3.5, sd = 0.75)
c3 <- rnorm(N, mean = 3.25, sd = 0.4)

t1 <- rnorm(N, mean = 3.5, sd = 0.5)
t2 <- rnorm(N, mean = 2.5, sd = 0.6)
t3 <- rnorm(N, mean = 3, sd = 0.75)
t4 <- rnorm(N, mean = 3.5, sd = 0.75)
t5 <- rnorm(N, mean = 3.25, sd = 0.4)
t6 <- rnorm(N, mean = 3.25, sd = 0.4)

# Add a `gender` column for coloring the data.
gender <- c(rep("Male", N / 2), rep("Female", N / 2))

# Add an `id` column for paired data plotting.
id <- 1:N

# Combine samples and gender into a DataFrame.
df <- tibble::tibble(
  `Control 1` = c1, `Control 2` = c2, `Control 3` = c3,
  `Test 1` = t1, `Test 2` = t2, `Test 3` = t3, `Test 4` = t4, `Test 5` = t5, `Test 6` = t6,
  Gender = gender, ID = id
)

df <- df %>%
  tidyr::gather(key = Group, value = Measurement, -ID, -Gender)
```

Note that we have 9 groups (3 *Control* samples and 6 *Test* samples). Our dataset also has a non-numerical column indicating gender, and another column indicating the identity of each observation.

This is known as a *long* dataset. See this [writeup](https://simonejdemyr.com/r-tutorials/basics/wide-and-long/) for more details.
```{r}
knitr::kable(head(df))
```

## Step 1: Loading Data
Before generating estimation plots and deriving confidence intervals for our effect sizes, we must first load the data and the corresponding groups.

To achieve this, we merely provide the DataFrame to the `load()` function, along with 'x' and 'y' representing the columns containing the treatment groups and measurement values, respectively. Additionally, we need to specify the two groups you wish to compare in the `idx` argument, either as a vector or a list.
```{r}
two_groups_unpaired <- load(df,
  x = Group, y = Measurement,
  idx = c("Control 1", "Test 1")
)
```

Printing this `dabestr` object gives you a gentle greeting, as well as the comparisons that can be computed.
```{r}
print(two_groups_unpaired)
```

### Changing statistical parameters
You can change the width of the confidence interval that will be produced by manipulating the `ci` argument.
```{r}
two_groups_unpaired_ci90 <- load(df,
  x = Group, y = Measurement,
  idx = c("Control 1", "Test 1"), ci = 90
)
```

```{r}
print(two_groups_unpaired_ci90)
```

## Step 2: Effect sizes
`dabestr` features a range of effect sizes:

- the mean difference (`mean_diff()`)
- the median difference (`median_diff()`)
- Cohen’s d (`cohens_d()`)
- Hedges’ g (`hedges_g()`)
- Cliff’s delta (`cliffs_delta()`)

The output of the `load()` function, a `dabest` object, is then passed into
these `effect_size()` functions as a parameter.

```{r}
two_groups_unpaired.mean_diff <- mean_diff(two_groups_unpaired)

print(two_groups_unpaired.mean_diff)
```

For each comparison, the type of effect size is reported (here, it’s the “unpaired mean difference”). The confidence interval is reported as:

_[confidenceIntervalWidth LowerBound, UpperBound]_

This confidence interval is generated through bootstrap resampling. See [Bootstrap Confidence Intervals](https://acclab.github.io/DABEST-python/blog/posts/bootstraps/bootstraps.html) for more details.

### P-values and statistical tests
Permutation P values are only provided to allow analysts to satisfy a customary requirement of scientific journals. DABEST's provision of P values does not constitute an endorsement of P values or null-hypothesis significance testing (NHST). If users need to include these in a study, we recommend that they:

1. avoid performing NHST, i.e. do not compare P to an alpha.
2. never refer to the P values in the 'Results' text.
3. state in their Methods section that *"No null-hypothesis significance testing was performed; P values are provided for legacy purposes only."*

## Step 3: Producing estimation plots
To produce a **Gardner-Altman estimation plot**, simply use the `dabest_plot()` function. You can read more about its genesis and design inspiration at [Robust and Beautiful Statistical Visualization](https://acclab.github.io/DABEST-python/blog/posts/robust-beautiful/robust-beautiful.html).

`dabest_plot()` requires only one compulsory parameter to run: the `dabest_effectsize_obj` obtained from the `effect_size()` function. This means that you can quickly create plots for various effect sizes with ease.
```{r}
dabest_plot(two_groups_unpaired.mean_diff)
```

Instead of a Gardner-Altman plot, you can produce a **Cumming estimation plot** by setting `float_contrast = FALSE` in the `dabest_plot()` function. This will plot the bootstrap effect sizes below the raw data, and will also display the mean (gap) and ± standard deviation of each group (vertical ends) as gapped lines. This design was inspired by Edward Tufte’s dictum to maximise the data-ink ratio.

```{r, eval = FALSE}
dabest_plot(two_groups_unpaired.mean_diff,
  float_contrast = FALSE,
  contrast_ylim = c(-0.3, 1.3)
)
```

```{r, echo = FALSE}
pp_plot <- dabest_plot(two_groups_unpaired.mean_diff,
  float_contrast = FALSE,
  contrast_ylim = c(-0.3, 1.3)
)

cowplot::plot_grid(
  plotlist = list(NULL, pp_plot, NULL),
  nrow = 1,
  ncol = 3,
  rel_widths = c(2.5, 5, 2.5)
)
```

The `dabestr` package also implements a variety of estimation plot designs aimed at depicting common experimental designs.

The **multi-two-group estimation plot** tiles two or more Cumming plots horizontally. To create this plot, you can pass a _nested list_ to the `idx` parameter when invoking the `load()` function for the first time.

As a result, the lower axes in the Cumming plot is effectively a [forest plot](https://en.wikipedia.org/wiki/Forest_plot), commonly used in meta-analyses to aggregate and compare data from different experiments.

```{r, warning = FALSE}
multi_2group <- load(df,
  x = Group, y = Measurement,
  idx = list(
    c("Control 1", "Test 1"),
    c("Control 2", "Test 2")
  )
)
multi_2group %>%
  mean_diff() %>%
  dabest_plot()
```

The **shared control plot** displays another common experimental paradigm, where several test samples are compared against a common reference sample.

This type of Cumming plot is automatically generated if the vector passed to the parameter `idx` has more than two data columns.
```{r, warnings = FALSE}
shared_control <- load(df,
  x = Group, y = Measurement,
  idx = c(
    "Control 1", "Test 1", "Test 2", "Test 3",
    "Test 4", "Test 5", "Test 6"
  )
)

print(shared_control)
```

```{r, warnings = FALSE}
shared_control.mean_diff <- mean_diff(shared_control)

print(shared_control.mean_diff)
```

```{r}
dabest_plot(shared_control.mean_diff)
```

The `dabestr` package empowers you to robustly perform statistical analyses and elegantly present complex visualizations.

```{r, warnings = FALSE}
multi_groups <- load(df,
  x = Group, y = Measurement,
  idx = list(
    c("Control 1", "Test 1"),
    c("Control 2", "Test 2", "Test 3"),
    c("Control 3", "Test 4", "Test 5", "Test 6")
  )
)

print(multi_groups)
```

```{r, warnings = FALSE}
multi_groups.mean_diff <- mean_diff(multi_groups)

print(multi_groups.mean_diff)
```

```{r, warnings = FALSE}
dabest_plot(multi_groups.mean_diff)
```

***
## **NOTE: Using wide datasets**

>Currently `dabestr` does not support the use of 'wide' data. To convert datasets from 'wide' to 'long'/'tidy', consider taking a look at [gather()](https://tidyr.tidyverse.org/reference/gather.html) as part of the [tidyr](https://tidyr.tidyverse.org) package.