# F1 Scores

In [2]:
library("ggpubr")
library(readr)
library(ggplot2)
library(tidyverse)
library(ARTool)
library(emmeans)
library(multcomp)
library(car)
library(rstatix)

Loading required package: ggplot2

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mtibble [39m 3.1.5     [32mv[39m [34mdplyr  [39m 1.0.7
[32mv[39m [34mtidyr  [39m 1.1.4     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mpurrr  [39m 0.3.4     [32mv[39m [34mforcats[39m 0.5.1

-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Loading required package: mvtnorm

Loading required package: survival

Loading required package: TH.data

Loading required package: MASS


Attaching package: 'MASS'


The following object is masked from 'package:dplyr':

    select



Attaching package: 'TH.data'


The following object is masked from 'package:MASS':

    geyser


Loading required package: carData


Attaching package: 'car'


Th

In [3]:
em_scores <- read_csv("em_scores.csv") %>%
    rename(question = `...1`) %>%
    pivot_longer(!question, names_to=c("retriever", "reader"), names_sep="_", values_to="em")

em_scores$retriever <- as.factor(em_scores$retriever)
em_scores$reader <- as.factor(em_scores$reader)

head(em_scores)

New names:
* `` -> ...1

[1mRows: [22m[34m59[39m [1mColumns: [22m[34m5[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[32mdbl[39m (5): ...1, faiss_dpr, faiss_longformer, es_dpr, es_longformer

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


question,retriever,reader,em
<dbl>,<fct>,<fct>,<dbl>
0,faiss,dpr,0
0,faiss,longformer,0
0,es,dpr,0
0,es,longformer,0
1,faiss,dpr,0
1,faiss,longformer,0


To test which tests we can use, we need to check for normality. For this, we use a Shapiro-Wilk test of normality. In this case, results with FAISS as retriever or DPR had reader had zero exact matches, thus making it impossible to compute the Shapiro-Wilk test of normality. Nonetheless, we know that a distribution with all-identical values is not normally distributed. As you can see in the results below, all other $p$-values are lower than 0.001, so we reject the null-hypothesis of normality and now know that none of the f1-scores are normally distributed.

In [14]:
em_scores %>%
    select(!question) %>%
    group_by(retriever) %>%
    filter(sum(em) > 0) %>%
    summarise(sw.stat = shapiro.test(em)$statistic,
              sw.p = shapiro.test(em)$p)
em_scores %>%
    select(!question) %>%
    group_by(reader) %>%
    filter(sum(em) > 0) %>%
    summarise(sw.stat = shapiro.test(em)$statistic,
              sw.p = shapiro.test(em)$p)

retriever,sw.stat,sw.p
<fct>,<dbl>,<dbl>
es,0.2503666,6.788451000000001e-22


reader,sw.stat,sw.p
<fct>,<dbl>,<dbl>
longformer,0.2503666,6.788451000000001e-22


Since our data is not normally distributed, we cannot use an ANOVA to compare our results. Therefore, we use an aligned-rank test, which is a non-parameteric version of a factorial repeated measures ANOVA.

In [4]:
model.acc <- art(f1 ~ retriever * reader, data = em_scores)
anova(model.acc)
art.con(model.acc, ~ retriever)
art.con(model.acc, ~ reader)

Unnamed: 0_level_0,Term,Df,Df.res,Sum Sq,Sum Sq.res,F value,Pr(>F)
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
retriever,retriever,1,232,11564,263081,10.1978,0.001600976
reader,reader,1,232,11564,263081,10.1978,0.001600976
retriever:reader,retriever:reader,1,232,11564,263081,10.1978,0.001600976


NOTE: Results may be misleading due to involvement in interactions



 contrast   estimate   SE  df t.ratio p.value
 es - faiss       14 4.38 232   3.193  0.0016

Results are averaged over the levels of: reader 

NOTE: Results may be misleading due to involvement in interactions



 contrast         estimate   SE  df t.ratio p.value
 dpr - longformer      -14 4.38 232  -3.193  0.0016

Results are averaged over the levels of: retriever 

From these results, we can see that both the retriever and the reader have a significant effect on the F1 score ($F = 58.63$ and $F = 16.23$ respectively, $p < 0.0001$ for both). However, there is also an interaction between the retriever and reader ($F = 43.53$, $p < 0.0001$). The post-hoc analysis of contrasts shows that ElasticSearch performs better than FAISS ($p < 0.0001$) and Longformer performs better than DPR ($p = 0.0001$).