## Outline

1. We collect a dataset consisting of (user_question, answer_context, dialogue_history -> answer)
2. We duplicate a small portion of dataset, where we remove answer_context
2. We augment 'answer_context' with (non_answer) picked by a reasonably-performing QA system: variable ordering, consistent number of answers
3. We train the model for exact-match generation 
- Also evaluate the exact-match ratio
- Separately evaluate with full-context questions

### 1. Positive contexts collection

In [1]:
import datasets

In [2]:
canard_train = datasets.load_dataset("json", data_files="datasets/CANARD_Release/train.json")["train"]

Using custom data configuration default-8d557d41fc795903
Found cached dataset json (/home/xstefan3/.cache/huggingface/datasets/json/default-8d557d41fc795903/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


  0%|          | 0/1 [00:00<?, ?it/s]

In [3]:
canard_train

Dataset({
    features: ['History', 'QuAC_dialog_id', 'Question', 'Question_no', 'Rewrite'],
    num_rows: 31526
})

In [4]:
canard_train[0]

{'History': ['Johnny Unitas', '1964 MVP season'],
 'QuAC_dialog_id': 'C_2ba58216460d43aa986fc0e897537239_0',
 'Question': 'what team did unitas play for',
 'Question_no': 1,
 'Rewrite': 'what team did Johnny Unitas play for?'}

In [5]:
quac_train = datasets.load_dataset("quac", split="train")

Found cached dataset quac (/home/xstefan3/.cache/huggingface/datasets/quac/plain_text/1.1.0/4170258e7e72d7c81bd6441b3f3489ea1544f0ff226ce61e22bb00c6e9d01fb6)


In [6]:
quac_train_df = quac_train.to_pandas().set_index("dialogue_id", drop=True)
quac_train_df.head(2)

Unnamed: 0_level_0,wikipedia_page_title,background,section_title,context,turn_ids,questions,followups,yesnos,answers,orig_answers
dialogue_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
C_69758fcdfc1f46baba0e92c0f3b0919c_1,Malayali,The Malayali people or Keralite people (also s...,Geographic distribution and population,"According to the Indian census of 2001, there ...","[C_69758fcdfc1f46baba0e92c0f3b0919c_1_q#0, C_6...","[Where is Malayali located?, What other langua...","[2, 1, 1, 1, 1, 1, 1]","[2, 2, 2, 2, 2, 0, 2]","{'texts': [['30,803,747 speakers of Malayalam ...","{'texts': ['30,803,747 speakers of Malayalam i..."
C_69758fcdfc1f46baba0e92c0f3b0919c_0,Malayali,The Malayali people or Keralite people (also s...,Language and literature,Malayalam is the language spoken by the Malaya...,"[C_69758fcdfc1f46baba0e92c0f3b0919c_0_q#0, C_6...","[what language do they speak?, Do they speak a...","[0, 0, 0, 0, 0, 0, 0]","[2, 2, 2, 2, 2, 2, 2]",{'texts': [['Malayalam is the language spoken ...,{'texts': ['Malayalam is the language spoken b...


In [7]:
quac_train_df.loc['C_2ba58216460d43aa986fc0e897537239_0'][["questions", "answers"]].values

array([array(['what team did unitas play for',
              'how many games did the colts win',
              'who did they play in the playoffs', 'did they win the super bowl',
              'who did they play in the super bowl', 'what were unitas stats'],
             dtype=object)                                                       ,
       {'texts': array([array(['The Colts'], dtype=object),
              array(['the Colts ran off 10 straight victories to finish with a 12-2 record.'],
                    dtype=object)                                                             ,
              array(['Cleveland Browns'], dtype=object),
              array(['losing 27-0.'], dtype=object),
              array(['the Packers.'], dtype=object),
              array(['Gary Cuozzo also suffered a season-ending injury the following'],
                    dtype=object)                                                      ],
             dtype=object), 'answer_starts': array([array([920], d

In [8]:
def answer_for_question(questions: dict, answers: list, question: str) -> str:
    answers = [anss[0] for anss in answers["texts"]]
    # print(questions)
    # print(question)
    assert question in questions
    assert len(answers) == len(questions)
    
    return next(a for i, a in enumerate(answers) if questions[i] == question)

In [9]:
canard_train = canard_train.map(lambda row: 
{
    "true_contexts": quac_train_df.loc[row["QuAC_dialog_id"]]["context"],
    "true_page": quac_train_df.loc[row["QuAC_dialog_id"]]["wikipedia_page_title"],
    "answer": answer_for_question(*quac_train_df.loc[row["QuAC_dialog_id"]][["questions", "answers"]].values, row["Question"])
})

  0%|          | 0/31526 [00:00<?, ?ex/s]

In [10]:
canard_train[0]

{'History': ['Johnny Unitas', '1964 MVP season'],
 'QuAC_dialog_id': 'C_2ba58216460d43aa986fc0e897537239_0',
 'Question': 'what team did unitas play for',
 'Question_no': 1,
 'Rewrite': 'what team did Johnny Unitas play for?',
 'true_contexts': "The 1964 season would see the Colts return to the top of the Western Conference. After dropping their season opener to the Minnesota Vikings, the Colts ran off 10 straight victories to finish with a 12-2 record. The season was one of Unitas' best as he finished with 2,824 yards passing, a league-best 9.26 yards per pass attempt, 19 touchdown passes and only 6 interceptions. He was named the NFL's Most Valuable Player by the AP and UPI for a second time. However, the season would end on a disappointing note for the Colts as they were upset by the Cleveland Browns in the 1964 NFL Championship Game, losing 27-0.  Unitas resumed his torrid passing in 1965, as he threw for 2,530 yards, 23 touchdowns and finished with a league-high and career best 97

In [11]:
import random

canard_negative_subsample = canard_train.select(random.sample(list(range(len(canard_train))), k=len(canard_train)//4))

len(canard_negative_subsample), len(canard_train)

(7881, 31526)

In [12]:
canard_negative_subsample = canard_negative_subsample.map(lambda row: {"true_contexts": ""})

  0%|          | 0/7881 [00:00<?, ?ex/s]

In [13]:
canard_negative_subsample[0]

{'History': ["Dinesh D'Souza",
  "Hillary's America: The Secret History of the Democratic Party",
  "Is Hillary's America a documentary?",
  "On July 25, 2016, D'Souza released the documentary film Hillary's America:",
  'Was it released in theaters?',
  "I don't know.",
  'What was the documentary about?',
  'The film criticizes the Democratic Party and Hillary Clinton,'],
 'QuAC_dialog_id': 'C_31bfdcd402d44289a6206d9b34765869_0',
 'Question': 'How did the critics feel about it?',
 'Question_no': 4,
 'Rewrite': "How did the critics feel about the film Hillary's America?",
 'true_contexts': '',
 'true_page': "Dinesh D'Souza",
 'answer': 'The film was universally panned by professional film critics.'}

In [14]:
canard_train = datasets.concatenate_datasets([canard_train, canard_negative_subsample])

len(canard_train)

39407

### 2. Negative contexts collection

We use BM25 to collect a realistic set of retrieves given by the IR search

In [15]:
from BM25_irsystem import BM25PlusSystem, SimpleDocProcessing

In [16]:
from pv211_utils.trec.entities import TrecDocumentBase, TrecQueryBase

In [17]:
documents = {str(i): TrecDocumentBase(document_id=i, body=context) for i, context in enumerate(quac_train_df.context)}

irsystem = BM25PlusSystem(documents, preprocessing=SimpleDocProcessing())

In [18]:
def get_negative_question_responses(question: str, num_responses: 5):
    # TODO: add contexts' titles
    unique_responses = []
    # question = "What team did Johnny Unitas play for?"

    for response_doc in irsystem.search(TrecQueryBase(query_id=0, title="", body=question, narrative="")):
        if response_doc.body not in unique_responses:
            unique_responses.append(response_doc.body)
        if len(unique_responses) >= num_responses:
            break

    return unique_responses

In [19]:
canard_train_augm = canard_train.map(
    lambda row: {"retrieved_contexts": get_negative_question_responses(row["Question"], num_responses=4) 
                                                  if row["true_contexts"] else get_negative_question_responses(row["Question"], num_responses=5)},
    # keep_in_memory=True,
    # num_proc=60
)
canard_train_augm.save_to_disk("canard_train_augm_full.hf")

  0%|          | 0/39407 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/39407 [00:00<?, ? examples/s]

In [20]:
datasets.load_from_disk("canard_train_augm_full.hf")

Dataset({
    features: ['History', 'QuAC_dialog_id', 'Question', 'Question_no', 'Rewrite', 'true_contexts', 'true_page', 'answer', 'retrieved_contexts'],
    num_rows: 39407
})

## Test dataset generation

In [21]:
import datasets

# make sure that we test with conversations that the model has not seen before
canard_test = datasets.load_dataset("json", data_files="datasets/CANARD_Release/test.json")["train"]
quac_test = datasets.load_dataset("quac", split="validation")
quac_test_df = quac_test.to_pandas().set_index("dialogue_id", drop=True)

Using custom data configuration default-a7ce477a9c57a36e
Found cached dataset json (/home/xstefan3/.cache/huggingface/datasets/json/default-a7ce477a9c57a36e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset quac (/home/xstefan3/.cache/huggingface/datasets/quac/plain_text/1.1.0/4170258e7e72d7c81bd6441b3f3489ea1544f0ff226ce61e22bb00c6e9d01fb6)


In [22]:
# check the match on QuAC_dialog_id
canard_test[102]

{'History': ['Ursula K. Le Guin',
  'Influences',
  'what influenced her?',
  'Le Guin was influenced by fantasy writers,',
  'who were they?',
  'J. R. R. Tolkien, by science fiction writers,',
  'how did they influence her?',
  'her influences, she replied: Once I learned to read, I read everything. I read all the famous fantasies'],
 'QuAC_dialog_id': 'C_420bfcf5d8b344a480ac654f08ee55ad_1',
 'Question': 'which other fantasy writer influenced her?',
 'Question_no': 4,
 'Rewrite': 'Besides J. R. R. Tolkien which other fantasy writer influenced Le Guin?'}

In [23]:
quac_test_df.loc["C_420bfcf5d8b344a480ac654f08ee55ad_1"]

wikipedia_page_title                                    Ursula K. Le Guin
background              Ursula Kroeber Le Guin (; October 21, 1929 - J...
section_title                                                  Influences
context                 Le Guin was influenced by fantasy writers, inc...
turn_ids                [C_420bfcf5d8b344a480ac654f08ee55ad_1_q#0, C_4...
questions               [what influenced her?, who were they?, how did...
followups                                     [0, 0, 0, 0, 1, 0, 0, 0, 1]
yesnos                                        [2, 2, 2, 2, 2, 0, 2, 2, 2]
answers                 {'texts': [['Le Guin was influenced by fantasy...
orig_answers            {'texts': ['Le Guin was influenced by fantasy ...
Name: C_420bfcf5d8b344a480ac654f08ee55ad_1, dtype: object

In [24]:
canard_test = canard_test.map(lambda row: 
{
    "true_contexts": quac_test_df.loc[row["QuAC_dialog_id"]]["context"],
    "true_page": quac_test_df.loc[row["QuAC_dialog_id"]]["wikipedia_page_title"],
    "answer": answer_for_question(*quac_test_df.loc[row["QuAC_dialog_id"]][["questions", "answers"]].values, row["Question"])
})

Loading cached processed dataset at /home/xstefan3/.cache/huggingface/datasets/json/default-a7ce477a9c57a36e/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-e93159eee5490cc9.arrow


### 2. Negative contexts collection


In [25]:
# We initialize a new IR system for response - pesimistic scenario
from BM25_irsystem import BM25PlusSystem, SimpleDocProcessing

In [26]:
from pv211_utils.trec.entities import TrecDocumentBase, TrecQueryBase

In [27]:
documents = {str(i): TrecDocumentBase(document_id=i, body=context) for i, context in enumerate(quac_test_df.context)}

irsystem = BM25PlusSystem(documents, preprocessing=SimpleDocProcessing())

In [28]:
canard_test_augm = canard_test.map(
    lambda row: {"retrieved_contexts": get_negative_question_responses(row["Question"], num_responses=4) 
                                                  if row["true_contexts"] else get_negative_question_responses(row["Question"], num_responses=5)},
    # keep_in_memory=True,
    # num_proc=60
)
canard_test_augm.save_to_disk("canard_test_augm_full.hf")

  0%|          | 0/5571 [00:00<?, ?ex/s]

Saving the dataset (0/1 shards):   0%|          | 0/5571 [00:00<?, ? examples/s]