|
|
|
# Formatting your dialogue data for the Llamipa parser. |
|
|
|
This is a collection of scripts which can regenerate the Llamipa data from the MSDC, or can help you to format your |
|
own dialogue data for use with the Llamipa parser. |
|
|
|
To start, the dialogue data must follow the MSDC format, where each dialogue is a json object, with "id" |
|
and "edus" fields. If the dialgoue is already annotated for discourse structure, a "relations" fields |
|
(see the corpus: https://huggingface.co/datasets/linagora/MinecraftStructuredDialogueCorpus). |
|
|
|
**Make sure to include a dummy 0 move, "Mission has Started", |
|
at the beginning of each dialogue. |
|
|
|
## STEP 1: Use the dialogue json data to create an intermediate format, where each |
|
speaker turn is a single object containing all discourse units. |
|
|
|
|
|
[ |
|
"id": "log3566", |
|
"turns": [ |
|
{ |
|
"turn": 0, |
|
"speaker": "Builder", |
|
"edus": [ |
|
"Mission has started." |
|
] |
|
}, |
|
{ |
|
"turn": 1, |
|
"speaker": "Architect", |
|
"edus": [ |
|
"Hi!" |
|
] |
|
}, |
|
{ |
|
"turn": 2, |
|
"speaker": "Builder", |
|
"edus": [ |
|
"Hi", "What are we building today?" |
|
] |
|
},... |
|
] |
|
] |
|
|
|
The format.py script takes the dialogue json as input, and outputs a <turns>.json file. |
|
|
|
Note: The script assumes the non-linguistic actions are of the same format as in the MSDC, e.g.: |
|
|
|
{ |
|
"turn": 22, |
|
"speaker": "Builder", |
|
"edus": ["place purple 5 1 -5, place purple 5 1 -4, place purple 5 2 -5, place purple 4 1 -5"] |
|
} |
|
|
|
|
|
## STEP 2: |
|
|
|
If using *UNANNOTATED* data, use the <turns>.json to create a <parser>.jsonl |
|
file formatted for Llamipa. Script: `format_unannotated.py` |
|
|
|
If using *ANNOTATED* DATA, use the <turns>.json and the original data json to create a <parser>.jsonl |
|
file formatted for Llamipa. Script: `format_annotated.py` |
|
|
|
**Make sure that the relation type representations in the `map_rels_str` dictionary in the `format_rels` |
|
function match those in your data. |
|
|
|
The DISTANCE variable is set to 15 edus, which is what was used for Llamipa training |
|
and testing, but can be changed to support contexts of different lengths. |
|
|
|
Note: If generating data for incremental parsing, make sure to add the space marker between |
|
dialogues (line 109 in `format_unannotated.py` and line 153 in `format_annotated.py`). Otherwise, comment |
|
this out if generating data for testing the parser with gold structure context. |
|
|
|
|
|
|