# Formatting your dialogue data for the Llamipa parser. 

This is a collection of scripts which can regenerate the Llamipa data from the MSDC, or can help you to format your
own dialogue data for use with the Llamipa parser. 

To start, the dialogue data must follow the MSDC format, where each dialogue is a json object, with "id" 
and "edus" fields. If the dialgoue is already annotated for discourse structure, a "relations" fields 
(see the corpus: https://huggingface.co/datasets/linagora/MinecraftStructuredDialogueCorpus). 

**Make sure to include a dummy 0 move, "Mission has Started", 
at the beginning of each dialogue. 

## STEP 1:  Use the dialogue json data to create an intermediate format, where each 
speaker turn is a single object containing all discourse units. 


[
     "id": "log3566",
        "turns": [
            {
                "turn": 0,
                "speaker": "Builder",
                "edus": [
                    "Mission has started."
                ]
            },
            {
                "turn": 1,
                "speaker": "Architect",
                "edus": [
                    "Hi!"
                ]
            },
            {
                "turn": 2,
                "speaker": "Builder",
                "edus": [
                    "Hi", "What are we building today?"
                ]
            },...
        ]
]

The format.py script takes the dialogue json as input, and outputs a <turns>.json file. 

Note: The script assumes the non-linguistic actions are of the same format as in the MSDC, e.g.:

{
    "turn": 22,
    "speaker": "Builder",
    "edus": ["place purple 5 1 -5, place purple 5 1 -4, place purple 5 2 -5, place purple 4 1 -5"]
}


## STEP 2: 

If using *UNANNOTATED* data, use the <turns>.json to create a <parser>.jsonl
file formatted for Llamipa. Script: `format_unannotated.py` 

If using *ANNOTATED* DATA, use the <turns>.json and the original data json to create a <parser>.jsonl
file formatted for Llamipa. Script: `format_annotated.py` 

**Make sure that the relation type representations in the `map_rels_str` dictionary in the `format_rels` 
function match those in your data.

The DISTANCE variable is set to 15 edus, which is what was used for Llamipa training
and testing, but can be changed to support contexts of different lengths.

Note: If generating data for incremental parsing, make sure to add the space marker between 
dialogues (line 109 in `format_unannotated.py` and line 153 in `format_annotated.py`). Otherwise, comment 
this out if generating data for testing the parser with gold structure context.