Llamipa / bespoke /instructions.txt

Rename bespoke/md_draft.txt to bespoke/instructions.txt

7153ec0 verified 5 months ago

2.62 kB


	# Formatting your dialogue data for the Llamipa parser.

	This is a collection of scripts which can regenerate the Llamipa data from the MSDC, or can help you to format your
	own dialogue data for use with the Llamipa parser.

	To start, the dialogue data must follow the MSDC format, where each dialogue is a json object, with "id"
	and "edus" fields. If the dialgoue is already annotated for discourse structure, a "relations" fields
	(see the corpus: https://huggingface.co/datasets/linagora/MinecraftStructuredDialogueCorpus).

	**Make sure to include a dummy 0 move, "Mission has Started",
	at the beginning of each dialogue.

	## STEP 1: Use the dialogue json data to create an intermediate format, where each
	speaker turn is a single object containing all discourse units.


	[
	"id": "log3566",
	"turns": [
	{
	"turn": 0,
	"speaker": "Builder",
	"edus": [
	"Mission has started."
	]
	},
	{
	"turn": 1,
	"speaker": "Architect",
	"edus": [
	"Hi!"
	]
	},
	{
	"turn": 2,
	"speaker": "Builder",
	"edus": [
	"Hi", "What are we building today?"
	]
	},...
	]
	]

	The format.py script takes the dialogue json as input, and outputs a <turns>.json file.

	Note: The script assumes the non-linguistic actions are of the same format as in the MSDC, e.g.:

	{
	"turn": 22,
	"speaker": "Builder",
	"edus": ["place purple 5 1 -5, place purple 5 1 -4, place purple 5 2 -5, place purple 4 1 -5"]
	}


	## STEP 2:

	If using UNANNOTATED data, use the <turns>.json to create a <parser>.jsonl
	file formatted for Llamipa. Script: `format_unannotated.py`

	If using ANNOTATED DATA, use the <turns>.json and the original data json to create a <parser>.jsonl
	file formatted for Llamipa. Script: `format_annotated.py`

	**Make sure that the relation type representations in the `map_rels_str` dictionary in the `format_rels`
	function match those in your data.

	The DISTANCE variable is set to 15 edus, which is what was used for Llamipa training
	and testing, but can be changed to support contexts of different lengths.

	Note: If generating data for incremental parsing, make sure to add the space marker between
	dialogues (line 109 in `format_unannotated.py` and line 153 in `format_annotated.py`). Otherwise, comment
	this out if generating data for testing the parser with gold structure context.