File size: 11,959 Bytes
322c9f8
 
 
 
 
25e88ac
322c9f8
 
 
263c7ca
42b6922
263c7ca
 
 
 
 
183485e
79f4083
 
 
263c7ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
710351f
263c7ca
04c7ccd
 
 
 
3b82713
ae700e1
263c7ca
99a82ba
 
 
 
263c7ca
 
8ed4a42
99a82ba
 
 
263c7ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99a82ba
263c7ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99a82ba
8136c17
99a82ba
263c7ca
 
99a82ba
 
263c7ca
99a82ba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263c7ca
99a82ba
263c7ca
99a82ba
263c7ca
99a82ba
322c9f8
99a82ba
 
 
 
 
322c9f8
99a82ba
 
 
 
 
322c9f8
99a82ba
 
322c9f8
99a82ba
 
 
 
 
322c9f8
99a82ba
322c9f8
99a82ba
 
322c9f8
99a82ba
 
 
 
322c9f8
99a82ba
 
 
 
 
 
 
 
 
 
 
 
 
 
322c9f8
99a82ba
322c9f8
99a82ba
 
 
 
 
 
322c9f8
 
99a82ba
322c9f8
99a82ba
eb9ff7d
322c9f8
 
 
 
99a82ba
322c9f8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
library_name: transformers
tags: []
---

# ServiceNow Table Answering

<!-- Provide a quick summary of what the model is/does. -->

## Introduction
ServiceNow is a platform that helps businesses automate their processes and workflows. They offer several solutions such as ITSM. Currently, users of ServiceNow generally need to apply filters and/or build dashboards to observe data about tables in ServiceNow, such as incidents and problems. Building dashboards and reports often require the help of developers and may be a hassle just for quick information. Dashboards are useful for visual representation, but it would be useful to be able to ask questions about the data just to a chatbot. I created some sample tables with some ServiceNow fields and used that as data. The task is to create a custom LLM chat/assistant that takes in data from tables ServiceNow uses such as incident, change, and problem, which can then be used to respond to user queries in natural language. 

## Training Data

For this project, the training data was structured around ServiceNow ITSM tables, specifically Incident, Change, and Problem tables. I used a certain subset fields from Incident, Change, and Problem tables.  For example, Problem tables have a problem id, priority, status, root cause, and resolved at field. Since I can’t use official data from in-use ServiceNow instances, which contain private information, I generated a synthetic dataset with custom code. Then, I had to structure that code in sqa format, which is the best format for the model I was using, TAPAS. For this, I had to save each table in a CSV file. The final refined dataset that I would pass in would contain an id, uestion, table_file, answer_coordinates if the answer was in the table itself, the actual answer, and a float answer if the answer was a numeric value not in the data, such as a count. I do have an aggregation_label field as well, which I set right before the training process, but after the train_test_table split. I used the method train_test_split() to obtain the training, validation, and test data. I specifically used a seed of 42:

Example of how the training data appears:

![image/png](https://cdn-uploads.huggingface.co/production/uploads/67885e8302ab11c0b0ed0853/-8piWOY40wzTk3qU1tmRS.png)

```python
train_val_data, test_data = train_test_split(data, test_size=0.1, random_state=42)
# Then split train+validation into train and validation
train_data, val_data = train_test_split(train_val_data, test_size=0.1, random_state=42)
```

## Training Method
I used full-fine tuning. The model did not really need generalization abilities. Its primary purpose is to take ServiceNow Tables and answer queries based on those tables. Keeping some generalization ability would be nice, but isn't really that necessary. PEFT could work as well to prevent catastrophic overfitting, but generalization is not hugely important. The drawbacks I had expected was some generalization loss, but that wasn't really the case.

These were the arguments/hyperparameters, I used. I tried using higher epochs, but those usually caused worse results:
```python
        num_train_epochs=1,  # Number of training epochs
    	per_device_train_batch_size=32,  # Batch size per device during training
    	per_device_eval_batch_size=64,   # Batch size per device during evaluation
    	learning_rate=0.00001,
    	warmup_steps=100,   # Number of warmup steps for learning rate scheduler
    	weight_decay=0.01,  # Strength of weight decay
    	evaluation_strategy="steps",  # Evaluate every 'eval_steps'
    	eval_steps=50,  # Evaluation frequency in steps
    	logging_steps=50,  # Log every eval_steps
    	save_steps=150,  # Save model every 500 steps
    	save_total_limit=2,
    	load_best_model_at_end=True,  # Load the best model when finished training
    	metric_for_best_model="eval_loss",  # Metric to use for best model selection
```

## Evaluation
I had three benchmarks, the WikiTableQuestions dataset, the TabFact dataset, and SQA. Fine-tuning did not harm the results of on the WTQ Validation Set and the TabFact Dataset, in which I got accuracies of .3405 and .5005, respectively for both the pre-trained and fine-tuned model. It slightly improved the results on the SQA dataset. There were improvements in the test results after training though. On the test set, there was quite a large jump in accuracy from 0.2033 to 0.4667 after fine-tuning.

| Model                                                | Test Set of Synthetic Dataset | Benchmark 1 (WTQ Validation Set)         | Benchmark 2 (TabFact) | Benchmark 3 (SQA) |
|------------------------------------------------------|-------------------------------|------------------------------------------|-----------------------|-------------------|
| google/tapas-base-finetuned-wtq (before Fine-tuning) | 0.2933                        | 0.3405                                   | 0.5005                | 0.2512            | 
| google/tapas-base-finetuned-wtq (Fine-tuned)         | 0.4667                        | 0.3405                                   | 0.5005                | 0.2525            | 
| mistralai/Mistral-7B-Instruct-v0.3                   | 0                             | Exact Match: 0.0346 / Fuzzy Match: 0.4744| 0.4995                | 0.0296                  | 
| meta-llama/Llama-3.2-1B                              | 0.0133                        | Exact Match: 0.0593 / Fuzzy Match: 0.2769| 0.4995                | 0.0238            | 

## Usage and Intended Uses

This model is designed for question answering over tabular data. It is mostly directed for querying ITSM tables (change, problem, and incident). It is used to answer questions, such as most common issues and 
number of records in varying categories.

```python
saved_path = "am5uc/ServiceNow_Table_Question_Answering"
tokenizer = TapasTokenizer.from_pretrained(saved_path)
model = TapasForQuestionAnswering.from_pretrained(saved_path)

question = "How many Hardware Upgrade changes are still pending?"
table_df = pd.DataFrame({
    "change_id": [
        "CHG3000",
        "CHG3001",
        "CHG3002",
        "CHG3003"
    ],
    "category": [
        "Security Patch",
        "Software Update",
        "Hardware Upgrade",
        "Software Update"
    ],
    "status": [
        "Rejected",
        "In Progress",
        "In Progress",
        "Completed"
    ],
    "approved_by": [
        "",
        "Manager2",
        "",
        "Admin1"
    ],
    "implementation_date": [
        "",
        "",
        "",
        "2023-05-30"
    ]
})

# Tokenize both Question and Table together
inputs = tokenizer(table=table_df, queries=[question], padding='max_length', return_tensors='pt')

# Model prediction
# --- Helper function ---
def get_final_answer(model, tokenizer, inputs, table_df):
    outputs = model(**inputs)

    logits = outputs.logits
    logits_agg = outputs.logits_aggregation

    predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
        inputs,
        logits.detach(),
        logits_agg=logits_agg.detach()
    )

    aggregation_operators = ["NONE", "SUM", "AVERAGE", "COUNT"]

    agg_op_idx = predicted_aggregation_indices[0] if predicted_aggregation_indices else 0
    agg_op = aggregation_operators[agg_op_idx]

    predicted_cells = []
    for coord in predicted_answer_coordinates[0]:
        cell_value = table_df.iat[coord[0], coord[1]]
        predicted_cells.append(cell_value)

    if agg_op == "COUNT":
        answer = len(predicted_cells)
    elif agg_op == "SUM":
        try:
            answer = sum(float(cell) for cell in predicted_cells)
        except ValueError:
            answer = "Could not SUM non-numeric cells"
    elif agg_op == "AVERAGE":
        try:
            answer = sum(float(cell) for cell in predicted_cells) / len(predicted_cells)
        except ValueError:
            answer = "Could not AVERAGE non-numeric cells"
    else:  # NONE
        answer = predicted_cells

    return agg_op, answer, predicted_cells

_, answer, _ = get_final_answer(model, tokenizer, inputs, table_df)

print(answer)

```

## Prompt Format
The prompt for the TAPAS model should be a natural language question paired with a structured table that can be passed in in dataframe format. TAPAS does not work with 
just one prompt and generally reqires a question and a table dataframe to work.

```python
question = "How many Hardware Upgrade changes are still pending?"
table_df = pd.DataFrame({
    "change_id": [
        "CHG3000",
        "CHG3001",
        "CHG3002",
        "CHG3003"
    ],
    "category": [
        "Security Patch",
        "Software Update",
        "Hardware Upgrade",
        "Software Update"
    ],
    "status": [
        "Rejected",
        "In Progress",
        "In Progress",
        "Completed"
    ],
    "approved_by": [
        "",
        "Manager2",
        "",
        "Admin1"
    ],
    "implementation_date": [
        "",
        "",
        "",
        "2023-05-30"
    ]
})

inputs = tokenizer(table=table_df, queries=[question], padding='max_length', return_tensors='pt')

```

Or you could define table in json format and then have table = pd.DataFrame(table) in your tokenizer.

## Expected Output Format
You tokenize the inputs and then perform a specific function to get outputs, which are the aggregation operation, answer, and predicted_cells. You can just grab the middle value which is the predicted answer. 
```python
# Tokenize both Question and Table together
inputs = tokenizer(table=table_df, queries=[question], padding='max_length', return_tensors='pt')

# Model prediction
 ##--- Helper function ---
 
def get_final_answer(model, tokenizer, inputs, table_df):
    outputs = model(**inputs)

    logits = outputs.logits
    logits_agg = outputs.logits_aggregation

    predicted_answer_coordinates, predicted_aggregation_indices = tokenizer.convert_logits_to_predictions(
        inputs,
        logits.detach(),
        logits_agg=logits_agg.detach()
    )

    aggregation_operators = ["NONE", "SUM", "AVERAGE", "COUNT"]

    agg_op_idx = predicted_aggregation_indices[0] if predicted_aggregation_indices else 0
    agg_op = aggregation_operators[agg_op_idx]

    predicted_cells = []
    for coord in predicted_answer_coordinates[0]:
        cell_value = table_df.iat[coord[0], coord[1]]
        predicted_cells.append(cell_value)

    if agg_op == "COUNT":
        answer = len(predicted_cells)
    elif agg_op == "SUM":
        try:
            answer = sum(float(cell) for cell in predicted_cells)
        except ValueError:
            answer = "Could not SUM non-numeric cells"
    elif agg_op == "AVERAGE":
        try:
            answer = sum(float(cell) for cell in predicted_cells) / len(predicted_cells)
        except ValueError:
            answer = "Could not AVERAGE non-numeric cells"
    else:  # NONE
        answer = predicted_cells

    return agg_op, answer, predicted_cells

_, answer, _ = get_final_answer(model, tokenizer, inputs, table_df)
print(answer)
```
If the question is asking for a count, like how many changes have been completed, the answer would just be one number.
If it is asking a question about the most common incident status or root cause, the answer would be the root cause or status the model
predicts. 


![image/png](https://cdn-uploads.huggingface.co/production/uploads/67885e8302ab11c0b0ed0853/ZfRuXoHTvMPZH9NQmi94G.png)

## Limitations
The model does still not come close to a 100% accuracy. Possibly using a larger model could help. I have ran into issues with larger table sizes before, so that may be an issue. Once again, possibly a larger model could help. Also this needs to take in a question and table in dataframe format, so more preocessing is necessary. The number of samples it is trained on could also be increased to possibly improve results.


## Model Card Authors [optional]

Abhinandan Mekap