mav23 commited on
Commit
92c2372
·
verified ·
1 Parent(s): 9fb923e

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +189 -0
  3. prem-1b-sql.Q4_0.gguf +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ prem-1b-sql.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ datasets:
4
+ - premai-io/spider
5
+ - premai-io/domains
6
+ - premai-io/birdbench
7
+ - gretelai/synthetic_text_to_sql
8
+
9
+ metrics:
10
+ - accuracy
11
+ base_model:
12
+ - deepseek-ai/deepseek-coder-1.3b-instruct
13
+ pipeline_tag: text2text-generation
14
+ ---
15
+
16
+ # Prem-1B-SQL
17
+
18
+ - Read the blogpost [here](https://blog.premai.io/prem-1b-sql-fully-local-performant-slm-for-text-to-sql/)
19
+ - PremSQL Library | [GitHub](https://github.com/premAI-io/premsql)
20
+
21
+ Prem-1B-SQL is one of the very first series of fully local Text-to-SQL models developed by Prem AI. Being a 1B parameter model
22
+ it easily fits on low GPU devices (and CPU devices when quantized). We believe that AI assisted data analysis should be a Local first
23
+ approach. Because exposing Databases to third-party closed-source models can lead to data security breaches. We will be publishing some
24
+ of the public benchmark results of this model very soon. We will also be iterating on this model for more better results.
25
+
26
+ - **Developed by:** [Prem AI](https://www.premai.io/)
27
+ - **License:** [MIT]
28
+
29
+ ## Results
30
+
31
+ We evaluated our model on two popular benchmark datasets: BirdBench and Spider. BirdBench consists of a public validation dataset (with 1534 data points) and a private test dataset. Spider comes up with only a public validation dataset. Here are the results:
32
+
33
+ | Dataset | Execution Accuracy |
34
+ |--------------------------|--------------------|
35
+ | BirdBench (validation) | 46% |
36
+ | BirdBench (private test) | 51.54% |
37
+ | Spider | 85% |
38
+
39
+ The BirdBench dataset is distributed across different difficulty levels. Here is a detailed view of the private results across different difficulty levels.
40
+
41
+ | Difficulty | Count | EX | Soft F1 |
42
+ |-------------|-------|---------|---------|
43
+ | Simple | 949 | 60.70 | 61.48 |
44
+ | Moderate | 555 | 47.39 | 49.06 |
45
+ | Challenging | 285 | 29.12 | 31.83 |
46
+ | Total | 1789 | 51.54 | 52.90 |
47
+
48
+
49
+ Here is a more detailed comparison of popular closed- and open-source models.
50
+
51
+ | Model | # Params (in Billion) | BirdBench Test Scores |
52
+ |-------------------------------|-----------------------|-----------------------|
53
+ | AskData + GPT-4o (current winner) | NA | 72.39 |
54
+ | DeepSeek coder 236B | 236 | 56.68 |
55
+ | GPT-4 (2023) | NA | 54.89 |
56
+ | **PremSQL 1B (ours)** | 1 | 51.4 |
57
+ | Qwen 2.5 7B Instruct | 7 | 51.1 |
58
+ | Claude 2 Base (2023) | NA | 49.02 |
59
+
60
+
61
+ ## How to use Prem-1B-SQL
62
+
63
+ Since it is a model built upon transformers, so it can be directly used with transformers. However running Text-to-SQL is not as simple
64
+ as running normal LLMs. The reason lies in model input prompt formations which is tightly coupled with databases. So we have developed PremSQL,
65
+ a fully open source library which is:
66
+
67
+ - **Local-First**: Avoid third-party closed-source providers and keep your data secure.
68
+ - **Customizable Datasets**: Create, fine-tune, and evaluate models with built-in or custom datasets.
69
+ - **Robust Executors and Evaluators**: Easily connect to databases and assess model performance.
70
+ - **Advanced Generators**: Convert natural language prompts into executable SQL queries.
71
+ - **Error Handling and Self-Correction**: Automatically correct SQL queries during inference.
72
+ - **Fine-Tuning Support**: Fine-tune models with LoRA, QLoRA, or full fine-tuning strategies.
73
+ - **End-to-End Pipelines**: Seamlessly integrate all components for autonomous data analysis.
74
+
75
+ To install PremSQL just create a new environment and type:
76
+
77
+ ```bash
78
+ pip install -U premsql
79
+ ```
80
+
81
+ Please [check out our documentation](https://docs.premai.io/premsql/introduction) to know about more details of the library usage.
82
+
83
+ ### Running Prem-1B-SQL using PremSQL Pipelines
84
+
85
+ The easiest way to use this model is through PremSQL pipelines. All you need to do is provide the database path (in case of SQLite databases)
86
+ or provide the DB connection URI. After this, all you need to do is, connect it with the model. Here is how you do that:
87
+
88
+ ```python
89
+ from premsql.pipelines import SimpleText2SQLAgent
90
+ from premsql.generators import Text2SQLGeneratorHF
91
+ from premsql.executors import SQLiteExecutor
92
+
93
+ # Provide a SQLite file here or see documentation for more customization
94
+ dsn_or_db_path = "./data/db/california_schools.sqlite"
95
+
96
+ agent = SimpleText2SQLAgent(
97
+ dsn_or_db_path=dsn_or_db_path,
98
+ generator=Text2SQLGeneratorHF(
99
+ model_or_name_or_path="premai-io/prem-1B-SQL",
100
+ experiment_name="simple_pipeline",
101
+ device="cuda:0",
102
+ type="test"
103
+ ),
104
+ )
105
+
106
+ question = "please list the phone numbers of the direct charter-funded schools that are opened after 2000/1/1"
107
+
108
+ response = agent.query(question)
109
+ response["table"]
110
+ ```
111
+
112
+ Under the hood, it automatically connects with your Database and do all the heavy lifting like prompt creation, execution etc for you.
113
+
114
+
115
+ ### Running Prem-1B-SQL using PremSQL Generators
116
+
117
+ You can also run the model using PremSQL Generators. This is helpful when you want to do generations in
118
+ bulk on some dataset. Here is an example:
119
+
120
+ ```python
121
+ from premsql.generators import Text2SQLGeneratorHF
122
+ from premsql.datasets import Text2SQLDataset
123
+
124
+ # Define a dataset
125
+ dataset = bird_dataset = Text2SQLDataset(
126
+ dataset_name='bird', split="validation", force_download=False,
127
+ dataset_folder="/path/to/dataset"
128
+ ).setup_dataset(num_rows=10, num_fewshot=3)
129
+
130
+ # Define a generator
131
+ generator = Text2SQLGeneratorHF(
132
+ model_or_name_or_path="premai-io/prem-1B-SQL",
133
+ experiment_name="test_generators",
134
+ device="cuda:0",
135
+ type="test"
136
+ )
137
+
138
+ # Generate on the full dataset
139
+ responses = generator.generate_and_save_results(
140
+ dataset=bird_dataset,
141
+ temperature=0.1,
142
+ max_new_tokens=256
143
+ )
144
+
145
+ print(responses)
146
+ ```
147
+
148
+ ### Using Execution guided Decoding
149
+
150
+ This strategy executes the generated SQL against the DB and, if it fails, uses the error message for correction, repeating until it gets a valid result or the retries run out.
151
+
152
+
153
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/637b0075806b18943e4ba357/_5rdIQZwyaUFb84xKW_AV.png)
154
+
155
+ ```python
156
+ from premsql.executors import SQLiteExecutor
157
+
158
+ executor = SQLiteExecutor()
159
+ response = generator.generate_and_save_results(
160
+ dataset=bird_dataset,
161
+ temperature=0.1,
162
+ max_new_tokens=256,
163
+ force=True,
164
+ executor=executor,
165
+ max_retries=5 # this is optional (default is already set to 5)
166
+ )
167
+ ```
168
+
169
+
170
+ You can also fine-tune Prem-1B-SQL using HuggingFace Transformers and with [PremSQL Tuners](https://docs.premai.io/premsql/tuners) as well.
171
+ Please [check out our documentation](https://docs.premai.io/premsql/introduction) to know about more about PremSQL and all the features
172
+ we provide.
173
+
174
+
175
+ ## Datasets used to train the model
176
+
177
+ Prem-1B-SQL is trained using the following datasets:
178
+
179
+ 1. [BirdBench Training dataset](https://bird-bench.github.io/) | Uploaded on [PremSQL datasets on HF](https://huggingface.co/datasets/premai-io/birdbench)
180
+ 2. [Spider dataset](https://yale-lily.github.io/spider) | Uploaded on [PremSQL datasets on HF](https://huggingface.co/datasets/premai-io/spider)
181
+ 3. [Domain specialization dataset, gathered and uploaded to PremSQL datasets](https://huggingface.co/datasets/premai-io/domains)
182
+ 4. [Gretel AI synthetic dataset](https://huggingface.co/datasets/gretelai/synthetic_text_to_sql?row=0)
183
+
184
+ Additionally we made error handling datasets on top of these datasets to make the model learn from its errors and self correct them.
185
+
186
+
187
+ ## Evaluation results of Prem-1B-SQL
188
+
189
+ The results of Prem-1B-SQL on some public benchmarks will be published soon.
prem-1b-sql.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a3afdea34bf49d328a7c30d10ee9d536b1e69713a263e91829b137ec94f909f
3
+ size 775937600