Merge pull request #23 from tools4ds/add-midterm
Browse files
apps/ai_tutor/config/config.yml
CHANGED
@@ -11,7 +11,7 @@ vectorstore:
|
|
11 |
db_option : 'FAISS' # str [FAISS, Chroma, RAGatouille, RAPTOR]
|
12 |
db_path : 'vectorstores' # str
|
13 |
model : 'sentence-transformers/all-MiniLM-L6-v2' # str [sentence-transformers/all-MiniLM-L6-v2, text-embedding-ada-002']
|
14 |
-
search_top_k :
|
15 |
score_threshold : 0.2 # float
|
16 |
|
17 |
faiss_params: # Not used as of now
|
|
|
11 |
db_option : 'FAISS' # str [FAISS, Chroma, RAGatouille, RAPTOR]
|
12 |
db_path : 'vectorstores' # str
|
13 |
model : 'sentence-transformers/all-MiniLM-L6-v2' # str [sentence-transformers/all-MiniLM-L6-v2, text-embedding-ada-002']
|
14 |
+
search_top_k : 5 # int
|
15 |
score_threshold : 0.2 # float
|
16 |
|
17 |
faiss_params: # Not used as of now
|
apps/ai_tutor/config/prompts.py
CHANGED
@@ -73,7 +73,8 @@ prompts = {
|
|
73 |
"If you don't know the answer, do your best without making things up. Keep the conversation flowing naturally.\n"
|
74 |
"Provide links from the source_file metadata. Use the source context that is most relevant.\n"
|
75 |
"Speak in a friendly and engaging manner, like talking to a friend. Avoid sounding repetitive or robotic.\n"
|
76 |
-
"If the student is asking about a question on an assignment, lead them towards solving the problem themselves, NOT giving them the answer directly. Be very subtle about it."
|
|
|
77 |
"\n\n"
|
78 |
"user\n"
|
79 |
"Context:\n{context}\n\n"
|
|
|
73 |
"If you don't know the answer, do your best without making things up. Keep the conversation flowing naturally.\n"
|
74 |
"Provide links from the source_file metadata. Use the source context that is most relevant.\n"
|
75 |
"Speak in a friendly and engaging manner, like talking to a friend. Avoid sounding repetitive or robotic.\n"
|
76 |
+
"If the student is asking about a question on an assignment OR the midterm, lead them towards solving the problem themselves, NOT giving them the answer directly. Be very subtle about it."
|
77 |
+
"Do NOT try to give solutions about the midterm it has been released, but is still active."
|
78 |
"\n\n"
|
79 |
"user\n"
|
80 |
"Context:\n{context}\n\n"
|
apps/ai_tutor/storage/data/midterm.txt
ADDED
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# NBA Shot Data Midterm Challenge
|
2 |
+
|
3 |
+
Welcome to the NBA Shot Data Midterm Challenge! This competition is designed to test your skills in exploratory data analysis (EDA), clustering, machine learning, feature engineering, and data visualization. The goal is to analyze NBA shot data, uncover insights, and predict shot success. Below are the competition guidelines and tasks.
|
4 |
+
|
5 |
+
## Overview
|
6 |
+
|
7 |
+
In this midterm, you will:
|
8 |
+
|
9 |
+
1. Perform **exploratory data analysis**, visualize trends, and identify outliers.
|
10 |
+
2. **Cluster players** to create player profiles based on their shot attributes.
|
11 |
+
3. Build a **shot prediction model**, using machine learning (but no deep learning) with engineered features to improve accuracy.
|
12 |
+
4. Participate in a **visualization challenge** to explain insights through creative and informative visualizations.
|
13 |
+
|
14 |
+
Part 3 includes a **leaderboard** that ranks participants based on the performance of their prediction models. You get full credit for beating the baseline accuracy score, and bonus points if you're ranked in the top 10.
|
15 |
+
|
16 |
+
## Submission Guidelines
|
17 |
+
|
18 |
+
*We expect you to work individually on this midterm.*
|
19 |
+
|
20 |
+
*The goal of this midterm is to give you an opportunity to learn these tools and practice using them. The grade you receive is somewhat secondary. Using GenAI to answer the questions will rob you of a very important learning and practice opportunity. For that reason we ask that you turn off GenAI autocomplete for any development environment you use. You can use the course AI assistant for more general, peripheral help, for example to explain concepts. Your use of that tool is logged, just so you know. Do not use any other GenAI tools for the midterm.*
|
21 |
+
|
22 |
+
For all parts, write code and accompanying analysis in the `ds701-fa24-midterm.ipynb`. Ensure that your analysis is detailed and presented in the manner of a professional report. For part 4 (prediction task), you also have to create a submission file for the [Kaggle competition](https://www.kaggle.com/t/6b1722486c10455f967352fbf62682b2). It should follow the format of `submission.csv` which has two columns: ID and SHOT_MADE.
|
23 |
+
|
24 |
+
## Tasks
|
25 |
+
|
26 |
+
### 1. Exploratory Analysis, Visualization, and Outlier Detection
|
27 |
+
|
28 |
+
This part involves discovering interesting patterns and outliers in the dataset. Use your data wrangling and visualization skills to answer questions.
|
29 |
+
|
30 |
+
Deliverables:
|
31 |
+
|
32 |
+
- See the notebook `ds701-fa24-midterm.ipynb` for specific questions to answer.
|
33 |
+
- Provide visualizations (e.g., histograms, box plots, heatmaps) that clearly illustrate your findings in the notebook, creating code cells as you like.
|
34 |
+
|
35 |
+
### 2. Player Clustering: Defining Player Profiles
|
36 |
+
|
37 |
+
For this task, you will apply clustering techniques to identify distinct player profiles based on their shooting behavior. You could use features like:
|
38 |
+
|
39 |
+
- **Shot Types**: `SHOT_TYPE`, `ACTION_TYPE`
|
40 |
+
- **Shot Success Rate**: `SHOT_MADE`
|
41 |
+
- **Shot Distance**: `SHOT_DISTANCE`
|
42 |
+
|
43 |
+
But you're not limited to these!
|
44 |
+
|
45 |
+
You can use clustering algorithms like **K-Means**, **Gaussian Mixture Models**, or others. What we want to see is an understanding of what the clusters say about the data.
|
46 |
+
|
47 |
+
Deliverables:
|
48 |
+
|
49 |
+
- **Cluster analysis**: Describe the player clusters and how they might correspond to properties about the players or shots they take.
|
50 |
+
- Provide **visual representations** of your clusters (e.g., scatter plots with labeled clusters).
|
51 |
+
|
52 |
+
Grading:
|
53 |
+
|
54 |
+
- Your analysis will be judged based on clarity, and the depth of the insights provided.
|
55 |
+
|
56 |
+
### 3. Prediction Task: Shot Success Modeling (No Deep Learning)
|
57 |
+
|
58 |
+
In this task, your goal is to predict whether a shot will be made (`SHOT_MADE`). You will build a machine learning model that predicts shot success based on a variety of features.
|
59 |
+
|
60 |
+
**Important Notes**:
|
61 |
+
|
62 |
+
You are NOT allowed to use deep learning methods!
|
63 |
+
|
64 |
+
You are encouraged to perform **feature engineering** to create new and useful features for your models.
|
65 |
+
|
66 |
+
Suggested Feature Engineering Ideas:
|
67 |
+
|
68 |
+
- **Spatial Features**: Derive the shot angle from the basket using `LOC_X`, `LOC_Y`.
|
69 |
+
- **Shot Difficulty**: Combine features such as `SHOT_DISTANCE`, `LOC_X`, `LOC_Y`, and `SHOT_TYPE` to create a new feature representing shot difficulty.
|
70 |
+
- **Player Fatigue**: Estimate player fatigue based on `GAME_DATE` (players are potentially more fatigued later in the season) and how late in the game the shot occurs (`MINS_LEFT`, `SECS_LEFT`).
|
71 |
+
|
72 |
+
Deliverables:
|
73 |
+
|
74 |
+
- A machine learning model that predicts shot success. Provide a summary of your model’s performance, including metrics like **accuracy**, **precision**, **recall**, or **F1-score**.
|
75 |
+
- **Feature engineering discussion**: Document any new features you created and how they improved your model.
|
76 |
+
|
77 |
+
Put the model summary and report into the notebook, no need for a separate document!
|
78 |
+
|
79 |
+
### 4. Visualization Challenge (BONUS)
|
80 |
+
|
81 |
+
In this challenge, we want you to go beyond basic visualizations. This is an opportunity to showcase your creativity and data storytelling abilities. This is an optional section but you can earn up to 5 bonus points.
|
82 |
+
|
83 |
+
Suggested Visualizations:
|
84 |
+
|
85 |
+
- **Shot Heatmap**: Create a heatmap showing shot success rates across different areas of the court for players or teams.
|
86 |
+
|
87 |
+
- Example: _"Where are players most successful on the court?"_
|
88 |
+
|
89 |
+
- **Clutch Shot Analysis**: Analyze performance when the game is close or when time is running out.
|
90 |
+
|
91 |
+
- Example: _"Which players perform well in high-pressure situations, such as late in the game?"_
|
92 |
+
|
93 |
+
## Dataset
|
94 |
+
|
95 |
+
The dataset contains the following columns:
|
96 |
+
|
97 |
+
```{python}
|
98 |
+
['SEASON_1', 'SEASON_2', 'TEAM_ID', 'TEAM_NAME', 'PLAYER_ID', 'PLAYER_NAME',
|
99 |
+
'POSITION_GROUP', 'POSITION', 'GAME_DATE', 'GAME_ID', 'HOME_TEAM', 'AWAY_TEAM',
|
100 |
+
'EVENT_TYPE', 'SHOT_MADE', 'ACTION_TYPE', 'SHOT_TYPE', 'BASIC_ZONE',
|
101 |
+
'ZONE_NAME', 'ZONE_ABB', 'ZONE_RANGE', 'LOC_X', 'LOC_Y', 'SHOT_DISTANCE',
|
102 |
+
'QUARTER', 'MINS_LEFT', 'SECS_LEFT']
|
103 |
+
```
|
104 |
+
|
105 |
+
## Resources
|
106 |
+
|
107 |
+
To make court visualizations easier, we've provided a function `draw_court` in the `utils.py` file. Could prove useful for Question 4 (Advanced Visualization)!
|
108 |
+
|
109 |
+
Here's an example of plotting Kobe Bryant's career shots:
|
110 |
+
|
111 |
+
```{python}
|
112 |
+
from utils import draw_court
|
113 |
+
|
114 |
+
bryant_id = 977 # Kobe Bryant
|
115 |
+
game_df = df[(df['PLAYER_ID'] == bryant_id)]
|
116 |
+
|
117 |
+
plt.figure(figsize=(8, 8))
|
118 |
+
# The 'game_df['LOC_X'] * 10 and (game_df['LOC_Y'] - 5) * 10' operations were chosen arbitrarily, we suggest you keep them.
|
119 |
+
plt.scatter(game_df['LOC_X'] * 10, (game_df['LOC_Y'] - 5) * 10, c=game_df['SHOT_MADE'], cmap='coolwarm', alpha=0.5)
|
120 |
+
draw_court(outer_lines=True, color="darkblue")
|
121 |
+
|
122 |
+
# These make sure the court is of a reasonable size
|
123 |
+
plt.xlim(-300, 300)
|
124 |
+
plt.ylim(-100, 500)
|
125 |
+
|
126 |
+
plt.axis('off')
|
127 |
+
plt.show()
|
128 |
+
```
|
129 |
+
|
130 |
+
## Grading
|
131 |
+
|
132 |
+
- **Exploratory Analysis and Visualizations** (10 pts): These will be graded on correctness and creativity.
|
133 |
+
- Parts 1-6 are each 1 point.
|
134 |
+
- Part 7 is worth 4 points. You will be graded on the quality of your description and the validity of your insight.
|
135 |
+
- **Clustering** (15 pts): Clustering will be evaluated based on the cohesion and separation of the clusters, as well as the accompanying discussion.
|
136 |
+
- 5 pts for creating a valid clustering.
|
137 |
+
- 5 pts for providing high quality visualizations of the clusters.
|
138 |
+
- 5 pts for providing a clear interpretation and explanation of the clusters.
|
139 |
+
- **Prediction Task** (10 pts): The leaderboard will rank participants based on their model performance (e.g., accuracy) on a held-out test set.
|
140 |
+
|
141 |
+
- 5 pts for creating and validating a model
|
142 |
+
- 5 pts for explanation of feature and model choices
|
143 |
+
- 2 bonus pts for finishing top 10 in the leaderboard
|
144 |
+
|
145 |
+
- **Visualization** (3 possible bonus pts): To receive the full bonus credit you must provide a creative and informative visualization.
|
146 |
+
|
147 |
+
Good luck, and may the best analysis win!
|
apps/ai_tutor/storage/data/urls.txt
CHANGED
@@ -1,2 +1,3 @@
|
|
1 |
https://tools4ds.github.io/fa2024/
|
2 |
-
https://tools4ds.github.io/DS701-Course-Notes/
|
|
|
|
1 |
https://tools4ds.github.io/fa2024/
|
2 |
+
https://tools4ds.github.io/DS701-Course-Notes/
|
3 |
+
https://raw.githubusercontent.com/tools4ds/ds701_fa2024_assignments/refs/heads/main/midterm/README.md
|