数据分析智能体：瞬间获取数据洞察 ✨

本教程为高级教程。建议先了解另一本手册的内容！

在本 Notebook 中，我们将创建一个数据分析智能体：一个配备数据分析库的代码智能体，能够加载和转换数据框，从中提取洞察，甚至绘制结果图表！

假设我想分析 Kaggle Titanic 挑战的数据，以预测每个乘客的生还情况。但在我深入挖掘之前，我希望一个自动化智能体为我准备分析，提取趋势并绘制一些图形来寻找洞察。

让我们开始设置这个系统。

运行下面的代码以安装所需的依赖：

!pip install seaborn "transformers[agents]"

我们首先创建智能体。我们使用了 ReactCodeAgent（请阅读文档了解更多关于智能体类型的信息），因此我们甚至不需要为其提供任何工具：它可以直接运行代码。

我们只需要确保它能够使用与数据科学相关的库，方法是将这些库传递给 additional_authorized_imports 参数：["numpy", "pandas", "matplotlib.pyplot", "seaborn"]。

一般来说，当在 additional_authorized_imports 中传递库时，确保这些库已在本地环境中安装，因为 Python 解释器只能使用已安装的库。

⚙ 我们的智能体将由 meta-llama/Meta-Llama-3.1-70B-Instruct 提供支持，使用 HfEngine 类，这个类通过 HF 的推理 API 实现：推理 API 使得运行任何操作系统模型变得快速而简单。

from transformers.agents import HfEngine, ReactCodeAgent
from huggingface_hub import login
import os

login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))

llm_engine = HfEngine("meta-llama/Meta-Llama-3.1-70B-Instruct")

agent = ReactCodeAgent(
    tools=[],
    llm_engine=llm_engine,
    additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
    max_iterations=10,
)

数据分析 📊🤔

在运行智能体时，我们提供了来自竞赛的额外说明，并将其作为关键字参数（kwarg）传递给 run 方法：

import os

os.mkdir("./figures")

additional_notes = """
### Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
"""

analysis = agent.run(
    """You are an expert data analyst.
Please load the source file and analyze its content.
According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
Then answer these questions one by one, by finding the relevant numbers.
Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

In your final answer: summarize these correlations and trends
After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
Your final answer should have at least 3 numbered and detailed parts.
""",
    additional_notes=additional_notes,
    source_file="titanic/train.csv",
)

>>> print(analysis)

Here are the correlations and trends found in the data:

1. **Correlation between age and survival rate**: The correlation is -0.0772, which suggests that as age increases, the survival rate decreases. This implies that older passengers were less likely to survive the Titanic disaster.

2. **Relationship between Pclass and survival rate**: The survival rates for each Pclass are:
   - Pclass 1: 62.96%
   - Pclass 2: 47.28%
   - Pclass 3: 24.24%
   This shows that passengers in higher socio-economic classes (Pclass 1 and 2) had a significantly higher survival rate compared to those in the lower class (Pclass 3).

3. **Relationship between fare and survival rate**: The correlation is 0.2573, which suggests a moderate positive relationship between fare and survival rate. This implies that passengers who paid higher fares were more likely to survive the disaster.

令人印象深刻，不是吗？你还可以为你的智能体提供一个可视化工具，让它能够反思自己绘制的图表！

数据科学智能体：进行预测 🛠️

👉 现在让我们深入一步：我们将让我们的模型在数据上执行预测。

为此，我们还需要让它使用 sklearn，并将其添加到 additional_authorized_imports 中。

agent = ReactCodeAgent(
    tools=[],
    llm_engine=llm_engine,
    additional_authorized_imports=[
        "numpy",
        "pandas",
        "matplotlib.pyplot",
        "seaborn",
        "sklearn",
    ],
    max_iterations=12,
)

output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_notes=additional_notes + "\n" + analysis,
)

智能体输出的测试预测，一旦提交到 Kaggle，得分为 0.78229，在 17,360 名参赛者中排名 #2824，而且比我几年前第一次尝试这个挑战时艰难取得的成绩还要好。

你的结果可能会有所不同，但无论如何，我认为能够在几秒钟内通过智能体实现这一点，实在是非常令人印象深刻。

🚀 以上只是一个数据分析智能体的简单尝试：它肯定可以在很多方面进行改进，以更好地适应你的具体使用场景！

Update on GitHub

Open-Source AI Cookbook

数据分析智能体：瞬间获取数据洞察 ✨

数据分析 📊🤔

数据科学智能体：进行预测 🛠️