Open-Source AI Cookbook documentation

使用 Haystack 和 NuExtract 进行信息提取

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

使用 Haystack 和 NuExtract 进行信息提取

作者：Stefano Fiorucci

在本 Notebook 中，我们将展示如何使用语言模型自动化从文本数据中提取信息。

🎯 目标：创建一个应用程序，从给定的文本或 URL 中提取特定信息，遵循用户定义的结构。

🧰 技术栈

Haystack 🏗️：一个可定制的编排框架，用于构建 LLM 应用程序。我们将使用 Haystack 来构建信息提取管道。
NuExtract：一个小型语言模型，专门针对结构化数据提取进行了微调。

安装依赖

! pip install haystack-ai trafilatura transformers pyvis

组件

Haystack 有两个主要概念：组件和管道。

🧩 组件是执行单一任务的构建块：文件转换、文本生成、嵌入创建等。

➿ 管道允许你通过将组件组合成有向（或循环）图，来定义数据在 LLM 应用程序中的流动。

接下来，我们将介绍我们信息提取应用程序的各个组件。之后，我们将把它们集成到一个管道中。

LinkContentFetcher 和 HTMLToDocument ：从网页提取文本

在我们的实验中，我们将从网络上的初创公司融资公告中提取数据。

为了下载网页并提取文本，我们使用两个组件：

LinkContentFetcher：获取某些URL的内容，并返回内容流的列表（作为ByteStream对象）。
HTMLToDocument：将HTML源代码转换为文本Documents。

>>> from haystack.components.fetchers import LinkContentFetcher
>>> from haystack.components.converters import HTMLToDocument


>>> fetcher = LinkContentFetcher()

>>> streams = fetcher.run(urls=["https://example.com/"])["streams"]

>>> converter = HTMLToDocument()
>>> docs = converter.run(sources=streams)

>>> print(docs)

&#123;'documents': [Document(id=65bb1ce4b6db2f154d3acfa145fa03363ef93f751fb8599dcec3aaf75aa325b9, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: &#123;'content_type': 'text/html', 'url': 'https://example.com/'})]}

HuggingFaceLocalGenerator ：加载并尝试模型

我们使用 HuggingFaceLocalGenerator，这是一个文本生成组件，允许使用 Transformers 库加载托管在 Hugging Face 上的模型。

Haystack 还支持许多其他的生成器，包括 HuggingFaceAPIGenerator（与 Hugging Face APIs 和 TGI 兼容）。

我们加载 NuExtract，这是一个基于 microsoft/Phi-3-mini-4k-instruct 进行微调的模型，用于从文本中执行结构化数据提取。该模型的大小为 3.8B 参数。还有其他变体可用：NuExtract-tiny（0.5B）和 NuExtract-large（7B）。

该模型以 bfloat16 精度加载，以适应 Colab 环境，并且与 FP32 相比，性能几乎没有损失，正如模型卡片中所建议的那样。

关于 Flash Attention 的说明

在推理时，你可能会看到一个警告：“You are not running the flash-attention implementation”。

像 Colab 或 Kaggle 这样的免费环境中提供的 GPU 不支持 Flash Attention，因此我们决定在本笔记本中不使用它。

如果你的 GPU 架构支持 Flash Attention（详情），你可以安装它并获得加速，方法如下：

pip install flash-attn --no-build-isolation

然后，将 "attn_implementation": "flash_attention_2" 添加到 model_kwargs 中。

from haystack.components.generators import HuggingFaceLocalGenerator
import torch

generator = HuggingFaceLocalGenerator(model="numind/NuExtract",
                                      huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})

# effectively load the model (warm_up is automatically invoked when the generator is part of a Pipeline)
generator.warm_up()

该模型支持特定的提示结构，可以从模型卡片中推断出来。

我们将手动创建一个提示来尝试该模型。稍后，我们将看到如何根据不同的输入动态生成提示。

>>> prompt="""<|input|>\n### Template:
... {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }
... ### Text:
... The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).

... The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.

... In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
... <|output|>
... """

>>> result = generator.run(prompt=prompt)
>>> print(result)

&#123;'replies': ['&#123;\n    "Car": &#123;\n        "Name": "Fiat Panda",\n        "Manufacturer": "Fiat",\n        "Designers": [\n            "Giorgetto Giugiaro",\n            "Aldo Mantovani",\n            "Giuliano Biasio",\n            "Roberto Giolito"\n        ],\n        "Number of units produced": "over 7.8 million"\n    }\n}\n']}

漂亮 ✅

PromptBuilder ：动态生成提示

PromptBuilder 是通过 Jinja2 提示模板初始化的，并通过填充传递的关键字参数来渲染该模板。

我们的提示模板复现了模型卡片中显示的结构。

在实验过程中，我们发现缩进模式对于确保良好的结果非常重要。这可能与模型的训练方式有关。

from haystack.components.builders import PromptBuilder
from haystack import Document

prompt_template = '''<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
'''

prompt_builder = PromptBuilder(template=prompt_template)

>>> example_document = Document(content="The Fiat Panda is a city car...")

>>> example_schema = {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }

>>> prompt=prompt_builder.run(documents=[example_document], schema=example_schema)["prompt"]

>>> print(prompt)

<|input|>
### Template:
&#123;
    "Car": &#123;
        "Designers": [],
        "Manufacturer": "",
        "Name": "",
        "Number of units produced": ""
    }
}

### Text
The Fiat Panda is a city car...
<|output|>

运行良好 ✅

OutputAdapter

你可能已经注意到，提取结果是 replies 列表中的第一个元素，并且是一个 JSON 字符串。

我们希望能够为每个源文档获得一个字典。

为了在管道中执行这个转换，我们可以使用 OutputAdapter。

>>> import json
>>> from haystack.components.converters import OutputAdapter


>>> adapter = OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
...                                          output_type=dict,
...                                          custom_filters={"json_loads": json.loads})

... print(adapter.run(**result))

&#123;'output': &#123;'Car': &#123;'Name': 'Fiat Panda', 'Manufacturer': 'Fiat', 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani', 'Giuliano Biasio', 'Roberto Giolito'], 'Number of units produced': 'over 7.8 million'}}}

信息提取管道

构建管道

现在我们可以通过添加和连接各个组件来创建我们的管道。

from haystack import Pipeline

ie_pipe = Pipeline()
ie_pipe.add_component("fetcher", fetcher)
ie_pipe.add_component("converter", converter)
ie_pipe.add_component("prompt_builder", prompt_builder)
ie_pipe.add_component("generator", generator)
ie_pipe.add_component("adapter", adapter)

ie_pipe.connect("fetcher", "converter")
ie_pipe.connect("converter", "prompt_builder")
ie_pipe.connect("prompt_builder", "generator")
ie_pipe.connect("generator", "adapter")

# IN CASE YOU NEED TO RECREATE THE PIPELINE FROM SCRATCH, YOU CAN UNCOMMENT THIS CELL

# ie_pipe = Pipeline()
# ie_pipe.add_component("fetcher", LinkContentFetcher())
# ie_pipe.add_component("converter", HTMLToDocument())
# ie_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
# ie_pipe.add_component("generator", HuggingFaceLocalGenerator(model="numind/NuExtract",
#                                       huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})
# )
# ie_pipe.add_component("adapter", OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
#                                          output_type=dict,
#                                          custom_filters={"json_loads": json.loads}))

# ie_pipe.connect("fetcher", "converter")
# ie_pipe.connect("converter", "prompt_builder")
# ie_pipe.connect("prompt_builder", "generator")
# ie_pipe.connect("generator", "adapter")

让我们回顾一下我们的管道设置：

>>> ie_pipe.show()

定义源和提取模式

我们选择了一些与最近初创公司融资公告相关的URL列表。

此外，我们定义了一个用于提取结构化信息的模式。

urls = ["https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
        "https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
        "https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
        "https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
        "https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
        "https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
        "https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
        "https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
        "https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
        "https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
        "https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
        "https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html"]


schema={
    "Funding": {
        "New funding": "",
        "Investors": [],
    },
     "Company": {
        "Name": "",
        "Activity": "",
        "Country": "",
        "Total valuation": "",
        "Total funding": ""
    }
}

运行管道！

我们将所需的数据传递给每个组件。

请注意，大多数组件从先前执行的组件中接收数据。

from tqdm import tqdm

extracted_data=[]

for url in tqdm(urls):
    result = ie_pipe.run({"fetcher":{"urls":[url]},
                          "prompt_builder": {"schema":schema}})

    extracted_data.append(result["adapter"]["output"])

让我们检查一些提取的数据。

extracted_data[:2]

数据探索与可视化

让我们探索提取的数据，以评估其正确性并获得一些洞察。

数据框（Dataframe）

我们首先创建一个 Pandas DataFrame。为了简化，我们将提取的数据展平。

def flatten_dict(d, parent_key=''):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key} - {k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key).items())
        elif isinstance(v, list):
            items.append((new_key, ', '.join(v)))
        else:
            items.append((new_key, v))
    return dict(items)

import pandas as pd

df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by='Company - Name')

df

dataframe

除了“公司 - 国家”部分的一些错误外，提取的数据看起来是正确的。

构建简单图表

为了理解公司与投资者之间的关系，我们构建一个图表并进行可视化。

首先，我们使用 NetworkX 构建图表。

NetworkX 是一个 Python 包，允许我们以简单的方式创建和操作网络/图表。

我们的简单图表将以公司和投资者作为节点。如果投资者和公司出现在同一文档中，我们将连接这些节点。

import networkx as nx

# Create a new graph
G = nx.Graph()

# Add nodes and edges
for el in extracted_data:
    company_name = el["Company"]["Name"]
    G.add_node(company_name, label=company_name, title="Company")

    investors = el["Funding"]["Investors"]
    for investor in investors:
        if not G.has_node(investor):
            G.add_node(investor, label=investor, title="Investor", color="red")
        G.add_edge(company_name, investor)

接下来，我们使用 Pyvis 来可视化图表。

Pyvis 是一个用于网络/图表交互式可视化的 Python 包。它与 NetworkX 无缝集成，非常适合用于图形的可视化。

from pyvis.network import Network
from IPython.display import display, HTML


net = Network(notebook=True, cdn_resources='in_line')
net.from_nx(G)

net.show('simple_graph.html')
display(HTML('simple_graph.html'))

graph visualization

看起来 Andreessen Horowitz 在选定的融资公告中出现得非常频繁 😊

结论与思考

在本 Notebook 中，我们演示了如何使用小型语言模型（NuExtract）和 Haystack（一个可定制的 LLM 应用编排框架）来设置信息提取系统。

我们可以如何利用提取的数据？

一些思路：

提取的数据可以添加到存储在 Document Store 中的原始文档中。这允许通过元数据过滤实现更强大的搜索功能。
在前一个思路的基础上，你可以进行 RAG（检索增强提取），通过从查询中提取元数据，如这篇博客文章中所述。
将文档和提取的数据存储在知识图谱中，并执行图形 RAG（Neo4j-Haystack 集成）。

Update on GitHub

←个人身份信息 (PII) 检测的 LLM 网关使用向量嵌入和 Qdrant 进行代码搜索→