# 异步摄取管道 + 元数据提取

最近，LlamaIndex引入了异步元数据提取。让我们比较在使用LlamaIndex的新旧版本时，摄取管道中的元数据提取速度。

我们将测试一个使用经典的Paul Graham文章的管道。


In [None]:
%pip install llama-index-embeddings-openai
%pip install llama-index-llms-openai

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

## 新的LlamaIndex数据摄入

在使用大于或等于v0.9.7版本的LlamaIndex时，我们可以利用改进的异步元数据提取功能来优化数据摄入管道。

**注意：** 在安装新版本后，请重新启动您的笔记本！


In [None]:
!pip install "llama_index>=0.9.7"

**注意：** `num_workers` 参数控制着同时可以发出多少个请求，使用异步信号量。将其设置得更高可能会增加速度，但也可能导致超时或速率限制，因此请明智地设置。


In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import TitleExtractor, SummaryExtractor
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode


def build_pipeline():
    llm = OpenAI(model="gpt-3.5-turbo-1106", temperature=0.1)

    transformations = [
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(
            llm=llm, metadata_mode=MetadataMode.EMBED, num_workers=8
        ),
        SummaryExtractor(
            llm=llm, metadata_mode=MetadataMode.EMBED, num_workers=8
        ),
        OpenAIEmbedding(),
    ]

    return IngestionPipeline(transformations=transformations)

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

In [None]:
import timetimes = []for _ in range(3):    time.sleep(30)  # 帮助防止速率限制/超时，保持每次运行公平    pipline = build_pipeline()    start = time.time()    nodes = await pipline.arun(documents=documents)    end = time.time()    times.append(end - start)print(f"平均时间：{sum(times) / len(times)}")

100%|██████████| 5/5 [00:01<00:00,  3.99it/s]
100%|██████████| 18/18 [00:07<00:00,  2.36it/s]
100%|██████████| 5/5 [00:01<00:00,  2.97it/s]
100%|██████████| 18/18 [00:06<00:00,  2.63it/s]
100%|██████████| 5/5 [00:01<00:00,  3.84it/s]
100%|██████████| 18/18 [01:07<00:00,  3.75s/it]


Average time: 31.196589946746826


当前的`openai` python客户端包有点不稳定 - 有时异步作业会超时，从而扭曲平均值。您可以看到最后一个进度条花了1分钟，而不是之前的6或7秒，这扭曲了平均值。


## 旧的LlamaIndex数据摄入

现在，让我们来比较一下较旧版本的LlamaIndex，它使用了“假”的异步方式进行元数据提取。

**注意：** 在安装新版本后，请重新启动您的笔记本！


In [None]:
!pip install "llama_index<0.9.6"

Collecting llama_index<0.9.6
  Obtaining dependency information for llama_index<0.9.6 from https://files.pythonhosted.org/packages/ac/3c/dee8ec4fecaaeabbd8a61ade9ddb6af09d05553c2a0acbebd1b559eaeb30/llama_index-0.9.5-py3-none-any.whl.metadata
  Downloading llama_index-0.9.5-py3-none-any.whl.metadata (8.2 kB)
Downloading llama_index-0.9.5-py3-none-any.whl (893 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m893.9/893.9 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: llama_index
  Attempting uninstall: llama_index
    Found existing installation: llama-index 0.9.8.post1
    Uninstalling llama-index-0.9.8.post1:
      Successfully uninstalled llama-index-0.9.8.post1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trulens-eval 0.18.0 requires llama-index==0.8.69, but you have llama-index 0.9

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "sk-..."

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import TitleExtractor, SummaryExtractor
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode


def build_pipeline():
    llm = OpenAI(model="gpt-3.5-turbo-1106", temperature=0.1)

    transformations = [
        SentenceSplitter(chunk_size=1024, chunk_overlap=20),
        TitleExtractor(llm=llm, metadata_mode=MetadataMode.EMBED),
        SummaryExtractor(llm=llm, metadata_mode=MetadataMode.EMBED),
        OpenAIEmbedding(),
    ]

    return IngestionPipeline(transformations=transformations)

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data/paul_graham").load_data()

In [None]:
import timetimes = []for _ in range(3):    time.sleep(30)  # 帮助防止速率限制/超时，保持每次运行公平    pipline = build_pipeline()    start = time.time()    nodes = await pipline.arun(documents=documents)    end = time.time()    times.append(end - start)print(f"平均时间：{sum(times) / len(times)}")

Extracting titles:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting summaries:   0%|          | 0/18 [00:00<?, ?it/s]

Extracting titles:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting summaries:   0%|          | 0/18 [00:00<?, ?it/s]

Extracting titles:   0%|          | 0/5 [00:00<?, ?it/s]

Extracting summaries:   0%|          | 0/18 [00:00<?, ?it/s]

Average time: 106.17690531412761
