博客
OpenAI 大型文本嵌入模型text-embedding-3-large

OpenAI 大型文本嵌入模型text-embedding-3-large

2024-08-013 分钟阅读

ext-embedding-3-large 简介

text-embedding-3-large 是 OpenAI 的大型文本嵌入模型，它能够创建高达 3072 维度的嵌入。与 OpenAI 的其他文本嵌入模型相比，例如从 text-embedding-ada-002 到 text-embedding-3-large，text-embedding-3-large 具有更强大的性能和更低的价格。

让我们快速了解一下一些基础知识。

如何使用 text-embedding-3-large 模型

有三种主要方法使用 text-embedding-3-large 模型生成向量嵌入：

OpenAI 嵌入：OpenAI 提供的 Python SDK。

Zilliz Cloud Pipelines：Zilliz Cloud（托管的 Milvus）内置的功能，它无缝集成了 text-embedding-3-large 模型。它提供了一个即用型解决方案，简化了文本向量嵌入的创建和检索。

PyMilvus：Milvus 的 Python SDK，它无缝集成了 text-embedding-3-large 模型。

from openai import OpenAI

client = OpenAI()
results = client.embeddings.create(
   input=[
   "Artificial intelligence was founded as an academic discipline in 1956.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England."
   ],
   model="text-embedding-3-large"
   )
embeddings = [data.embedding for data in results]

有关更多信息，请参考 OpenAI 的嵌入指南。

通过 PyMilvus 生成向量嵌入。

from pymilvus import modelopenai_ef = model.dense.OpenAIEmbeddingFunction(   model_name="text-embedding-3-large", # Specify the model name   api_key="YOUR_API_KEY", # Provide your OpenAI API key   dimensions=512 # Set the embedding dimensionality   )# Generate embeddings for documentsdocs = [   "Artificial intelligence was founded as an academic discipline in 1956.",   "Alan Turing was the first person to conduct substantial research in AI.",   "Born in Maida Vale, London, Turing was raised in southern England."]docs_embeddings = openai_ef.encode_documents(docs)print("Docs embeddings:", docs_embeddings)# Generate embeddings for queriesqueries = ["When was artificial intelligence founded",          "Where was Alan Turing born?"]query_embeddings = openai_ef.encode_queries(queries)print("Queries embeddings:", query_embeddings)

有关更多信息，请参考我们的 PyMilvus 嵌入模型文档。

通过 Zilliz Cloud Pipelines 生成向量嵌入。

import requests

ZILLIZ_REGION = "gcp-us-west1" # Change region if needed
ZILLIZ_API_KEY = "*****" # Use your Zilliz API Key
PROJECT_ID = "proj-xxxxx" # Use your Project ID
CLUSTER_ID = "inxx-xxxxx" # Use your Cluster ID

domain = f"https://controller.api.{ZILLIZ_REGION}.zillizcloud.com/v1/pipelines"
headers = {
           "Authorization": f"Bearer {ZILLIZ_API_KEY}",
           "Accept": "application/json",
           "Content-Type": "application/json"
       }

################### Ingestion Pipeline ###################
pipe_config = {
       "name": "my_text_ingestion_pipeline",
       "projectId": PROJECT_ID,
       "clusterId": CLUSTER_ID,
       "collectionName": "my_collection",
       "description": "A pipeline that generates text embeddings and stores additional fields.",
       "type": "INGESTION", 
       "functions": [
           {
               "name": "index_my_text",
               "action": "INDEX_TEXT",
               "language": "ENGLISH",
               "embedding": "openai/text-embedding-3-large" # Change this value to switch embedding service
           }
       ]  
   }

# Create ingestion pipeline
response = requests.post(domain, headers=headers, json=pipe_config)
pipe_id = response.json()["data"]["pipelineId"]
print(response.json())
# {
#     "code": 200,
#     "data": {
#     "pipelineId": "pipe-xxxxx",
#     "name": "my_text_ingestion_pipeline",
#     "type": "INGESTION",
#     "createTimestamp": 1721115598000,
#     "description": "A pipeline that generates text embeddings and stores additional fields.",
#     "status": "SERVING",
#     "functions": [{
#         "name":"index_my_text",
#         "action":"INDEX_TEXT",
#         "inputFields": ["text_list"],
#         "language": "ENGLISH",
#         "embedding": "openai/text-embedding-3-large"
#         }],
#     "clusterId":"inxxx-xxxxx",
#     "collectionName":"my_collection"
#     }
# }

# Run ingestion pipeline
url = f"{domain}/{pipe_id}/run"
data = {
   "text_list": ["text 1", "text 2"]
   }
response = requests.post(url, headers=headers, json={"data": data})
print(response.json())
# {
#     "code": 200,
#     "data": {
#         "num_entities": 2,
#         "usage":{"embedding":6},
#         "ids": [450930478333400937,450930478333400938]
#         }
# }

################### Search Pipeline ###################
pipe_config = {
       "name": "my_text_search_pipeline",
       "projectId": PROJECT_ID,
       "description": "A pipeline that receives text and search for semantically similar texts",
       "type": "SEARCH",
       "functions": [
           {
               "name": "search_text",
               "action": "SEARCH_TEXT",
               "clusterId": CLUSTER_ID,
               "collectionName": "my_collection",
               "embedding": "openai/text-embedding-3-large",
               "reranker": "zilliz/bge-reranker-base" # optional, remove this config if you want to disable reranker
           }
       ]
   }

# Create search pipeline
response = requests.post(domain, headers=headers, json=pipe_config)
pipe_id = response.json()["data"]["pipelineId"]
print(response.json())
# {
#     "code": 200,
#     "data": {
#     "pipelineId": "pipe-xxxxx",
#     "name": "my_text_search_pipeline",
#     "type": "SEARCH",
#     "createTimestamp": 1721117024000,
#     "description": "A pipeline that receives text and search for semantically similar texts",
#     "status": "SERVING",
#     "functions": [{
#         "name":"search_text",
#         "action":"SEARCH_TEXT",
#         "inputFields": ["query_text"],
#         "embedding": "openai/text-embedding-3-large",
#         "reranker":"zilliz/bge-reranker-base",
#         "clusterId":"inxxx-xxxxx",
#         "collectionName":"my_collection"
#         }]
#     }
# }

# Run search pipeline
url = f"{domain}/{pipe_id}/run"
data = {
   "query_text": "example query"
   }
params = {
         "limit": 2,
         "offset": 0,
         "outputFields": [],
         "filter": ""
     }
response = requests.post(url, headers=headers, json={"data": data, "params": params})
print(response.json())
# {
#     "code": 200,
#     "data": {
#         "result": [
#             {"id": 450930478333400938, "distance": 0.07672341167926788, "text": "text 2"},
#             {"id": 450930478333400937, "distance": 0.028605177998542786, "text": "text 1"}
#             ],
#         "usage": {"embedding": 2, "rerank": 16}
#         }
# }

有关更多信息，请参考以下资源：