{"cells": [{"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["## 使用嵌入进行代码搜索\n", "\n", "本笔记展示了如何使用Ada嵌入来实现语义代码搜索。在这个演示中,我们使用我们自己的[openai-python代码库](https://github.com/openai/openai-python)。我们实现了一个简单版本的文件解析和从python文件中提取函数的功能,这些函数可以被嵌入、索引和查询。\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### 辅助函数\n", "\n", "我们首先设置一些简单的解析函数,这些函数允许我们从我们的代码库中提取重要信息。\n"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["import pandas as pd\n", "from pathlib import Path\n", "\n", "DEF_PREFIXES = ['def ', 'async def ']\n", "NEWLINE = '\\n'\n", "\n", "def get_function_name(code):\n", " \"\"\"\n", " 从以'def'或'async def'开头的行中提取函数名。\n", " \"\"\"\n", " for prefix in DEF_PREFIXES:\n", " if code.startswith(prefix):\n", " return code[len(prefix): code.index('(')]\n", "\n", "\n", "def get_until_no_space(all_lines, i):\n", " \"\"\"\n", " 获取所有行,直到找到函数定义之外的行。\n", " \"\"\"\n", " ret = [all_lines[i]]\n", " for j in range(i + 1, len(all_lines)):\n", " if len(all_lines[j]) == 0 or all_lines[j][0] in [' ', '\\t', ')']:\n", " ret.append(all_lines[j])\n", " else:\n", " break\n", " return NEWLINE.join(ret)\n", "\n", "\n", "def get_functions(filepath):\n", " \"\"\"\n", " 获取Python文件中的所有函数。\n", " \"\"\"\n", " with open(filepath, 'r') as file:\n", " all_lines = file.read().replace('\\r', NEWLINE).split(NEWLINE)\n", " for i, l in enumerate(all_lines):\n", " for prefix in DEF_PREFIXES:\n", " if l.startswith(prefix):\n", " code = get_until_no_space(all_lines, i)\n", " function_name = get_function_name(code)\n", " yield {\n", " 'code': code,\n", " 'function_name': function_name,\n", " 'filepath': filepath,\n", " }\n", " break\n", "\n", "\n", "def extract_functions_from_repo(code_root):\n", " \"\"\"\n", " 从仓库中提取所有.py文件中的函数。\n", " \"\"\"\n", " code_files = list(code_root.glob('**/*.py'))\n", "\n", " num_files = len(code_files)\n", " print(f'Total number of .py files: {num_files}')\n", "\n", " if num_files == 0:\n", " print('Verify openai-python repo exists and code_root is set correctly.')\n", " return None\n", "\n", " all_funcs = [\n", " func\n", " for code_file in code_files\n", " for func in get_functions(str(code_file))\n", " ]\n", "\n", " num_funcs = len(all_funcs)\n", " print(f'Total number of functions extracted: {num_funcs}')\n", "\n", " return all_funcs\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["# 数据加载\n", "\n", "我们首先加载openai-python文件夹,并使用我们上面定义的函数提取所需的信息。\n"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Total number of .py files: 51\n", "Total number of functions extracted: 97\n"]}], "source": ["# Set user root directory to the 'openai-python' repository\n", "root_dir = Path.home()\n", "\n", "# Assumes the 'openai-python' repository exists in the user's root directory\n", "code_root = root_dir / 'openai-python'\n", "\n", "# 从仓库中提取所有功能\n", "all_funcs = extract_functions_from_repo(code_root)\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["现在我们已经有了我们的内容,我们可以将数据传递给`text-embedding-3-small`模型,并获得我们的向量嵌入。\n"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", " | code | \n", "function_name | \n", "filepath | \n", "code_embedding | \n", "
---|---|---|---|---|
0 | \n", "def _console_log_level():\\n if openai.log i... | \n", "_console_log_level | \n", "openai/util.py | \n", "[0.005937571171671152, 0.05450401455163956, 0.... | \n", "
1 | \n", "def log_debug(message, **params):\\n msg = l... | \n", "log_debug | \n", "openai/util.py | \n", "[0.017557814717292786, 0.05647840350866318, -0... | \n", "
2 | \n", "def log_info(message, **params):\\n msg = lo... | \n", "log_info | \n", "openai/util.py | \n", "[0.022524144500494003, 0.06219055876135826, -0... | \n", "
3 | \n", "def log_warn(message, **params):\\n msg = lo... | \n", "log_warn | \n", "openai/util.py | \n", "[0.030524108558893204, 0.0667714849114418, -0.... | \n", "
4 | \n", "def logfmt(props):\\n def fmt(key, val):\\n ... | \n", "logfmt | \n", "openai/util.py | \n", "[0.05337328091263771, 0.03697286546230316, -0.... | \n", "