{"cells": [{"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["## 使用嵌入进行代码搜索\n", "\n", "本笔记展示了如何使用Ada嵌入来实现语义代码搜索。在这个演示中,我们使用我们自己的[openai-python代码库](https://github.com/openai/openai-python)。我们实现了一个简单版本的文件解析和从python文件中提取函数的功能,这些函数可以被嵌入、索引和查询。\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### 辅助函数\n", "\n", "我们首先设置一些简单的解析函数,这些函数允许我们从我们的代码库中提取重要信息。\n"]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["import pandas as pd\n", "from pathlib import Path\n", "\n", "DEF_PREFIXES = ['def ', 'async def ']\n", "NEWLINE = '\\n'\n", "\n", "def get_function_name(code):\n", " \"\"\"\n", " 从以'def'或'async def'开头的行中提取函数名。\n", " \"\"\"\n", " for prefix in DEF_PREFIXES:\n", " if code.startswith(prefix):\n", " return code[len(prefix): code.index('(')]\n", "\n", "\n", "def get_until_no_space(all_lines, i):\n", " \"\"\"\n", " 获取所有行,直到找到函数定义之外的行。\n", " \"\"\"\n", " ret = [all_lines[i]]\n", " for j in range(i + 1, len(all_lines)):\n", " if len(all_lines[j]) == 0 or all_lines[j][0] in [' ', '\\t', ')']:\n", " ret.append(all_lines[j])\n", " else:\n", " break\n", " return NEWLINE.join(ret)\n", "\n", "\n", "def get_functions(filepath):\n", " \"\"\"\n", " 获取Python文件中的所有函数。\n", " \"\"\"\n", " with open(filepath, 'r') as file:\n", " all_lines = file.read().replace('\\r', NEWLINE).split(NEWLINE)\n", " for i, l in enumerate(all_lines):\n", " for prefix in DEF_PREFIXES:\n", " if l.startswith(prefix):\n", " code = get_until_no_space(all_lines, i)\n", " function_name = get_function_name(code)\n", " yield {\n", " 'code': code,\n", " 'function_name': function_name,\n", " 'filepath': filepath,\n", " }\n", " break\n", "\n", "\n", "def extract_functions_from_repo(code_root):\n", " \"\"\"\n", " 从仓库中提取所有.py文件中的函数。\n", " \"\"\"\n", " code_files = list(code_root.glob('**/*.py'))\n", "\n", " num_files = len(code_files)\n", " print(f'Total number of .py files: {num_files}')\n", "\n", " if num_files == 0:\n", " print('Verify openai-python repo exists and code_root is set correctly.')\n", " return None\n", "\n", " all_funcs = [\n", " func\n", " for code_file in code_files\n", " for func in get_functions(str(code_file))\n", " ]\n", "\n", " num_funcs = len(all_funcs)\n", " print(f'Total number of functions extracted: {num_funcs}')\n", "\n", " return all_funcs\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["# 数据加载\n", "\n", "我们首先加载openai-python文件夹,并使用我们上面定义的函数提取所需的信息。\n"]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Total number of .py files: 51\n", "Total number of functions extracted: 97\n"]}], "source": ["# Set user root directory to the 'openai-python' repository\n", "root_dir = Path.home()\n", "\n", "# Assumes the 'openai-python' repository exists in the user's root directory\n", "code_root = root_dir / 'openai-python'\n", "\n", "# 从仓库中提取所有功能\n", "all_funcs = extract_functions_from_repo(code_root)\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["现在我们已经有了我们的内容,我们可以将数据传递给`text-embedding-3-small`模型,并获得我们的向量嵌入。\n"]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
codefunction_namefilepathcode_embedding
0def _console_log_level():\\n if openai.log i..._console_log_levelopenai/util.py[0.005937571171671152, 0.05450401455163956, 0....
1def log_debug(message, **params):\\n msg = l...log_debugopenai/util.py[0.017557814717292786, 0.05647840350866318, -0...
2def log_info(message, **params):\\n msg = lo...log_infoopenai/util.py[0.022524144500494003, 0.06219055876135826, -0...
3def log_warn(message, **params):\\n msg = lo...log_warnopenai/util.py[0.030524108558893204, 0.0667714849114418, -0....
4def logfmt(props):\\n def fmt(key, val):\\n ...logfmtopenai/util.py[0.05337328091263771, 0.03697286546230316, -0....
\n", "
"], "text/plain": [" code function_name \\\n", "0 def _console_log_level():\\n if openai.log i... _console_log_level \n", "1 def log_debug(message, **params):\\n msg = l... log_debug \n", "2 def log_info(message, **params):\\n msg = lo... log_info \n", "3 def log_warn(message, **params):\\n msg = lo... log_warn \n", "4 def logfmt(props):\\n def fmt(key, val):\\n ... logfmt \n", "\n", " filepath code_embedding \n", "0 openai/util.py [0.005937571171671152, 0.05450401455163956, 0.... \n", "1 openai/util.py [0.017557814717292786, 0.05647840350866318, -0... \n", "2 openai/util.py [0.022524144500494003, 0.06219055876135826, -0... \n", "3 openai/util.py [0.030524108558893204, 0.0667714849114418, -0.... \n", "4 openai/util.py [0.05337328091263771, 0.03697286546230316, -0.... "]}, "execution_count": 3, "metadata": {}, "output_type": "execute_result"}], "source": ["from utils.embeddings_utils import get_embedding\n", "\n", "df = pd.DataFrame(all_funcs)\n", "df['code_embedding'] = df['code'].apply(lambda x: get_embedding(x, model='text-embedding-3-small'))\n", "df['filepath'] = df['filepath'].map(lambda x: Path(x).relative_to(code_root))\n", "df.to_csv(\"data/code_search_openai-python.csv\", index=False)\n", "df.head()\n"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### 测试\n", "\n", "让我们使用一些简单的查询来测试我们的端点。如果您熟悉`openai-python`存储库,您会发现我们可以很容易地通过简单的英文描述找到我们要查找的函数。\n", "\n", "我们定义了一个`search_functions`方法,该方法接受包含嵌入、查询字符串和一些其他配置选项的数据。搜索我们的数据库的过程如下:\n", "\n", "1. 首先,我们使用`text-embedding-3-small`对我们的查询字符串(`code_query`)进行嵌入。这里的理由是,像'a function that reverses a string'这样的查询字符串和像'def reverse(string): return string[::-1]'这样的函数在嵌入时会非常相似。\n", "2. 然后,我们计算我们的查询字符串嵌入与数据库中所有数据点之间的余弦相似度。这会给出每个点与我们的查询之间的距离。\n", "3. 最后,我们按照它们与我们的查询字符串的距离对所有数据点进行排序,并返回函数参数中请求的结果数量。\n"]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": ["from utils.embeddings_utils import cosine_similarity\n", "\n", "def search_functions(df, code_query, n=3, pprint=True, n_lines=7):\n", " embedding = get_embedding(code_query, model='text-embedding-3-small')\n", " df['similarities'] = df.code_embedding.apply(lambda x: cosine_similarity(x, embedding))\n", "\n", " res = df.sort_values('similarities', ascending=False).head(n)\n", "\n", " if pprint:\n", " for r in res.iterrows():\n", " print(f\"{r[1].filepath}:{r[1].function_name} score={round(r[1].similarities, 3)}\")\n", " print(\"\\n\".join(r[1].code.split(\"\\n\")[:n_lines]))\n", " print('-' * 70)\n", "\n", " return res\n"]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["openai/validators.py:format_inferrer_validator score=0.453\n", "def format_inferrer_validator(df):\n", " \"\"\"\n", " This validator will infer the likely fine-tuning format of the data, and display it to the user if it is classification.\n", " It will also suggest to use ada and explain train/validation split benefits.\n", " \"\"\"\n", " ft_type = infer_task_type(df)\n", " immediate_msg = None\n", "----------------------------------------------------------------------\n", "openai/validators.py:infer_task_type score=0.37\n", "def infer_task_type(df):\n", " \"\"\"\n", " Infer the likely fine-tuning task type from the data\n", " \"\"\"\n", " CLASSIFICATION_THRESHOLD = 3 # min_average instances of each class\n", " if sum(df.prompt.str.len()) == 0:\n", " return \"open-ended generation\"\n", "----------------------------------------------------------------------\n", "openai/validators.py:apply_validators score=0.369\n", "def apply_validators(\n", " df,\n", " fname,\n", " remediation,\n", " validators,\n", " auto_accept,\n", " write_out_file_func,\n", "----------------------------------------------------------------------\n"]}], "source": ["res = search_functions(df, 'fine-tuning input data validation logic', n=3)\n"]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["openai/validators.py:get_common_xfix score=0.487\n", "def get_common_xfix(series, xfix=\"suffix\"):\n", " \"\"\"\n", " Finds the longest common suffix or prefix of all the values in a series\n", " \"\"\"\n", " common_xfix = \"\"\n", " while True:\n", " common_xfixes = (\n", " series.str[-(len(common_xfix) + 1) :]\n", " if xfix == \"suffix\"\n", " else series.str[: len(common_xfix) + 1]\n", "----------------------------------------------------------------------\n", "openai/validators.py:common_completion_suffix_validator score=0.449\n", "def common_completion_suffix_validator(df):\n", " \"\"\"\n", " This validator will suggest to add a common suffix to the completion if one doesn't already exist in case of classification or conditional generation.\n", " \"\"\"\n", " error_msg = None\n", " immediate_msg = None\n", " optional_msg = None\n", " optional_fn = None\n", "\n", " ft_type = infer_task_type(df)\n", "----------------------------------------------------------------------\n"]}], "source": ["res = search_functions(df, 'find common suffix', n=2, n_lines=10)\n"]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["openai/cli.py:tools_register score=0.391\n", "def tools_register(parser):\n", " subparsers = parser.add_subparsers(\n", " title=\"Tools\", help=\"Convenience client side tools\"\n", " )\n", "\n", " def help(args):\n", " parser.print_help()\n", "\n", " parser.set_defaults(func=help)\n", "\n", " sub = subparsers.add_parser(\"fine_tunes.prepare_data\")\n", " sub.add_argument(\n", " \"-f\",\n", " \"--file\",\n", " required=True,\n", " help=\"JSONL, JSON, CSV, TSV, TXT or XLSX file containing prompt-completion examples to be analyzed.\"\n", " \"This should be the local file path.\",\n", " )\n", " sub.add_argument(\n", " \"-q\",\n", "----------------------------------------------------------------------\n"]}], "source": ["res = search_functions(df, 'Command line interface for fine-tuning', n=1, n_lines=20)\n"]}], "metadata": {"kernelspec": {"display_name": "openai", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5"}, "orig_nbformat": 4}, "nbformat": 4, "nbformat_minor": 2}