{"cells": [{"cell_type": "markdown", "metadata": {"id": "0wjP9mrldJsd"}, "source": ["## 在Kangas中可视化嵌入\n", "\n", "在这个Jupyter Notebook中,我们构建一个包含数据和嵌入投影到2维空间中的Kangas DataGrid。\n"]}, {"cell_type": "markdown", "metadata": {"id": "4tPKQqqldJsj"}, "source": ["## 什么是Kangas?\n", "\n", "[Kangas](https://github.com/comet-ml/kangas/)是一个面向数据科学家的开源、混合媒体、类似数据框的工具。它是由[Comet](https://comet.com/)开发的,Comet是一家旨在帮助减少将模型投入生产过程中的摩擦的公司。\n"]}, {"cell_type": "markdown", "metadata": {"id": "6sNsB2iFdJsk"}, "source": ["### 1. 设置\n", "\n", "要开始使用,我们需要使用pip安装kangas,并导入它。\n"]}, {"cell_type": "code", "execution_count": 1, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "N8gi529adL-f", "outputId": "c12e9973-a179-41e3-c5a8-f241804d99ad"}, "outputs": [], "source": ["%pip install kangas --quiet\n"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"id": "htxjXThodRxD"}, "outputs": [], "source": ["import kangas as kg\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 2. 构建Kangas数据表格\n", "\n", "我们使用原始数据和嵌入来创建一个Kangas数据表格。数据由一系列评论行组成,而嵌入由1536个浮点值组成。在这个示例中,我们直接从github获取数据,以防您不是在OpenAI的存储库中运行此笔记本。\n", "\n", "我们使用Kangas将CSV文件读入数据表格中,以便进行进一步处理。\n"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "0SxWlRTrdVJq", "outputId": "d36c3a14-2e80-4315-e285-f39f6b008976"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...\n"]}, {"name": "stderr", "output_type": "stream", "text": ["1001it [00:00, 2412.90it/s]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]\n"]}], "source": ["data = kg.read_csv(\"https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv\")\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["我们可以查看CSV文件的字段:\n"]}, {"cell_type": "code", "execution_count": 4, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "bzhQgoRGeMCp", "outputId": "791c4e40-fb28-409e-d1e9-20b753fb1215"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["DataGrid (in memory)\n", " Name : fine_food_reviews_with_embeddings_1k\n", " Rows : 1,000\n", " Columns: 9\n", "# Column Non-Null Count DataGrid Type \n", "--- -------------------- --------------- --------------------\n", "1 Column 1 1,000 INTEGER \n", "2 ProductId 1,000 TEXT \n", "3 UserId 1,000 TEXT \n", "4 Score 1,000 INTEGER \n", "5 Summary 1,000 TEXT \n", "6 Text 1,000 TEXT \n", "7 combined 1,000 TEXT \n", "8 n_tokens 1,000 INTEGER \n", "9 embedding 1,000 TEXT \n"]}], "source": ["data.info()\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["并查看第一行和最后一行:\n"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 349}, "id": "Q95N832aeaBr", "outputId": "aaea2816-e5a1-4e52-f228-c3e6aca6fa3e"}, "outputs": [{"data": {"text/html": ["\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding
1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064
2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196
3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795
5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118
...
996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533
997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467
998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240
999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919
1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261
[1000 rows x 9 columns]
* Use DataGrid.save() to save to disk
** Use DataGrid.show() to start user interface
"], "text/plain": [" row-id Column 1 ProductId UserId Score Summary Text combined n_tokens embedding \n", " 1 0 B003XPF9BO A3R7JR3FMEBXQB 5 where does one Wanted to save Title: where do 52 [0.007018072064 \n", " 2 297 B003VXHGPK A21VWSCGW7UUAR 4 Good, but not W Honestly, I hav Title: Good, bu 178 [-0.00314055196 \n", " 3 296 B008JKTTUA A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118 \n", " 4 295 B000LKTTTW A14MQ40CCU8B13 5 Best tomato sou I have a hard t Title: Best tom 111 [-0.00139322795 \n", " 5 294 B001D09KAM A34XBAIFT02B60 1 Should advertis First, these sh Title: Should a 78 [-0.01757248118 \n", "...\n", " 996 623 B0000CFXYA A3GS4GWPIBV0NT 1 Strange inflamm Truthfully wasn Title: Strange 110 [0.000110913533 \n", " 997 624 B0001BH5YM A1BZ3HMAKK0NC 5 My favorite and You've just got Title: My favor 80 [-0.02086931467 \n", " 998 625 B0009ET7TC A2FSDQY5AI6TNX 5 My furbabies LO Shake the conta Title: My furba 47 [-0.00974910240 \n", " 999 619 B007PA32L2 A15FF2P7RPKH6G 5 got this for th all i have hear Title: got this 50 [-0.00521062919 \n", " 1000 999 B001EQ5GEO A3VYU0VO6DYV6I 5 I love Maui Cof My first experi Title: I love M 118 [-0.00605782261 \n", "\n", " [1000 rows x 9 columns] \n", "\n", "* Use DataGrid.save() to save to disk\n", "** Use DataGrid.show() to start user interface"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["data\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["现在,我们创建一个新的DataGrid,将数字转换为嵌入式:\n"]}, {"cell_type": "code", "execution_count": 8, "metadata": {"id": "Bu0erP68dvLU"}, "outputs": [], "source": ["import ast # 将字符串形式的数字列表转换为数字列表\n", "\n", "dg = kg.DataGrid(\n", " name=\"openai_embeddings\",\n", " columns=data.get_columns(),\n", " converters={\"Score\": str},\n", ")\n", "for row in data:\n", " embedding = ast.literal_eval(row[8])\n", " row[8] = kg.Embedding(\n", " embedding, \n", " name=str(row[3]), \n", " text=\"%s - %.10s\" % (row[3], row[4]),\n", " projection=\"umap\",\n", " )\n", " dg.append(row)\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["新的DataGrid现在具有一个带有适当数据类型的嵌入列。\n"]}, {"cell_type": "code", "execution_count": 9, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "gd6Od4Bmhijy", "outputId": "9aa38221-0272-4a63-e393-706e0a0c5879"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["DataGrid (in memory)\n", " Name : openai_embeddings\n", " Rows : 1,000\n", " Columns: 9\n", "# Column Non-Null Count DataGrid Type \n", "--- -------------------- --------------- --------------------\n", "1 Column 1 1,000 INTEGER \n", "2 ProductId 1,000 TEXT \n", "3 UserId 1,000 TEXT \n", "4 Score 1,000 TEXT \n", "5 Summary 1,000 TEXT \n", "6 Text 1,000 TEXT \n", "7 combined 1,000 TEXT \n", "8 n_tokens 1,000 INTEGER \n", "9 embedding 1,000 EMBEDDING-ASSET \n"]}], "source": ["dg.info()\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["我们只需保存数据表格,就完成了。\n"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["dg.save()\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 3. 渲染2D投影\n", "\n", "要在笔记本中直接渲染数据,只需展示它。请注意,每一行都包含一个嵌入投影。\n", "\n", "向右滚动以查看每行的嵌入投影。\n", "\n", "投影空间中点的颜色代表分数。\n"]}, {"cell_type": "code", "execution_count": 11, "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 771}, "id": "Z8j-GdpiijU0", "outputId": "20a0b1ca-3059-4384-cd8c-b32b1aa1c270"}, "outputs": [{"data": {"text/html": ["\n", " \n", " "], "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["dg.show()\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["按“分数”分组,查看每个组的行。\n"]}, {"cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [{"data": {"text/html": ["\n", " \n", " "], "text/plain": [""]}, "metadata": {}, "output_type": "display_data"}], "source": ["dg.show(group=\"Score\", sort=\"Score\", rows=5, select=\"Score,embedding\")\n"]}, {"cell_type": "markdown", "metadata": {"id": "vLIxfmK5dJsq"}, "source": ["这个数据网格的示例托管在这里:https://kangas.comet.com/?datagrid=/data/openai_embeddings.datagrid\n"]}], "metadata": {"accelerator": "TPU", "colab": {"gpuType": "V100", "machine_shape": "hm", "provenance": []}, "gpuClass": "standard", "kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11"}, "vscode": {"interpreter": {"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"}}}, "nbformat": 4, "nbformat_minor": 4}