{"cells": [{"cell_type": "markdown", "metadata": {"id": "0wjP9mrldJsd"}, "source": ["## 在Kangas中可视化嵌入\n", "\n", "在这个Jupyter Notebook中,我们构建一个包含数据和嵌入投影到2维空间中的Kangas DataGrid。\n"]}, {"cell_type": "markdown", "metadata": {"id": "4tPKQqqldJsj"}, "source": ["## 什么是Kangas?\n", "\n", "[Kangas](https://github.com/comet-ml/kangas/)是一个面向数据科学家的开源、混合媒体、类似数据框的工具。它是由[Comet](https://comet.com/)开发的,Comet是一家旨在帮助减少将模型投入生产过程中的摩擦的公司。\n"]}, {"cell_type": "markdown", "metadata": {"id": "6sNsB2iFdJsk"}, "source": ["### 1. 设置\n", "\n", "要开始使用,我们需要使用pip安装kangas,并导入它。\n"]}, {"cell_type": "code", "execution_count": 1, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "N8gi529adL-f", "outputId": "c12e9973-a179-41e3-c5a8-f241804d99ad"}, "outputs": [], "source": ["%pip install kangas --quiet\n"]}, {"cell_type": "code", "execution_count": 2, "metadata": {"id": "htxjXThodRxD"}, "outputs": [], "source": ["import kangas as kg\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["### 2. 构建Kangas数据表格\n", "\n", "我们使用原始数据和嵌入来创建一个Kangas数据表格。数据由一系列评论行组成,而嵌入由1536个浮点值组成。在这个示例中,我们直接从github获取数据,以防您不是在OpenAI的存储库中运行此笔记本。\n", "\n", "我们使用Kangas将CSV文件读入数据表格中,以便进行进一步处理。\n"]}, {"cell_type": "code", "execution_count": 3, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "0SxWlRTrdVJq", "outputId": "d36c3a14-2e80-4315-e285-f39f6b008976"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...\n"]}, {"name": "stderr", "output_type": "stream", "text": ["1001it [00:00, 2412.90it/s]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]\n"]}], "source": ["data = kg.read_csv(\"https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv\")\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["我们可以查看CSV文件的字段:\n"]}, {"cell_type": "code", "execution_count": 4, "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "bzhQgoRGeMCp", "outputId": "791c4e40-fb28-409e-d1e9-20b753fb1215"}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["DataGrid (in memory)\n", " Name : fine_food_reviews_with_embeddings_1k\n", " Rows : 1,000\n", " Columns: 9\n", "# Column Non-Null Count DataGrid Type \n", "--- -------------------- --------------- --------------------\n", "1 Column 1 1,000 INTEGER \n", "2 ProductId 1,000 TEXT \n", "3 UserId 1,000 TEXT \n", "4 Score 1,000 INTEGER \n", "5 Summary 1,000 TEXT \n", "6 Text 1,000 TEXT \n", "7 combined 1,000 TEXT \n", "8 n_tokens 1,000 INTEGER \n", "9 embedding 1,000 TEXT \n"]}], "source": ["data.info()\n"]}, {"cell_type": "markdown", "metadata": {}, "source": ["并查看第一行和最后一行:\n"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 349}, "id": "Q95N832aeaBr", "outputId": "aaea2816-e5a1-4e52-f228-c3e6aca6fa3e"}, "outputs": [{"data": {"text/html": ["
row-id | Column 1 | ProductId | UserId | Score | Summary | Text | combined | n_tokens | embedding |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | B003XPF9BO | A3R7JR3FMEBXQB | 5 | where does one | Wanted to save | Title: where do | 52 | [0.007018072064 |
2 | 297 | B003VXHGPK | A21VWSCGW7UUAR | 4 | Good, but not W | Honestly, I hav | Title: Good, bu | 178 | [-0.00314055196 |
3 | 296 | B008JKTTUA | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
4 | 295 | B000LKTTTW | A14MQ40CCU8B13 | 5 | Best tomato sou | I have a hard t | Title: Best tom | 111 | [-0.00139322795 |
5 | 294 | B001D09KAM | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
... | 996 | 623 | B0000CFXYA | A3GS4GWPIBV0NT | 1 | Strange inflamm | Truthfully wasn | Title: Strange | 110 | [0.000110913533 |
997 | 624 | B0001BH5YM | A1BZ3HMAKK0NC | 5 | My favorite and | You've just got | Title: My favor | 80 | [-0.02086931467 |
998 | 625 | B0009ET7TC | A2FSDQY5AI6TNX | 5 | My furbabies LO | Shake the conta | Title: My furba | 47 | [-0.00974910240 |
999 | 619 | B007PA32L2 | A15FF2P7RPKH6G | 5 | got this for th | all i have hear | Title: got this | 50 | [-0.00521062919 |
1000 | 999 | B001EQ5GEO | A3VYU0VO6DYV6I | 5 | I love Maui Cof | My first experi | Title: I love M | 118 | [-0.00605782261 |
[1000 rows x 9 columns] | |||||||||
* Use DataGrid.save() to save to disk | |||||||||
** Use DataGrid.show() to start user interface |