{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "Tce3stUlHN0L" }, "source": [ "##### Copyright 2024 Google LLC." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "R5DkeFMP75as" }, "source": [ "# Multimodal Live API - Quickstart" ] }, { "cell_type": "markdown", "metadata": { "id": "tqktCVDm1yFo" }, "source": [ "\n", " \n", "
\n", " Run in Google Colab\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "iS0rHk3RBrtA" }, "source": [ "This notebook demonstrates simple usage of the Gemini 2.0 Multimodal Live API. For an overview of new capabilities refer to the [Gemini 2.0 docs](https://ai.google.dev/gemini-api/docs/models/gemini-v2).\n", "\n", "\n", "This notebook implements a simple turn-based chat where you send messages as text, and the model replies with audio. The API is capable of much more than that. The goal here is to demonstrate with **simple code**.\n", "\n", "If you aren't looking for code, and just want to try multimedia streaming use [Live API in Google AI Studio](https://aistudio.google.com/app/live).\n", "\n", "The [Next steps](#next_steps) section at the end of this tutorial provides links to additional resources." ] }, { "cell_type": "markdown", "metadata": { "id": "Mfk6YY3G5kqp" }, "source": [ "## Setup" ] }, { "cell_type": "markdown", "metadata": { "id": "d5027929de8f" }, "source": [ "### Install SDK\n", "\n", "The new **[Google Gen AI SDK](https://ai.google.dev/gemini-api/docs/sdks)** provides programmatic access to Gemini 2.0 (and previous models) using both the [Google AI for Developers](https://ai.google.dev/gemini-api/docs) and [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) APIs. With a few exceptions, code that runs on one platform will run on both.\n", "\n", "More details about this new SDK on the [documentation](https://ai.google.dev/gemini-api/docs/sdks) or in the [Getting started](../gemini-2/get_started.ipynb) notebook." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "46zEFO2a9FFd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/110.9 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m110.9/110.9 kB\u001b[0m \u001b[31m3.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/168.2 kB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m168.2/168.2 kB\u001b[0m \u001b[31m6.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", "\u001b[?25h" ] } ], "source": [ "!pip install -U -q google-genai" ] }, { "cell_type": "markdown", "metadata": { "id": "CTIfnvCn9HvH" }, "source": [ "### Set up your API key\n", "\n", "To run the following cell, your API key must be stored in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "A1pkoyZb9Jm3" }, "outputs": [], "source": [ "from google.colab import userdata\n", "import os\n", "\n", "os.environ['GOOGLE_API_KEY'] = userdata.get('GOOGLE_API_KEY')" ] }, { "cell_type": "markdown", "metadata": { "id": "3Hx_Gw9i0Yuv" }, "source": [ "### Initialize SDK client\n", "\n", "The client will pick up your API key from the environment variable.\n", "To use the live API you need to set the client version to `v1alpha`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "HghvVpbU0Uap" }, "outputs": [], "source": [ "from google import genai\n", "client = genai.Client(http_options= {'api_version': 'v1alpha'})" ] }, { "cell_type": "markdown", "metadata": { "id": "QOov6dpG99rY" }, "source": [ "### Select a model\n", "\n", "Multimodal Live API are a new capability introduced with the [Gemini 2.0](https://ai.google.dev/gemini-api/docs/models/gemini-v2) model. It won't work with previous generation models.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "27Fikag0xSaB" }, "outputs": [], "source": [ "MODEL = \"gemini-2.0-flash-exp\"" ] }, { "cell_type": "markdown", "metadata": { "id": "GOOZsm7i9io6" }, "source": [ "### Import\n", "\n", "Import all the necessary modules." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "Yd1vs3cP8EmS" }, "outputs": [], "source": [ "import asyncio\n", "import base64\n", "import contextlib\n", "import datetime\n", "import os\n", "import json\n", "import wave\n", "import itertools\n", "\n", "from IPython.display import display, Audio\n", "\n", "from google import genai\n", "from google.genai import types\n", "\n", "async def async_enumerate(it):\n", " n = 0\n", " async for item in it:\n", " yield n, item\n", " n +=1" ] }, { "cell_type": "markdown", "metadata": { "id": "jj7gDzfDOq4h" }, "source": [ "## Text to Text\n", "\n", "The simplest way to use the Live API is as a text-to-text chat interface, but it can do **a lot** more than this." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "dDfslcyIOqgI" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "> Hello? Gemini are you there? \n", "\n", "- Yes\n", "- , I'm here! How can I help you today?\n", "\n" ] } ], "source": [ "config={\n", " \"generation_config\": {\"response_modalities\": [\"TEXT\"]}}\n", "\n", "async with client.aio.live.connect(model=MODEL, config=config) as session:\n", " message = \"Hello? Gemini are you there?\"\n", " print(\"> \", message, \"\\n\")\n", " await session.send(message, end_of_turn=True)\n", "\n", " # For text responses, When the model's turn is complete it breaks out of the loop.\n", " turn = session.receive()\n", " async for chunk in turn:\n", " if chunk.text is not None:\n", " print(f'- {chunk.text}')" ] }, { "cell_type": "markdown", "metadata": { "id": "rvpmur4lKfOv" }, "source": [ "## Simple text to audio" ] }, { "cell_type": "markdown", "metadata": { "id": "jjkzgogvG1q0" }, "source": [ "The simplest way to playback the audio in Colab, is to write it out to a `.wav` file. So here is a simple wave file writer:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "7mEDGwJfLRrm" }, "outputs": [], "source": [ "@contextlib.contextmanager\n", "def wave_file(filename, channels=1, rate=24000, sample_width=2):\n", " with wave.open(filename, \"wb\") as wf:\n", " wf.setnchannels(channels)\n", " wf.setsampwidth(sample_width)\n", " wf.setframerate(rate)\n", " yield wf" ] }, { "cell_type": "markdown", "metadata": { "id": "DGuKQSurN7F4" }, "source": [ "The next step is to tell the model to return audio by setting `\"response_modalities\": [\"AUDIO\"]` in the `GenerationConfig`. \n", "\n", "When you get a response from the model, then you write out the data to a `.wav` file." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "VFD4VleVKj1-" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "> Hello? Gemini are you there? \n", "\n", "audio/pcm;rate=24000\n", "............" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "config={\n", " \"generation_config\": {\"response_modalities\": [\"AUDIO\"]}}\n", "\n", "\n", "\n", "async with client.aio.live.connect(model=MODEL, config=config) as session:\n", " file_name = 'audio.wav'\n", " with wave_file(file_name) as wav:\n", " message = \"Hello? Gemini are you there?\"\n", " print(\"> \", message, \"\\n\")\n", " await session.send(message, end_of_turn=True)\n", "\n", " turn = session.receive()\n", " async for n,response in async_enumerate(turn):\n", " if response.data is not None:\n", " wav.writeframes(response.data)\n", "\n", " if n==0:\n", " print(response.server_content.model_turn.parts[0].inline_data.mime_type)\n", " print('.', end='')\n", "\n", "\n", "display(Audio(file_name, autoplay=True))\n" ] }, { "cell_type": "markdown", "metadata": { "id": "QutDG7r78Zf-" }, "source": [ "## Towards Async Tasks\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YfEQZrtZY_90" }, "source": [ "The real power of the Live API is that it's real time, and interruptable. You can't get that full power in a simple sequence of steps. To really use the functionality you will move the `send` and `recieve` operations (and others) into their own [async tasks](https://docs.python.org/3/library/asyncio-task.html).\n", "\n", "Because of the limitations of Colab this tutorial doesn't totally implement the interactive async tasks, but it does implement the next step in that direction:\n", "\n", "- It separates the `send` and `receive`, but still runs them sequentially. \n", "- In the next tutorial you'll run these in separate `async` tasks.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "QUBet__tZF0o" }, "source": [ "Setup a quick logger to make debugging easier (switch to `setLevel('DEBUG')` to see debugging messages)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "bWTaU8j-X3AJ" }, "outputs": [], "source": [ "import logging\n", "\n", "logger = logging.getLogger('Live')\n", "logger.setLevel('INFO')" ] }, { "cell_type": "markdown", "metadata": { "id": "ERqyY0IFN8G9" }, "source": [ "The class below implements the interaction with the Live API." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "id": "3zAjMOZXFuxI" }, "outputs": [], "source": [ "class AudioLoop:\n", " def __init__(self, turns=None, config=None):\n", " self.session = None\n", " self.index = 0\n", " self.turns = turns\n", " if config is None:\n", " config={\n", " \"generation_config\": {\n", " \"response_modalities\": [\"AUDIO\"]}}\n", " self.config = config\n", "\n", " async def run(self):\n", " logger.debug('connect')\n", " async with client.aio.live.connect(model=MODEL, config=self.config) as session:\n", " self.session = session\n", "\n", " async for sent in self.send():\n", " # Ideally send and recv would be separate tasks.\n", " await self.recv()\n", "\n", " async def _iter(self):\n", " if self.turns:\n", " for text in self.turns:\n", " print(\"message >\", text)\n", " yield text\n", " else:\n", " print(\"Type 'q' to quit\")\n", " while True:\n", " text = await asyncio.to_thread(input, \"message > \")\n", "\n", " # If the input returns 'q' quit.\n", " if text.lower() == 'q':\n", " break\n", "\n", " yield text\n", "\n", " async def send(self):\n", " async for text in self._iter():\n", " logger.debug('send')\n", "\n", " # Send the message to the model.\n", " await self.session.send(text, end_of_turn=True)\n", " logger.debug('sent')\n", " yield text\n", "\n", " async def recv(self):\n", " # Start a new `.wav` file.\n", " file_name = f\"audio_{self.index}.wav\"\n", " with wave_file(file_name) as wav:\n", " self.index += 1\n", "\n", " logger.debug('receive')\n", "\n", " # Read chunks from the socket.\n", " turn = self.session.receive()\n", " async for n, response in async_enumerate(turn):\n", " logger.debug(f'got chunk: {str(response)}')\n", "\n", " if response.data is None:\n", " logger.debug(f'Unhandled server message! - {response}')\n", " else:\n", " wav.writeframes(response.data)\n", " if n == 0:\n", " print(response.server_content.model_turn.parts[0].inline_data.mime_type)\n", " print('.', end='')\n", "\n", " print('\\n')\n", "\n", " display(Audio(file_name, autoplay=True))\n", " await asyncio.sleep(2)\n" ] }, { "cell_type": "markdown", "metadata": { "id": "AwNPuC_rAHAc" }, "source": [ "There are 3 methods worth describing here:" ] }, { "cell_type": "markdown", "metadata": { "id": "tXPhEdHIPBif" }, "source": [ "**`run` - The main loop**\n", "\n", "This method:\n", "\n", "- Opens a `websocket` connecting to the Live API.\n", "- Calls the initial `setup` method.\n", "- Then enters the main loop where it alternates between `send` and `recv` until send returns `False`.\n", "- The next tutorial will demonstrate how to stream media and run these asynchronously." ] }, { "cell_type": "markdown", "metadata": { "id": "oCg1qFf0PV44" }, "source": [ "**`send` - Sends input text to the api**\n", "\n", "The `send` method collects input text from the user, wraps it in a `client_content` message (an instance of `BidiGenerateContentClientContent`), and sends it to the model.\n", "\n", "If the user sends a `q` this method returns `False` to signal that it's time to quit." ] }, { "cell_type": "markdown", "metadata": { "id": "tLukmBhPPib4" }, "source": [ "**`recv` - Collects audio from the API and plays it**\n", "\n", "The `recv` method collects audio chunks in a loop and writes them to a `.wav` file. It breaks out of the loop once the model sends a `turn_complete` method, and then plays the audio.\n", "\n", "To keep things simple in Colab it collects **all** the audio before playing it. [Other examples](#next_steps) demonstrate how to play audio as soon as you start to receive it (using `PyAudio`), and how to interrupt the model (implement input and audio playback on separate tasks)." ] }, { "cell_type": "markdown", "metadata": { "id": "gGYtiV2N8b2o" }, "source": [ "### Run\n", "\n", "Run it:\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "id": "WxdwgTKIGIlY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "message > Hello\n", "audio/pcm;rate=24000\n", ".......\n", "\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "message > What's your name?\n", "audio/pcm;rate=24000\n", ".................\n", "\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "await AudioLoop(['Hello', \"What's your name?\"]).run()" ] }, { "cell_type": "markdown", "metadata": { "id": "ietchD8GbcXt" }, "source": [ "## Next steps\n", "\n", "\n", "\n", "This tutorial just shows basic usage of the Live API, using the Python GenAI SDK.\n", "\n", "- If you aren't looking for code, and just want to try multimedia streaming use [Live API in Google AI Studio](https://aistudio.google.com/app/live).\n", "- If you want to see how to setup streaming interruptible audio and video using the Live API see the [Audio and Video input Tutorial](../gemini-2/live_api_starter.py).\n", "- If you're interested in the low level details of using the websockets directly, see the [websocket version of this tutorial](../gemini-2/websockets/live_api_starter.ipynb).\n", "- Try the [Tool use in the live API tutorial](../gemini-2/live_api_tool_use.ipynb) for an walkthrough of Gemini-2's new tool use capabilities.\n", "- There is a [Streaming audio in Colab example](../gemini-2/websockets/live_api_streaming_in_colab.ipynb), but this is more of a **demo**, it's **not optimized for readability**.\n", "- Other nice Gemini 2.0 examples can also be found in the [Cookbook](https://github.com/google-gemini/cookbook/blob/main/gemini-2/), in particular the [video understanding](./video_understanding.ipynb) and the [spatial understanding](./spatial_understanding.ipynb) ones.\n" ] } ], "metadata": { "colab": { "collapsed_sections": [ "Tce3stUlHN0L" ], "name": "live_api_starter.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }