{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Import data from JSON files\n", "\n", "Load structured data from JSON files into Pixeltable tables for processing and analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem\n", "\n", "You have data in JSON format—from APIs, exports, or application logs. You need to load this data for processing with AI models or combining with other data sources.\n", "\n", "| Source | Records | Use case |\n", "|--------|---------|----------|\n", "| api_response.json | 1,000 | Analyze API data |\n", "| user_events.json | 50,000 | Process event logs |\n", "| products.json | 500 | Enrich with AI descriptions |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Solution\n", "\n", "**What's in this recipe:**\n", "\n", "- Import JSON files directly into tables\n", "- Import from URLs (APIs, remote files)\n", "- Handle nested JSON structures\n", "\n", "You use `pxt.create_table()` with a `source` parameter to create a table from a JSON file or URL. The JSON must be an array of objects, where each object becomes a row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:17.051779Z", "iopub.status.busy": "2025-12-12T02:36:17.051676Z", "iopub.status.idle": "2025-12-12T02:36:19.698988Z", "shell.execute_reply": "2025-12-12T02:36:19.698601Z" } }, "outputs": [], "source": [ "%pip install -qU pixeltable" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:19.715489Z", "iopub.status.busy": "2025-12-12T02:36:19.715231Z", "iopub.status.idle": "2025-12-12T02:36:21.283202Z", "shell.execute_reply": "2025-12-12T02:36:21.282863Z" } }, "outputs": [], "source": [ "import json\n", "import pixeltable as pxt\n", "import tempfile\n", "from pathlib import Path" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create sample JSON file\n", "\n", "First, create a sample JSON file to demonstrate the import process:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:21.285291Z", "iopub.status.busy": "2025-12-12T02:36:21.285083Z", "iopub.status.idle": "2025-12-12T02:36:21.289672Z", "shell.execute_reply": "2025-12-12T02:36:21.289375Z" } }, "outputs": [], "source": [ "# Create sample JSON data (array of objects)\n", "sample_data = [\n", " {\n", " 'id': 1,\n", " 'title': 'Introduction to ML',\n", " 'author': 'Alice',\n", " 'tags': ['ml', 'intro'],\n", " 'rating': 4.5,\n", " },\n", " {\n", " 'id': 2,\n", " 'title': 'Deep Learning Basics',\n", " 'author': 'Bob',\n", " 'tags': ['dl', 'neural'],\n", " 'rating': 4.8,\n", " },\n", " {\n", " 'id': 3,\n", " 'title': 'NLP Fundamentals',\n", " 'author': 'Carol',\n", " 'tags': ['nlp', 'text'],\n", " 'rating': 4.2,\n", " },\n", " {\n", " 'id': 4,\n", " 'title': 'Computer Vision',\n", " 'author': 'Dave',\n", " 'tags': ['cv', 'images'],\n", " 'rating': 4.6,\n", " },\n", " {\n", " 'id': 5,\n", " 'title': 'Reinforcement Learning',\n", " 'author': 'Eve',\n", " 'tags': ['rl', 'agents'],\n", " 'rating': 4.3,\n", " },\n", "]\n", "\n", "# Save to temporary JSON file\n", "temp_dir = tempfile.mkdtemp()\n", "json_path = Path(temp_dir) / 'articles.json'\n", "\n", "with open(json_path, 'w') as f:\n", " json.dump(sample_data, f, indent=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import JSON file\n", "\n", "Use `create_table` with `source` to create a table directly from a JSON file:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:21.291410Z", "iopub.status.busy": "2025-12-12T02:36:21.291293Z", "iopub.status.idle": "2025-12-12T02:36:21.541336Z", "shell.execute_reply": "2025-12-12T02:36:21.541002Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/pjlb/.pixeltable/pgdata\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Created directory 'json_demo'.\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create a fresh directory\n", "pxt.drop_dir('json_demo', force=True)\n", "pxt.create_dir('json_demo')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:21.543056Z", "iopub.status.busy": "2025-12-12T02:36:21.542966Z", "iopub.status.idle": "2025-12-12T02:36:22.624357Z", "shell.execute_reply": "2025-12-12T02:36:22.623877Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'articles'.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Inserting rows into `articles`: 0 rows [00:00, ? rows/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Inserting rows into `articles`: 5 rows [00:00, 538.52 rows/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Inserted 5 rows with 0 errors.\n" ] } ], "source": [ "# Import JSON file into a new table\n", "articles = pxt.create_table(\n", " 'json_demo/articles',\n", " source=str(json_path),\n", " source_format='json', # Explicitly specify format when using local file paths\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.626739Z", "iopub.status.busy": "2025-12-12T02:36:22.626282Z", "iopub.status.idle": "2025-12-12T02:36:22.642776Z", "shell.execute_reply": "2025-12-12T02:36:22.642302Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idtitleauthortagsrating
1Introduction to MLAlice["ml", "intro"]4.5
2Deep Learning BasicsBob["dl", "neural"]4.8
3NLP FundamentalsCarol["nlp", "text"]4.2
4Computer VisionDave["cv", "images"]4.6
5Reinforcement LearningEve["rl", "agents"]4.3
" ], "text/plain": [ " id title author tags rating\n", "0 1 Introduction to ML Alice [ml, intro] 4.5\n", "1 2 Deep Learning Basics Bob [dl, neural] 4.8\n", "2 3 NLP Fundamentals Carol [nlp, text] 4.2\n", "3 4 Computer Vision Dave [cv, images] 4.6\n", "4 5 Reinforcement Learning Eve [rl, agents] 4.3" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View imported data\n", "articles.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import from URL\n", "\n", "You can import JSON directly from a URL—useful for APIs and remote data:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.645514Z", "iopub.status.busy": "2025-12-12T02:36:22.645176Z", "iopub.status.idle": "2025-12-12T02:36:22.793348Z", "shell.execute_reply": "2025-12-12T02:36:22.793018Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'posts'.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Inserting rows into `posts`: 0 rows [00:00, ? rows/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Inserting rows into `posts`: 100 rows [00:00, 15623.57 rows/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Inserted 100 rows with 0 errors.\n" ] } ], "source": [ "# Import from a public JSON URL\n", "# Using JSONPlaceholder API as an example\n", "posts = pxt.create_table(\n", " 'json_demo/posts',\n", " source='https://jsonplaceholder.typicode.com/posts',\n", " source_format='json', # Required for URL sources\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.795044Z", "iopub.status.busy": "2025-12-12T02:36:22.794940Z", "iopub.status.idle": "2025-12-12T02:36:22.807864Z", "shell.execute_reply": "2025-12-12T02:36:22.807510Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdidtitlebody
11sunt aut facere repellat provident occaecati excepturi optio reprehenderitquia et suscipit\n", "suscipit recusandae consequuntur expedita et cum\n", "reprehenderit molestiae ut ut quas totam\n", "nostrum rerum est autem sunt rem eveniet architecto
12qui est esseest rerum tempore vitae\n", "sequi sint nihil reprehenderit dolor beatae ea dolores neque\n", "fugiat blanditiis voluptate porro vel nihil molestiae ut reiciendis\n", "qui aperiam non debitis possimus qui neque nisi nulla
13ea molestias quasi exercitationem repellat qui ipsa sit autet iusto sed quo iure\n", "voluptatem occaecati omnis eligendi aut ad\n", "voluptatem doloribus vel accusantium quis pariatur\n", "molestiae porro eius odio et labore et velit aut
14eum et est occaecatiullam et saepe reiciendis voluptatem adipisci\n", "sit amet autem assumenda provident rerum culpa\n", "quis hic commodi nesciunt rem tenetur doloremque ipsam iure\n", "quis sunt voluptatem rerum illo velit
15nesciunt quas odiorepudiandae veniam quaerat sunt sed\n", "alias aut fugiat sit autem sed est\n", "voluptatem omnis possimus esse voluptatibus quis\n", "est aut tenetur dolor neque
" ], "text/plain": [ " userId id title \\\n", "0 1 1 sunt aut facere repellat provident occaecati e... \n", "1 1 2 qui est esse \n", "2 1 3 ea molestias quasi exercitationem repellat qui... \n", "3 1 4 eum et est occaecati \n", "4 1 5 nesciunt quas odio \n", "\n", " body \n", "0 quia et suscipit\\nsuscipit recusandae consequu... \n", "1 est rerum tempore vitae\\nsequi sint nihil repr... \n", "2 et iusto sed quo iure\\nvoluptatem occaecati om... \n", "3 ullam et saepe reiciendis voluptatem adipisci\\... \n", "4 repudiandae veniam quaerat sunt sed\\nalias aut... " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View first few rows\n", "posts.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import from Python dictionaries\n", "\n", "Use `create_table` with a list of dictionaries as `source`—useful when you have data in memory:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.809856Z", "iopub.status.busy": "2025-12-12T02:36:22.809726Z", "iopub.status.idle": "2025-12-12T02:36:22.879195Z", "shell.execute_reply": "2025-12-12T02:36:22.878827Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created table 'events'.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Inserting rows into `events`: 0 rows [00:00, ? rows/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Inserting rows into `events`: 3 rows [00:00, 988.06 rows/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Inserted 3 rows with 0 errors.\n" ] } ], "source": [ "# Import from a list of dictionaries\n", "events = [\n", " {\n", " 'event': 'page_view',\n", " 'user_id': 101,\n", " 'timestamp': '2024-01-15T10:30:00',\n", " },\n", " {\n", " 'event': 'click',\n", " 'user_id': 101,\n", " 'timestamp': '2024-01-15T10:31:00',\n", " },\n", " {\n", " 'event': 'purchase',\n", " 'user_id': 102,\n", " 'timestamp': '2024-01-15T10:32:00',\n", " },\n", "]\n", "\n", "event_table = pxt.create_table('json_demo/events', source=events)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.881533Z", "iopub.status.busy": "2025-12-12T02:36:22.881399Z", "iopub.status.idle": "2025-12-12T02:36:22.891633Z", "shell.execute_reply": "2025-12-12T02:36:22.891236Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
eventuser_idtimestamp
page_view1012024-01-15T10:30:00
click1012024-01-15T10:31:00
purchase1022024-01-15T10:32:00
" ], "text/plain": [ " event user_id timestamp\n", "0 page_view 101 2024-01-15T10:30:00\n", "1 click 101 2024-01-15T10:31:00\n", "2 purchase 102 2024-01-15T10:32:00" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View imported events\n", "event_table.collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Add computed columns\n", "\n", "Once imported, you can enrich the data with computed columns:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.893534Z", "iopub.status.busy": "2025-12-12T02:36:22.893424Z", "iopub.status.idle": "2025-12-12T02:36:22.941814Z", "shell.execute_reply": "2025-12-12T02:36:22.941390Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Added 5 column values with 0 errors.\n" ] }, { "data": { "text/plain": [ "5 rows updated, 10 values computed." ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Add a computed column combining title and author\n", "articles.add_computed_column(\n", " summary=articles.title + ' by ' + articles.author\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2025-12-12T02:36:22.943716Z", "iopub.status.busy": "2025-12-12T02:36:22.943589Z", "iopub.status.idle": "2025-12-12T02:36:22.953969Z", "shell.execute_reply": "2025-12-12T02:36:22.953604Z" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleauthorsummary
Introduction to MLAliceIntroduction to ML by Alice
Deep Learning BasicsBobDeep Learning Basics by Bob
NLP FundamentalsCarolNLP Fundamentals by Carol
Computer VisionDaveComputer Vision by Dave
Reinforcement LearningEveReinforcement Learning by Eve
" ], "text/plain": [ " title author summary\n", "0 Introduction to ML Alice Introduction to ML by Alice\n", "1 Deep Learning Basics Bob Deep Learning Basics by Bob\n", "2 NLP Fundamentals Carol NLP Fundamentals by Carol\n", "3 Computer Vision Dave Computer Vision by Dave\n", "4 Reinforcement Learning Eve Reinforcement Learning by Eve" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View with computed column\n", "articles.select(\n", " articles.title, articles.author, articles.summary\n", ").collect()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanation\n", "\n", "**JSON format requirements:**\n", "\n", "The JSON file must contain an array of objects at the top level:\n", "\n", "```json\n", "[\n", " {\"col1\": \"value1\", \"col2\": 123},\n", " {\"col1\": \"value2\", \"col2\": 456}\n", "]\n", "```\n", "\n", "**Source types supported:**\n", "\n", "| Source | Example |\n", "|--------|---------|\n", "| JSON file path | `source='/path/to/data.json'` |\n", "| JSON URL | `source='https://api.example.com/data'` |\n", "| List of dicts | `source=[{'a': 1}, {'a': 2}]` |\n", "\n", "**Nested JSON handling:**\n", "\n", "Nested objects and arrays are stored as JSON columns. You can access nested fields using Pixeltable's JSON path syntax in computed columns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## See also\n", "\n", "- [Import CSV files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-csv) - For CSV and Excel imports\n", "- [Import Parquet files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-parquet) - For Parquet data\n", "- [Extract fields from JSON](https://docs.pixeltable.com/howto/cookbooks/core/workflow-json-extraction) - Parse LLM response fields" ] } ], "metadata": { "kernelspec": { "display_name": "pixeltable", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.11" } }, "nbformat": 4, "nbformat_minor": 2 }