--- title: Things I Learned - 09 Feb 2025 date: 2025-02-09T00:00:00+00:00 categories: - til description: 'I learned teaching techniques from IIT Madras, automated video highlights with OpusClip, and summarized DuckCon #6. I also found that DeepSeek R1 excels at text-to-CAD and gathered OpenAI’s latest prompting advice for reasoning models.' keywords: [duckdb, opusclip, deepseek r1, ibis, data catalogs, postman alternatives, reasoning models, openai prompting] --- This week, I learned: - Lessons from discussions at IIT Madras: - Even in recorded video tutorials, asking students a question and pausing to give them time to think can be effective. - When you put students in front of real clients, engagement increases dramatically. - Most teaching assistants would like to help diligent students among the bottom half (more than the top decile of students). - However, there is a fraction of poor performers who do not care, and are best ignored. Their engagement and effort is a good measure of their interest. - Defining a minimal set of principles that we want to teach helps us measure if we've helped the bottom half at least meet those objectives. - Teaching is hard. Even after explanations, students, even ENGAGED students, tend to make basic mistakes - ChatGPT does a good job of spotting errors in architectural and structural diagrams. In fact, the whole theme of spotting errors in large diagram is a theme that can have potential use cases. Source: Dan Becker. - R1 seems good at text-to-CAD. Even better than Sonnet. Source: Dan Becker - OpenAI advices a few different prompting techniques for reasoning models. [OpenAI](https://platform.openai.com/docs/guides/reasoning#advice-on-prompting): - Avoid examples unless zero-shot prompting fails. - Avoid chain-of-thought. These models do that internally anyway. - Short, direct prompts are better than detailed prompts. - [GitHub models](https://github.com/marketplace/models) is free for anyone to try. The model catalog us _extensive_ and even includes `o3-mini` which was launched this week (though in limited preview). - The data catalog space is led by proprietary solutions: - [Alation Data Catalog](https://www.alation.com/data-catalog/): Market leader; growing steadily in enterprise use - [Collibra Data Catalog](https://www.collibra.com/data-catalog): Widely adopted with steady growth - [AWS Glue Data Catalog](https://aws.amazon.com/glue/features/data-catalog/): Growing rapidly as AWS expands its data services - [Informatica Enterprise Data Catalog](https://www.informatica.com/products/data-catalog/): Long established and stable, though facing newer alternatives - [Microsoft Purview Unified Catalog](https://www.microsoft.com/en-us/microsoft-365/enterprise-data-catalog): Experiencing fast growth driven by cloud momentum - [Atlan Data Catalog](https://www.atlan.com/data-catalog): Relatively new but gaining fast traction among tech-forward organizations - [OpusClip](https://www.opus.pro/) automatically creates short clips from long videos. I ran it on [Programming Minecraft with WebSockets in Python](https://youtu.be/NgvtJZDcY) to get this [short 30-second clip](https://www.youtube.com/shorts/v3W2cjTWY-Y). 30 minutes. 100% automated. - Alternatives to Postman: - [Hoppscotch](https://hoppscotch.io/) – A web‑based/desktop API client supporting REST, GraphQL, and WebSockets. It’s lightweight, open-source, and self‑hostable. - [HTTPie](https://httpie.io/app) – A web-based API along with a friendly command-line tool for API interaction. - [Insomnia](https://insomnia.rest/) (or its fork Insomnium) – A popular cross‑platform API client with a minimal interface and plugin ecosystem. - [Bruno](https://www.usebruno.com/) – A desktop open-source API client that stores collections as files (ideal for Git versioning). - [Milkman](https://milkman.dev/) – A desktop open‑source workbench for managing API requests. - Here is the summary of [DuckCon #6](https://www.youtube.com/watch?v=Sb9DFclZRpg) on 31 Jan 2025 in Amsterdam. I copied the transcript from [YouTubeTranscript](https://youtubetranscript.com/) and passed it through Gemini 2.0 Flash Exp with the system prompt: "Summarize this transcript from the DuckDB conference without missing any points. Cover every point mentioned. A lot of spelling errors that sound like DuckDB are likely to be DuckDB". - Introduction & Welcome: - **DuckCon #6:** This is the 6th DuckDB conference, held in their hometown. The first DuckCon was online due to the pandemic. - **Live Streaming:** This is the first time DuckCon is being live-streamed, chosen to accommodate global time zones (especially China and the US). - **Global Reach:** The live stream is intended to reach users in areas where in-person DuckCons are unlikely. - **Q&A:** Slido (qa.duckdb.org) will be used for Q&A, with upvoting to prioritize questions. - **Sponsors:** Thanks to gold sponsor monday.com and silver sponsors Real and Crunchy Data. - **DuckCon Purpose:** DuckCon is a place for users to connect, share experiences, and provide feedback to the DuckDB team. - **Inspiration:** The team is inspired by the community's use of DuckDB and how far the project has come. - **Mission Statement:** DuckDB aims to make large datasets less intimidating and more accessible, moving away from fear of data to confidence in handling it. - **Motivation:** The project was born from seeing people struggle with data that didn't fit in Excel and the lack of user-friendly tools. - **Industry Trends:** Single-node processing capabilities have grown faster than the size of useful datasets. - **Data Singularity:** A prediction that most data analysis queries can run on a single node is now a reality. - **Real-World Data Sizes:** Analysis of Snowflake and Redshift data shows that 99.9% of datasets are under 300GB. - **Raspberry Pi Benchmark:** The industry-standard TPCH benchmark (scale factor 300, ~300GB) can run on a Raspberry Pi using DuckDB. - **Single Node Growth:** Single-node processing power is rapidly increasing, allowing for larger datasets to be handled. - **Adoption Numbers:** - **32 Million Extension Installs:** 32 million DuckDB extension installs in the last month. - **1.8 Million Unique Website Visitors:** 1.8 million unique visitors per month to the DuckDB website. - **Blue Sky Community:** Growing community on Blue Sky, with the hashtag `#dataBS`. - Technical Updates (Mark): - **Extension Ecosystem:** Focus on enabling the community to build and share extensions. - **Community Extensions:** Making it easier to create and use community-built extensions. - **DuckDB v1.2 (Harlequin Duck):** Releasing next week, named after the Harlequin duck. - **CSV Reader Improvements:** Significant improvements to the CSV reader. - **Friendlier SQL:** Improvements to the SQL experience. - **CLI Autocomplete:** Reworked and improved CLI autocomplete. - **Performance Optimizations:** Many queries are now faster due to performance work. - **C API for Extensions:** Introducing a C API to make building extensions easier. - **Logging Features:** Improved logging for production use. - **Lakehouse Focus:** The main focus for the year is on lakehouse formats and related features. - Q&A (Mark & Hanis): - **Doubling Team:** If the team doubled, they would focus on client integrations and other projects, not a major architectural change. - **Partitioning:** Near-term plans to add support for partitioning, related to lakehouse formats. - **DuckDB WASM:** The WASM ecosystem is evolving, with exciting possibilities for in-browser use. - **Financial/Pharmaceutical Industries:** DuckDB could replace some SAS workflows due to its cost-effectiveness and capabilities. - **Lakehouse & MotherDuck:** Lakehouse work is separate from MotherDuck, though MotherDuck will likely support lakehouse features. - **Contributing to Extensions:** Plans to make it easier to contribute to extensions, including support for Rust and Go. - Airport Extension (Rusty): - **Analogy:** The airport extension allows DuckDB to "fly" to remote servers using Apache Arrow Flight. - **Functionality:** Supports select, insert, update, and delete operations on remote data sources. - **Motivation:** To reduce the burden of writing extensions and enable faster development using existing code. - **Arrow Flight:** Uses Arrow Flight for communication, enabling connections to various data sources. - **Demo 1: Delta Lake:** - Attaches to a flight server for Delta Lake access. - Allows creating schemas, tables, and performing standard SQL operations. - Uses Python and deltars (Rust implementation of Delta Lake). - Supports predicate pushdown and C integration with the DuckDB catalog. - **Demo 2: AutoGluon:** - Integrates the AutoGluon AutoML package. - Predicts Hacker News post votes using a trained model. - Demonstrates table-returning functions for model fitting and prediction. - No C++ code required, just Python. - **Demo 3: Geocoding:** - Uses a geocoder service to convert addresses to coordinates and vice versa. - Demonstrates scalar UDFs for vectorized requests. - Uses a Python example for a simple uppercase function. - **Features:** - List flights, take flights. - Catalog integration. - Select, update, delete. - Scalar UDFs. - Table in/out functions. - Authentication for row/column filtering. - **Availability:** Requires DuckDB 1.2, MIT licensed, available on GitHub. - Q&A (Rusty): - **Most Proud Extension:** Airport is the most fun, but the AWS API wrapper also brings joy. - **Extension Resources:** The GitHub DuckDB extension template and reading others' source code are helpful. - **Airport & Other Extensions:** Airport is separate and can be used alongside other extensions like spatial or httpfs. - **Graph Support:** Graph database support is planned, with examples like Kuzu, Neptune, and Neo4j. - **Licensing:** Airport is MIT licensed, compatible with Apache license. - **Scaling Out:** Airport can be used to query multiple DuckDB instances on different machines. - Ibis & Geospatial (Nati): - **Nati Clementi:** Senior software engineer at Nvidia, working on open-source projects like Ibis. - **Ibis:** Open-source Python library for data wrangling, with a DataFrame API and interfaces to 15+ engines, including DuckDB. - **DuckDB for Geospatial:** DuckDB is fast, has a geospatial extension, and supports various geospatial formats. - **Geop Parquet:** Becoming a standard for geospatial data, enabling cloud data warehouse interoperability and compression. - **Geo Arrow:** A way of representing geospatial vector data in memory for faster processing. - **Ibis Benefits:** Allows writing Python instead of SQL, with deferred execution determined by the engine. - **Demo:** - Uses OverTour Maps data in geop parquet format. - Filters data using bounding boxes. - Demonstrates geospatial operations like ST_Distance and ST_Transform. - Plots data using Lumber. - Shows how to find points of interest near a location (e.g., the Van Gogh Museum). - **Ibis & DuckDB:** Ibis uses DuckDB for the parquet reader and lets DuckDB do the heavy lifting. - **Ibis Optimizations:** Ibis does type checking but doesn't do query optimization, leaving that to the engine. - **Ibis in Browser:** Ibis works in the browser through DuckDB WASM. - Q&A (Nati): - **Linear Interpolation:** Ibis ML module can help with regression-related tasks. - **Missing Features:** No major features are missing in the DuckDB/Ibis geospatial setup, with minimal overhead. - **Parquet Reader:** Ibis uses DuckDB's parquet reader. - **Query Optimization:** Ibis does not optimize SQL queries, leaving that to DuckDB. - **Ibis in Browser:** Ibis works in the browser through DuckDB WASM. - Rill & Metrics Layer (Mike): - **Rill:** A BI tool optimized for DuckDB, with instant slicing and dicing, BI as code, and a metrics-first philosophy. - **Metrics-First:** Design metrics models, and Rill autogenerates dashboards and user experiences. - **Live Demo:** - Downloaded Rill using a curl command. - Created a new project called "DuckCon 6". - Imported a parquet file of GitHub commit data. - Used AI to generate a metrics model and dashboard. - Showed the dashboard with trends and filtering. - **Metrics as Building Blocks:** Metrics are flexible, fast, and intuitive. - **SQL for Metrics:** Metrics should be defined in SQL, not other languages. - **Visual Metrics Editor:** Rill has a visual editor for defining metrics using DuckDB SQL. - **Metric Stack:** - **Legacy:** Data warehouses, traditional BI tools, inconsistent metrics, full table scans. - **DuckDB Powered:** Consistent metrics, fast olap queries, SQL everywhere. - **Challenges:** Data modeling is hard, metric changes can be expensive, single-node scale has limits. - **AI & Metrics:** AI can assist in metrics modeling, optimization, and conversational data exploration. - Q&A (Mike): - **Complex Metrics:** Rill works well with complex metrics involving multiple sources and transformations by joining tables in DuckDB. - **60 FPS Dashboards:** Users can feel the difference with faster dashboards. - **Defining Metrics:** Metrics are defined in the Rill UI using SQL expressions. - **Replacing ChatGPT:** Considering locally run self-hosted models for privacy. - Stock Data Analysis (Ryan): - **Two Takeaways:** Simple finance data flows with trade data and a tool called Q Studio. - **Ryan Hamilton:** 14 years building large data platforms in banks. - **Bank Data:** Data from exchanges, market data providers, and internal systems. - **Use Cases:** Backtesting, data analysis, and report generation. - **Q Studio:** A Java desktop application that connects to 30 databases, including DuckDB. - **Demo:** - Loaded a 6GB CSV file of trade data into DuckDB. - Showed basic queries, pivoting, and Candlestick charts. - Demonstrated time-based aggregation and moving averages. - Showed a basic trading strategy using window functions. - **DuckDB Benefits:** Fast, easy to use, great for time-based analysis. - Q&A (Ryan): - **KDB+ vs. DuckDB:** KDB+ is for large data, DuckDB is more approachable with strong Python integration. - **XML Files:** Offloading processing to DuckDB, not planning XML integration. - Lightning Talks: - **Zuk (Jared):** - Search engine research using DuckDB. - Python-based experiments with SQL. - Removing document lengths for faster search engines. - **DuckPGQ (Daniel):** - Graph analytics in DuckDB using SQL property graph queries (pgq). - Visual graph syntax for pattern matching and path finding. - Outperforms Neo4j on analytical queries. - **Yat (Kristoff):** - Smallest DuckDB SQL orchestrator. - Runs SQL queries in a folder in the correct order. - Generates a mermaid diagram for lineage. - **Grafana & DuckDB (Sam):** - Lessons learned from using DuckDB in Grafana. - Security incident due to shell commands and file access. - Importance of reading the documentation. - **Cloud Slur (Adam):** - Syncing query engine for bank transaction data. - Uses LLM to convert human language to SQL. - Uses DuckDB in the browser, Node.js, and Python. - **Healthcare Data (Tony):** - Data engineering use cases in healthcare. - Dynamic data masking system using DuckDB and Snowflake. - Data integration pipeline using DuckDB and Arrow streams. - Closing Remarks: - **Michel Simmons:** Author of the DuckDB in Action book, will be signing books. - **Poster Session:** A poster session will follow the talks. - **Sponsors:** Thanks again to the sponsors. - **Social Event:** The conference will now move to the social event. - [ibis](https://duckdb.org/docs/guides/python/ibis.html) is a Python library that works with _multiple_ dataframe backends like DuckDB, Polars, and Pandas. - With just 3 annotators and 50-100 samples, you can figure out if an LLM can replace human annotators systematically.[Arxiv](https://arxiv.org/pdf/2501.10970) [ChatGPT explanation](https://chatgpt.com/share/679f21a4-d700-800c-b1f1-987b56b6fe0a) - Curiosity and agency may be the differentiator in a world of LLMs (not experience, knowledge, or ability), since LLMs will democratize expertise. [Jack Clark](https://importai.substack.com/p/import-ai-397-deepseek-means-ai-proliferation) - "AI/human combined work can be copyrighted as long as a human is adding, changing or selecting elements. Prompts alone do not usually produce copyrighted work." - [Copyright and Artificial Intelligence, Jan 2025, US Copyright Office](https://copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-2-Copyrightability-Report.pdf) via [Ethan Mollick](https://bsky.app/profile/did:plc:flxq4uyjfotciovpw3x3fxnu/post/3lgxlnzgbss2j) - **Human Authorship is Essential:** Works created solely by AI are not copyrightable. - **AI can be used as a Tool:** Using AI as a tool does not negate copyright protection, as long as the final work reflects sufficient human creativity. - **Prompts Alone are Insufficient:** Simply providing prompts to an AI system, even detailed ones, is generally not enough to establish authorship. Prompts are considered instructions or ideas, which are not copyrightable. - **Expressive Inputs:** When a human author provides their own expressive content (like a drawing, photo, or text) as input to an AI system, and that content is perceptible in the output, the human author can claim copyright in that portion of the output. - **Modifying and Arranging AI-Generated Content:** Humans can claim copyright in the creative selection, coordination, and arrangement of AI-generated material, as well as in creative modifications to AI-generated outputs. - **No Need for New Legislation:** The report concludes that existing copyright law is adequate to address the copyrightability of AI-generated works, and no new legislation is needed at this time. - **Case-by-Case Analysis:** Copyrightability will be determined on a case-by-case basis, considering the specific facts of each work and the extent of human contribution.