{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "YjOR1n15wn1K" }, "source": [ "\n", " \"Open\n", "\n", "\n", " \"View\n", "\n", "\n", "\n", " \"View\n", "\n", "\n", "# Intro to `qsv count`" ] }, { "cell_type": "markdown", "metadata": { "id": "9CFiuF_abuL9" }, "source": [ "In this notebook we'll be covering examples of using [qsv's `count` command](https://github.com/jqnatividad/qsv/blob/master/src/cmd/count.rs).\n", "\n", "This notebook uses **qsv**, an open-source CSV data wrangling toolkit available as a command line tool. You may learn more at https://github.com/jqnatividad/qsv." ] }, { "cell_type": "markdown", "metadata": { "id": "3vsTulc0OGqi" }, "source": [ "## Table of Contents\n", "\n", "1. [Setup](#1)\n", " - 1.1 [Environment Notes](#1.1)\n", " - 1.2 [Downloading qsv](#1.2)\n", " - 1.3 [Downloading a CSV Data Set](#1.3)\n", "2. [Let's Use `qsv count`!](#2)\n", " - [2.1 Option: `--help`](#2.1)\n", " - [2.2 Running `qsv count` On Our CSV](#2.2)\n", " - [2.3 Option: `--human-readable, -H`](#2.3)\n", " - [2.4 Option: `--no-headers, -n`](#2.4)\n", " - [2.5 Option: `--width`](#2.5)\n", " - [2.5.1 Understanding the `--width` Option's Output](#2.5.1)\n", "3. [Bash Use Cases](#3)\n", " - [3.1 String Interpolation](#3.1)\n", "4. [Python Use Cases](#4)\n", " - [4.1 Running `qsv count` on User's Input File Path](#4.1)\n", "5. [Conclusion](#5)" ] }, { "cell_type": "markdown", "metadata": { "id": "iUJGEpSUMA7R" }, "source": [ "\n", "## Part 1: Setup" ] }, { "cell_type": "markdown", "metadata": { "id": "v05J1AsdXAgT" }, "source": [ "\n", "### 1.1 Environment Notes\n", "\n", " - The notebook was run on Google Colab based on an Ubuntu 22.04 LTS environment, so you may need to modify the commands if you're running on a different OS (i.e. Windows) or missing any dependencies.\n", " - Commands are prepended by an exclamation point `!` in this Jupyter notebook environment to execute them, but should be removed when using Bash on a terminal." ] }, { "cell_type": "markdown", "metadata": { "id": "jThnX2bkBvZj" }, "source": [ "\n", "### 1.2 Downloading qsv\n", "\n", "First, let's download qsv into our notebook from the [releases page](https://github.com/jqnatividad/qsv/releases). We'll use qsv 0.111.0:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5E4Jy22ozjM8", "outputId": "8ec7d56b-9f65-4c16-aa00-c73f579f88bf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", " 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0\n", "100 73.3M 100 73.3M 0 0 42.5M 0 0:00:01 0:00:01 --:--:-- 76.7M\n", "Archive: qsv-0.111.0-x86_64-unknown-linux-gnu.zip\n", " inflating: qsv-0.111.0-files/README \n", " inflating: qsv-0.111.0-files/qsv \n", " inflating: qsv-0.111.0-files/qsv_glibc-2.31 \n", " inflating: qsv-0.111.0-files/qsv_glibc-2.31_rust_version_info.txt \n", " inflating: qsv-0.111.0-files/qsv_nightly \n", " inflating: qsv-0.111.0-files/qsv_nightly_rust_version_info.txt \n", " inflating: qsv-0.111.0-files/qsvdp \n", " inflating: qsv-0.111.0-files/qsvdp_glibc-2.31 \n", " inflating: qsv-0.111.0-files/qsvdp_nightly \n", " inflating: qsv-0.111.0-files/qsvlite \n", " inflating: qsv-0.111.0-files/qsvlite_glibc-2.31 \n", " inflating: qsv-0.111.0-files/qsvlite_nightly \n" ] } ], "source": [ "# Downloading the .zip file that contains qsv\n", "!curl -LO https://github.com/jqnatividad/qsv/releases/download/0.111.0/qsv-0.111.0-x86_64-unknown-linux-gnu.zip\n", "# Unzipping the .zip file into a folder\n", "!unzip -o qsv-0.111.0-x86_64-unknown-linux-gnu.zip -d qsv-0.111.0-files\n", "# Moving the qsv binary file from the folder into /bin to use the qsv command anywhere on our system\n", "!cp qsv-0.111.0-files/qsv /bin" ] }, { "cell_type": "markdown", "metadata": { "id": "9SjFA9yu0tVu" }, "source": [ "\n", "### 1.3 Downloading a CSV Data Set\n", "\n", "Here is the main CSV data set I'll be using:\n", "\n", "| Data set | Source | Download Link | Rounded size |\n", "| --------- | ------ | ------------ | ------------ |\n", "| Indicators of Anxiety or Depression Based on Reported Frequency of Symptoms During Last 7 Days | https://catalog.data.gov/dataset/indicators-of-anxiety-or-depression-based-on-reported-frequency-of-symptoms-during-last-7- | https://data.cdc.gov/api/views/8pt5-q6wp/rows.csv?accessType=DOWNLOAD | 2.1 MB |\n", "\n", "Let's download the data set into our notebook as `data.csv`." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "mZO_FS7QzLN3", "outputId": "fd374e58-4377-4ae7-a56e-f5b900e2c613" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " % Total % Received % Xferd Average Speed Time Time Time Current\n", " Dload Upload Total Spent Left Speed\n", "100 2144k 0 2144k 0 0 1894k 0 --:--:-- 0:00:01 --:--:-- 1895k\n" ] } ], "source": [ "# Downloading the .csv file as data.csv\n", "!curl https://data.cdc.gov/api/views/8pt5-q6wp/rows.csv?accessType=DOWNLOAD -o data.csv" ] }, { "cell_type": "markdown", "metadata": { "id": "EFMTcysGCIv4" }, "source": [ "\n", "## Part 2: Let's Use `qsv count`!" ] }, { "cell_type": "markdown", "metadata": { "id": "pbLT5IRy2SBD" }, "source": [ "Time to explore with `qsv count`! Let's start by simply getting the help message for `qsv count`." ] }, { "cell_type": "markdown", "metadata": { "id": "-Yz5GgW-B_bY" }, "source": [ "\n", "### 2.1 Option: `--help`" ] }, { "cell_type": "markdown", "metadata": { "id": "-38wM5HbzAKS" }, "source": [ "As with any qsv command, we'll use the `--help` option to get the help message:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7OO8ACo2wa3K", "outputId": "d8e1bc84-b75a-4e17-c07a-8b226489ccee" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Prints a count of the number of records in the CSV data.\n", "\n", "Note that the count will not include the header row (unless --no-headers is\n", "given).\n", "\n", "For examples, see https://github.com/jqnatividad/qsv/blob/master/tests/test_count.rs.\n", "\n", "Usage:\n", " qsv count [options] []\n", " qsv count --help\n", "\n", "count options:\n", " -H, --human-readable Comma separate row count.\n", " --width Also return the length of the longest record.\n", " The count and width are separated by a semicolon.\n", "\n", "Common options:\n", " -h, --help Display this message\n", " -n, --no-headers When set, the first row will be included in\n", " the count.\n" ] } ], "source": [ "!qsv count --help" ] }, { "cell_type": "markdown", "metadata": { "id": "1Dy4LH5_YDxl" }, "source": [ "\n", "### 2.2 Running `qsv count` On Our CSV" ] }, { "cell_type": "markdown", "metadata": { "id": "3Q4nR9fHSqUx" }, "source": [ "We may start by getting the default output for `qsv count` by running it on our data set. This should get us the number of non-header records (rows) in our CSV:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_iRwWwae1Hj9", "outputId": "f1a7bd66-f53c-4113-831e-96e2b17da5c6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13671\n" ] } ], "source": [ "!qsv count data.csv" ] }, { "cell_type": "markdown", "metadata": { "id": "cUEH69_x2dfB" }, "source": [ "That's 13,671 non-header rows of data!" ] }, { "cell_type": "markdown", "metadata": { "id": "oy4QaBBlCPmR" }, "source": [ "\n", "### 2.3 Option: `--human-readable`, `-H`\n", "\n", "With the `--human-readable` option (or its alias `-H`), qsv should automatically add commas in the appropriate places to help us read the number better." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "hnIGo5Gq2cLp", "outputId": "29043891-0204-4f9c-fa95-e308615c7681" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13,671\n" ] } ], "source": [ "!qsv count data.csv --human-readable" ] }, { "cell_type": "markdown", "metadata": { "id": "NAV9od0K3ADL" }, "source": [ "\n", "### 2.4 Option: `--no-headers`, `-n`\n", "\n", "What if we we want to also include the header row in the count (therefore counting all the rows in the CSV)?\n", "\n", "We can use the `--no-headers` (or `-n`) option to include the header row in the count.\n", "\n", "We should expect `13,672` as our output, including the commas by also using `-H`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "_vhmXFlI2-ta", "outputId": "115e04e7-6710-45a1-c4ef-797a16321068" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13,672\n" ] } ], "source": [ "!qsv count data.csv -n -H" ] }, { "cell_type": "markdown", "metadata": { "id": "kF1uMdsC3p-R" }, "source": [ "\n", "### 2.5 Option: `--width`\n", "\n", "There's one more option that you might not expect.\n", "\n", "What if we wanted to find out how long the longest row is in our data set, based on the number of characters it has?\n", "\n", "The `--width` option should return the length of the longest record. The count and width are separated by a semicolon." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tukInTsZ3otd", "outputId": "4685e56d-70f7-4bbd-a923-7ea3cfcaebc9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13671;237\n" ] } ], "source": [ "!qsv count data.csv --width" ] }, { "cell_type": "markdown", "metadata": { "id": "fwDXGWYwCn5F" }, "source": [ "\n", "#### 2.5.1 Understanding the `--width` Option's Output" ] }, { "cell_type": "markdown", "metadata": { "id": "ahfwSjfZ4HAs" }, "source": [ "The longest record has 237 characters. But you may have some questions about this width output:\n", "\n", "- Does the width include the header if we don't specify the `--no-headers` option?\n", "- Does the width include the commas within the rows that separate the field values?\n", "\n", "Let's find out with this simple CSV file we'll name `sample.csv`:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uKsH1x6U9pvg", "outputId": "7f4140cf-b15f-4170-9833-0d8f07c57131" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "letter,number\n", "alpha,13\n", "beta,24\n" ] } ], "source": [ "# Write our data to sample.csv\n", "!echo 'letter,number' > sample.csv\n", "!echo 'alpha,13' >> sample.csv\n", "!echo 'beta,24' >> sample.csv\n", "# Display the data from sample.csv\n", "!cat sample.csv" ] }, { "cell_type": "markdown", "metadata": { "id": "NxVMGJ5F_iK5" }, "source": [ "First let's use `--width` without `--no-headers`.\n", "\n", "Our initial assumption is that if the headers are not included then we should get `8` as the width because there are `8` total characters in the row `alpha,13` when you also include the comma `,`." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xDMu7_Bw9U-r", "outputId": "441c25f4-2b11-472d-9e9b-a0786ab09607" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2;9\n" ] } ], "source": [ "!qsv count sample.csv --width" ] }, { "cell_type": "markdown", "metadata": { "id": "RcdXNt09ARte" }, "source": [ "Hmm... We get 9. Why is that?\n", "\n", "In our CSV data, there's a sort of hidden character at the end of each row: the newline character `\\n`. This is included as a character in the width for our rows, so we simply add 1 to our estimate of 8. We can also see that the comma is included in the width output.\n", "\n", "To further verify both of these claims, let's run the command with `--no-headers` to try and include the header row in the width output. Based on what we've learned so far, we can expect that all characters including the commas `,` between field values and the newline `\\n` at the end of the longest row are included in the width output. So for the header row `letter,number` we should expect a width of `12 + 1 + 1 = 14`:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iaXU60RT9tmZ", "outputId": "34d33cbe-179a-4d97-bdf8-28a194ac223c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3;14\n" ] } ], "source": [ "!qsv count sample.csv --width --no-headers" ] }, { "cell_type": "markdown", "metadata": { "id": "zMbfd8p_BhGm" }, "source": [ "Awesome! Now that you have a better understanding of `qsv count`, try it out for yourself!" ] }, { "cell_type": "markdown", "metadata": { "id": "yepclZC7Esqs" }, "source": [ "\n", "## Part 3: Bash Use Cases" ] }, { "cell_type": "markdown", "metadata": { "id": "aDlQAqa122Sm" }, "source": [ "\n", "### 3.1 String Interpolation\n", "\n", "Let's say I want to write a sentence that dynamically includes the count of a CSV file within it. For example, I want to print out:\n", "\n", "```\n", "There are 1,000,000 non-header rows of data in the data set!\n", "```\n", "\n", "The `1,000,000` is arbitrary, that is, it should be the output from using `qsv count` on a CSV file. Here's a Bash script using the `echo` command we can use to achieve this:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "SNqWnmsdEuog", "outputId": "72a054e2-6ff0-4717-8a03-a216df539671" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 13,671 non-header rows of data in the data set!\n" ] } ], "source": [ "!echo \"There are $(qsv count data.csv -H) non-header rows of data in the data set!\"" ] }, { "cell_type": "markdown", "metadata": { "id": "32ZRTPBLF0kq" }, "source": [ "\n", "## Part 4: Python Use Cases" ] }, { "cell_type": "markdown", "metadata": { "id": "Zdsw69HF2z4p" }, "source": [ "\n", "### 4.1 Running `qsv count` on User's Input File Path\n", "\n", "Let's say we want to run a Python script where the user can simply enter the path to the CSV file (in our case we can just write `data.csv`) and then get the output of running `qsv count` on it. We can use the `subprocess` module to run `qsv` commands and print the output. Here's a sample script with comments to help understand how it works:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1qVofYiSFTeF", "outputId": "7d785e15-317e-4763-ff98-761f16d06c86" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Enter the path to your CSV file: data.csv\n", "There are 13,671 non-header rows of data in the data set!\n" ] } ], "source": [ "import subprocess\n", "\n", "# Get user input for the CSV path\n", "csv_path = input('Enter the path to your CSV file: ')\n", "\n", "# Run qsv count on the CSV file with the -H option\n", "command = [\"qsv\", \"count\", csv_path, \"-H\"]\n", "\n", "# Get the qsv count output\n", "subprocess_output = subprocess.run(command, capture_output=True)\n", "# stdout - standard stream of output from our subprocess that runs count\n", "# decode - convert the stdout output from bytes to string\n", "# strip - remove any spaces/newline characters in the output\n", "count = subprocess_output.stdout.decode().strip()\n", "\n", "# Print the output of qsv count within a sentence\n", "print(f\"There are {count} non-header rows of data in the data set!\")" ] }, { "cell_type": "markdown", "metadata": { "id": "dnywoxa9KB30" }, "source": [ "With this script we can now simply provide the file path and then get the `qsv count` output for it! Of course you may expand on this script with improvements such as:\n", "\n", "- Verifying the file exists and is a CSV.\n", "- Error handling with a try/except block and printing `stderr`." ] }, { "cell_type": "markdown", "metadata": { "id": "glGYBjB5Zw-M" }, "source": [ "\n", "## 5. Conclusion\n", "\n", "In this notebook we covered example usage of `qsv count` for tallying the number of rows in a CSV file. We discussed all the options that are available for `qsv count`, and we also went further to discover how `qsv count` can be integrated in Bash and Python." ] } ], "metadata": { "colab": { "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }