--- title: "Converting lazy data frames into Parquet files (Python version)" author: Ian D. Gow date: 2026-03-23 date-format: "D MMMM YYYY" categories: [Python, Parquet, db2pq, WRDS, "Tidy Finance"] format: html: colorlinks: true number-sections: true pdf: number-sections: true colorlinks: true geometry: - left=2cm - right=2cm papersize: a4 filters: - ../abstract.lua include-in-header: ../preamble.tex mainfont: TeX Gyre Pagella mathfont: TeX Gyre Pagella Math execute: cache: true freeze: auto keywords: - Python - Polars - Arrow - db2pq - WRDS jupyter: db2pq-test --- ::: {.abstract} In a note I published yesterday ([*Converting lazy data frames into Parquet files*](../published/lazy_tbl_to_pq.html)), I showed how my `db2pq` R package can be used to turn “lazy data frames” produced by the R package `dbplyr` into Parquet files on disk without needing to load the data into memory. Today, I wondered if I could do the same with my `db2pq` *Python* package. In this note, I turn "lazy data frames" produced by Ibis into Parquet files on disk. ::: ::: {.callout-note} The source for this note is available in [a folder on GitHub](https://github.com/iangow/notes/tree/main/db2pq_test) that contains two files: - [`ibis_to_pq.qmd`](https://raw.githubusercontent.com/iangow/notes/main/db2pq_test/ibis_to_pq.qmd) - [`pyproject.toml`](https://raw.githubusercontent.com/iangow/notes/main/db2pq_test/pyproject.toml) To run this note locally with `uv`, put the two files above in a directory on your computer, I will refer to this directory as the **project directory** in this now. For the convenience of *Tidy Finance* readers, I follow the approach laid out [here](https://www.tidy-finance.org/python/setting-up-your-environment.html) closely. The *Tidy Finance* instructions start with installation of "`uv`, a modern Python package and project manager." You should install `uv` using the instructions provided by *Tidy Finance*; these are taken from the [homepage for `uv`](https://docs.astral.sh/uv/getting-started/installation/). From the project directory, run the following command: ```bash uv sync ``` Next, you should follow the instructions provided by *Tidy Finance* for setting things up to connect to WRDS: > Create a `.env` file in your project directory. > For the purpose of this book, we create and save the following variables (where user and password are our private login credentials): > > ``` > WRDS_USER=user > WRDS_PASSWORD=password > ``` Once you have done the above, you should be able to run the code in this note. For example, you could `uv run jupyter lab` and copy-paste the commands show here (in order). Alternatively, if you are comfortable with Quarto, you could open `ibis_to_pq.qmd` and treat it like a notebook. ::: # Motivating example According to [Tidy Finance with Python](https://www.tidy-finance.org/python/wrds-crsp-and-compustat.html#daily-crsp-data): > The daily CRSP data file is substantially larger than monthly data and can exceed 20 GB. > This has two important implications: you cannot hold all the daily return data in your memory (hence it is not possible to save the entire dataset to your local folder), and in our experience, the download usually crashes (or never stops) because it is too much data for the WRDS cloud to prepare and send to your R session. > > There is a solution to this challenge. > As with many big data problems, you can split up the big task into several smaller tasks that are easier to handle. > That is, instead of downloading data about all stocks at once, download the data in small batches of stocks consecutively. Below I show how one can simplify the code dramatically (no batches!) and download the data faster (for me, it takes *less than a minute*) and with no significant burden on either the CPU or RAM. An issue with translating the `lazy_tbl_to_pq()` function from the R version of my `db2pq` package is that there is no equivalent of `dbplyr` for Python. While Python Polars has its `LazyFrame`, the Tidy Finance task involves extracting data from a PostgreSQL database and Polars has no way of creating lazy data frames from database connections. The closest analogue is surely Ibis. So I added a function `ibis_to_pq()` to my Python package `db2pq` and in this note I take it for a spin. I start by creating a connection to the WRDS PostgreSQL database. Here I follow the [*Tidy Finance* approach](https://www.tidy-finance.org/python/wrds-crsp-and-compustat.html#accessing-wrds) closely. But so long as `wrds` is a `psycopg`-backed SQLAlchemy engine for the WRDS PostgreSQL database, you should be able to use whatever approach you are used to. ```{python} from sqlalchemy import create_engine import os from dotenv import load_dotenv load_dotenv() connection_string = ( "postgresql+psycopg://" f"{os.getenv('WRDS_USER')}:{os.getenv('WRDS_PASSWORD')}" "@wrds-pgdata.wharton.upenn.edu:9737/wrds" ) wrds = create_engine(connection_string, pool_pre_ping=True) ``` Having established `wrds`, I load in the packages I will be using. ```{python} import polars as pl import ibis from ibis import _ from db2pq import ibis_to_pq ``` Then I turn `wrds` into an Ibis instance with WRDS PostgreSQL as its **backend**. You can learn more about Ibis [here](https://ibis-project.org). ```{python} db = ibis.postgres.from_connection( wrds.connect().connection.driver_connection ) ``` I then set up Ibis **lazy tables** for the three tables I will use here.^[The R version of the note did not use `ff.factors_daily` from WRDS, but I do so here because I don't have the same `copy_inline()` functionality I had when using R in yesterday's note.] ```{python} dsf = db.table("dsf_v2", database="crsp") stksecurityinfohist = db.table("stksecurityinfohist", database="crsp") factors_daily = db.table("factors_daily", database="ff") ``` Ibis offers an "interactive" option that makes it easier to work with lazy tables. ```{python} ibis.options.interactive = True ``` Let's look at `ff3`: ```{python} factors_daily ``` And then `dsf`: ```{python} dsf ``` ```{python} start_date = "1960-01-01" end_date = "2024-12-31" ``` For clarity, in translating the original *Tidy Finance* query, I break out the `security` sub-query from the main query. ```{python} security = ( stksecurityinfohist .filter( _.sharetype == "NS", _.securitytype == "EQTY", _.securitysubtype == "COM", _.usincflg == "Y", _.issuertype.isin(["ACOR", "CORP"]), _.primaryexch.isin(["N", "A", "Q"]), _.conditionaltype.isin(["RW", "NW"]), _.tradingstatusflg == "A", ).select( "permno", "secinfostartdt", "secinfoenddt", ) ) ``` I next construct the main `crsp_daily` query. ```{python} crsp_daily = ( dsf .filter(_.dlycaldt.between(start_date, end_date)) .inner_join(security, dsf.permno == security.permno) .filter(_.dlycaldt.between(_.secinfostartdt, _.secinfoenddt)) .select( permno=_.permno, date=_.dlycaldt, ret=_.dlyret, ) .drop_null(["permno", "date", "ret"]) .left_join( factors_daily .select("date", "rf") .rename(risk_free="rf"), _.date == factors_daily.date, ) .mutate( ret_excess=ibis.greatest(_.ret - _.risk_free, -1) ) .select( "permno", "date", ret=_.ret.cast("float64"), ret_excess=_.ret_excess.cast("float64"), ) ) ``` Let's take a peek at `crsp_daily`, which is still a lazy table: ```{python} crsp_daily ``` The next step is to create a Parquet file from `crsp_daily` using `ibis_to_pq()`. ```{python} #| cache: true #| eval: true %%time ibis_to_pq(crsp_daily, "crsp_daily.parquet") ``` For some reason, `ibis_to_pq()` does not perform as well as its sibling `lazy_tbl_to_pq()` in the R version of `db2pq`. This performance is a bit disappointing given that I had `ibis_to_pq()` use the ADBC driver rather than the default `psycopg` driver used by Ibis. I was surprised that Ibis did not support the Arrow driver. ```{python} #| eval: true #| cache: true %%time ibis_to_pq(factors_daily, "factors_ff3_daily.parquet") ``` ## Using the data I figured that it would be a little dull to do no more than simply create the Parquet files. But looking through *Tidy Finance with Python*, the only use of `crsp_daily` seemed to be in the section [*Estimating Beta Using Daily Returns*](https://www.tidy-finance.org/r/beta-estimation.html#estimating-beta-using-daily-returns). There *Tidy Finance* say: > We then create a connection to the daily CRSP data, but we don’t load the whole table into our memory. > We only extract all distinct `permno` because we loop the beta estimation over batches of stocks with size 500. > To estimate the CAPM over a consistent lookback window while accommodating different return frequencies, we adjust the minimum required number of observations accordingly. > Specifically, we require at least 1,000 daily returns over a five‑year period for a valid estimation. > This threshold is consistent with the monthly requirement of 48 observations out of 60 months, given that there are roughly 252 trading days in a year. The *Tidy Finance* code uses pandas, but I will use Python Polars, which does much better with larger data sets. I start by creating `LazyFrame` instances of the two tables that are used in calculating betas. ```{python} crsp_daily = pl.scan_parquet("crsp_daily.parquet") factors_ff3_daily = pl.scan_parquet("factors_ff3_daily.parquet") ``` The next step is to implement the approach described above. Python Polars can handle the details of creating the windows and so on, but we don't really want to have to hand over all the data to Statsmodels to do the regressions.^[The original pandas code uses `smf.ols()` from Statsmodels.] But if you're reading *Tidy Finance with Python*, it seems reasonable to assume that you know how to calculate a univariate regression coefficient, so let's just use Polars expressions and do it by hand. Unfortunately, it turns out that we will be calculating about 2.5 million betas over windows of up to about 1,250 days each and that will be a lot of data even for Polars!^[There will be a lot of overlap in these windows and I had thought that Polars would just take the data and just process the windows at the start of each month as it passed through for each `permno`, but it seems it literally stacks copies of data for each window and I quickly ran out of RAM when I tried this.] So I ended up using batches of 500, just as *Tidy Finance* do. I make a function that calculates $\beta$ for 500 `permno` values at a time. The key element here is `group_by_dynamic()`: For each `permno`, this creates a sequence of monthly time windows over date, then aggregates within each window. More specifically: - `group_by="permno"` means each stock is handled separately. - `every="1mo"` means start a new window every 1 month. - `period="60mo"` means each window spans 60 months. - `offset="-60mo"` shifts each window backward by 60 months relative to its label/start schedule, so each monthly output row looks back over the previous 60 months rather than forward. - `closed="left"` means the left endpoint is included and the right endpoint is excluded. - `label="right"` means the output date attached to each aggregated window is the right edge of the window. So in plain English: For each stock, every month, compute summary statistics using up to 60 months of prior data, and label the result with the date at the right edge of that lookback window. *Tidy Finance* do not provide any indication of how long their code takes to run, but let's see how long this takes to run: ```{python} %%time min_obs = 1000 batch_size = 500 def get_betas_batch(permno_batch: pl.DataFrame) -> pl.LazyFrame: return ( crsp_daily .join(permno_batch.lazy(), on="permno", how="semi") .select("permno", "date", "ret_excess") .join( factors_ff3_daily.select( "date", pl.col("mktrf").cast(pl.Float64), pl.col("rf").cast(pl.Float64), ), on="date", how="inner", ) .select( "permno", "date", "ret_excess", mkt_excess=pl.col("mktrf") - pl.col("rf"), ) .sort("permno", "date") .group_by_dynamic( "date", every="1mo", period="60mo", offset="-60mo", group_by="permno", closed="left", label="right", ) .agg( beta=pl.cov("mkt_excess", "ret_excess") / pl.col("mkt_excess").var(), mean_ret=pl.col("ret_excess").mean(), mean_mkt=pl.col("mkt_excess").mean(), n=pl.len(), ) .filter(pl.col("n") >= min_obs) .with_columns( alpha=pl.col("mean_ret") - pl.col("beta") * pl.col("mean_mkt") ) .drop("mean_ret", "mean_mkt") ) permnos = ( crsp_daily .select("permno") .unique() .collect() ) betas = pl.concat( ( get_betas_batch(permnos.slice(i, batch_size)).collect() for i in range(0, permnos.height, batch_size) ), rechunk=True, ) ``` So about 15 seconds for the whole lot. And let's take a peek at the data: ```{python} betas ```