---
title: "Missing Form APs?"
author: Ian D. Gow
date: 2024-12-01
date-format: "D MMMM YYYY"
categories: [Research methods, PCAOB, XBRL, SEC]
keywords:
  - PCAOB
  - Form AP
  - XBRL
  - SEC
  - Research methods
number-sections: true
format:
  html:
    colorlinks: true
    csl: jfe.csl
    fig-format: png
  typst:
    colorlinks: true
    linkcolor: "#0000FF"
    fig-format: pdf
    margin:
      x: 2cm
    papersize: a4
    mainfont: Libertinus Serif
    mathfont: New Computer Modern Math
    monofont: Menlo
bibliography: papers.bib
execute:
  freeze: auto
---

In May of 2024, I [posted on LinkedIn](https://www.linkedin.com/posts/iangow_working-with-form-ap-data-activity-7196971381010432001-Isyc) a brief note about working with data from Form APs filed with the PCAOB.
In a comment on my LinkedIn posting, [Olga Usvyatsky](https://www.linkedin.com/in/olgausvyatsky/) suggested that "the variation is not limited to firms' names - for instance, I find errors in reporting CIK codes of the clients intriguing."
I thought it would be interesting to investigate the issue raised by Olga, but doing so would be greatly facilitated by an alternative source for data on auditor-client relationships.

Recently, I discovered that it is relatively straightforward to process XBRL data filed using SEC EDGAR using data sets prepared by the SEC and posted on its website.
There are two data sets: the [*Financial Statements*](https://www.sec.gov/data-research/sec-markets-data/financial-statement-data-sets) and [*Financial Statement and Notes*](https://www.sec.gov/data-research/financial-statement-notes-data-sets) data sets, with the latter being roughly ten times as large as the former.
For the task we consider here, we need to use the *Financial Statement and Notes* data set.^[Run `source("https://raw.githubusercontent.com/iangow/notes/refs/heads/main/get_dera_notes.R")` to get these data.]

In essence, I compare the auditors listed in firms' 10-K filings with data on Form APs, with a focus on my success rate in matching the two.
In @tbl-top-non-matches, we can see that the auditor with the greatest number of non-matches is B F Borgers CPA PC, an auditor featured in a tongue-in-cheek *Financial Times* [article by George Steer](https://www.ft.com/content/2556fab5-d168-4d68-aedb-9be4b5ab739d) that showed that the auditor Ben Borgers had used 14 different names---including the name "Ben F orgers"---on Form AP filings.^[The analysis of the Form AP data in the FT article had been conducted by independent researcher [Stephen Walker](https://www.stephenwalker.me) and I showed how to reproduce Stephen's analysis in [an earlier note](../form_ap_names.html).]
While more research would be needed to investigate the reasons for "missing" data, having Borgers emerge as the "winner" yet again suggests that missing Form AP filings might be another red flag worth pursuing.

This note was written using [Quarto](https://quarto.org) and compiled with [RStudio](https://posit.co/products/open-source/rstudio/), an integrated development environment (IDE) for working with R.
The source code for this note is available [here](https://raw.githubusercontent.com/iangow/notes/main/published/missing_form_aps.qmd) and the latest version of this PDF is [here](https://raw.githubusercontent.com/iangow/notes/main/published/missing_form_aps.pdf).

```{r}
#| message: false
#| include: false
library(tidyverse)
library(DBI)
library(farr)
library(tinytable)
library(dbplyr)
```

```{r}
#| include: false
db <- dbConnect(duckdb::duckdb())

load_parquet <- function (conn, table, schema = "",
                          data_dir = Sys.getenv("DATA_DIR"))
{
    file_path <- file.path(data_dir, schema, paste0(table, ".parquet"))
    df_sql <- paste0("SELECT * FROM read_parquet('", file_path,
        "', union_by_name=true)")
    dplyr::tbl(conn, dplyr::sql(df_sql))
}

sub <- load_parquet(db, "sub_*", "dera_notes") |>
  select(adsh, cik, form)
txt <- load_parquet(db, "txt_*", "dera_notes")
form_aps <- load_parquet(db, "form_aps", "pcaob")
```

## XBRL tags

A key concept in XBRL is the **tag**.
Each value will be associated with, *inter alia*, a tag that indicates what the value represents.
For example, a value might be tagged as `AssetsCurrent` to indicate that the value represents the total of current assets.^[The *inter alia* refers to other things use to narrow down what the value relates to, such as the filing (`adsh`, also known as the **accession number**), the period (`ddate`), and the applicable XBRL version (`version`).]

To understand tags related to auditors, I begin by examining the tags that begin with the text `Auditor`.
From @tbl-tags, it can be seen that there are three such tags in common use: `AuditorName`, `AuditorLocation`, and `AuditorFirmId`.^[In @tbl-tags, we do see `AuditorFirmId1` and `AuditorFirmId2`, but given how rare these tags are, we can ignore these for present purposes.]

From @tbl-tags, it appears that not every filing with non-missing `AuditorName` has non-missing `AuditorFirmId`.
@tbl-missing-id-names provides more data on the distribution of missing values for these two fields, where `has_name` and `has_id` indicate non-missing values of `AuditorName` and `AuditorFirmId`, respectively.

@tbl-missing-ids provides some information on some cases where `AuditorName` is present, but `AuditorFirmId` is not.
While there's no clear pattern to these data, they do suggest that firms are not always diligent in including `AuditorFirmId` in XBRL filings.

From @tbl-forms, it can be seen that most filings containing information in `AuditorFirmId` are on variants of Form 10-K.
So I focus on Form 10-K filings (and variants) in the analysis in this note.^[Specifically, I consider 10-K, 10-K/A, 10-KT, and 10-KT/A filings.]

From the Form APs, I collect data on audit firm IDs, the CIKs of issuers and fiscal period-end dates.

I then merge data from these 10-K filings with data on Form APs using **accession numbers** and period-end dates.
@tbl-top-non-matches shows that roughly 1--2% of filings on Form 10-Ks involving Big Four auditors (`firm_id` values of 34, 42, 238, and 185), Grant Thornton (248) or BDO (243) appear not to have corresponding matches in the Form APs data.

While there might be a perfectly innocent explanation, it might be worth digging deeper to understand why (say) Gries & Associates, Yusufali & Associates, and Heaton & Company have such high rates of unmatched filings.

```{r}
#| echo: false
#| label: tbl-tags
#| tbl-cap: Tags beginning with `Auditor` on `txt`
txt |>
  filter(str_detect(tag, "^Auditor")) |>
  count(tag, sort = TRUE) |>
  collect() |>
  tt() |>
  style_tt(align = "lr") |>
  format_tt(digits = 0,
            num_mark_big = ",")
```

```{r}
#| label: missing
#| cache: true
#| include: false
missing <-
  txt |>
  filter(tag %in% c("AuditorName", "AuditorFirmId")) |>
  select(adsh, tag, ddate, value) |>
  pivot_wider(names_from = "tag",
              values_from = "value") |>
  mutate(has_name = !is.na(AuditorName),
        has_id = !is.na(AuditorFirmId)) |>
  collect()
```

```{r}
#| echo: false
#| label: tbl-missing-id-names
#| tbl-cap: Distribution of missingness of `AuditorName` and `AuditorFirmId`
#| cache: true
#| dependson: missing
missing |>
  count(has_name, has_id) |>
  collect() |>
  tt() |>
  style_tt(align = "ccr") |>
  format_tt(escape = TRUE,
            digits = 0,
            num_mark_big = ",")
```

```{r}
#| echo: false
#| label: tbl-missing-ids
#| tbl-cap: Sample of auditor names with missing IDs
#| cache: true
#| dependson: missing
missing |>
  filter(has_name, !has_id) |>
  count(AuditorName, sort = TRUE) |>
  filter(n > 2) |>
  tt() |>
  style_tt(align = "lr") |>
  format_tt(digits = 0,
            num_mark_big = ",",
            escape = TRUE)
```

```{r}
#| include: false
auditor_id_tags_raw <-
  txt |>
  filter(tag == "AuditorFirmId", is.na(coreg)) |>
  select(adsh, ddate, tag, value) |>
  pivot_wider(names_from = "tag", values_from = "value") |>
  rename(firm_id = AuditorFirmId) |>
  mutate(firm_id = as.integer(firm_id))

auditor_id_dates <-
  txt |>
  filter(tag == "DocumentPeriodEndDate", is.na(coreg)) |>
  select(adsh, ddate, tag, value) |>
  pivot_wider(names_from = "tag", values_from = "value") |>
  mutate(DocumentPeriodEndDate = as.Date(DocumentPeriodEndDate),
         month = floor_date(DocumentPeriodEndDate, "month"))

auditor_id_tags <-
  auditor_id_tags_raw |>
  inner_join(auditor_id_dates, by = join_by(adsh, ddate)) |>
  filter(!is.na(firm_id), !is.na(month)) |>
  compute()
```

```{r}
#| echo: false
#| label: tbl-forms
#| tbl-cap: Forms with information on `AuditorFirmId`
auditor_id_tags |>
  inner_join(sub, by = join_by(adsh)) |>
  count(form) |>
  arrange(desc(n)) |>
  collect() |>
  tt() |>
  style_tt(align = "lr") |>
  format_tt(escape = TRUE,
            digits = 0,
            num_mark_big = ",")
```

```{r}
#| include: false
auditor_ids <-
  auditor_id_tags |>
  inner_join(sub, by = join_by(adsh)) |>
  filter(form %in% c("10-K", "10-K/A", "10-KT", "10-KT/A")) |>
  rename(issuer_cik = cik) |>
  compute()
```

```{r}
#| include: false
auditor_ids_form_aps <-
  form_aps |>
  select(firm_id, issuer_cik, fiscal_period_end_date) |>
  mutate(issuer_cik = as.double(issuer_cik),
         month = floor_date(fiscal_period_end_date, "month")) |>
  compute()
```

```{r}
#| include: false
auditor_names <-
  form_aps |>
  select(firm_id, firm_name, audit_report_date) |>
  group_by(firm_id) |>
  window_order(desc(audit_report_date)) |>
  mutate(firm_name = first(firm_name)) |>
  distinct(firm_id, firm_name) |>
  compute()
```

```{r}
#| include: false
match_data <-
  auditor_ids |>
  left_join(auditor_ids_form_aps,
            by = join_by(issuer_cik, month),
            suffix = c("", "_aps")) |>
  inner_join(auditor_names, by = join_by(firm_id)) |>
  mutate(matched = !is.na(firm_id_aps)) |>
  compute()
```

```{r}
#| label: tbl-top-non-matches
#| tbl-cap: Number of cases without Form AP filings by auditor
#| echo: false
match_data |>
  group_by(firm_id, firm_name) |>
  summarize(unmatched = sum(as.integer(!matched), na.rm = TRUE),
            total = n(),
            .groups = "drop") |>
  mutate(`% unmatched` = unmatched / total * 100) |>
  arrange(desc(unmatched)) |>
  collect(n = 20) |>
  tt() |>
  style_tt(align = "rlrrd") |>
  format_tt(escape = TRUE) |>
  format_tt(j = c(3, 4), digits = 0, num_mark_big = ",") |>
  format_tt(j = 5,  num_fmt = "decimal", digits = 2, num_zero = TRUE)
```

```{r}
#| include: false
match_data |>
  filter(firm_id %in% c("34", "42", "238", "185"),
         !matched)
```