# Harvest items from a search in RecordSearch

Ever searched for items in RecordSearch and wanted to save the results as a CSV file, or in some other machine-readable format? This notebook makes it easy to save the results of an item search as a downloadable dataset. You can even download all the images from items that have been digitised, or save the complete files as PDFs!

RecordSearch doesn't currently have an option for downloading machine-readable data. So to get collection metadata in a structured form, we have to resort of screen-scraping. This notebook uses the [RecordSearch Data Scraper](https://wragge.github.io/recordsearch_data_scraper/) to do most of the work.

Notes:

* The RecordSearch Data Scraper caches results to improve efficiency. This also makes it easy to resume a failed harvest. If you want to completely refresh a harvest, then delete the `cache_db.sqlite` file to start from scratch.
* The harvesting function below automatically slices large searches (greater than 20,000 results) into smaller chunks. This avoids RecordSearch's 20,000 result limit. This should work in most cases. If it doesn't, try changing the `control_range` list below. This list supplies a range of prefixes which are supplied (with a trailing '*' for wildcard matches) as the `control` value.

## Available search parameters

The available search parameters are the same as those in RecordSearch's Advanced Search form. There's lots of them, but you'll probably only end up using a few like `kw` and `series`. Note that you can use \* for wildcard searches as you can in the web interface. So setting `kw` to 'wragge\*' will find both 'wragge' and 'wragges'.

See the [RecordSearch Data Scraper documentation](https://wragge.github.io/recordsearch_data_scraper/scrapers.html#RSItemSearch) for more information on search parameters.

* `kw` – string containing keywords to search for
* `kw_options` – how to interpret `kw`, possible values are:
 * 'ALL' – return results containing all of the keywords (default)
 * 'ANY' – return results containg any of the keywords
 * 'EXACT' – treat `kw` as a phrase rather than a list of words
* `kw_exclude` – string containing keywords to exclude from search
* `kw_exclude_options` – how to interpret `kw_exclude`, possible values are:
 * 'ALL' – exclude results containing all of the keywords (default)
 * 'ANY' – exclude results containg any of the keywords
 * 'EXACT' – treat `kw_exact` as a phrase rather than a list of words
* `search_notes` – set to 'on' to search item notes as well as metadata
* `series` – search for items in this series
* `series_exclude` – exclude items from this series
* `control` – search for items matching this control symbol
* `control_exclude` – exclude items matching this control symbol
* `item_id` – search for items with this item ID number (formerly called `barcode`)
* `date_from` – search for items with a date (year) greater than or equal to this, eg. '1935'
* `date_to` – search for items with a date (year) less than or equal to this
* `formats` – limit search to items in a particular format, see possible values below
* `formats_exclude` – exclude items in a particular format, see possible values below
* `locations` – limit search to items held in a particular location, see possible values below
* `locations_exclude` – exclude items held in a particular location, see possible values below
* `access` – limit to items with a particular access status, see possible values below
* `access_exclude` – exclude items with a particular access status, see possible values below
* `digital` – set to `True` to limit to items that are digitised


Possible values for `formats` and `formats_exclude`: 

* 'Paper files and documents'
* 'Index cards'
* 'Bound volumes'
* 'Cartographic records'
* 'Photographs'
* 'Microforms'
* 'Audio-visual records'
* 'Audio records'
* 'Electronic records'
* '3-dimensional records'
* 'Scientific specimens'
* 'Textiles'

Possible values for `locations` and `locations_exclude`:

* 'NAT, ACT'
* 'Adelaide'
* 'Australian War Memorial'
* 'Brisbane'
* 'Darwin'
* 'Hobart'
* 'Melbourne'
* 'Perth'
* 'Sydney'

Possible values for `access` and `access_exclude`:

* 'OPEN'
* 'OWE'
* 'CLOSED'
* 'NYE'

There are some additional parameters that affect the way the search results are delivered.

* `record_detail` – controls the amount of information included in each item record, possible values:
 * 'brief' (default) – just the info in the search results
 * 'digitised' – add the number of pages if the file is digitised (slower)
 * 'full' – get the full individual record for each result, includes number of digitised pages and access examination details (slowest)
 
Note that if you want to harvest all the digitised page images from a search, you need to set `record_detail` to either 'digitised' or 'full'.

## How your harvest is saved

Once it's downloaded all the results, the harvesting function creates a directory for the harvest and saves three files inside:

* `metadata.json` – this is a summary of your harvest, including the parameters you used and the date it was run
* `results.ndjson` – this is the harvested data with each record saved as a JSON object on a new line
* `results.csv` – the harvested data with any duplicates removed saved as a CSV file (if you've saved 'full' records, the list of `access_decision_reasons` will be saved as a pipe-separated string)

The `metadata.json` file looks something like this:

```json
{
 "date_harvested": "2021-05-22T22:05:10.705184", 
 "search_params": {"results_per_page": 20, "sort": 9, "record_detail": "digitised"}, 
 "search_kwargs": {"kw": "wragge"}, 
 "total_results": 208, 
 "total_harvested": 208,
 "total_deduplicated": 208
}
```

The 'total' values represent slightly different things:

* `total_results`: the number of matching results RecordSearch thinks there are
* `total_harvested`: the number of results actually harvested
* `total_deduplicated`: the number of records left after duplicates are removed from the harvested results

Duplicate records sometimes occur when items have an alternative control symbol. The CSV creation process removes any duplicates.

The fields in the results files are:

* `title`
* `identifier` 
* `series` 
* `control_symbol`
* `digitised_status`
* `digitised_pages` – if `record_detail` is set to 'digitised' or 'full'
* `access_status`
* `access_decision_reasons` – if `record_detail` is set to 'full'
* `location`
* `retrieved` – date/time when this record was retrieved from RecordSearch
* `contents_date_str`
* `contents_start_date`
* `contents_end_date`
* `access_decision_date_str` – if `record_detail` is set to 'full'
* `access_decision_date` – if `record_detail` is set to 'full'

See below for information on saving digitised images and PDFs.

## Import what we need

In [None]:
import json
import string
import time
from datetime import datetime
from pathlib import Path

import pandas as pd
import requests
from IPython.display import HTML, FileLink, display
from recordsearch_data_scraper.scrapers import RSItemSearch
from slugify import slugify
from tqdm.auto import tqdm

# This is a workaround for a problem with tqdm adding space to cells
HTML(
 """
 
"""
)

## Define some functions

In [None]:
# This is basically a list of letters and numbers that we can use to build up control symbol values.
control_range = (
 [str(number) for number in range(0, 10)]
 + [letter for letter in string.ascii_uppercase]
 + ["/"]
)


def get_results(data_dir, **kwargs):
 """
 Save all the results from a search using the given parameters.
 If there are more than 20,000 results, return False.
 Otherwise, return the harvested items.
 """
 s = RSItemSearch(**kwargs)
 if s.total_results == "20,000+":
 return False
 else:
 with tqdm(total=s.total_results, leave=False) as pbar:
 more = True
 while more:
 data = s.get_results()
 if data["results"]:
 save_to_ndjson(data_dir, data["results"])
 pbar.update(len(data["results"]))
 time.sleep(0.5)
 else:
 more = False
 return True


def refine_controls(current_control, data_dir, **kwargs):
 """
 Add additional letters/numbers to the control symbol wildcard search
 until the number of results is less than 20,000.
 Then harvest the results.
 Returns:
 * the RSItemSearch object (containing the search params, total results etc)
 * a list containing the harvested items
 """
 for control in control_range:
 new_control = current_control.strip("*") + control + "*"
 # print(new_control)
 kwargs["control"] = new_control
 results = get_results(data_dir, **kwargs)
 # print(total)
 if results is False:
 refine_controls(new_control, data_dir, **kwargs)


def create_data_dir(search, today):
 """
 Create a directory for the harvested data -- using the date and search parameters.
 """
 params = search.params.copy()
 params.update(search.kwargs)
 search_param_str = slugify(
 "_".join(
 sorted(
 [
 f"{k}_{v}"
 for k, v in params.items()
 if v is not None and k not in ["results_per_page", "sort"]
 ]
 )
 )
 )
 data_dir = Path("harvests", f'{today.strftime("%Y%m%d_%H%M%S")}_{search_param_str}')
 data_dir.mkdir(exist_ok=True, parents=True)
 return data_dir


def save_to_ndjson(data_dir, results):
 """
 Save results into a single, newline delimited JSON file.
 """
 output_file = Path(data_dir, "results.ndjson")
 with output_file.open("a") as ndjson_file:
 for result in results:
 ndjson_file.write(json.dumps(result) + "\n")


def save_metadata(search, data_dir, today, totals):
 """
 Save information about the harvest to a JSON file.
 """
 metadata = {
 "date_harvested": today.isoformat(),
 "search_params": search.params,
 "search_kwargs": search.kwargs,
 "total_results": search.total_results,
 "total_harvested": totals["harvested"],
 "total_after_deduplication": totals["deduped"],
 }

 with Path(data_dir, "metadata.json").open("w") as md_file:
 json.dump(metadata, md_file)


def save_csv(data_dir):
 """
 Save the harvested results as a CSV file, removing any duplicates.
 """
 output_file = Path(data_dir, "results.csv")
 input_file = Path(data_dir, "results.ndjson")
 df = pd.read_json(input_file, lines=True)
 harvested = df.shape[0]
 # Flatten list
 try:
 df["access_decision_reasons"] = (
 df["access_decision_reasons"].dropna().apply(lambda l: " | ".join(l))
 )
 except KeyError:
 pass
 # Remove any duplicates
 df.drop_duplicates(inplace=True)
 df.to_csv(output_file, index=False)
 deduped = df.shape[0]
 return {"harvested": harvested, "deduped": deduped}


def harvest_search(**kwargs):
 """
 Harvest all the items from a search using the supplied parameters.
 If there are more than 20,000 results, it will use control symbol
 wildcard values to try and split the results into harvestable chunks.
 """
 # Initialise the search
 search = RSItemSearch(**kwargs)
 today = datetime.now()
 data_dir = create_data_dir(search, today)
 # If there are more than 20,000 results, try chunking using control symbols
 if search.total_results == "20,000+":
 # Loop through the letters and numbers
 for control in control_range:
 # print(control)
 # Add letter/number as a wildcard value
 kwargs["control"] = f"{control}*"
 # Try getting the results
 results = get_results(data_dir, **kwargs)
 # print(results)
 if results is False:
 # If there's still more than 20,000, add more letters/numbers to the control symbol!
 refine_controls(control, data_dir, **kwargs)
 # If there's less than 20,000 results, save them all
 else:
 get_results(data_dir, **kwargs)
 totals = save_csv(data_dir)
 save_metadata(search, data_dir, today, totals)
 print(f"Harvest directory: {data_dir}")
 display(FileLink(Path(data_dir, "metadata.json")))
 display(FileLink(Path(data_dir, "results.ndjson")))
 display(FileLink(Path(data_dir, "results.csv")))
 return data_dir


def save_images(harvest_dir):
 df = pd.read_csv(Path(harvest_dir, "results.csv"))
 with tqdm(
 total=df.loc[df["digitised_status"] == True].shape[0], desc="Files"
 ) as pbar:
 for item in df.loc[df["digitised_status"] == True].itertuples():
 image_dir = Path(
 f"{harvest_dir}/images/{slugify(item.series)}-{slugify(str(item.control_symbol))}-{item.identifier}"
 )

 # Create the folder (and parent if necessary)
 image_dir.mkdir(exist_ok=True, parents=True)

 # Loop through the page numbers
 for page in tqdm(
 range(1, int(item.digitised_pages) + 1), desc="Images", leave=False
 ):

 # Define the image filename using the barcode and page number
 filename = Path(f"{image_dir}/{item.identifier}-{page}.jpg")

 # Check to see if the image already exists (useful if rerunning a failed harvest)
 if not filename.exists():
 # If it doens't already exist then download it
 img_url = f"https://recordsearch.naa.gov.au/NaaMedia/ShowImage.asp?B={item.identifier}&S={page}&T=P"
 response = requests.get(img_url)
 try:
 response.raise_for_status()
 except requests.exceptions.HTTPError:
 pass
 else:
 filename.write_bytes(response.content)

 time.sleep(0.5)
 pbar.update(1)


def save_pdfs(harvest_dir):
 df = pd.read_csv(Path(harvest_dir, "results.csv"))
 pdf_dir = Path(harvest_dir, "pdfs")
 pdf_dir.mkdir(exist_ok=True, parents=True)
 with tqdm(
 total=df.loc[df["digitised_status"] == True].shape[0], desc="Files"
 ) as pbar:
 for item in df.loc[df["digitised_status"] == True].itertuples():
 pdf_file = Path(
 pdf_dir,
 f"{slugify(item.series)}-{slugify(str(item.control_symbol))}-{item.identifier}.pdf",
 )
 if not pdf_file.exists():
 pdf_url = f"https://recordsearch.naa.gov.au/SearchNRetrieve/NAAMedia/ViewPDF.aspx?B={item.identifier}&D=D"
 response = requests.get(pdf_url)
 try:
 response.raise_for_status()
 except requests.exceptions.HTTPError:
 pass
 else:
 pdf_file.write_bytes(response.content)
 time.sleep(0.5)
 pbar.update(1)

## Start a harvest

Insert your search parameters in the brackets below.

Examples:

* `search, items = harvest_search(kw='rabbit')`
* `search, items = harvest_search(kw='rabbit', digital=True)`
* `search, items = harvest_search(record_detail='full', kw='rabbit', series='A1)`
* `search, items = harvest_search(series='B13')`

If you're running a long harvest, there's a good chance it will get interrupted at some point. Don't worry, just run the cell above again. The scraper caches your results, so it won't need to start from scratch.

In [None]:
data_dir = harvest_search(kw="wragge exhibit", record_detail="digitised")

## Saving images from digitised files

Once you've saved all the metadata from your search, you can use it to download images from all the items that have been digitised.

Note that you can only save the images if you set the `record_detail` parameter to 'digitised' or 'full' in the original harvest.

The function below will look for all items that have a `digitised_pages` value in the harvest results, and then download an image for each page. The images will be saved in an `images` subdirectory, inside the original harvest directory.

In [None]:
# Supply the path to the directory containing the harvested data
# This is the value returned by the `harvest_search()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_images(data_dir)

## Saving digitised files as PDFs

You can also save digitised files as PDFs. The function below will save any digisted files in the results to a `pdfs` subdirectory within the harvest directory.

In [None]:
# Supply the path to the directory containing the harvested data
# This is the value returned by the `harvest_search()` function.
# eg: 'harvests/20210522_digital_True_kw_wragge_record_detail_full'
save_pdfs(data_dir)

----

Created by [Tim Sherratt](https://timsherratt.org/) for the [GLAM Workbench](https://glam-workbench.github.io/). Support me by becoming a [GitHub sponsor](https://github.com/sponsors/wragge)!