This notebook contains an example of combining checkpoint use inside a input function in Snakemake 8.18.2. The notebook was based on a [work](https://edwards.flinders.edu.au/how-to-use-snakemake-checkpoints/) written on an outdated version of the library.

Acknowledgments: I would like to thank [Wayne](https://stackoverflow.com/users/8508004/wayne) for his help in implementation.

In [1]:
! snakemake --version

8.18.2


# Case

In [1]:
! cat Snakefile

OUTDIR = "first_directory"
SNDDIR = "second_directory"

SMP = None

def get_file_names(wildcards):
 ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
 global SMP
 SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
 return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)

def get_second_files(wildcards):
 ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
 SMP2, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
 return expand(os.path.join(SNDDIR, "{SM}.tsv"), SM=SMP2)


rule all:
 input: 
 get_second_files,
 expand("list_of_files_{SAMPLE}.txt",SAMPLE=SMP)


checkpoint make_five_files:
 output:
 directory(OUTDIR)
 params:
 o = OUTDIR
 shell:
 """
 mkdir {output};
 for D in $(seq 1 5); do
 touch {params.o}/$RANDOM.txt
 done
 """

rule copy_files:
 input:
 get_file_names
 output:
 os.path.join(SNDDIR, "{SAMPLE}.tsv")
 shell:
 """
 touch {output}
 """


rule list_all_files:
 input:
 expand(os.path.join(SNDDIR, "{SAMPLE}.tsv"), 

In the source code, the {SAMPLE} wildcard uses the value of the SMP variable. However, SMP is initially defined as None, which may cause the code to not work correctly. When a checkpoint is executed, the SMP variable changes and must maintain a list of file names in the OUTDIR directories.

In [3]:
!snakemake -c 1 --debug-dag

[33mAssuming unrestricted shared filesystem usage.[0m
[33mBuilding DAG of jobs...[0m
[33mcandidate job all
 wildcards: [0m
[33mcandidate job make_five_files
 wildcards: [0m
[33mselected job make_five_files
 wildcards: [0m
[33mcandidate job list_all_files
 wildcards: SAMPLE=None[0m
[33mcandidate job copy_files
 wildcards: SAMPLE=None[0m
[33mcandidate job make_five_files
 wildcards: [0m
[33mselected job make_five_files
 wildcards: [0m
[33mselected job copy_files
 wildcards: SAMPLE=None[0m
[33mselected job list_all_files
 wildcards: SAMPLE=None[0m
[33mselected job all
 wildcards: [0m
[33mcandidate job copy_files
 wildcards: SAMPLE=None[0m
[33mselected job copy_files
 wildcards: SAMPLE=None[0m
[33mcandidate job all
 wildcards: [0m
[33mcandidate job copy_files
 wildcards: SAMPLE=10536[0m
[33mselected job copy_files
 wildcards: SAMPLE=10536[0m
[33mcandidate job copy_files
 wildcards: SAMPLE=16393[0m
[33mselected job copy_files
 wildcards: SAMPLE=16393[0m

The logs show that after executing make_five_files, the {SAMPLE} wildcard gets the correct values (for example, 10536, 16393, etc.). copy_files and list_all_files are selected multiple times with SAMPLE=None, which is not correct behavior.

In [5]:
!ls

SM2.ipynb Snakefile first_directory list_of_files_None.txt second_directory


In [2]:
!ls first_directory/

10536.txt 16393.txt 16815.txt 2323.txt 6362.txt


In [4]:
!ls second_directory/

10536.tsv 16393.tsv 16815.tsv 2323.tsv 6362.tsv None.tsv


# Solution

Using a function that returns a list of file names from a directory. The OUTDIR directory is used as input in the all rule.

In [6]:
! cat Snakefile

import os

OUTDIR = "first_directory"


def get_txt_files(wildcards):
 ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
 print([file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')])
 return [file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')]


rule all:
 input: 
 OUTDIR,
 expand(f"{OUTDIR}/"+"{SMP}.doc", SMP=get_txt_files)

checkpoint make_five_files:
 output:
 directory(OUTDIR)
 params:
 o = OUTDIR
 shell:
 """
 mkdir {output};
 for D in $(seq 1 5); do
 touch {params.o}/$RANDOM.txt
 done
 """

rule copy_files:
 input:
 "{SMP}.txt"
 output:
 out = "{SMP}.tsv"
 shell:
 """
 touch {output.out}
 """

rule copy2:
 input:
 "{SMP}.tsv"
 output:
 "{SMP}.doc"
 shell:
 """
 touch {output}
 """

In [2]:
! snakemake -c 1 -F

[33mAssuming unrestricted shared filesystem usage.[0m
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job count
--------------- -------
all 1
make_five_files 1
total 2
[0m
[33mSelect jobs to execute...[0m
[33mExecute 1 jobs...[0m
[32m[0m
[32m[Thu Aug 22 17:43:06 2024][0m
[32mlocalcheckpoint make_five_files:
 output: first_directory
 jobid: 1
 reason: Forced execution
 resources: tmpdir=/tmp[0m
[33mDAG of jobs will be updated after completion.[0m
[32m[0m
[32m[Thu Aug 22 17:43:06 2024][0m
[32mFinished job 1.[0m
[32m1 of 2 steps (50%) done[0m
['10576', '23600', '21071', '24382', '17800']
[33mSelect jobs to execute...[0m
[33mExecute 1 jobs...[0m
[32m[0m
[32m[Thu Aug 22 17:43:06 2024][0m
[32mlocalrule copy_files:
 input: first_directory/10576.txt
 output: first_directory/10576.tsv
 jobid: 5
 reason: Forc

In [3]:
! ls 

first_directory Snakefile Untitled.ipynb


In [4]:
! ls first_directory/

10576.doc 17800.doc 21071.doc 23600.doc 24382.doc
10576.tsv 17800.tsv 21071.tsv 23600.tsv 24382.tsv
10576.txt 17800.txt 21071.txt 23600.txt 24382.txt


Let's add the creation of files in a new directory. The directory name is specified by the SNDDIR variable at the beginning of the file.

In [7]:
! cat Snakefile

import os

OUTDIR = "first_directory"
SNDDIR = "second_directory"



def get_txt_files(wildcards):
 ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
 print([file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')])
 return [file.split(".")[0] for file in os.listdir(ck_output) if file.endswith('.txt')]


rule all:
 input: 
 OUTDIR,
 expand(f"{SNDDIR}/"+"{SMP}.doc", SMP=get_txt_files)


checkpoint make_five_files:
 output:
 directory(OUTDIR)
 params:
 o = OUTDIR
 shell:
 """
 mkdir {output};
 for D in $(seq 1 5); do
 touch {params.o}/$RANDOM.txt
 done
 """

rule copy_files:
 input:
 f"{OUTDIR}/"+"{SMP}.txt"
 output:
 out = f"{SNDDIR}/"+"{SMP}.tsv"
 shell:
 """
 touch {output.out}
 """

rule copy2:
 input:
 f"{SNDDIR}/"+"{SMP}.tsv"
 output:
 f"{SNDDIR}/"+"{SMP}.doc"
 shell:
 """
 touch {output}
 """




In [14]:
! snakemake -c 1 -F

[33mAssuming unrestricted shared filesystem usage.[0m
[33mBuilding DAG of jobs...[0m
[33mUsing shell: /usr/bin/bash[0m
[33mProvided cores: 1 (use --cores to define parallelism)[0m
[33mRules claiming more threads will be scaled down.[0m
[33mJob stats:
job count
--------------- -------
all 1
make_five_files 1
total 2
[0m
[33mSelect jobs to execute...[0m
[33mExecute 1 jobs...[0m
[32m[0m
[32m[Thu Aug 22 17:52:10 2024][0m
[32mlocalcheckpoint make_five_files:
 output: first_directory
 jobid: 1
 reason: Forced execution
 resources: tmpdir=/tmp[0m
[33mDAG of jobs will be updated after completion.[0m
[32m[0m
[32m[Thu Aug 22 17:52:11 2024][0m
[32mFinished job 1.[0m
[32m1 of 2 steps (50%) done[0m
['5114', '11823', '32502', '6262', '21833']
[33mSelect jobs to execute...[0m
[33mExecute 1 jobs...[0m
[32m[0m
[32m[Thu Aug 22 17:52:11 2024][0m
[32mlocalrule copy_files:
 input: first_directory/5114.txt
 output: second_directory/5114.tsv
 jobid: 5
 reason: Forced 

In [15]:
! ls 

first_directory second_directory Snakefile Untitled.ipynb


In [16]:
! ls first_directory/

11823.txt 21833.txt 32502.txt 5114.txt 6262.txt


In [17]:
! ls second_directory/

11823.doc 21833.doc 32502.doc 5114.doc 6262.doc
11823.tsv 21833.tsv 32502.tsv 5114.tsv 6262.tsv
