say 'Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description'
run '### Code snippet 1
datalad create --description "course on DataLad-101 on my private Laptop" -c text2git DataLad-101'
say 'Datalad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.'
run '### Code snippet 2
cd DataLad-101
ls    # ls does not show any output, because the dataset is empty.'
say 'GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let'"'"'s take a look into the history, even if it is small atm'
run '### Code snippet 3
git log'
say 'Datalad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!'
run '### Code snippet 4
ls -a # show also hidden files'
say 'The dataset is empty, lets put some PDFs inside. First, create a directory to store them in:'
run '### Code snippet 5
mkdir books'
say 'The tree command shows us the directory structure in the dataset. Apart from the directory, its empty.'
run '### Code snippet 6
tree'
say 'We use wget to download a few books from the web. CAVE: longish realcommand!'
run '### Code snippet 7
cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://www.gitbook.com/download/pdf/book/swaroopch/byte-of-python -O byte-of-python.pdf && cd ../'
say 'Here they are:'
run '### Code snippet 8
tree'
say 'What has happened to our dataset now with this new content? We can use datalad status to find out:'
run '### Code snippet 9
datalad status'
say 'ATM the files are untracked and thus unknown to any version control system. In order to version control the PDFs we need to save them. We attach a meaningful summary of this with the -m option:'
run '### Code snippet 10
datalad save -m "add books on Python and Unix to read later"'
say 'Save command reports what has been added to the dataset. Now we can see how this action looks like in our dataset'"'"'s history:'
run '### Code snippet 11
git log -p -n 1'
say 'Its inconvenient that we saved two books together - we should have saved them as independent modifications of the dataset. To see how single modifications can be saved, let'"'"'s download another book'
run '### Code snippet 12
cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../'
say 'Check the dataset state with the status command frequently'
run '### Code snippet 13
datalad status'
say 'To save a single modification, provide a path to it!'
run '### Code snippet 14
datalad save -m "add reference book about git" books/progit.pdf'
say 'Let'"'"'s view the growing history (concise with the --oneline option):'
run '### Code snippet 15
# lets make the output a bit more concise with the --oneline option
git log --oneline'
say 'Let'"'"'s find out how we can modify files in dataset. Lets create a text file with notes about the DataLad commands we learned. (maybe explain here docs)'
run '### Code snippet 16
cat << EOT > notes.txt
One can create a new dataset with '"'"'datalad create [--description] PATH'"'"'.
The dataset is created empty

EOT'
say 'As expected, there is a new file in the dataset. At first the file is untracked. We can save without a path specification because it is the only existing modification'
run '### Code snippet 17
datalad status'
run '### Code snippet 18
datalad save -m "Add notes on datalad create"'
say 'Now let'"'"'s start to modify this text file by adding more notes to it. Think about this being a code file that you add functions to:'
run '### Code snippet 19
cat << EOT >> notes.txt
The command "datalad save [-m] PATH" saves the file
(modifications) to history. Note to self:
Always use informative, concise commit messages.

EOT'
run '### Code snippet 20
datalad status'
say 'The modification can simply be saved as well'
run '### Code snippet 21
datalad save -m "add note on datalad save"'
say 'An the history gives an accurate record of what happened to this file'
run '### Code snippet 22
git log -p -n 2'
say 'ADINA DO A PAUSE HERE!!!'
say 'The next challenge is to install an existing dataset from the web as a subdataset. First, we create a location for this'
run '### Code snippet 23
# we are in the root of DataLad-101
mkdir recordings'
say 'We need to install the dataset as a subdataset. For this, we use the datalad install command with a --dataset option and --source option as well as a path. Else the dataset would not be registered as a subdataset!'
run '### Code snippet 24
datalad install -d . -s https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow'
say 'Let'"'"'s take a look at the directory structure after the installation'
run '### Code snippet 25
tree -d   # we limit the output to directories'
say 'And now lets look into these seminar series folders: There are hundreds of mp3 files, yet the download only took a few seconds! How can that be?'
run '### Code snippet 26
cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking
ls'
say 'Upon installation of a DataLad dataset, DataLad retrieves only small files and metadata. Therefore the dataset is tiny in size. The files are non-functional now atm (Try opening one)'
run '### Code snippet 27
cd ../      # in longnow/
du -sh      # Unix command to show size of contents'
say 'But how large would the dataset be if we had all the content?'
run '### Code snippet 28
datalad status --annex'
say 'Now let'"'"'s finally get some content in this dataset. This is done with the datalad get command'
run '### Code snippet 29
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3'
say 'Datalad status can also summarize how much of the content is already present locally:'
run '### Code snippet 30
datalad status --annex all'
say 'Let'"'"'s get a few more files. Note how already obtained files are not downloaded again:'
run '### Code snippet 31
datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \
Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3'
say 'On Dataset nesting: You have seen the history of DataLad-101. But the subdataset has a standalone history as well! We can find out who created it!'
run '### Code snippet 32
git log --reverse'
say 'We can make a note about this:'
run '### Code snippet 33
# in the root of DataLad-101:
cd ../../
cat << EOT >> notes.txt
The command '"'"'datalad install [--source] PATH'"'"'
installs a dataset from e.g., a URL or a path.
If you install a dataset into an existing
dataset (as a subdataset), remember to specify the
root of the superdataset with the '"'"'-d'"'"' option.

EOT
datalad save -m "Add note on datalad install"'
say 'The superdataset only stores the version of the subdataset.  Let'"'"'s take a look into how the superdataset'"'"'s history looks like'
run '### Code snippet 34
git log -p -n 2'
say 'We can find this shasum in the subdatasets history: it'"'"'s the most recent change'
run '### Code snippet 35
cd recordings/longnow
git log --oneline'
run '### Code snippet 36
cd ../../'


say 'Let'"'"'s create a data analysis project with a yoda procedure'
run '### Code snippet 123
# inside of DataLad-101
datalad create -c yoda --dataset . midterm_project'
say 'Now install input data as a subdataset'
run '### Code snippet 124
cd midterm_project
# we are in midterm_project, thus -d . points to the root of it.
datalad install -d . --source https://github.com/datalad-handbook/iris_data.git input/'
say 'here is how the directory structure looks like'
run '### Code snippet 125
cd ../
tree -d
cd midterm_project'
say 'Let'"'"'s create code for an analysis'
run '### Code snippet 126
cat << EOT > code/script.py

import pandas as pd
import seaborn as sns
import datalad.api as dl
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

data = "input/iris.csv"

# make sure that the data is obtained (get will also install linked sub-ds!):
dl.get(data)

# prepare the data as a pandas dataframe
df = pd.read_csv(data)
attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"]
df.columns = attributes

# create a pairplot to plot pairwise relationships in the dataset
plot = sns.pairplot(df, hue='"'"'class'"'"')
plot.savefig('"'"'pairwise_relationships.png'"'"')

# perform a K-nearest-neighbours classification with scikit-learn
# Step 1: split data in test and training dataset (20:80)
array = df.values
X = array[:,0:4]
Y = array[:,4]
test_size = 0.20
seed = 7
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y,
                                                                    test_size=test_size,
                                                                    random_state=seed)
# Step 2: Fit the model and make predictions on the test dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_test)

# Step 3: Save the classification report
report = classification_report(Y_test, predictions, output_dict=True)
df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"')

EOT'
say 'datalad status will show a new file'
run '### Code snippet 127
datalad status'
say 'Save the analysis to the history'
run '### Code snippet 128
datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py'
say 'The datalad run command can reproducibly execute a command reproducibly'
run '### Code snippet 129
datalad run -m "analyze iris data with classification analysis" \
  --input "input/iris.csv" \
  --output "prediction_report.csv" \
  --output "pairwise_relationships.png" \
  "python3 code/script.py"'
say 'Let'"'"'s take a look at the history'
run '### Code snippet 130
git log --oneline'
say 'create human readable information for your project'
run '### Code snippet 131
# with the >| redirection we are replacing existing contents in the file
cat << EOT >| README.md

# Midterm YODA Data Analysis Project

## Dataset structure

- All inputs (i.e. building blocks from other sources) are located in input/.
- All custom code is located in code/.
- All results (i.e., generated files) are located in the root of the dataset:
  - "prediction_report.csv" contains the main classification metrics.
  - "output/pairwise_relationships.png" is a plot of the relations between features.

EOT'
say 'The README file is now modified'
run '### Code snippet 132
datalad status'
say 'Let'"'"'s save this change'
run '### Code snippet 133
datalad save -m "Provide project description" README.md'