say 'Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description' run '### Code snippet 1 datalad create -c text2git DataLad-101' say 'Datalad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.' run '### Code snippet 2 cd DataLad-101 ls # ls does not show any output, because the dataset is empty.' say 'GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let'"'"'s take a look into the history, even if it is small atm' run '### Code snippet 3 git log' say 'Datalad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!' run '### Code snippet 4 ls -a # show also hidden files' say 'The dataset is empty, lets put some PDFs inside. First, create a directory to store them in:' run '### Code snippet 5 mkdir books' say 'The tree command shows us the directory structure in the dataset. Apart from the directory, its empty.' run '### Code snippet 6 tree' say 'We use wget to download a few books from the web. CAVE: longish realcommand!' run '### Code snippet 7 cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://edisciplinas.usp.br/pluginfile.php/3252353/mod_resource/content/1/b_Swaroop_Byte_of_python.pdf -O byte-of-python.pdf && cd ../' say 'Here they are:' run '### Code snippet 8 tree' say 'What has happened to our dataset now with this new content? We can use datalad status to find out:' run '### Code snippet 9 datalad status' say 'ATM the files are untracked and thus unknown to any version control system. In order to version control the PDFs we need to save them. We attach a meaningful summary of this with the -m option:' run '### Code snippet 10 datalad save -m "add books on Python and Unix to read later"' say 'Save command reports what has been added to the dataset. Now we can see how this action looks like in our dataset'"'"'s history:' run '### Code snippet 11 git log -p -n 1' say 'Its inconvenient that we saved two books together - we should have saved them as independent modifications of the dataset. To see how single modifications can be saved, let'"'"'s download another book' run '### Code snippet 12 cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../' say 'Check the dataset state with the status command frequently' run '### Code snippet 13 datalad status' say 'To save a single modification, provide a path to it!' run '### Code snippet 14 datalad save -m "add reference book about git" books/progit.pdf' say 'Let'"'"'s view the growing history (concise with the --oneline option):' run '### Code snippet 15 # lets make the output a bit more concise with the --oneline option git log --oneline' say 'finally, datalad-download-url' run '### Code snippet 16 datalad download-url http://www.tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf \ --dataset . \ -m "add beginners guide on bash" \ -O books/bash_guide.pdf' run '### Code snippet 17 ls books' run '### Code snippet 18 datalad status' say 'Let'"'"'s find out how we can modify files in dataset. Lets create a text file with notes about the DataLad commands we learned. (maybe explain here docs)' run '### Code snippet 19 cat << EOT > notes.txt One can create a new dataset with '"'"'datalad create [--description] PATH'"'"'. The dataset is created empty EOT' say 'As expected, there is a new file in the dataset. At first the file is untracked. We can save without a path specification because it is the only existing modification' run '### Code snippet 20 datalad status' run '### Code snippet 21 datalad save -m "Add notes on datalad create"' say 'Now let'"'"'s start to modify this text file by adding more notes to it. Think about this being a code file that you add functions to:' run '### Code snippet 22 cat << EOT >> notes.txt The command "datalad save [-m] PATH" saves the file (modifications) to history. Note to self: Always use informative, concise commit messages. EOT' run '### Code snippet 23 datalad status' say 'The modification can be saved as well' run '### Code snippet 24 datalad save -m "add note on datalad save"' say 'An the history gives an accurate record of what happened to this file' run '### Code snippet 25 git log -p -n 2' say 'The next challenge is to clone an existing dataset from the web as a subdataset. First, we create a location for this' run '### Code snippet 26 # we are in the root of DataLad-101 mkdir recordings' say 'We need to clone the dataset as a subdataset. For this, we use the datalad clone command with a --dataset option and a path. Else the dataset would not be registered as a subdataset!' run '### Code snippet 27 datalad clone --dataset . \ https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow' say 'Let'"'"'s take a look at the directory structure after cloning' run '### Code snippet 28 tree -d # we limit the output to directories' say 'And now lets look into these seminar series folders: There are hundreds of mp3 files, yet the download only took a few seconds! How can that be?' run '### Code snippet 29 cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking ls' say 'Upon cloning of a DataLad dataset, DataLad retrieves only small files and metadata. Therefore the dataset is tiny in size. The files are non-functional now atm (Try opening one)' run '### Code snippet 30 cd ../ # in longnow/ du -sh # Unix command to show size of contents' say 'But how large would the dataset be if we had all the content?' run '### Code snippet 31 datalad status --annex' say 'Now let'"'"'s finally get some content in this dataset. This is done with the datalad get command' run '### Code snippet 32 datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3' say 'Datalad status can also summarize how much of the content is already present locally:' run '### Code snippet 33 datalad status --annex all' say 'Let'"'"'s get a few more files. Note how already obtained files are not downloaded again:' run '### Code snippet 34 datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \ Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \ Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3' say 'On Dataset nesting: You have seen the history of DataLad-101. But the subdataset has a standalone history as well! We can find out who created it!' run '### Code snippet 35 git log --reverse' say 'We can make a note about this:' run '### Code snippet 36 # in the root of DataLad-101: cd ../../ cat << EOT >> notes.txt The command '"'"'datalad clone URL/PATH [PATH]'"'"' installs a dataset from e.g., a URL or a path. If you install a dataset into an existing dataset (as a subdataset), remember to specify the root of the superdataset with the '"'"'-d'"'"' option. EOT datalad save -m "Add note on datalad clone"' say 'The superdataset only stores the version of the subdataset. Let'"'"'s take a look into how the superdataset'"'"'s history looks like' run '### Code snippet 37 git log -p' say 'We can find this shasum in the subdatasets history: it'"'"'s the most recent change' run '### Code snippet 38 cd recordings/longnow git log --oneline' run '### Code snippet 39 cd ../../' say 'Let'"'"'s create a data analysis project with a yoda procedure' run '### Code snippet 126 # inside of DataLad-101 datalad create -c yoda --dataset . midterm_project' say 'Now clone input data as a subdataset' run '### Code snippet 127 cd midterm_project # we are in midterm_project, thus -d . points to the root of it. datalad clone -d . https://github.com/datalad-handbook/iris_data.git input/' say 'here is how the directory structure looks like' run '### Code snippet 128 cd ../ tree -d cd midterm_project' say 'Let'"'"'s create code for an analysis' run '### Code snippet 129 cat << EOT > code/script.py import pandas as pd import seaborn as sns import datalad.api as dl from sklearn import model_selection from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report data = "input/iris.csv" # make sure that the data are obtained (get will also install linked sub-ds!): dl.get(data) # prepare the data as a pandas dataframe df = pd.read_csv(data) attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"] df.columns = attributes # create a pairplot to plot pairwise relationships in the dataset plot = sns.pairplot(df, hue='"'"'class'"'"', palette='"'"'muted'"'"') plot.savefig('"'"'pairwise_relationships.png'"'"') # perform a K-nearest-neighbours classification with scikit-learn # Step 1: split data in test and training dataset (20:80) array = df.values X = array[:,0:4] Y = array[:,4] test_size = 0.20 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed) # Step 2: Fit the model and make predictions on the test dataset knn = KNeighborsClassifier() knn.fit(X_train, Y_train) predictions = knn.predict(X_test) # Step 3: Save the classification report report = classification_report(Y_test, predictions, output_dict=True) df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"') EOT' say 'datalad status will show a new file' run '### Code snippet 130 datalad status' say 'Save the analysis to the history' run '### Code snippet 131 datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py' say 'The datalad run command can reproducibly execute a command reproducibly' run '### Code snippet 132 datalad run -m "analyze iris data with classification analysis" \ --input "input/iris.csv" \ --output "prediction_report.csv" \ --output "pairwise_relationships.png" \ "python3 code/script.py"' say 'Let'"'"'s take a look at the history' run '### Code snippet 133 git log --oneline' say 'create human readable information for your project' run '### Code snippet 134 # with the >| redirection we are replacing existing contents in the file cat << EOT >| README.md # Midterm YODA Data Analysis Project ## Dataset structure - All inputs (i.e. building blocks from other sources) are located in input/. - All custom code is located in code/. - All results (i.e., generated files) are located in the root of the dataset: - "prediction_report.csv" contains the main classification metrics. - "output/pairwise_relationships.png" is a plot of the relations between features. EOT' say 'The README file is now modified' run '### Code snippet 135 datalad status' say 'Let'"'"'s save this change' run '### Code snippet 136 datalad save -m "Provide project description" README.md' say 'Computational reproducibility: add a software container' run '### Code snippet 137 # we are in the midterm_project subdataset datalad containers-add midterm-software --url shub://adswa/resources:1' say 'The software container got added to your datasets history' run '### Code snippet 138 git log -n 1 -p' say 'The analysis can be rerun in a software container' run '### Code snippet 139 datalad containers-run -m "rerun analysis in container" \ --container-name midterm-software \ --input "input/iris.csv" \ --output "prediction_report.csv" \ --output "pairwise_relationships.png" \ "python3 code/script.py"' say 'Here is how that looks like in the history:' run '### Code snippet 140 git log -p -n 1' say 'Save the change in the superdataset' run '### Code snippet 141 cd ../ datalad status' say 'Save the change in the superdataset' run '### Code snippet 142 datalad save -d . -m "add container and execute analysis within container" midterm_project'