say 'Datasets are datalads core data type. We will explore the concepts of datasets by creating one with datalad create. optional configuration template and a description' run '### Code snippet 1 datalad create --description "course on DataLad-101 on my private Laptop" -c text2git DataLad-101' say 'Datalad informs about what it is doing during a command. At the end is a summary, in this case it is ok. What is inside of a newly created dataset? We list contents with ls.' run '### Code snippet 2 cd DataLad-101 ls # ls does not show any output, because the dataset is empty.' say 'GIT LOG, SHASUM, MESSAGE: A dataset is version controlled. This means, edits done to a file are associated with information about the change, the author, and the time + ability to restore previous states of the dataset. Let'"'"'s take a look into the history, even if it is small atm' run '### Code snippet 3 git log' say 'Datalad, git-annex, and git create hidden files and directories in your dataset. Make sure to not delete them!' run '### Code snippet 4 ls -a # show also hidden files' say 'The dataset is empty, lets put some PDFs inside. First, create a directory to store them in:' run '### Code snippet 5 mkdir books' say 'The tree command shows us the directory structure in the dataset. Apart from the directory, its empty.' run '### Code snippet 6 tree' say 'We use wget to download a few books from the web. CAVE: longish realcommand!' run '### Code snippet 7 cd books && wget -nv https://sourceforge.net/projects/linuxcommand/files/TLCL/19.01/TLCL-19.01.pdf/download -O TLCL.pdf && wget -nv https://www.gitbook.com/download/pdf/book/swaroopch/byte-of-python -O byte-of-python.pdf && cd ../' say 'Here they are:' run '### Code snippet 8 tree' say 'What has happened to our dataset now with this new content? We can use datalad status to find out:' run '### Code snippet 9 datalad status' say 'ATM the files are untracked and thus unknown to any version control system. In order to version control the PDFs we need to save them. We attach a meaningful summary of this with the -m option:' run '### Code snippet 10 datalad save -m "add books on Python and Unix to read later"' say 'Save command reports what has been added to the dataset. Now we can see how this action looks like in our dataset'"'"'s history:' run '### Code snippet 11 git log -p -n 1' say 'Its inconvenient that we saved two books together - we should have saved them as independent modifications of the dataset. To see how single modifications can be saved, let'"'"'s download another book' run '### Code snippet 12 cd books && wget -nv https://github.com/progit/progit2/releases/download/2.1.154/progit.pdf && cd ../' say 'Check the dataset state with the status command frequently' run '### Code snippet 13 datalad status' say 'To save a single modification, provide a path to it!' run '### Code snippet 14 datalad save -m "add reference book about git" books/progit.pdf' say 'Let'"'"'s view the growing history (concise with the --oneline option):' run '### Code snippet 15 # lets make the output a bit more concise with the --oneline option git log --oneline' say 'Let'"'"'s find out how we can modify files in dataset. Lets create a text file with notes about the DataLad commands we learned. (maybe explain here docs)' run '### Code snippet 16 cat << EOT > notes.txt One can create a new dataset with '"'"'datalad create [--description] PATH'"'"'. The dataset is created empty EOT' say 'As expected, there is a new file in the dataset. At first the file is untracked. We can save without a path specification because it is the only existing modification' run '### Code snippet 17 datalad status' run '### Code snippet 18 datalad save -m "Add notes on datalad create"' say 'Now let'"'"'s start to modify this text file by adding more notes to it. Think about this being a code file that you add functions to:' run '### Code snippet 19 cat << EOT >> notes.txt The command "datalad save [-m] PATH" saves the file (modifications) to history. Note to self: Always use informative, concise commit messages. EOT' run '### Code snippet 20 datalad status' say 'The modification can simply be saved as well' run '### Code snippet 21 datalad save -m "add note on datalad save"' say 'An the history gives an accurate record of what happened to this file' run '### Code snippet 22 git log -p -n 2' say 'ADINA DO A PAUSE HERE!!!' say 'The next challenge is to install an existing dataset from the web as a subdataset. First, we create a location for this' run '### Code snippet 23 # we are in the root of DataLad-101 mkdir recordings' say 'We need to install the dataset as a subdataset. For this, we use the datalad install command with a --dataset option and --source option as well as a path. Else the dataset would not be registered as a subdataset!' run '### Code snippet 24 datalad install -d . -s https://github.com/datalad-datasets/longnow-podcasts.git recordings/longnow' say 'Let'"'"'s take a look at the directory structure after the installation' run '### Code snippet 25 tree -d # we limit the output to directories' say 'And now lets look into these seminar series folders: There are hundreds of mp3 files, yet the download only took a few seconds! How can that be?' run '### Code snippet 26 cd recordings/longnow/Long_Now__Seminars_About_Long_term_Thinking ls' say 'Upon installation of a DataLad dataset, DataLad retrieves only small files and metadata. Therefore the dataset is tiny in size. The files are non-functional now atm (Try opening one)' run '### Code snippet 27 cd ../ # in longnow/ du -sh # Unix command to show size of contents' say 'But how large would the dataset be if we had all the content?' run '### Code snippet 28 datalad status --annex' say 'Now let'"'"'s finally get some content in this dataset. This is done with the datalad get command' run '### Code snippet 29 datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3' say 'Datalad status can also summarize how much of the content is already present locally:' run '### Code snippet 30 datalad status --annex all' say 'Let'"'"'s get a few more files. Note how already obtained files are not downloaded again:' run '### Code snippet 31 datalad get Long_Now__Seminars_About_Long_term_Thinking/2003_11_15__Brian_Eno__The_Long_Now.mp3 \ Long_Now__Seminars_About_Long_term_Thinking/2003_12_13__Peter_Schwartz__The_Art_Of_The_Really_Long_View.mp3 \ Long_Now__Seminars_About_Long_term_Thinking/2004_01_10__George_Dyson__There_s_Plenty_of_Room_at_the_Top__Long_term_Thinking_About_Large_scale_Computing.mp3' say 'On Dataset nesting: You have seen the history of DataLad-101. But the subdataset has a standalone history as well! We can find out who created it!' run '### Code snippet 32 git log --reverse' say 'We can make a note about this:' run '### Code snippet 33 # in the root of DataLad-101: cd ../../ cat << EOT >> notes.txt The command '"'"'datalad install [--source] PATH'"'"' installs a dataset from e.g., a URL or a path. If you install a dataset into an existing dataset (as a subdataset), remember to specify the root of the superdataset with the '"'"'-d'"'"' option. EOT datalad save -m "Add note on datalad install"' say 'The superdataset only stores the version of the subdataset. Let'"'"'s take a look into how the superdataset'"'"'s history looks like' run '### Code snippet 34 git log -p -n 2' say 'We can find this shasum in the subdatasets history: it'"'"'s the most recent change' run '### Code snippet 35 cd recordings/longnow git log --oneline' run '### Code snippet 36 cd ../../' say 'Let'"'"'s create a data analysis project with a yoda procedure' run '### Code snippet 123 # inside of DataLad-101 datalad create -c yoda --dataset . midterm_project' say 'Now install input data as a subdataset' run '### Code snippet 124 cd midterm_project # we are in midterm_project, thus -d . points to the root of it. datalad install -d . --source https://github.com/datalad-handbook/iris_data.git input/' say 'here is how the directory structure looks like' run '### Code snippet 125 cd ../ tree -d cd midterm_project' say 'Let'"'"'s create code for an analysis' run '### Code snippet 126 cat << EOT > code/script.py import pandas as pd import seaborn as sns import datalad.api as dl from sklearn import model_selection from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report data = "input/iris.csv" # make sure that the data is obtained (get will also install linked sub-ds!): dl.get(data) # prepare the data as a pandas dataframe df = pd.read_csv(data) attributes = ["sepal_length", "sepal_width", "petal_length","petal_width", "class"] df.columns = attributes # create a pairplot to plot pairwise relationships in the dataset plot = sns.pairplot(df, hue='"'"'class'"'"') plot.savefig('"'"'pairwise_relationships.png'"'"') # perform a K-nearest-neighbours classification with scikit-learn # Step 1: split data in test and training dataset (20:80) array = df.values X = array[:,0:4] Y = array[:,4] test_size = 0.20 seed = 7 X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed) # Step 2: Fit the model and make predictions on the test dataset knn = KNeighborsClassifier() knn.fit(X_train, Y_train) predictions = knn.predict(X_test) # Step 3: Save the classification report report = classification_report(Y_test, predictions, output_dict=True) df_report = pd.DataFrame(report).transpose().to_csv('"'"'prediction_report.csv'"'"') EOT' say 'datalad status will show a new file' run '### Code snippet 127 datalad status' say 'Save the analysis to the history' run '### Code snippet 128 datalad save -m "add script for kNN classification and plotting" --version-tag ready4analysis code/script.py' say 'The datalad run command can reproducibly execute a command reproducibly' run '### Code snippet 129 datalad run -m "analyze iris data with classification analysis" \ --input "input/iris.csv" \ --output "prediction_report.csv" \ --output "pairwise_relationships.png" \ "python3 code/script.py"' say 'Let'"'"'s take a look at the history' run '### Code snippet 130 git log --oneline' say 'create human readable information for your project' run '### Code snippet 131 # with the >| redirection we are replacing existing contents in the file cat << EOT >| README.md # Midterm YODA Data Analysis Project ## Dataset structure - All inputs (i.e. building blocks from other sources) are located in input/. - All custom code is located in code/. - All results (i.e., generated files) are located in the root of the dataset: - "prediction_report.csv" contains the main classification metrics. - "output/pairwise_relationships.png" is a plot of the relations between features. EOT' say 'The README file is now modified' run '### Code snippet 132 datalad status' say 'Let'"'"'s save this change' run '### Code snippet 133 datalad save -m "Provide project description" README.md'