{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "99af7aa1", "metadata": {}, "source": [ "# Getting started with `musif`\n", "\n", "[Download the Getting started tutorial Jupyter notebook here](https://raw.githubusercontent.com/DIDONEproject/musif/main/docs/source/Tutorial.ipynb)\n", "\n", "\n", "`musif` is a Python library to analyze music scores. It is a tool to massively extract features from MusicXML and MuseScore files.\n", "\n", "`musif` was born in the context of the [ERC Project \"DIDONE\"](https://didone.eu/) and, consequently,\n", "it is specialized in 18th-century Italian opera arias. However, it is also prepared to work with other repertoires.\n", "\n", "This tutorial is an introduction for people who are not experts in programming. If you are already an expert, just skip to the [Data Section](#data) and then go to the [Advanced Tutorial](https://musif.didone.eu/Tutorial_20poprock.html).\n", "\n", "\n", "## Installation\n", "\n", "First, you should install [`Python`](https://www.python.org/downloads/) > 3.10. An easy way to do this is by using [`Anaconda`](https://www.anaconda.com/products/distribution), especially if you are not used to commandline interface.A\n", "Once you have installed `anaconda`:\n", "1. Launch the `anaconda-navigator`\n", "2. [Create an environment](https://docs.anaconda.com/navigator/getting-started/#managing-environments) selecting python version >= 3.10\n", "3. Switch to the newly created environment by clicking on its name\n", "\n", "\n", "To install `musif`:\n", "1. [Download this notebook](https://raw.githubusercontent.com/DIDONEproject/musif/main/docs/source/Tutorial.ipynb).\n", "2. Start `jupyter` in your Anaconda environment.\n", "3. Open this tutorial.\n", "4. Run the following cell by clicking on it and pressing Ctrl+Enter." ] }, { "attachments": {}, "cell_type": "markdown", "id": "91eac574", "metadata": {}, "source": [ "Here, the `!` is a special command that executes commands in the terminal. After having run it, you may need to restart the notebook (click the circular arrow ↻ in the top bar, near the icons ▶ and ⏹)\n", "\n", "To run this tutorial:\n", "1. In the `Home` tab of the `anaconda-navigator`, select \"All applications\" and the newly created environment in the options at the top.\n", "2. Click on `Install`, near to the `Jupyter` icon\n", "3. Once installed, click on `launch` near the `notebook` icon; a web interface will open in the browser\n", "4. [Download](https://raw.githubusercontent.com/DIDONEproject/musif/main/docs/source/Tutorial.ipynb) by clicking iwth right mouse button and selecting \"save as...\"\n", "5. Navigate to the downloaded file from the web interface and open it\n", "6. Run the following cell by clicking on it and pressing Ctrl+Enter " ] }, { "cell_type": "code", "execution_count": null, "id": "a7710973", "metadata": {}, "outputs": [], "source": [ "! pip install musif" ] }, { "cell_type": "code", "execution_count": 2, "id": "9a1257d3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Version: 1.2.3\n" ] } ], "source": [ "import musif\n", "print('Version: ', musif.__version__)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "01cd214c", "metadata": {}, "source": [ "## Introduction\n", "\n", "If you are new to Python, we suggest you to read an introductory tutorial for it, for instance, [this one](https://www.w3schools.com/python/default.asp). \n", "\n", "In the following, we will introduce some technical terminology that may be useful to you to understand technical documentation while working with `musif`:\n", "\n", "* A _function_ is a way to represent code that is convenient for humans. You can think of functions as mathematical functions, with some input and some output. However, some programming languages call them _procedures_; this is not the case with Python, but this name allows grasping what functions are, after all: successions of commands that the computer has to execute.\n", "\n", "* An _object_ is a computational way to represent information _and_ code in the memory of computers; you can think of objects as real concepts of the real world: objects have properties (in Python named _fields_) and functionalities (named _methods_). For instance, an object could be a vehicle, which has some properties (length, maximum speed, number of wheels) and some functionalities (accelerate, decelerate, stop). Objects can also have specializations (named _children_): in our example, a _child_ of vehicle could be the car and another _child_ could be the bike: they have different properties and apply the functionalities in a different way. Both the vehicle, the car, and the bike may have instances: the car that you use everyday to go to work is different from your friend's even if they have the same exact properties, because they are two different concrete objects. Technically, those two cars are two _instances_ of the same _class_. To create an instance, you have to call a function, generically named _constructor_, which takes as arguments the class and the other properties. This function will return the instance. To use `musif`, you don't need to know a lot about objects, but while you search the web it is good to have a little of knowledge.\n", "\n", "* A _DataFrame_ is another way to represent information for computers. They are designed to be extremely efficient, even if sometimes some aspects of the information can get lost. They are mainly used for data science problems. You can think of a _DataFrame_ as a table, with rows and columns. Usually, rows are _instances_ while columns are _properties_. In data science, these words often become _samples_ and _features_/_variables_. A typical operation is to select only certain columns (properties) or only certain rows (instances) to select subset of the data or to modify the data itself.\n", "\n", "* Don't be scared to use web search engines such as Google: searching the web in a proper way is one of the most important skills a programmer has!\n", "\n", "### Main objects\n", "\n", "When using `musif`, you will usually interface with two objects:\n", "1. [`FeaturesExtractor()`](API/musif.extract.html#musif.extract.extract.FeaturesExtractor), which reads music scores and computes a DataFrame containing all the extracted features. In the simplest case, each row represents a music score, while each column represents a feature.\n", "2. [`DataProcessor()`](API/musif.process.html#musif.process.processor.DataProcessor), which takes the DataFrame with all the features in it and post-processes it to clean, improve, and possibly modify some of the features.\n", "\n", "These two objects take as input two different configurations that modify their behavior. In other words, the constructors of `FeaturesExtractor` and `DataProcessor` can accept a wide range of arguments.\n", "\n", "But let's proceed step by step!" ] }, { "cell_type": "code", "execution_count": 3, "id": "b8119d50", "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "import zipfile\n", "from pathlib import Path\n", "\n", "data_dir = Path(\"data\")\n", "dataset_path = \"dataset.zip\"\n", "urllib.request.urlretrieve(\"https://zenodo.org/record/4027957/files/AnatomyComposerAttributionMIDIFilesAndFeatureData_1_0.zip?download=1\", dataset_path)\n", "with zipfile.ZipFile(dataset_path, 'r') as zip_ref:\n", " zip_ref.extractall(data_dir)\n", "data_dir = data_dir / Path('AnatomyComposerAttributionMIDIFilesAndFeatureData_1_0') / Path('MIDI/')\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cab6f329", "metadata": {}, "source": [ "## Configuration\n", "\n", "Let's create a configuration for our experiment. Configurations can be expressed using a `yaml` file or with key-value arguments. `yaml` files are designed for complex projects, while key-value arguments are perfect for simple situations like this.\n", "\n", "Key-value arguments are something similar to a dictionary: There is a _key_ which must be unique in the dictionary; each _key_ is associated with a _value_, which can be repeated. Python can retrieve a value using its key in a very efficient way!\n", "\n", "First, we'll need to import the class that describes how a configuration is:" ] }, { "cell_type": "code", "execution_count": 4, "id": "7fe4511f", "metadata": {}, "outputs": [], "source": [ "from musif.config import ExtractConfiguration\n", "\n", "config = ExtractConfiguration(\n", " None,\n", " data_dir = data_dir,\n", " basic_modules=[\"scoring\"],\n", " features = [\"core\", \"ambitus\", \"melody\", \"tempo\", \n", " \"density\", \"texture\", \"lyrics\", \"scale\", \n", " \"key\", \"dynamics\", \"rhythm\"],\n", " parallel = -1 #use > 1 if you wish to use parallelization (runs faster, uses more memory)\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b7c511bb", "metadata": {}, "source": [ "Now, we can call its constructor to obtain a configuration object:" ] }, { "attachments": {}, "cell_type": "markdown", "id": "64a59048", "metadata": {}, "source": [ "## Feature extraction\n", "\n", "Now that we have our configuration, we pass it to the function that creates `FeaturesExtraction` objects. This function is exactly named `FeaturesExtraction`:" ] }, { "cell_type": "code", "execution_count": 5, "id": "b900810d", "metadata": {}, "outputs": [], "source": [ "from musif.extract.extract import FeaturesExtractor\n", "extractor = FeaturesExtractor(config)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "343373ac", "metadata": {}, "source": [ "Before starting the extraction, we also need to tell MuseScore the type of files it should look for. In this case, we want it to look for files with extension `'.mid'`. By default, it would look for `.mscx` files, so we need to change it:" ] }, { "attachments": {}, "cell_type": "markdown", "id": "756f12ba", "metadata": {}, "source": [ "Now, we can start the extraction using the method `extract`. It will return a `DataFrame`:" ] }, { "cell_type": "code", "execution_count": null, "id": "e286b65b", "metadata": {}, "outputs": [], "source": [ "df = extractor.extract()" ] }, { "cell_type": "code", "execution_count": 7, "id": "da54809a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape df: (175, 927)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FamilyGen_DensityFamilyGen_NotesFamilyGen_NotesMeanFamilyGen_NumberOfFilteredPartsFamilyGen_NumberOfPartsFamilyGen_SoundingDensityFamilyGen_SoundingMeasuresFamilyGen_SoundingMeasuresMeanFamilyInstrumentationFamilyScoring...SoundFl_TrimmedIntervallicMeanSoundFl_TrimmedIntervallicStdSoundScoringTempoTempoGrouped1TempoGrouped2TimeSignatureTimeSignatureGroupedVoicesWindowId
0<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.1134021.468171fl<NA><NA>None2/1other0
1<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.1659481.620333fl<NA><NA>None2/1other0
2<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.1062421.634416fl<NA><NA>None2/1other0
3<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.0957681.578589fl<NA><NA>None2/1other0
4<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.0736041.623796fl<NA><NA>None2/1other0
\n", "

5 rows × 927 columns

\n", "
" ], "text/plain": [ " FamilyGen_Density FamilyGen_Notes FamilyGen_NotesMean \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " FamilyGen_NumberOfFilteredParts FamilyGen_NumberOfParts \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " FamilyGen_SoundingDensity FamilyGen_SoundingMeasures \\\n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " FamilyGen_SoundingMeasuresMean FamilyInstrumentation FamilyScoring ... \\\n", "0 ww ww ... \n", "1 ww ww ... \n", "2 ww ww ... \n", "3 ww ww ... \n", "4 ww ww ... \n", "\n", " SoundFl_TrimmedIntervallicMean SoundFl_TrimmedIntervallicStd \\\n", "0 -0.113402 1.468171 \n", "1 -0.165948 1.620333 \n", "2 -0.106242 1.634416 \n", "3 -0.095768 1.578589 \n", "4 -0.073604 1.623796 \n", "\n", " SoundScoring Tempo TempoGrouped1 TempoGrouped2 TimeSignature \\\n", "0 fl None 2/1 \n", "1 fl None 2/1 \n", "2 fl None 2/1 \n", "3 fl None 2/1 \n", "4 fl None 2/1 \n", "\n", " TimeSignatureGrouped Voices WindowId \n", "0 other 0 \n", "1 other 0 \n", "2 other 0 \n", "3 other 0 \n", "4 other 0 \n", "\n", "[5 rows x 927 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Shape df: ', df.shape)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 8, "id": "84672e45", "metadata": {}, "outputs": [], "source": [ "df.set_index('Id', inplace=True)\n", "df.drop(['level_0', 'index'], axis=1, errors = 'ignore', inplace = True)" ] }, { "cell_type": "code", "execution_count": 9, "id": "c6a2720e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape df: (175, 926)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
FamilyGen_DensityFamilyGen_NotesFamilyGen_NotesMeanFamilyGen_NumberOfFilteredPartsFamilyGen_NumberOfPartsFamilyGen_SoundingDensityFamilyGen_SoundingMeasuresFamilyGen_SoundingMeasuresMeanFamilyInstrumentationFamilyScoring...SoundFl_TrimmedIntervallicMeanSoundFl_TrimmedIntervallicStdSoundScoringTempoTempoGrouped1TempoGrouped2TimeSignatureTimeSignatureGroupedVoicesWindowId
Id
0<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.1134021.468171fl<NA><NA>None2/1other0
1<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.1659481.620333fl<NA><NA>None2/1other0
2<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.1062421.634416fl<NA><NA>None2/1other0
3<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.0957681.578589fl<NA><NA>None2/1other0
4<NA><NA><NA><NA><NA><NA><NA><NA>wwww...-0.0736041.623796fl<NA><NA>None2/1other0
\n", "

5 rows × 926 columns

\n", "
" ], "text/plain": [ " FamilyGen_Density FamilyGen_Notes FamilyGen_NotesMean \\\n", "Id \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " FamilyGen_NumberOfFilteredParts FamilyGen_NumberOfParts \\\n", "Id \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " FamilyGen_SoundingDensity FamilyGen_SoundingMeasures \\\n", "Id \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", "\n", " FamilyGen_SoundingMeasuresMean FamilyInstrumentation FamilyScoring ... \\\n", "Id ... \n", "0 ww ww ... \n", "1 ww ww ... \n", "2 ww ww ... \n", "3 ww ww ... \n", "4 ww ww ... \n", "\n", " SoundFl_TrimmedIntervallicMean SoundFl_TrimmedIntervallicStd \\\n", "Id \n", "0 -0.113402 1.468171 \n", "1 -0.165948 1.620333 \n", "2 -0.106242 1.634416 \n", "3 -0.095768 1.578589 \n", "4 -0.073604 1.623796 \n", "\n", " SoundScoring Tempo TempoGrouped1 TempoGrouped2 TimeSignature \\\n", "Id \n", "0 fl None 2/1 \n", "1 fl None 2/1 \n", "2 fl None 2/1 \n", "3 fl None 2/1 \n", "4 fl None 2/1 \n", "\n", " TimeSignatureGrouped Voices WindowId \n", "Id \n", "0 other 0 \n", "1 other 0 \n", "2 other 0 \n", "3 other 0 \n", "4 other 0 \n", "\n", "[5 rows x 926 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('Shape df: ', df.shape)\n", "df.head()" ] }, { "cell_type": "markdown", "id": "fc3daafc", "metadata": {}, "source": [ "### Finding features\n", "Yey! We've successfully employed `musif` to extract the desired features from the scores. All of these features are now stored in the df variable, and we can access them from there.\n", "How to find certain features? All types of features and their correspondent definitions can be found in https://musif.didone.eu/Feature_definition.html. There you can find each type of feature along with a regular expression (_RegEx_) and a brief explanation stating what that set of features describes.l To find a specific set of features based on its regex:" ] }, { "cell_type": "code", "execution_count": 10, "id": "88976ce7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PartFlI_IntervalA-2_PerPartFlI_IntervalA1_PerPartFlI_IntervalA2_PerPartFlI_IntervalA3_PerPartFlI_IntervalA4_PerPartFlI_IntervalM-2_PerPartFlI_IntervalM-3_PerPartFlI_IntervalM-6_PerPartFlI_IntervalM-7_PerPartFlI_IntervalM-9_Per...PartFlI_IntervalsMajorDesc_PerPartFlI_IntervalsMinorAll_PerPartFlI_IntervalsMinorAsc_PerPartFlI_IntervalsMinorDesc_PerPartFlI_IntervalsPerfectAll_PerPartFlI_IntervalsPerfectAsc_PerPartFlI_IntervalsPerfectDesc_PerPartFlI_IntervalsWithinOctaveAll_PerPartFlI_IntervalsWithinOctaveAsc_PerPartFlI_IntervalsWithinOctaveDesc_Per
Id
0<NA><NA><NA><NA><NA>0.1149430.034483<NA><NA><NA>...0.1728370.0689660.0344830.0344830.5121230.1064470.1018681.00.3244970.371695
1<NA><NA><NA><NA><NA>0.1883120.045455<NA><NA><NA>...0.2189580.1623380.0454550.1168830.3726180.1059620.0622011.00.3605320.435014
2<NA><NA><NA><NA><NA>0.2457130.044699<NA><NA><NA>...0.2828510.2851560.1093750.150.1896590.0452570.0244751.00.4046410.475432
3<NA><NA><NA><NA><NA>0.1379310.062069<NA><NA><NA>...0.1910590.2368420.0758620.1228070.3752960.1074780.0685471.00.3737530.426976
4<NA><NA><NA><NA><NA>0.1902110.055034<NA><NA><NA>...0.2493770.1900830.0578510.1322310.2730480.0846580.0398221.00.4150490.436383
..................................................................
170<NA><NA><NA><NA><NA>0.2148410.02509<NA><NA><NA>...0.2437840.2549020.1086960.1372550.2829860.0708280.0513961.00.398170.441068
171<NA><NA><NA><NA><NA>0.1922550.048058<NA><NA><NA>...0.2430460.2751320.1269840.1428570.1891750.0645030.0472431.00.4792530.443318
172<NA><NA><NA><NA><NA>0.1965810.025641<NA><NA><NA>...0.240050.2205880.1016950.1176470.2441570.0317320.0341521.00.4172260.404501
173<NA><NA><NA><NA><NA>0.2549020.039216<NA><NA><NA>...0.2818090.196970.0909090.1060610.2260090.0519430.0310671.00.4083210.44868
174<NA><NA><NA><NA><NA>0.2142860.038961<NA><NA><NA>...0.2435360.2207790.0779220.1333330.3132520.0844270.0491061.00.3748770.445405
\n", "

175 rows × 66 columns

\n", "
" ], "text/plain": [ " PartFlI_IntervalA-2_Per PartFlI_IntervalA1_Per PartFlI_IntervalA2_Per \\\n", "Id \n", "0 \n", "1 \n", "2 \n", "3 \n", "4 \n", ".. ... ... ... \n", "170 \n", "171 \n", "172 \n", "173 \n", "174 \n", "\n", " PartFlI_IntervalA3_Per PartFlI_IntervalA4_Per PartFlI_IntervalM-2_Per \\\n", "Id \n", "0 0.114943 \n", "1 0.188312 \n", "2 0.245713 \n", "3 0.137931 \n", "4 0.190211 \n", ".. ... ... ... \n", "170 0.214841 \n", "171 0.192255 \n", "172 0.196581 \n", "173 0.254902 \n", "174 0.214286 \n", "\n", " PartFlI_IntervalM-3_Per PartFlI_IntervalM-6_Per \\\n", "Id \n", "0 0.034483 \n", "1 0.045455 \n", "2 0.044699 \n", "3 0.062069 \n", "4 0.055034 \n", ".. ... ... \n", "170 0.02509 \n", "171 0.048058 \n", "172 0.025641 \n", "173 0.039216 \n", "174 0.038961 \n", "\n", " PartFlI_IntervalM-7_Per PartFlI_IntervalM-9_Per ... \\\n", "Id ... \n", "0 ... \n", "1 ... \n", "2 ... \n", "3 ... \n", "4 ... \n", ".. ... ... ... \n", "170 ... \n", "171 ... \n", "172 ... \n", "173 ... \n", "174 ... \n", "\n", " PartFlI_IntervalsMajorDesc_Per PartFlI_IntervalsMinorAll_Per \\\n", "Id \n", "0 0.172837 0.068966 \n", "1 0.218958 0.162338 \n", "2 0.282851 0.285156 \n", "3 0.191059 0.236842 \n", "4 0.249377 0.190083 \n", ".. ... ... \n", "170 0.243784 0.254902 \n", "171 0.243046 0.275132 \n", "172 0.24005 0.220588 \n", "173 0.281809 0.19697 \n", "174 0.243536 0.220779 \n", "\n", " PartFlI_IntervalsMinorAsc_Per PartFlI_IntervalsMinorDesc_Per \\\n", "Id \n", "0 0.034483 0.034483 \n", "1 0.045455 0.116883 \n", "2 0.109375 0.15 \n", "3 0.075862 0.122807 \n", "4 0.057851 0.132231 \n", ".. ... ... \n", "170 0.108696 0.137255 \n", "171 0.126984 0.142857 \n", "172 0.101695 0.117647 \n", "173 0.090909 0.106061 \n", "174 0.077922 0.133333 \n", "\n", " PartFlI_IntervalsPerfectAll_Per PartFlI_IntervalsPerfectAsc_Per \\\n", "Id \n", "0 0.512123 0.106447 \n", "1 0.372618 0.105962 \n", "2 0.189659 0.045257 \n", "3 0.375296 0.107478 \n", "4 0.273048 0.084658 \n", ".. ... ... \n", "170 0.282986 0.070828 \n", "171 0.189175 0.064503 \n", "172 0.244157 0.031732 \n", "173 0.226009 0.051943 \n", "174 0.313252 0.084427 \n", "\n", " PartFlI_IntervalsPerfectDesc_Per PartFlI_IntervalsWithinOctaveAll_Per \\\n", "Id \n", "0 0.101868 1.0 \n", "1 0.062201 1.0 \n", "2 0.024475 1.0 \n", "3 0.068547 1.0 \n", "4 0.039822 1.0 \n", ".. ... ... \n", "170 0.051396 1.0 \n", "171 0.047243 1.0 \n", "172 0.034152 1.0 \n", "173 0.031067 1.0 \n", "174 0.049106 1.0 \n", "\n", " PartFlI_IntervalsWithinOctaveAsc_Per \\\n", "Id \n", "0 0.324497 \n", "1 0.360532 \n", "2 0.404641 \n", "3 0.373753 \n", "4 0.415049 \n", ".. ... \n", "170 0.39817 \n", "171 0.479253 \n", "172 0.417226 \n", "173 0.408321 \n", "174 0.374877 \n", "\n", " PartFlI_IntervalsWithinOctaveDesc_Per \n", "Id \n", "0 0.371695 \n", "1 0.435014 \n", "2 0.475432 \n", "3 0.426976 \n", "4 0.436383 \n", ".. ... \n", "170 0.441068 \n", "171 0.443318 \n", "172 0.404501 \n", "173 0.44868 \n", "174 0.445405 \n", "\n", "[175 rows x 66 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.filter(regex='Part.+_Interval.+_Per')" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9b29fedc", "metadata": {}, "source": [ "## Post-processing\n", "\n", "Most of the features we have computed actually need some post-processing, for instance to replace `NaN` with 0, merge columns, or remove features created while computing other features.\n", "\n", "For this, we need a further step. In the next cell we will:\n", "1. Instantiate a `DataProcessor` object using:\n", " * the generated DataFrame\n", " * the default configuration (i.e. `None` in place of the yaml file/configuration object)\n", "2. Call the method `process()` of that object to start the post-processing of the features\n", "3. We retrieve the post-processed data from the field `data`\n", "4. We print the size of the DataFrame." ] }, { "cell_type": "code", "execution_count": null, "id": "926bc78a", "metadata": {}, "outputs": [], "source": [ "try:\n", " import google.colab\n", " IN_COLAB = True\n", "except:\n", " IN_COLAB = False\n", "\n", "# Check if in colab\n", "if IN_COLAB:\n", " print('in colab')\n", " import urllib.request\n", " # Replace with the raw URL of the YAML file on GitHub\n", " github_url = \"https://raw.githubusercontent.com/DIDONEproject/musif/main/config_postprocess_example.yml\" \n", " # Replace with the desired local file name\n", " local_file_name = \"config_postprocess_example.yml\" \n", " urllib.request.urlretrieve(github_url, local_file_name)\n", " print(f\"File downloaded to: {local_file_name}\")\n", "else:\n", " local_file_name = \"../../config_postprocess_example.yml\" \n" ] }, { "cell_type": "code", "execution_count": 11, "id": "35c706fa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[1;37m\n", "Post-processing data...\u001b[0m\n" ] } ], "source": [ "from musif.process.processor import DataProcessor\n", "\n", "\n", "processed_df = DataProcessor(df,'../../config_postprocess_example.yml',\n", " ).process().data\n" ] }, { "cell_type": "code", "execution_count": 12, "id": "7df5c5aa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Shape post-processed df: (175, 585)\n" ] } ], "source": [ "# with `.shape` you can see the number of rows and columns of the dataframe\n", "print('Shape post-processed df: ', processed_df.shape)" ] }, { "cell_type": "markdown", "id": "64d30975", "metadata": {}, "source": [ "Now we have 'processed_df', a DataFrame that synthesizes the extracted information by deleting and/or combining columns of the extracted df. We can adjust and modify post-processing parameters by changing the `yaml` file file or by inserting key-value arguments. That way we can choose which columns we desire to be deleted or modified in order to introduce processed_df to further experiments." ] }, { "cell_type": "markdown", "id": "ebaa76ac", "metadata": {}, "source": [] }, { "attachments": {}, "cell_type": "markdown", "id": "7cc69508", "metadata": {}, "source": [ "## Statistical processing\n", "\n", "Let's try to classify the features. We will setup a feature-learning approach with an autoencoder architecture.\n", "\n", "For this, we will use `sklearn` and its Multilayer Perceptron, so you will need to [install](https://docs.anaconda.com/navigator/getting-started/#managing-packages) `scikit-learn` and `seaborn` packages in your anaconda environment.\n", "\n", "In the next cell, the topic becomes a little more technical, but it's just an example to show that you can use this DataFrame for statistical analysis. We will first remove redundant information (the `FileName` and the `Id` columns that were automatically assigned by the `FeatureExtractor`). \n", "\n", "Then, we will create a model which:\n", "1. Assigns a number to each feature that has strings as values (`OrdinalEncoder`).\n", "2. Standardizes the features to get comparable values.\n", "3. Trains a simple feed-forward fully connected autoencoder with ReLU activations and LBGFS optimizer.\n", "\n", "The objective is to learn a 2D space where the 396 extracted features can be represented without loosing information." ] }, { "cell_type": "code", "execution_count": null, "id": "f3018334", "metadata": {}, "outputs": [], "source": [ "!pip install scikit-learn seaborn" ] }, { "cell_type": "code", "execution_count": 14, "id": "82adba52", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Absolute Error: 0.49524726832323923\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/opt/anaconda3/envs/musif_tutorials/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:545: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", " self.n_iter_ = _check_optimize_result(\"lbfgs\", opt_res, self.max_iter)\n" ] } ], "source": [ "from sklearn.neural_network import MLPRegressor\n", "from sklearn.preprocessing import StandardScaler, OrdinalEncoder\n", "from sklearn.pipeline import make_pipeline\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "# removing FileName and Id\n", "if 'FileName' in processed_df:\n", " del processed_df['FileName']\n", "if 'Id' in processed_df:\n", " del processed_df['Id']\n", "\n", "preprocessor = make_pipeline(\n", " OrdinalEncoder(), # give a cardinal number to features that are categories\n", " StandardScaler(), # subtract the mean and scale between -1 and +1\n", ")\n", "\n", "model = make_pipeline(\n", " preprocessor,\n", " MLPRegressor( \n", " hidden_layer_sizes=(128, 32, 8, 2, 8, 32, 128, 396), # the output size is the same as the number of features\n", " activation=\"relu\",\n", " solver=\"lbfgs\",\n", " max_iter=100,\n", " tol=0.1,\n", " random_state=934,\n", " max_fun=10**6\n", " # shuffle=True \n", " )\n", ")\n", "\n", "y_true = preprocessor.fit_transform(processed_df)\n", "\n", "# the next call will take some time...\n", "model.fit(processed_df, y_true)\n", "y_hat = model.predict(processed_df)\n", "print(f\"Mean Absolute Error: {mean_absolute_error(y_true, y_hat)}\")" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f75abded", "metadata": {}, "source": [ "Now, we will attach a method `transform` to the MLPClassifier which returns the activations at the inner layer with 2 outputs, that we interpret as latent features. To compare the features, we will scale them in [0, 1].\n", "\n", "Then, we plot the music scores according to the learned feature space." ] }, { "cell_type": "code", "execution_count": 15, "id": "278cdf51", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(175, 2)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import MinMaxScaler\n", "\n", "mlpclassifier = model['mlpregressor']\n", "\n", "def mytransform_method(X):\n", " activations = [None for _ in range(mlpclassifier.n_layers_)]\n", " activations[0] = X\n", " X = mlpclassifier._forward_pass(activations)[-6]\n", " return MinMaxScaler().fit_transform(X)\n", "\n", "mlpclassifier.transform = mytransform_method\n", "\n", "learned_features = model.transform(processed_df)\n", "\n", "learned_features.shape" ] }, { "cell_type": "code", "execution_count": 16, "id": "9e4a575f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 413, "width": 547 } }, "output_type": "display_data" } ], "source": [ "import seaborn\n", "seaborn.scatterplot(x=learned_features[:, 0], y=learned_features[:, 1])" ] }, { "cell_type": "markdown", "id": "c1bab95d", "metadata": {}, "source": [ "For comparison, let's plot the 2D features learned by a standard PCA (the final scaler is aded to compare with the autoencoder):" ] }, { "cell_type": "code", "execution_count": 17, "id": "00e64c38", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 413, "width": 547 } }, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca_pipeline = make_pipeline(\n", " preprocessor, PCA(2), MinMaxScaler()\n", ")\n", "data_pca = pca_pipeline.fit_transform(processed_df)\n", "ax = seaborn.scatterplot(x=data_pca[:, 0], y=data_pca[:, 1])\n" ] } ], "metadata": { "celltoolbar": "Edit Metadata", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.8" }, "vscode": { "interpreter": { "hash": "3fac8f976d2c4ee2bc715215bb95954c479b3e9d5186e3f6df5c366b6270a4af" } } }, "nbformat": 4, "nbformat_minor": 5 }