{ "cells": [ { "cell_type": "markdown", "id": "262c9f37-85ab-4148-b7d0-085a72540699", "metadata": {}, "source": [ "In this example we ask workers to record given texts via voice recorder.\n", "\n", "This is annotation task, since there are an unlimited number of options for recording on a single text.\n", "\n", "This example has two features in addition to usual annotation task:\n", "- We need some place to store files with worker's audio recordings. Crowdom uses S3 object storage for this purpose, so you will be asked to specify your storage – it's endpoint, bucket, path and credentials.\n", "- (_experimental feature_) Worker's recordings are checked by the ASR model, and not by other workers, as in the usual case. We recognize worker recordings by the ASR and compare the given transcript with the source text. If the distance between these two texts is too big, we consider task corresponding to the source text performed incorrectly.\n", "\n", "You may want to first study [audio transcript](../audio_transcript/audio_transcript.ipynb) example because it contains more detailed description of annotation tasks pipeline. " ] }, { "cell_type": "markdown", "id": "679d0511-9124-4798-acab-2f59fdbbf150", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "code", "execution_count": null, "id": "dd492869-e62e-4db2-a2ec-6c2f748c31f0", "metadata": {}, "outputs": [], "source": [ "from datetime import timedelta\n", "import os\n", "import pandas as pd\n", "from typing import List, Tuple\n", "\n", "import toloka.client as toloka\n", "\n", "from crowdom import base, datasource, client, objects, pricing, params as labeling_params" ] }, { "cell_type": "code", "execution_count": 2, "id": "eaf4a3d8-cfef-4915-a4aa-51763c483f4d", "metadata": {}, "outputs": [], "source": [ "import yaml\n", "import logging.config" ] }, { "cell_type": "code", "execution_count": 37, "id": "0b1204e0-390f-4e39-99c4-385844303a12", "metadata": {}, "outputs": [], "source": [ "with open('logging.yaml') as f:\n", " logging.config.dictConfig(yaml.full_load(f.read()))" ] }, { "cell_type": "code", "execution_count": 4, "id": "922702ba-b407-4c5a-bb48-18f6be81d4c5", "metadata": {}, "outputs": [], "source": [ "from IPython.display import clear_output, display" ] }, { "cell_type": "code", "execution_count": 5, "id": "dd4d7220-23b0-4c4e-8234-cdfd462abbf7", "metadata": {}, "outputs": [], "source": [ "token = os.getenv('TOLOKA_TOKEN') or input('Enter your token: ')\n", "clear_output()" ] }, { "cell_type": "code", "execution_count": 6, "id": "2821010b-ad50-4517-a4b3-942626ca1083", "metadata": { "tags": [] }, "outputs": [], "source": [ "toloka_client = client.create_toloka_client(token=token)" ] }, { "cell_type": "code", "execution_count": 7, "id": "a58dea4f-5055-4152-92b1-e7f598fd3769", "metadata": {}, "outputs": [], "source": [ "s3 = datasource.S3(\n", " endpoint='storage.yandexcloud.net',\n", " bucket=input('Enter your S3 bucket: '),\n", " path=input('Enter path in bucket to store audio recordings: '),\n", " access_key_id=os.getenv('AWS_ACCESS_KEY_ID') or input('Enter your AWS access key ID: '),\n", " secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY') or input('Enter your AWS secret access key: '),\n", ")\n", "clear_output()" ] }, { "cell_type": "code", "execution_count": 8, "id": "fda807bd-c182-403a-a2d7-f0d4eb51e342", "metadata": {}, "outputs": [], "source": [ "function = base.AnnotationFunction(inputs=(objects.Text,), outputs=(objects.Audio,))" ] }, { "cell_type": "code", "execution_count": 9, "id": "ed25ead5-0ff6-4ff1-abb3-bd8ba394d72b", "metadata": {}, "outputs": [], "source": [ "import markdown2" ] }, { "cell_type": "code", "execution_count": 10, "id": "cef5451e-3531-4809-9b68-4ce775f24bbf", "metadata": {}, "outputs": [], "source": [ "instruction = {}\n", "for worker_lang in ['RU']:\n", " with open(f'instruction_{worker_lang}.md') as f:\n", " instruction[worker_lang] = markdown2.markdown(f.read())" ] }, { "cell_type": "code", "execution_count": 11, "id": "0f9f2ec8-0614-492c-a46d-7d09179f6ff1", "metadata": {}, "outputs": [], "source": [ "task_spec = base.TaskSpec(\n", " id='voice-recording',\n", " function=function,\n", " name=base.LocalizedString({\n", " 'EN': 'Voice recording',\n", " 'RU': 'Запись речи на диктофон',\n", " }),\n", " description=base.LocalizedString({\n", " 'EN': 'Speak the recordings into a voice recorder',\n", " 'RU': 'Нужно наговорить записи на диктофон.',\n", " }),\n", " instruction=instruction,\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "id": "46de38e6-0b41-4383-8d89-a35e6d3ae546", "metadata": {}, "outputs": [], "source": [ "lang = 'EN'" ] }, { "cell_type": "code", "execution_count": 13, "id": "e41cdcf5-0092-4c30-9ce0-f2b9013ceb14", "metadata": {}, "outputs": [], "source": [ "task_spec_en = client.AnnotationTaskSpec(task_spec, lang)" ] }, { "cell_type": "code", "execution_count": 14, "id": "cefc0e35-c481-4fe6-bd9e-d05e6e44f089", "metadata": {}, "outputs": [], "source": [ "task_duration_hint = timedelta(seconds=10)" ] }, { "cell_type": "code", "execution_count": null, "id": "6082d190-246a-4a67-9de8-8694a6771eae", "metadata": {}, "outputs": [], "source": [ "client.define_task(task_spec_en, toloka_client)" ] }, { "cell_type": "code", "execution_count": 55, "id": "9f4414b9-397b-47f6-8bc8-79606103ba84", "metadata": {}, "outputs": [], "source": [ "input_objects = datasource.read_tasks('tasks.json', task_spec_en.task_mapping)\n", "\n", "# Checks are provided by ASR model, which is not controlled in usual way. We need at least one control task by technical reasons.\n", "control_objects = datasource.read_tasks('control_tasks.json', task_spec_en.check.task_mapping, has_solutions=True)" ] }, { "cell_type": "markdown", "id": "b2e09271-76b7-4638-a23f-0b6cc0296196", "metadata": { "tags": [] }, "source": [ "# Model definition" ] }, { "cell_type": "markdown", "id": "dc659fdd-a674-47b3-84e2-a241d1541810", "metadata": {}, "source": [ "We will use Yandex Speechkit ASR model in this example. You can use any model you want." ] }, { "cell_type": "code", "execution_count": null, "id": "54ecf6bd-fcfc-46ab-bb76-62f12d44ee21", "metadata": {}, "outputs": [], "source": [ "# TODO: publish in pypi.org\n", "%pip install -i http://pypi.yandex-team.ru/simple/ yandex-speechkit" ] }, { "cell_type": "code", "execution_count": null, "id": "04087eea-0152-4b8d-90f4-523979b00af8", "metadata": {}, "outputs": [], "source": [ "%pip install pylev" ] }, { "cell_type": "code", "execution_count": 21, "id": "fb1782f4-1116-4cc0-a39f-2b28d15897de", "metadata": {}, "outputs": [], "source": [ "from speechkit.common.utils import configure_credentials\n", "from speechkit.common import Product\n", "from speechkit import model_repository\n", "from speechkit.stt import RecognitionConfig, AudioProcessingType" ] }, { "cell_type": "code", "execution_count": 22, "id": "367a2a6f-3a5c-46a6-8864-d4cb2b4c3a97", "metadata": {}, "outputs": [], "source": [ "from IPython.display import clear_output\n", "configure_credentials(yc_ai_token=f'Api-Key {input(\"Enter your ASR model API key: \")}')\n", "clear_output()" ] }, { "cell_type": "code", "execution_count": 23, "id": "415205c6-28db-4521-8e1f-0532bc5e7d42", "metadata": {}, "outputs": [], "source": [ "model = model_repository.recognition_model(product=Product.Yandex)" ] }, { "cell_type": "code", "execution_count": 24, "id": "3ea3fd7b-d551-4645-ad6b-345059fc9ee1", "metadata": {}, "outputs": [], "source": [ "lang_map = {'EN': 'en-US', 'RU': 'ru-RU'}\n", "config = RecognitionConfig(mode=AudioProcessingType.Full, language=lang_map[lang])" ] }, { "cell_type": "code", "execution_count": 25, "id": "4a26f340-c5d4-48ae-a13d-6c9f2c66871e", "metadata": {}, "outputs": [], "source": [ "from multiprocessing.pool import ThreadPool\n", "from pydub import AudioSegment\n", "import io" ] }, { "cell_type": "code", "execution_count": 26, "id": "3492d9c1-01bd-4e67-a1fc-49fc8807dc76", "metadata": {}, "outputs": [], "source": [ "def recognize_record(s3_url) -> str:\n", " file_name = s3_url.split('/')[-1]\n", " audio_bytes = s3.client.get_object(Bucket=s3.bucket, Key=f'{s3.path}/{file_name}')['Body'].read()\n", " audio = AudioSegment.from_wav(io.BytesIO(audio_bytes))\n", " result = model.transcribe(audio, config)\n", " return ' '.join(chunk.raw_text for chunk in result)" ] }, { "cell_type": "code", "execution_count": 27, "id": "8f9924ea-e4ce-4283-aca2-751233ce829b", "metadata": {}, "outputs": [], "source": [ "import pylev" ] }, { "cell_type": "code", "execution_count": 28, "id": "faec93e2-5993-4b48-8417-f9da47e4e3f0", "metadata": {}, "outputs": [], "source": [ "def levenshtein(hypothesis: str, reference: str) -> float:\n", " return float(pylev.levenshtein(hypothesis, reference)) / max(len(hypothesis), len(reference), 1)" ] }, { "cell_type": "code", "execution_count": 57, "id": "0753e1e9-9e97-472d-a056-2192231e2b45", "metadata": {}, "outputs": [], "source": [ "logger = logging.getLogger('crowdom')" ] }, { "cell_type": "markdown", "id": "d6702e22-a2f1-4278-b04b-b22e7f7afdf2", "metadata": {}, "source": [ "Model is specified by it's name (see below) and Python function, which provides implementation of task function for a batch of source items.\n", "\n", "Since model is used for worker recordings evaluation, we transform task function\n", "\n", "```\n", "f(Text) = Audio\n", "```\n", "\n", "into it's evaluation form\n", "\n", "```\n", "f(Text, Audio) = BinaryEvaluation\n", "```\n", "\n", "If we consider worker's recordings as accurate, we return `BinaryEvaluation(True)` for it." ] }, { "cell_type": "code", "execution_count": 62, "id": "2f2ee1e9-1265-43ef-a36e-d2325cb5c9d8", "metadata": {}, "outputs": [], "source": [ "def recognize_voice_recordings(tasks: List[Tuple[objects.Text, objects.Audio]]) -> List[Tuple[base.BinaryEvaluation]]:\n", " if not tasks:\n", " return\n", " pool = ThreadPool(processes=min(len(tasks), 40))\n", " recognized_texts = pool.map(recognize_record, [audio.url for _, audio in tasks])\n", " results = []\n", " for (source_text, audio), recognized_text in zip(tasks, recognized_texts):\n", " distance = levenshtein(recognized_text, source_text.text)\n", " verdict = (base.BinaryEvaluation(distance <= 0.5),)\n", " results.append(verdict)\n", " logger.debug('\\n' + f\"\"\"\n", "audio: {audio.url}\n", "source text: {source_text.text}\n", "recognized text: {recognized_text}\n", "distance: {distance}\"\"\".strip() + '\\n')\n", " return results" ] }, { "cell_type": "code", "execution_count": 63, "id": "95d206a2-5051-431b-99df-c062e23d565c", "metadata": {}, "outputs": [], "source": [ "from crowdom import worker" ] }, { "cell_type": "code", "execution_count": 64, "id": "6c34c922-4807-4d55-9fb0-944f74f9170d", "metadata": {}, "outputs": [], "source": [ "model_worker = worker.Model(name='asr:general', func=recognize_voice_recordings)" ] }, { "cell_type": "markdown", "id": "6d248a21-2fcf-421b-ace1-8ef08aa45b2c", "metadata": {}, "source": [ "# Labeling" ] }, { "cell_type": "code", "execution_count": 66, "id": "79aff013-62e6-42e0-bd29-48f9bbe01935", "metadata": {}, "outputs": [], "source": [ "params_form = labeling_params.get_annotation_interface(\n", " task_spec=task_spec_en,\n", " check_task_duration_hint=task_duration_hint,\n", " annotation_task_duration_hint=task_duration_hint,\n", " toloka_client=toloka_client,\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "id": "db5e4013", "metadata": {}, "outputs": [], "source": [ "check_params, annotation_params = params_form.get_params()" ] }, { "cell_type": "code", "execution_count": 16, "id": "e1eeba85", "metadata": {}, "outputs": [], "source": [ "check_params.control_tasks_count = 1 # we need to create pool, even without opening it, with one stub control task\n", "check_params.model = model_worker # specify your model worker" ] }, { "cell_type": "code", "execution_count": null, "id": "67a7ae7c-7052-40d0-944c-6ba6b44af1a7", "metadata": {}, "outputs": [], "source": [ "artifacts = client.launch_annotation(\n", " task_spec_en,\n", " annotation_params,\n", " check_params,\n", " input_objects,\n", " control_objects,\n", " toloka_client,\n", " s3=s3,\n", ")" ] }, { "cell_type": "code", "execution_count": 70, "id": "f6be197e-190c-4647-b6fb-4d726e83c365", "metadata": {}, "outputs": [], "source": [ "results = artifacts.results" ] }, { "cell_type": "code", "execution_count": 72, "id": "5910da39-e788-40f5-8081-b0b5b79c4b9c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>text</th>\n", " <th>audio</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>hello</td>\n", " <td>https://storage.yandexcloud.net/test/voice-recording/B2FA010A-80AD-4B9B-936F-E74AF44FDB42.wav</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>no thanks</td>\n", " <td>https://storage.yandexcloud.net/test/voice-recording/509A1B21-E035-47CE-BE9C-C102470F159A.wav</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>your order is accepted</td>\n", " <td>https://storage.yandexcloud.net/test/voice-recording/99BB1FB8-B9CC-4BDB-9DC2-9DFC6E404C98.wav</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " text \\\n", "0 hello \n", "1 no thanks \n", "2 your order is accepted \n", "\n", " audio \n", "0 https://storage.yandexcloud.net/test/voice-recording/B2FA010A-80AD-4B9B-936F-E74AF44FDB42.wav \n", "1 https://storage.yandexcloud.net/test/voice-recording/509A1B21-E035-47CE-BE9C-C102470F159A.wav \n", "2 https://storage.yandexcloud.net/test/voice-recording/99BB1FB8-B9CC-4BDB-9DC2-9DFC6E404C98.wav " ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results.predict()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }