{ "cells": [ { "cell_type": "markdown", "id": "f01acd4a", "metadata": {}, "source": [ "## Table of Contents\n", "\n", "1. [Initial Setup](#chapter1)\n", " * [1.1. Install pdf2docx](#section_1_1)\n", " * [1.2. Create an S3 bucket in the same region](#section_1_2)\n", " * [1.3. Create Amazon Translate Batch Service Policy](#section_1_3)\n", " * [1.4. Create Amazon Translate Batch Service Role](#section_1_4)\n", " * [1.5. Attach the policy to the Service Role](#section_1_5)\n", " \n", "----\n", " \n", "2. [Upload multiple files to S3](#chapter2)\n", " * [2.1. Widget to upload multiple files](#section_2_1)\n", " * [2.2. Write to S3 bucket](#section_2_2)\n", "\n", "----\n", " \n", "3. [Translate Japanese documents to English using Batch Translation](#chapter3)\n", " * [3.1. Create and start the batch translation job](#section_3_1)\n", " * [3.2. Check the status of the job](#section_3_2)\n", "\n", "----\n", " \n", "4. [Verify and clean up](#chapter4)\n", " * [4.1. Verify the document created in S3](#section_4_1)\n", " * [4.2. Clean Up (optional)](#section_4_2)\n", " \n", "----" ] }, { "cell_type": "markdown", "id": "df706d13", "metadata": {}, "source": [ "### 1. Initial Setup \n", "Run this section to install any libraries necessary and and IAM policy or roles needed as a pre-requisite" ] }, { "cell_type": "markdown", "id": "5d9c0711", "metadata": {}, "source": [ "#### 1.1 Install pdf2docx \n", "Install the library [pdf2docx](https://pdf2docx.com/) to convert pdf to docx as [Amazon Translate](https://aws.amazon.com/translate/) do not currently support pdf formats." ] }, { "cell_type": "code", "execution_count": null, "id": "edea2905", "metadata": {}, "outputs": [], "source": [ "!pip3 install pdf2docx" ] }, { "cell_type": "markdown", "id": "c7c24079", "metadata": {}, "source": [ "#### 1.2 Create an S3 bucket in the same region \n", "_For example since this focusses on Japanaese to English Translation we can name the prefixes accordingly.:-_\n", "\n", "_Choose a unique bucket name_\n", "\n", "bucket_name='translate-ja-en-kunal'\n", "\n", "in_prefix_name='Japanese/input'\n", "\n", "**Enter a unique bucket name before running the below cell**" ] }, { "cell_type": "code", "execution_count": null, "id": "add7958e", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "from pprint import pprint\n", "\n", "# Enter the unique S3 bucket name before running\n", "bucket_name='translate-ja-en-kunal'\n", "\n", "my_region = boto3.session.Session().region_name\n", "s3_client = boto3.client('s3', region_name=my_region)\n", "location = {'LocationConstraint': my_region}\n", "response=s3_client.create_bucket(Bucket=bucket_name,CreateBucketConfiguration=location)\n", "pprint(response)" ] }, { "cell_type": "markdown", "id": "aac5ae0e", "metadata": {}, "source": [ "#### 1.3 Create Amazon Translate Batch Service Policy \n", "_Enter the bucket name created above, Policy Name_\n", "_For example:-_\n", "\n", "bucket_name='translate-ja-en-kunal'\n", "\n", "PolicyName='AmazonTranslateServicePolicy-Japanese-English-Document-Translation'\n", "\n", "Description='Amazon Translate service role policy for Batch'" ] }, { "cell_type": "code", "execution_count": null, "id": "0990d3b3", "metadata": { "scrolled": true }, "outputs": [], "source": [ "import json\n", "\n", "client = boto3.client('iam')\n", "# You may use the same Policy Name as long as it is not taken in your account\n", "policy_name='AmazonTranslateServicePolicy-Japanese-English-Document-Translation'\n", "policy_desc='Amazon Translate service role policy for Batch'\n", "\n", "policy_document={\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Action\": [\n", " \"s3:GetObject\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::\" + bucket_name + \"/*\",\n", " \"arn:aws:s3:::\" + bucket_name + \"/*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"s3:ListBucket\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::\" + bucket_name,\n", " \"arn:aws:s3:::\" + bucket_name\n", " ],\n", " \"Effect\": \"Allow\"\n", " },\n", " {\n", " \"Action\": [\n", " \"s3:PutObject\"\n", " ],\n", " \"Resource\": [\n", " \"arn:aws:s3:::\" + bucket_name + \"/*\"\n", " ],\n", " \"Effect\": \"Allow\"\n", " }\n", " ]\n", "}\n", "\n", "response = client.create_policy(\n", " PolicyName=policy_name,\n", " PolicyDocument=json.dumps(policy_document),\n", " Description=policy_desc\n", ")\n", "policy_response=response\n", "policy_arn=policy_response['Policy']['Arn']\n", "\n", "print(\"Bucket Name\",bucket_name)\n", "print(\"Policy Name:\",policy_name)\n", "print(\"Policy Arn:\",policy_arn)" ] }, { "cell_type": "markdown", "id": "aa56c53d", "metadata": {}, "source": [ "#### 1.4 Create Amazon Translate Batch Service Role \n", "_Enter a role name and description_\n", "_For example:-_\n", "\n", "RoleName='AmazonTranslateServiceRole-Japanese-English-Document-Translation'\n", "\n", "Description='Amazon Translate service role for Batch.'" ] }, { "cell_type": "code", "execution_count": null, "id": "4bffa5bd", "metadata": {}, "outputs": [], "source": [ "import boto3\n", "import json\n", "\n", "client = boto3.client('iam')\n", "# You may use the same Policy Name as long as it is not taken in your account\n", "role_name='AmazonTranslateServiceRole-Japanese-English-Document-Translation'\n", "role_desc='Amazon Translate service role for Batch.'\n", "\n", "trust_relationship_policy={\n", " \"Version\": \"2012-10-17\",\n", " \"Statement\": [\n", " {\n", " \"Effect\": \"Allow\",\n", " \"Principal\": {\n", " \"Service\": \"translate.amazonaws.com\"\n", " },\n", " \"Action\": \"sts:AssumeRole\"\n", " }\n", " ]\n", "}\n", "\n", "\n", "response = client.create_role(\n", " Path='/service-role/',\n", " RoleName=role_name,\n", " AssumeRolePolicyDocument=json.dumps(trust_relationship_policy),\n", " Description=role_desc\n", ")\n", "role_response=response\n", "role_arn=role_response['Role']['Arn']\n", "\n", "print(\"Role Name:\",role_name)\n", "print(\"Role Arn:\",role_arn)" ] }, { "cell_type": "markdown", "id": "8c36082c", "metadata": {}, "source": [ "#### 1.5 Attach the policy to the Service Role " ] }, { "cell_type": "code", "execution_count": null, "id": "9b783599", "metadata": {}, "outputs": [], "source": [ "# Attach a role policy\n", "client.attach_role_policy(\n", " PolicyArn=policy_arn,\n", " RoleName=role_name\n", ")" ] }, { "cell_type": "markdown", "id": "8b9923e2", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "id": "ec9092b3", "metadata": {}, "source": [ "### 2. Upload multiple files to S3 \n", "\n", "Upload multiple Japanese documents to be translated from desktop.\n", "Accepted formats are _docx_, _pdf_" ] }, { "cell_type": "markdown", "id": "4ed7b5a0", "metadata": {}, "source": [ "#### 2.1 Widget to upload multiples \n", "\n", "Accepted formats are _docx_, _pdf_" ] }, { "cell_type": "code", "execution_count": null, "id": "32e9220c", "metadata": {}, "outputs": [], "source": [ "# Create the upload widget to upload the file from local\n", "# Click to upload files (docx / pdf)\n", "from ipywidgets import FileUpload\n", "from IPython.display import display\n", "upload = FileUpload(accept='.docx,.pdf', multiple=True)\n", "display(upload)" ] }, { "cell_type": "markdown", "id": "d5ce353d", "metadata": {}, "source": [ "#### 2.2 Write to S3 bucket \n", "\n", "* docx will be written to S3\n", "* pdf will be converted to docx before writing" ] }, { "cell_type": "code", "execution_count": null, "id": "8f0ecaf8", "metadata": {}, "outputs": [], "source": [ "from pdf2docx import parse\n", "import os\n", "\n", "# Translation input and out file prefix in S3\n", "in_prefix_name='Japanese/input'\n", "out_prefix_name='Japanese/output' \n", "\n", "s3 = boto3.resource('s3', region_name=my_region)\n", "for name, md in upload.value.items():\n", "# If the file type is pdf, convert to docx \n", " if md['metadata']['type'] == 'application/pdf':\n", " with open (name, 'wb') as file:\n", " file.write(md['content'])\n", " filename, file_extension = os.path.splitext(name)\n", " newfilename = filename + '.docx'\n", " parse(name, newfilename, start=0, end=None)\n", " s3.Bucket(bucket_name).upload_file(newfilename,os.path.join(in_prefix_name,newfilename))\n", " os.remove(name)\n", " os.remove(newfilename)\n", " \n", " else:\n", " with open (name, 'wb') as file:\n", " s3.Object(bucket_name, os.path.join(in_prefix_name,name)).put(Body=md['content'])" ] }, { "cell_type": "markdown", "id": "b1ec9023", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "id": "a7b93748", "metadata": {}, "source": [ "#### 3.1 Create and start the batch translation job " ] }, { "cell_type": "code", "execution_count": null, "id": "7f56ce80", "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "\n", "client = boto3.client('translate')\n", "\n", "now=datetime.now().strftime(\"%m%d%Y%H%M%S\")\n", "job_name='japanese-to-english-multi-pages' + '-' + now\n", "content_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document'\n", "\n", "job_response = client.start_text_translation_job(\n", " JobName=job_name,\n", " InputDataConfig={\n", " 'S3Uri': os.path.join('s3://',bucket_name,in_prefix_name),\n", " 'ContentType': content_type\n", " },\n", " OutputDataConfig={\n", " 'S3Uri': os.path.join('s3://',bucket_name,out_prefix_name)\n", " },\n", " DataAccessRoleArn=role_arn,\n", " SourceLanguageCode='ja',\n", " TargetLanguageCodes=[\n", " 'en',\n", " ]\n", ")\n", "job_id=job_response['JobId']\n", "job_status=job_response['JobStatus']\n", "\n", "\n", "print(\"JobId\",job_id)\n", "print(\"JobStatus\",job_status)\n", "print(\"Job Name\",job_name)\n", "pprint(job_response)" ] }, { "cell_type": "markdown", "id": "3af66f72", "metadata": {}, "source": [ "#### 3.2 Check the status of the job \n", "\n", "Keep checking on the JobStatus which will change from **SUBMITTED** --> **IN_PROGRESS** --> **COMPLETED**" ] }, { "cell_type": "code", "execution_count": null, "id": "32956400", "metadata": {}, "outputs": [], "source": [ "# Get job status\n", "status_response = client.describe_text_translation_job(\n", " JobId=job_id\n", ")\n", "\n", "job_status=status_response['TextTranslationJobProperties']['JobStatus']\n", "print(\"Job Name\",job_name)\n", "print(\"Job Status\",job_status)\n", "pprint(status_response)" ] }, { "cell_type": "markdown", "id": "e7535346", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "id": "b698a37d", "metadata": {}, "source": [ "### 4. Verify and clean up \n", "\n", "Verify the translated document created in S3 and then clean up resources (optional)." ] }, { "cell_type": "markdown", "id": "31aa6869", "metadata": {}, "source": [ "#### 4.1 Verify the document created in s3 \n", "\n", "Verify the translated document created in s3 location" ] }, { "cell_type": "code", "execution_count": null, "id": "81dfe1c2", "metadata": {}, "outputs": [], "source": [ "print(os.path.join('s3://',bucket_name,out_prefix_name))" ] }, { "cell_type": "markdown", "id": "3a7bc6ff", "metadata": {}, "source": [ "#### 4.2 Clean up (optional) \n", "\n", "Clean up the resources created after you are done." ] }, { "cell_type": "code", "execution_count": null, "id": "77300b2f", "metadata": {}, "outputs": [], "source": [ "print(\"Reminder : Following are the resources which you created in this Notebook which needs to be cleaned up after you are done in region,{}.\".format(my_region))\n", "print(\"Bucket Name\",bucket_name)\n", "print(\"Policy Name:\",policy_name)\n", "print(\"Policy Arn:\",policy_arn)\n", "print(\"Role Name:\",role_name)\n", "print(\"Role Arn:\",role_arn)" ] }, { "cell_type": "markdown", "id": "a4fd4444", "metadata": {}, "source": [ "##### All Done!" ] } ], "metadata": { "kernelspec": { "display_name": "conda_python3", "language": "python", "name": "conda_python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 5 }