{
"cells": [
{
"cell_type": "markdown",
"id": "f01acd4a",
"metadata": {},
"source": [
"## Table of Contents\n",
"\n",
"1. [Initial Setup](#chapter1)\n",
" * [1.1. Install pdf2docx](#section_1_1)\n",
" * [1.2. Create an S3 bucket in the same region](#section_1_2)\n",
" * [1.3. Create Amazon Translate Batch Service Policy](#section_1_3)\n",
" * [1.4. Create Amazon Translate Batch Service Role](#section_1_4)\n",
" * [1.5. Attach the policy to the Service Role](#section_1_5)\n",
" \n",
"----\n",
" \n",
"2. [Upload multiple files to S3](#chapter2)\n",
" * [2.1. Widget to upload multiple files](#section_2_1)\n",
" * [2.2. Write to S3 bucket](#section_2_2)\n",
"\n",
"----\n",
" \n",
"3. [Translate Japanese documents to English using Batch Translation](#chapter3)\n",
" * [3.1. Create and start the batch translation job](#section_3_1)\n",
" * [3.2. Check the status of the job](#section_3_2)\n",
"\n",
"----\n",
" \n",
"4. [Verify and clean up](#chapter4)\n",
" * [4.1. Verify the document created in S3](#section_4_1)\n",
" * [4.2. Clean Up (optional)](#section_4_2)\n",
" \n",
"----"
]
},
{
"cell_type": "markdown",
"id": "df706d13",
"metadata": {},
"source": [
"### 1. Initial Setup \n",
"Run this section to install any libraries necessary and and IAM policy or roles needed as a pre-requisite"
]
},
{
"cell_type": "markdown",
"id": "5d9c0711",
"metadata": {},
"source": [
"#### 1.1 Install pdf2docx \n",
"Install the library [pdf2docx](https://pdf2docx.com/) to convert pdf to docx as [Amazon Translate](https://aws.amazon.com/translate/) do not currently support pdf formats."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "edea2905",
"metadata": {},
"outputs": [],
"source": [
"!pip3 install pdf2docx"
]
},
{
"cell_type": "markdown",
"id": "c7c24079",
"metadata": {},
"source": [
"#### 1.2 Create an S3 bucket in the same region \n",
"_For example since this focusses on Japanaese to English Translation we can name the prefixes accordingly.:-_\n",
"\n",
"_Choose a unique bucket name_\n",
"\n",
"bucket_name='translate-ja-en-kunal'\n",
"\n",
"in_prefix_name='Japanese/input'\n",
"\n",
"**Enter a unique bucket name before running the below cell**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "add7958e",
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"from pprint import pprint\n",
"\n",
"# Enter the unique S3 bucket name before running\n",
"bucket_name='translate-ja-en-kunal'\n",
"\n",
"my_region = boto3.session.Session().region_name\n",
"s3_client = boto3.client('s3', region_name=my_region)\n",
"location = {'LocationConstraint': my_region}\n",
"response=s3_client.create_bucket(Bucket=bucket_name,CreateBucketConfiguration=location)\n",
"pprint(response)"
]
},
{
"cell_type": "markdown",
"id": "aac5ae0e",
"metadata": {},
"source": [
"#### 1.3 Create Amazon Translate Batch Service Policy \n",
"_Enter the bucket name created above, Policy Name_\n",
"_For example:-_\n",
"\n",
"bucket_name='translate-ja-en-kunal'\n",
"\n",
"PolicyName='AmazonTranslateServicePolicy-Japanese-English-Document-Translation'\n",
"\n",
"Description='Amazon Translate service role policy for Batch'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0990d3b3",
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"import json\n",
"\n",
"client = boto3.client('iam')\n",
"# You may use the same Policy Name as long as it is not taken in your account\n",
"policy_name='AmazonTranslateServicePolicy-Japanese-English-Document-Translation'\n",
"policy_desc='Amazon Translate service role policy for Batch'\n",
"\n",
"policy_document={\n",
" \"Version\": \"2012-10-17\",\n",
" \"Statement\": [\n",
" {\n",
" \"Action\": [\n",
" \"s3:GetObject\"\n",
" ],\n",
" \"Resource\": [\n",
" \"arn:aws:s3:::\" + bucket_name + \"/*\",\n",
" \"arn:aws:s3:::\" + bucket_name + \"/*\"\n",
" ],\n",
" \"Effect\": \"Allow\"\n",
" },\n",
" {\n",
" \"Action\": [\n",
" \"s3:ListBucket\"\n",
" ],\n",
" \"Resource\": [\n",
" \"arn:aws:s3:::\" + bucket_name,\n",
" \"arn:aws:s3:::\" + bucket_name\n",
" ],\n",
" \"Effect\": \"Allow\"\n",
" },\n",
" {\n",
" \"Action\": [\n",
" \"s3:PutObject\"\n",
" ],\n",
" \"Resource\": [\n",
" \"arn:aws:s3:::\" + bucket_name + \"/*\"\n",
" ],\n",
" \"Effect\": \"Allow\"\n",
" }\n",
" ]\n",
"}\n",
"\n",
"response = client.create_policy(\n",
" PolicyName=policy_name,\n",
" PolicyDocument=json.dumps(policy_document),\n",
" Description=policy_desc\n",
")\n",
"policy_response=response\n",
"policy_arn=policy_response['Policy']['Arn']\n",
"\n",
"print(\"Bucket Name\",bucket_name)\n",
"print(\"Policy Name:\",policy_name)\n",
"print(\"Policy Arn:\",policy_arn)"
]
},
{
"cell_type": "markdown",
"id": "aa56c53d",
"metadata": {},
"source": [
"#### 1.4 Create Amazon Translate Batch Service Role \n",
"_Enter a role name and description_\n",
"_For example:-_\n",
"\n",
"RoleName='AmazonTranslateServiceRole-Japanese-English-Document-Translation'\n",
"\n",
"Description='Amazon Translate service role for Batch.'"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4bffa5bd",
"metadata": {},
"outputs": [],
"source": [
"import boto3\n",
"import json\n",
"\n",
"client = boto3.client('iam')\n",
"# You may use the same Policy Name as long as it is not taken in your account\n",
"role_name='AmazonTranslateServiceRole-Japanese-English-Document-Translation'\n",
"role_desc='Amazon Translate service role for Batch.'\n",
"\n",
"trust_relationship_policy={\n",
" \"Version\": \"2012-10-17\",\n",
" \"Statement\": [\n",
" {\n",
" \"Effect\": \"Allow\",\n",
" \"Principal\": {\n",
" \"Service\": \"translate.amazonaws.com\"\n",
" },\n",
" \"Action\": \"sts:AssumeRole\"\n",
" }\n",
" ]\n",
"}\n",
"\n",
"\n",
"response = client.create_role(\n",
" Path='/service-role/',\n",
" RoleName=role_name,\n",
" AssumeRolePolicyDocument=json.dumps(trust_relationship_policy),\n",
" Description=role_desc\n",
")\n",
"role_response=response\n",
"role_arn=role_response['Role']['Arn']\n",
"\n",
"print(\"Role Name:\",role_name)\n",
"print(\"Role Arn:\",role_arn)"
]
},
{
"cell_type": "markdown",
"id": "8c36082c",
"metadata": {},
"source": [
"#### 1.5 Attach the policy to the Service Role "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9b783599",
"metadata": {},
"outputs": [],
"source": [
"# Attach a role policy\n",
"client.attach_role_policy(\n",
" PolicyArn=policy_arn,\n",
" RoleName=role_name\n",
")"
]
},
{
"cell_type": "markdown",
"id": "8b9923e2",
"metadata": {},
"source": [
"----"
]
},
{
"cell_type": "markdown",
"id": "ec9092b3",
"metadata": {},
"source": [
"### 2. Upload multiple files to S3 \n",
"\n",
"Upload multiple Japanese documents to be translated from desktop.\n",
"Accepted formats are _docx_, _pdf_"
]
},
{
"cell_type": "markdown",
"id": "4ed7b5a0",
"metadata": {},
"source": [
"#### 2.1 Widget to upload multiples \n",
"\n",
"Accepted formats are _docx_, _pdf_"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32e9220c",
"metadata": {},
"outputs": [],
"source": [
"# Create the upload widget to upload the file from local\n",
"# Click to upload files (docx / pdf)\n",
"from ipywidgets import FileUpload\n",
"from IPython.display import display\n",
"upload = FileUpload(accept='.docx,.pdf', multiple=True)\n",
"display(upload)"
]
},
{
"cell_type": "markdown",
"id": "d5ce353d",
"metadata": {},
"source": [
"#### 2.2 Write to S3 bucket \n",
"\n",
"* docx will be written to S3\n",
"* pdf will be converted to docx before writing"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8f0ecaf8",
"metadata": {},
"outputs": [],
"source": [
"from pdf2docx import parse\n",
"import os\n",
"\n",
"# Translation input and out file prefix in S3\n",
"in_prefix_name='Japanese/input'\n",
"out_prefix_name='Japanese/output' \n",
"\n",
"s3 = boto3.resource('s3', region_name=my_region)\n",
"for name, md in upload.value.items():\n",
"# If the file type is pdf, convert to docx \n",
" if md['metadata']['type'] == 'application/pdf':\n",
" with open (name, 'wb') as file:\n",
" file.write(md['content'])\n",
" filename, file_extension = os.path.splitext(name)\n",
" newfilename = filename + '.docx'\n",
" parse(name, newfilename, start=0, end=None)\n",
" s3.Bucket(bucket_name).upload_file(newfilename,os.path.join(in_prefix_name,newfilename))\n",
" os.remove(name)\n",
" os.remove(newfilename)\n",
" \n",
" else:\n",
" with open (name, 'wb') as file:\n",
" s3.Object(bucket_name, os.path.join(in_prefix_name,name)).put(Body=md['content'])"
]
},
{
"cell_type": "markdown",
"id": "b1ec9023",
"metadata": {},
"source": [
"----"
]
},
{
"cell_type": "markdown",
"id": "a7b93748",
"metadata": {},
"source": [
"#### 3.1 Create and start the batch translation job "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7f56ce80",
"metadata": {},
"outputs": [],
"source": [
"from datetime import datetime\n",
"\n",
"client = boto3.client('translate')\n",
"\n",
"now=datetime.now().strftime(\"%m%d%Y%H%M%S\")\n",
"job_name='japanese-to-english-multi-pages' + '-' + now\n",
"content_type='application/vnd.openxmlformats-officedocument.wordprocessingml.document'\n",
"\n",
"job_response = client.start_text_translation_job(\n",
" JobName=job_name,\n",
" InputDataConfig={\n",
" 'S3Uri': os.path.join('s3://',bucket_name,in_prefix_name),\n",
" 'ContentType': content_type\n",
" },\n",
" OutputDataConfig={\n",
" 'S3Uri': os.path.join('s3://',bucket_name,out_prefix_name)\n",
" },\n",
" DataAccessRoleArn=role_arn,\n",
" SourceLanguageCode='ja',\n",
" TargetLanguageCodes=[\n",
" 'en',\n",
" ]\n",
")\n",
"job_id=job_response['JobId']\n",
"job_status=job_response['JobStatus']\n",
"\n",
"\n",
"print(\"JobId\",job_id)\n",
"print(\"JobStatus\",job_status)\n",
"print(\"Job Name\",job_name)\n",
"pprint(job_response)"
]
},
{
"cell_type": "markdown",
"id": "3af66f72",
"metadata": {},
"source": [
"#### 3.2 Check the status of the job \n",
"\n",
"Keep checking on the JobStatus which will change from **SUBMITTED** --> **IN_PROGRESS** --> **COMPLETED**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "32956400",
"metadata": {},
"outputs": [],
"source": [
"# Get job status\n",
"status_response = client.describe_text_translation_job(\n",
" JobId=job_id\n",
")\n",
"\n",
"job_status=status_response['TextTranslationJobProperties']['JobStatus']\n",
"print(\"Job Name\",job_name)\n",
"print(\"Job Status\",job_status)\n",
"pprint(status_response)"
]
},
{
"cell_type": "markdown",
"id": "e7535346",
"metadata": {},
"source": [
"----"
]
},
{
"cell_type": "markdown",
"id": "b698a37d",
"metadata": {},
"source": [
"### 4. Verify and clean up \n",
"\n",
"Verify the translated document created in S3 and then clean up resources (optional)."
]
},
{
"cell_type": "markdown",
"id": "31aa6869",
"metadata": {},
"source": [
"#### 4.1 Verify the document created in s3 \n",
"\n",
"Verify the translated document created in s3 location"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81dfe1c2",
"metadata": {},
"outputs": [],
"source": [
"print(os.path.join('s3://',bucket_name,out_prefix_name))"
]
},
{
"cell_type": "markdown",
"id": "3a7bc6ff",
"metadata": {},
"source": [
"#### 4.2 Clean up (optional) \n",
"\n",
"Clean up the resources created after you are done."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "77300b2f",
"metadata": {},
"outputs": [],
"source": [
"print(\"Reminder : Following are the resources which you created in this Notebook which needs to be cleaned up after you are done in region,{}.\".format(my_region))\n",
"print(\"Bucket Name\",bucket_name)\n",
"print(\"Policy Name:\",policy_name)\n",
"print(\"Policy Arn:\",policy_arn)\n",
"print(\"Role Name:\",role_name)\n",
"print(\"Role Arn:\",role_arn)"
]
},
{
"cell_type": "markdown",
"id": "a4fd4444",
"metadata": {},
"source": [
"##### All Done!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_python3",
"language": "python",
"name": "conda_python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}