com.ibm.streamsx.hdfs The **streamsx.hdfs** toolkit provides operators that can read and write data from Hadoop Distributed File System **(HDFS)** version 2 or later. It supports also copy files from local file system to the remote HDFS and from HDFS to the local file system. **HDFS2FileSource** This operator opens a file on HDFS and sends out its contents in tuple format on its output port. **HDFS2FileSink** This operator writes tuples that arrive on its input port to the output file that is named by the **file** parameter. **HDFS2DirectoryScan** This operator repeatedly scans an HDFS directory and writes the names of new or modified files that are found in the directory to the output port. The operator sleeps between scans. **HDFS2FileCopy** This operator copies files from local file system to the remote HDFS and from HDFS to the local disk. The operators in this toolkit use Hadoop Java APIs to access HDFS and WEBHDFS. The operators support the following versions of Hadoop distributions: * Apache Hadoop versions 2.7 * Apache Hadoop versions 3.0 or higher * Cloudera distribution including Apache Hadoop version 4 (CDH4) and version 5 (CDH 5) * Hortonworks Data Platform (HDP) Version 2.6 or higher * Hortonworks Data Platform (HDP) Version 3.0 or higher * IBM Analytic Engine 1.1 (HDP 2.7) * IBM Analytic Engine 1.2 (HDP 3.1) Note: The reference platforms that were used for testing are Hadoop 2.7.3, HDP . You can access Hadoop remotely by specifying the `webhdfs://hdfshost:webhdfsport` schema in the URI that you use to connect to WEBHDFS. For example: () as lineSink1 = HDFS2FileSink(LineIn) { param hdfsUri : "clsadmin": hdfsUser : "webhdfs://your-hdfs-host-ip-address:8443"; hdfsPassword : "PASSWORD"; file : "LineInput.txt" ; } Or "hdfs://your-hdfs-host-ip-address:8020" as hdfsUri () as lineSink1 = HDFS2FileSink(LineIn) { param hdfsUri : "hdfs": hdfsUser : "hdfs://your-hdfs-host-ip-address:8020" file : "LineInput.txt" ; } This example copies the file test.txt from local path ./data/ into /user/hdfs/works directory of HDFS. The parameter **credentials** is a JSON string that contains `user`, `password` and `webhdfs`. streams<boolean succeed> copyFromLocal = HDFS2FileCopy() { param localFile : "test.txt"; hdfsFile : "/user/hdfs/works/test.txt"; deleteSourceFile : false; overwriteDestinationFile : true; direction : copyFromLocalFile; credentials : $credentials ; } **Kerberos configuration** For Apache Hadoop 2.x, CDH, and HDP, you can optionally configure these operators to use the Kerberos protocol to authenticate users that read and write to HDFS. Kerberos authentication provides a more secure way of accessing HDFS by providing user authentication. To use Kerberos authentication, you must configure the **authPrincipal** and **authKeytab** operator parameters at compile time. The **authPrincipal** parameter specifies the Kerberos principal, which is typically the principal that is created for the Streams instance owner. The **authKeytab** parameter specifies the keytab file that is created for the principal. For **Kerberos** authentication it is required to create a **Principal** and a **Keytab** for each user. If you use **ambari** to configure your hadoop server, you can create principals and keytabs via ambari (Enable Kerberos). More details about Kerberos configuration: https://developer.ibm.com/hadoop/2016/08/18/overview-of-kerberos-in-iop-4-2/ Copy the created keytab into local streams server for example in `etc` directory of your SPL application. Before you start your SPL application, you can check the keytab with **kinit** tool kinit -k -t KeytabPath Principal KeytabPath is the full path to the keytab file For example: kinit -k -t /home/streamsadmin/workspace/myproject/etc/hdfs.headless.keytab hdfs-hdp2@HDP2.COM In this case **HDP2.com** is the **kerebors realm** and the user is **hdfs**. Here is an SPL example to write a file into hadoop server with kerberos configuration. () as lineSink1 = HDFS2FileSink(LineIn) { param authKeytab : "etc/hdfs.headless.keytab" ; authPrincipal : "hdfs-hdp2@HDP2.COM" ; configPath : "etc" ; file : "LineInput.txt" ; } The HDSF configuration file **core-site.xml** has to be copied into local `etc` directory. + Developing and running applications that use the HDFS Toolkit To create applications that use the HDFS Toolkit, you must configure either Streams Studio or the SPL compiler to be aware of the location of the toolkit. # Before you begin * Install IBM InfoSphere Streams. Configure the product environment variables by entering the following command: source product-installation-root-directory/streams-version/bin/streamsprofile.sh # About this task After the location of the toolkit is communicated to the compiler, the SPL artifacts that are specified in the toolkit can be used by an application. The application can include a use directive to bring the necessary namespaces into scope. Alternatively, you can fully qualify the operators that are provided by toolkit with their namespaces as prefixes. # Procedure 1. If InfoSphere Streams has access to the `impl/lib/ext` where Hadoop libraries located. The JAR libraries will be download via `maven`. The new version of HDFS toolkit need not necessarily a Hadoop client installation. 2. Build your application. You can use the `sc` command or Streams Studio. 3. Start the InfoSphere Streams instance. 4. Run the application. You can submit the application as a job by using the **streamtool submitjob** command or by using Streams Studio. + What is new ++ What is new in version 4.3.0 * The operator HDFS2FileSink provides a new parameter 'tempFile'. The 'tempFile' parameter specifies the name of the file that the operator writes to. When the file is closed the file is renamed to the final filename defined by the **file** parameter. ++ What is new in version 4.4.1 * The pom.xml file has been upgraded to use Hadoop version 2.8.5 libraries. ++ What is new in version 5.0.0 * The streamsx.hdfs toolkit has been improved with a new operator **HDFS2FileCopy**. * `HDFS2FileCopy` copies files in to directions: * `copyFromLocalFile` : Copies a file from local disk to the HDFS file system. * `copyToLocalFile` : Copies a file from HDFS file system to the local disk. * It supports also two new parameters for all operators: * `credentials` : credentials is a JSON string that contains key/value pairs for `user` and `password` and `webhdfs` . * `appConfigName` : Streams application configuration that support `credentials`. And `credentials` is a JSON string that contains `user` and `password` and `webhdfs`. * Adapting of streamsx.hdfs to support HDP 2.x and HDP 3.x file systems. * Adapting of com.ibm.streamsx.hdfs.client.webhdfs java package to support `knox` HttpURLConnection. * The pom.xml file has been upgraded to use Hadoop version 3.1.0 libraries. ++ What is new in version 5.1.0 * Check if core-site.xml file exist in configPath ++ What is new in version 5.2.0 * The pom.xml file has been upgraded to use jackson.core libraries version 2.10.2. ++ What is new in version 5.2.1 * New error message has been added for translation. ++ What is new in version 5.2.2 * pom.xml updated to use commons-configuration2 version 2.7 ++ What is new in version 5.2.3 * pom.xml updated to use commons-codec version 1.14 ++ What is new in version 5.3.0 * pom.xml updated to use new jar library hadoop version 3.3 and guava version 29.0-jre ++ What is new in version 5.3.1 * pom.xml updated to use guava version 30.0-jre ++ What is new in version 5.3.2 * pom.xml updated to use slf4j libraries version 1.7.36 ++ What is new in version 5.3.3 * pom.xml updated to use the latest apache libraries ++ What is new in version 5.3.4 * The Vulnerability issues for 3rd party libraries have been fixed * hadoop libraries upgraded to version 3.3.6 * commons-cli upgraded to 1.5.0 * commons-codecs upgraded to 1.16.1 5.3.4 4.2.0.0