Setting up Amazon Redshift Cluster and accessing using Pentaho Kettle


Amazon Redshift is a fully managed and highly scalable data-warehouse service in the cloud. You can start from few hundred GB of data and scale upto petabyte or more. Redshift falls under the database section of Amazon Web Services (AWS) which is using PostgreSQL database for storing data. Redshift provides you to auto-scale the database as your data grows so that its gets easier to focus on the analysis part of BI solutions. Read the official redshift document here for more.

In this blog, i will be explaining the steps to setup Amazon Redshift cluster atleast for one test instance and the process to connect it using PDI.

Step 1: Registering for AWS and login

First of all, visit Amazon AWS and register yourself into AWS account. Simply provide all the basic information.

Step 2: AWS Management Console

This is the home page for AWS and the place where you can view all the product services provided by Amazon. Check a sample screenshot here.

Step 3: Creating a Redshift Cluster

The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your data set and then perform data analysis queries.

Log into Redshift from the AWS Management Console and click on “Launch Cluster“.

Next you will asked to provide the below details like :

  1. Cluster Details:

aws2in the above image:

  • Cluster Identifier: The unique name you give to your cluster.
  • Database Name: The name of your database. Default is : dev. You can give name of your choice.
  • Database Port: This is the port on which the database accepts the connection.
  • Master Username and Password: The usual credentials of the cluster

2. Node Configurations:

Once you are done with the cluster configuration, next step would be to define the nodes. As for this blog, i am using a single node cluster type along with dc1.large node type. Based upon your usage, you can select the node type from a list of different node types to get the memory and storage configurations.

aws3 In the above image; the dc1.large node type is selected for a Single Node cluster with one compute nodes.

3. Additional Configurations:

In the Additional Configurations section (after you hit “Continue” from step 2), you are asked to give choose VPC(Virtual Private Cloud), security details, encryption database, etc. All the nodes are configured in the VPC by default. AWS will provide a default configuration of the VPC. This default configuration we need to tweak which is explained below.

For now, simple click on CONTINUE and finally you are done with the Single Node Redshift Cluster Configuration. Below is an image of a test cluster that i have created.

Capture

Step 4: Changing the Security User Group in EC2

You cannot connect to the Redshift database through clients if you are not allowing the IP and TCP port accessible to the security group.

EC2 allows users to rent virtual computers on which to run their own computer applications. EC2 allows scalable deployment of applications by providing a web service through which a user can boot an Amazon Machine Image to create a virtual machine, which Amazon calls an “instance”, containing any software desired. More on EC2 here.

Now Click on EC2 Dashboard (from the AWS Management Console) and select the Security Group Tab on the right. There you will find the security groups that are available. You need to change the Inbound rule for the security group.

22In the above image, for the default VPC security group created during the Redshift cluster configuration, i have added few extra rules. I have added the SSH and ALL TCP type to be publicly accessible. This will enable the clients or 3rd party tools to easily access the redshift cluster.

Ideally, if you are using the cluster for serious task or confidential data, it is recommended to use the Public IP address on which the cluster is defined in the Inbound rule. Else if the cluster is for test purpose (as in my case) its OK to ignore the IP specification and accept all the IPs.

Click here for more on authorizing the inbound rule.

STEP 5: Using PDI/Kettle to Access Redshift

Accessing Redshift using Kettle is straight forward once you have properly configured the cluster as mentioned above. Since Redshift is basically a PostgreSQL database, Kettle will access the cluster using the Postgresql database type.

In the Database Connection section of the Kettle, select Connection Type as : PostgreSQL. Give the Host Name as the cluster host name (you can find it in the “Connect Client” section of the Redshift dashboard). Rest of the details are same as what you have provided during the cluster configuration step.

conne

Also make sure that you have the postgresql driver available in the /lib folder of the kettle installation.

Generic Connectivity:

AWS Redshift also provides you with its custom Redshift JDBC/ODBC jar files to connect using client. In the Redshift Dashboard, move to the “Connect Client” section of the tab and select the drivers. Download the driver as per your requirement.

CapturewwPlace the custom downloaded jar in the /lib of the kettle and in the Connection Type of the Database View, select “Generic Type”

coon2Custom Connectivity URL is the url string which you will find in the “JDBC Configuration Installation”.

Custom Driver is the driver class name that redshift is using. For Redshift JDBC 4.0 the class name is : com.amazon.redshift.jdbc4.Driver and for JDBC4.1, the class name is : com.amazon.redshift.jdbc41.Driver

This will help you to connect to Redshift using PDI/Kettle.

Do post your thoughts and comments 🙂


[More on :Loading Data to AWS S3 using Pentaho DI in HERE ]

Advertisements

17 thoughts on “Setting up Amazon Redshift Cluster and accessing using Pentaho Kettle

    1. Hi Tom. No there seems to be no seperate plugin for DynamoDB in PDI. I have checked the pentaho marketplace and there seems to be no dedicated plugin. But it doesn’t mean you cannot connect to DynamoDB using PDI.
      Download the dynamodb client jar from the aws first. Next select “Generic Database” from the database option and give the necessary connection credentials. The connection url would be the client string given by aws. Driver would be : jdbc.dynamodb.DynamoDBDriver (just check once before using this) and the user/password. The connection is similar to connecting to redshift using generic db, as i have explained in my blog. You may also check this blog. I found it using dynamodb

      Like

  1. Hi Rishu,

    Thanks for the quick response. I have tried looking to download the dynamodb client jar from the aws but i couldn’t find it. Either i failed to look properly(i don’t believe!) or the client jar is not available. Could you please point me to the download resource..

    I have also checked out the blog link. The trail version runs great but i certainly do not want to pay for an open source technology which is highly contrary on what open source is supposed to be.

    Thanks,
    Tom

    Like

  2. Hi Rishu,

    Even i’m trying to find a way to connect to dynamodb through pentaho. Could you please give the link where i can download the “Dynamodb client jar from the aws” which you were talking about..

    Like

  3. From Connect Client, I’ve downloaded JSBC driver. I have also got postgresql driver (jar) and placed them both in the lib/ dir. Define new connection as you have posted — my host name is from the “Get cluster connection URL / JDBC URL but without the leading “jdbc:redshift://” — I am getting the following in the logs:

    2016/04/06 15:55:12 – aws_redshift – New database connection defined
    2016/04/06 15:55:12 – aws_redshift – Attempting to find connection in Named Datasources
    2016/04/06 15:55:12 – aws_redshift – Unable to find datasource in Named Datasources. Finally will try to attempt connecting using class

    And then moments later, a (failed) connection report shows :
    Error connecting to database [aws_redshift] : org.pentaho.di.core.exception.KettleDatabaseException:
    Error occurred while trying to connect to the database

    Error connecting to database: (using class org.postgresql.Driver)
    Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

    org.pentaho.di.core.exception.KettleDatabaseException:
    Error occurred while trying to connect to the database

    Error connecting to database: (using class org.postgresql.Driver)
    Connection refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.

    Hostname : my-dw-sandbox.bl4hblah.redshift.amazonaws.com
    Port : 5432
    Database name : blueberry

    Like

    1. i am not sure what you mean by ‘Connect Client’, but it seems you are having issue with the driver. I assume you are having conflict of the postgresql driver and redshift driver. Try removing the postgresql driver and then try connecting.

      Like

  4. Hi Rishu,
    I am trying to build a data pipeline between S3 and Redshift but it is unable to connect to redshift and returns error of Connection Timeout.
    Username and Password are correct. and I am using following String
    jdbc:postgresql://hw-1.xxxxxx454.eu-west-1.redshift.amazonaws.com:xxxx/?tcpKeepAlive=true
    Could it be a problem of Security groups or VPC security ?
    Thanks in advance 🙂

    Like

  5. Hi Rishu,
    I am trying to connect pentaho schema-workbench to amazon redshift, but when I do it schema-workbench seems to be reading the redshift database catalog, which in my case there are around 15 schemas, what makes makes it really slow to do anything.
    Do you know if there is a way to force schema-workbench database connection to read only one schema?
    Thank you!

    Like

  6. Hi Rishu,

    Excellent post!, I have been able to configure Pentaho 6.0 with Redshift as target and source as AMR hive. I am much disappointed with the r/w io performance while pushing data in redshift. As far as hive is concerned it is pretty good as compared. Do you have any clues to improve the r/w in Redshift. Now it is taking hours to push half a million records … thanks indeed great post.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s