Amazon Redshift is a fully managed and highly scalable data-warehouse service in the cloud. You can start from few hundred GB of data and scale upto petabyte or more. Redshift falls under the database section of Amazon Web Services (AWS) which is using PostgreSQL database for storing data. Redshift provides you to auto-scale the database as your data grows so that its gets easier to focus on the analysis part of BI solutions. Read the official redshift document here for more.
In this blog, i will be explaining the steps to setup Amazon Redshift cluster atleast for one test instance and the process to connect it using PDI.
Step 1: Registering for AWS and login
First of all, visit Amazon AWS and register yourself into AWS account. Simply provide all the basic information.
Step 2: AWS Management Console
This is the home page for AWS and the place where you can view all the product services provided by Amazon. Check a sample screenshot here.
Step 3: Creating a Redshift Cluster
The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your data set and then perform data analysis queries.
Log into Redshift from the AWS Management Console and click on “Launch Cluster“.
Next you will asked to provide the below details like :
- Cluster Details:
- Cluster Identifier: The unique name you give to your cluster.
- Database Name: The name of your database. Default is : dev. You can give name of your choice.
- Database Port: This is the port on which the database accepts the connection.
- Master Username and Password: The usual credentials of the cluster
2. Node Configurations:
Once you are done with the cluster configuration, next step would be to define the nodes. As for this blog, i am using a single node cluster type along with dc1.large node type. Based upon your usage, you can select the node type from a list of different node types to get the memory and storage configurations.
3. Additional Configurations:
In the Additional Configurations section (after you hit “Continue” from step 2), you are asked to give choose VPC(Virtual Private Cloud), security details, encryption database, etc. All the nodes are configured in the VPC by default. AWS will provide a default configuration of the VPC. This default configuration we need to tweak which is explained below.
For now, simple click on CONTINUE and finally you are done with the Single Node Redshift Cluster Configuration. Below is an image of a test cluster that i have created.
Step 4: Changing the Security User Group in EC2
You cannot connect to the Redshift database through clients if you are not allowing the IP and TCP port accessible to the security group.
EC2 allows users to rent virtual computers on which to run their own computer applications. EC2 allows scalable deployment of applications by providing a web service through which a user can boot an Amazon Machine Image to create a virtual machine, which Amazon calls an “instance”, containing any software desired. More on EC2 here.
Now Click on EC2 Dashboard (from the AWS Management Console) and select the Security Group Tab on the right. There you will find the security groups that are available. You need to change the Inbound rule for the security group.
In the above image, for the default VPC security group created during the Redshift cluster configuration, i have added few extra rules. I have added the SSH and ALL TCP type to be publicly accessible. This will enable the clients or 3rd party tools to easily access the redshift cluster.
Ideally, if you are using the cluster for serious task or confidential data, it is recommended to use the Public IP address on which the cluster is defined in the Inbound rule. Else if the cluster is for test purpose (as in my case) its OK to ignore the IP specification and accept all the IPs.
Click here for more on authorizing the inbound rule.
STEP 5: Using PDI/Kettle to Access Redshift
Accessing Redshift using Kettle is straight forward once you have properly configured the cluster as mentioned above. Since Redshift is basically a PostgreSQL database, Kettle will access the cluster using the Postgresql database type.
In the Database Connection section of the Kettle, select Connection Type as : PostgreSQL. Give the Host Name as the cluster host name (you can find it in the “Connect Client” section of the Redshift dashboard). Rest of the details are same as what you have provided during the cluster configuration step.
Also make sure that you have the postgresql driver available in the /lib folder of the kettle installation.
AWS Redshift also provides you with its custom Redshift JDBC/ODBC jar files to connect using client. In the Redshift Dashboard, move to the “Connect Client” section of the tab and select the drivers. Download the driver as per your requirement.
Custom Driver is the driver class name that redshift is using. For Redshift JDBC 4.0 the class name is :
com.amazon.redshift.jdbc4.Driver and for JDBC4.1, the class name is :
This will help you to connect to Redshift using PDI/Kettle.
Do post your thoughts and comments 🙂
[More on :Loading Data to AWS S3 using Pentaho DI in HERE ]