Streaming data from Twitter Api is really important from the data analytic perspective. Getting the pulse of your user community on the web and across different geographics gets really important in terms of making big decisions. Pentaho Kettle does provide you with few steps to read or stream data from Twitter. In fact there is already a sample example present in the installation directory of the PDI on the twitter. But that sample code might not work due to changes in the Authentication system of twitter api’s. Currently Twitter uses OAuth now for the third party users to access the data.
So in this blog will share few steps to actually stream the twitter data using OAuth:
STEP 1: Register an Application in Twitter (if you haven’t done yet):
Very first step is to register an application on Twitter. Click on this link and register yourself an application.
STEP 2: The Authentication details of the App:
Once you have registered your app on twitter, you will find few details shown. Check images below:
In the above images, my application name is : EnigmaRishu and twitter provides with various keys and access tokens. These keys and tokens are required when calling the twitter api from PDI in the request header.
STEP 3: Building a Transformation:
Define Parameters: This data grid step is where we would define all the parameters required as a part of the authentication process. The various parameters are documented on the twitter developer space. The token values and key values needs to be as per your registered application and it differs user wise.
Generate Header: This is a JS Step which includes codes developed by Paul Johnston (A JavaScript implementation of the Secure Hash Algorithm, SHA-1) and Netflix to handle the encryption of the keys and token. All we need to do is to pass the secure data to the header along with the query.
Calling Twitter REST Client: This step called a REST Client and we are actually using REST API of Twitter to read the twitter streams. The URL would be the combination of twitter search url and the query string. Check the sample code (as below) for more details.
Output – Twitter Result: The final search result and all the twitter. The raw format of the data is in JSON. If you want to further analyze the JSON format, use the steps like JSON Input file to analyse each section of the data.
Having said, twitter limits the streaming data for security reasons. More details here in the official document.
PDI Sample code:
The sample PDI as explained above is in the github repo here.


Can u let me know how can I import the KTR file to Kettle and see the design
LikeLike
Hi.. can you give me more details on exactly what you r looking to achieve? Sorry, I didnt quite get the question !!
LikeLike
Hi!, can you help me?, when I run the transformation, I can only get 100 tweets, is there a configuration to increase the number of tweets returned
LikeLike
Hi angel… ofcourse there is a limit to the number of tweets you can access to. Twitter has a limit chart based on the access rate due to security concerns. You can check this link: https://dev.twitter.com/rest/public/rate-limits for more. Hope it helps 🙂
LikeLike
Hello Rishu,
Thanks for putting this together and sharing! Unfortunately I keep getting errors like this (PDI 6.01, Sun JDK 7):
2015/12/23 17:33:14 – Generate Header.0 – TypeError: Cannot call method “indexOf” of null (script#659)
Any ideas on how to work around this? I’ve made the logging more verbose and added some additional debug logging which shows the URL is constructed correctly, but in some cases when the function ‘getBaseString’ is called, message.action is empty…
LikeLike
@mike can you share your code with me. it’s a bit difficult to debug this from the error log. Can you share me the code at : rishu.shrivastava@gmail.com
LikeLike
What to enter in oauth_nonce, oauth_signature, oauth_timestamp in the grid?
LikeLike
@Andreas You need to first register your app in the twitter dev site. Once done, you will find the details in there.
LikeLike
I registered my app, but there are no values shown for oauth_nonce/signature/timestamp. Can’t see that in your screenshots too.
LikeLike
Hi, @Rishu. Im’ using Spoon 5.4.0.1 with api.twitter.com v1.1. The problem is 2016/06/28 13:10:55 – Calling Twitter REST Client.0 – Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.
I’ve already generated the certificate to api.twitter.com and updated cacerts in JAVA_HOME. It seems like it has done nothing to Spoon.
Do you know why is this happening? Thank you so much.
LikeLike
Hi, @Rishu. Im’ using Spoon 5.4.0.1 with api.twitter.com v1.1. The problem is 2016/06/28 13:10:55 – Calling Twitter REST Client.0 – Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target.
I’ve already generated the certificate to api.twitter.com and updated cacerts in JAVA_HOME. It seems like it has done nothing to Spoon.
Do you know I is this happening? Thak you so much.
LikeLike
Hi Rishu,
Thank you or this post !
I’ve used you’re workflow to connect magento 2 oauth 1.0 API with little modifications, but unfortunatelly I’ve got an “Invalid signature” response; same request & parameters in Firefox RET client works fine…
Did you face this problem during your build?
Thks, Thomas
LikeLike