What is the PDI Client (Spoon) in Pentaho Data Integration (Kettle)?

Original article can be found here (source): Artificial Intelligence on Medium

How to use Spoon and its features? How to customize the tool as per our preference?

We will have to open the Spoon using spoon.bat or spoon.sh based on your OS. Spoon files are located in ~/data-integration/ folder. Once you double click on the same, you should be able to see the below screen.

Disclaimer: There are tons of features in PDI or PDI Client and I will not able to go through all of them. However, will cover key and useful features.

Spoon welcome screen

It’s a self-explanatory and minimalist welcome screen with four options – work, learn, extend, discover. Let’s click on ‘New transformation’ and see the actual work screen. Below is the screen that you will see, of course without those ugly red lines that I have created for our understanding.

PDI screen by clicking on ‘New Transformation’ from the work section.

PDI screen can be distributed in five-segments as shown above. Let’s understand each segment.

  1. Save and Open: We have the entire navigation bar here in this segment. We can create a new, open, save transformations or jobs.
  2. PDI Plugins: So, PDI comes with a lot of preloaded plugins. Plugins are the customization that we want in our workflows. For eg, we can choose ‘Microsoft Excel Input’ as a plugin to read excel files.
  3. PDI Canvas: As shown in the above image, we just need to drag and drop a step (plugin) to canvas to create an ETL workflow.
  4. Execute and Debug: We can use this tool to execute and debug our workflows. PDI Client gives us the real-time logs on the screen itself, which helps in tweaking the configurations as per the requirements.
  5. PDI Repositories: We can have two types of repositories viz. File and Database. These repositories helps us to manage the transformations and jobs files. We will have a separate blog explaining the repositories.

Well, PDI Client (Spoon) allows us to customize the tool as per our preferences.

You can click on the Tools options from the navigation bar and then click on the options. You will see the below screen. There are a bunch of options that you can tweak as per your preference.

These are some basic settings that you can tweak.

Besides, you can tweak the background colour, font size, font family as per your preferences. However, I like to minimalist design approach and colour tone of the same.

You can tweak the font and background colour as well.

How to create transformations and jobs

In PDI, we can create either transformations or jobs, both are useful for performing various data sourcing, manipulations and loading tasks. In the work section, we can either open an existing transformation (.ktr) or jobs (.kjb) files.

Let’s create a simple transformation and jobs files by clicking on ‘New Transformation’ and ‘New Job’ from the work section.

I will add two simple “dummy” steps which do nothing and save it. You can click ‘Ctrl+S’ to save the files.

Dragged a dummy step from the PDI Plugin Window. You can search it on the top

Below are the two files that I saved, although it says .kjb and .ktr files, these are simple XML files which store the steps (plugins) and configurations used; this then is used by the Java code to run your flows. In our case, it should save the ‘Dummy (do nothing)’ step XML.

Files saved in my desktop with .kjb and .ktr extensions

Let’s open it in some text editor to check the content. Now, ideally, we are supposed to open these files using PDI Client (Spoon).

Opened the file in a text editor. You can open it in a simple notepad.

As you can see, these are plain XML files storing our configurations. You can search for an element (XML tag) ‘<step>’, you can see our step ‘Dummy (do nothing).

Our saved step ‘Dummy (do nothing)’

We will create more such flows using multiple steps in our future blogs.

How to add documentation with the flow for future reference?

Documentations is one of the core parts of any programming language. It helps us in understanding a piece of program in simple text. PDI allows us to write notes at the transformation/job level and the step level. I would encourage you to write a brief description of the transformation/job and a detailed note on the role of a specific step (input parameter, logic and output).

You can simply right-click on the canvas and choose the option ‘New Note’ to write at transformation/job level. Below is an example.

Description at transformation helps understand the reasoning behind the flow

We can right-click on a particular step and choose the ‘Description’ option to elaborate on the step.

Helps other developers decode the steps

What is a PDI marketplace?

Although, PDI comes with a lot of plugins out of the box. We can install custom-built plugins available in the market place. We can click on the Marketplace option on the Welcome screen ‘Extend’ section.

Please make sure it a stable release before using it on production. You can read the description and rating.

There are tons of plugins available in the marketplace.

You will have to restart the PDI Client (Spoon) post the installation. All plugins are stored in ~/data-integration/plugins or ~/data-integration/lib folder.

What is a PDI community?

PDI has a growing community and contributors. In case I have struggled with a particular problem statement then similar to stalk overflow for other languages, I have referred to Pentaho Community Forum for solutions. Majority of the times the problems have been faced by others and have been solved as well. Here’s the link to the forum

Conclusion

I hate theories and learn only from implementing a particular use case. We will be exactly doing the same from next blog post onward. We will create and solve three real-world use cases, where we will go use multiple PDI steps and understand data types, if statements, loops etc.

See you in the next post. Happy ETL