Introduction
Have you ever worked on any data preparation platform? Or have you worked with a lot of data and cannot handle it?
Trifacta is a platform for data preparation. Many modern businesses employ it for its reliability and speed. It empowers and allows everyone to prepare diver and messy data faster. You can use the trifacta interactive platform to present visual data for exploration. It can be used for machine learning, analytics, or standard reporting.
In this article, we will understand Trifacta and its dataprep in detail.

Getting Started with Trifacta
Object Overview
In this section, we will explore the objects that can be created in Trifacta and their relationships. The sample data is transformed by creating the following:
-
Flows: It is the basic unit for organizing the work. It is a container used to hold one or more datasets. The datasets can be associated with recipes and other objects. It is also a means for packaging Dataprep by Trifacta objects. It can be for different types of actions. They are as follows:-
- Copying
- Creating relationships
- Executing pre-configured jobs
- Creating references between external flows and recipes.
-
Imported datasets: These are the data that are imported to the platform. It is a reference to the original data. Data may not exist within the platform. It can reference a database table, a file, multiple files, or other data types. When it is added to a flow, then it becomes usable.
- It can be referenced in recipes.
- They can be created by the Imported Data page.
- Recipes: It is a user-defined step sequence. It is applied to transform or change a dataset. Outputs and references are associated with each recipe in a flow. The object of the recipe is created from an imported dataset. It can be created from another recipe too. You can chain recipes together by creating a recipe from another. Recipes are interpreted by Dataprep.
Import Basics
Dataprep can import a variety of flat file formats by Trifacta. It can also be imported from other distributed sources. A reference to data imported is stored by the platform as an imported dataset. The steps are as follows:-
- Log in to the application.
- Click Library in the menu bar. Click Import Data.
-
Follow the following steps to add a dataset:
- Select the location. The location must be where your source is located.
-
Upload:
- Select Upload to upload a file from your local desktop. You can also select multiple files to upload.
- Select the needed files for your source and click Open.
-
Backend Storage (like S3):
- Navigate and select files for your source. One or more files can be selected.
- Click the plus icon to upload the dataset in the queue next to its name.
- You can select multiple files.
- Click Add to insert a new flow checkbox. This will create a new flow for you, Dataprep. The imported datasets are also added.
- Click Continue to begin working with the dataset.
- The imported dataset and its flow are created.
- You can begin working on the Transformer page with the dataset.
Transform Basics
The transformer page is opened when you edit your dataset recipe. It is the place where you begin with wrangling tasks on a sample dataset. You build your transformation recipe from this interface. Results can also be seen in real-time in the sample. You can run a job against the complete dataset once you are happy with what you see.
Goal
Data transformations have done when all the following are done
- Deleted any incorrect, missing, or incomplete values from your data
- Data from other datasets can be added to your dataset as needed.
- Change your dataset's values to follow the desired schema.
- Executed job against the entire dataset
- Exported the results from your recipe and dataset for use in further systems
Recommended Methods for Building Recipes
Dataprep supports the following methods by Trifacta. It is used for building recipes on the Transformation page. These methods are listed below in order of ease to use:-
- Select something: You are presented with a list of ideas for steps you can do. It is based on the selection or patterns that match the selection when you choose data elements on the Transformer page. You can select columns or one or more values from a column.
- Toolbar and column menus: Through the Transformer toolbar or the column context menus on the Transformer page, you can access pre-configured transformations.
- Search and browse for transformations: You can assemble recipe steps using the Transform Builder and the Search panel. It can be done through a simple, menu-driven interface.
Sample
It is challenging to load large datasets in Dataprep. It can overload the browser or can affect the performance of your browser. So the application can work on sample data. When the results are satisfying, you can apply the job to the entire dataset. The first set of rows of source data is the default sample.
Cleanse
It addresses the issues in data quality. It can be categorized as follows:-
- Consistency
- Validity
- Reliability
When data is imported, it contains some parts that are not required in our final output. It can be multiple rows, columns, or specific values. So this phase focuses on the following activities:-
- Change data types
- Remove unused columns
- Correcting mismatched and missing data
- Improve validity, consistency, and reliability of data
Assess Data Quality
You can develop standards for data quality. It applies to the specific details of your dataset. For example, check the sqFt field for values less than zero if your dataset contains square footage for commercial rental properties. It can be a data quality rule. These values are highlighted in red in a data quality bar for the rule. It is for simple review and triage.
Modify
After cleansing your data, you may need to modify the datasets. It is done to format the datasets for the targeted systems properly. You can also specify the level of aggregation, or any other modification can also be done.
Enrichment
You need to augment or enhance the dataset before you deliver it to the target system. It can be the addition of new columns or values from other datasets.
Sampling
Data present on the Transformation page is the sample of your entire dataset.
- The sample is your complete dataset if it is small enough.
- Dataprep by Trifacta automatically creates an initial data sample from your dataset. It is the first row of the dataset for larger datasets.
Profile
You can create and examine visual profiles of specific columns as part of the transformation process. It can be for your entire dataset also. Finding outliers, anomalies, and other problems with your data can be made much easier with the help of these interactive profiles.
Running Job Basics
In this section, you will learn about the overview of running job basics.
Configure Job
Click Run when you are ready to test your recipe with the entire dataset. It is present on the Transformer page. The output format and any compressions that need to be applied are specified on the Run Job page. Compression is not required until you are working with a large dataset.
Run Job
Click Run to queue jobs for execution. The job is then queued up for processing. You can track all the progress on the Job Details page. Click the Profile tab if visual profiling for the job was enabled. The results can be seen in the Output Destinations tab when a job is completed.
Iterate
You can see and review the recipe's transformation effects on the Job Details page in the Profile tab. It can be seen across the entire dataset. Data histograms and statistics provide a broad understanding of how well your transformation recipe works.
Export Basics
You can export the transformed data once you have improved your recipe and obtained a satisfactory result.
Steps
- Click the Job History icon present in the left navbar.
- Click the Job identifier on the Job History page. It will open the job on the Job Details page.
- Select the Output Destinations tab.
Export by
- Direct file download: Select the file you want to download. Click Download the result from the menu on the right-hand side.
- Create a new dataset: New datasets can be created from a generated output. Click on the file. Click Create imported dataset from its right-hand side context menu.