A Workflow is a sequence of connected steps that defines a process of transforming and manipulating your data. Data Studio provides a number of default Workflow steps. The area where Workflows are viewed or edited is called the Workflow canvas.
If your Workflow is complex or tackles many actions, it can become difficult to manage and test. You can resolve this is by creating re-usable Workflows.
Duplicate data to multiple outputs.
The rows from the source data will be duplicated (branched) into two outputs. To create more than two duplicates of the data, you can link several Branch steps.
This step allows you to convert column(s) into rows by clicking Selected columns in the column list on the left.
Specify the following for the new columns:
To do the reverse, use the Rows to columns Workflow step.
This step allows you to export your Workflow's output data to a file or an External system that uses a JDBC connection.
Each Workflow can contain multiple Export steps, so you can export data at any point. Each of these steps can be configured individually.
Click Export Settings in the step dialog to specify where and how the data will be exported.
Specify the Exporter (.csv, .txt, .xls, .xlsx or .json).
Enable Overwrite existing to allow a file with the same name to be exported more than once (the latest copy will replace the previous one). If this option is disabled, you will get an error when trying to export to the same file name more than once.
Column name style determines how the column names will be displayed in the export file:
Specify the Character set (most sets are supported).
"INSERT INTO tablename (column1, …) VALUES (value1, …)"
"UPDATE tablename SET (column1=value1, …) WHERE (columnName=value, … )"
"DELETE FROM tablename WHERE columnName=value, … "
Filter your data by selecting a column and entering a comparison value to match against using an operation like Equals, Contains or In Range. To filter on multiple columns and/or multiple comparison values create an advanced Function that returns a boolean (true or false) value.
In Explore mode, right-click on a value in a column and select one of the options Keep or Remove to show or exclude all rows containing the chosen value.
The Split Workflow step is the same as Filter but has the benefit of two outputs to see the rows that pass the filter and the rows that fail the filter.
The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.
Fires a user-defined event from the Workflow.
You can kick off custom events at any point in your Workflow.
Tick the Only fire if input has rows checkbox for the step to be kicked off only when the input has data.
For example, you want to kick off an event when the Validate step has failures. Select the 'Validation failed' custom event and specify the reason for the failure in the 'Reason' event data item. This reason will now be available as an event data item when creating/editing notifications.
Another example is using a Filter step to identify records that you have particular interest in. The Fire event step can then follow the Filter step but only fire when the Filter step is producing records.
Grouping will reduce the number of rows in a table.
The default output of a Group step is the Count aggregate, which returns the total count of rows. This count will then be broken down by each unique value in the group column.
Moving columns from ungrouped to grouped will show the distinct/deduplicated values in the column (along with the optional Count of the number of times each combination of values appears in the data).
The Add aggregate button is used to create a new column(s) containing a value or metric for each group.
Give the new aggregate column a name or optional suffix value if applying an aggregate on Multiple columns, then select the type of aggregate to use.
Aggregate | Description |
---|---|
Count | Returns a count of the number of items in each group. |
Count excluding nulls | Returns a count of the number of items in each group ignoring null values. |
First value | Returns the first value in the column for each group. |
First populated value | Returns the first value in the column for each group that is not null. |
Last value | Returns the last value (based on input row order) in the column for each group. |
Minimum value | Returns the minimum value in the column for each group. |
Maximum value | Returns the maximum value in the column for each group. |
Grouped values | Returns a comma separated list of all values in each group. The order is as they appear in the input. |
Sorted grouped values | Returns a comma separated list of all values in each group sorted A to Z. |
Reverse grouped values | Returns a comma separated list of all values in each group sorted Z to A. |
Average | Returns the average (mean) of the values in each group. |
Median | Returns the median (based on a sort of the values) of the values in each group. |
Sum | Returns the sum of the values in each group. |
Standard deviation | Returns the standard deviation of the values in each group. |
Standard deviation (population) | Returns the standard deviation of the values in each group where it is known that the values represent the whole population. |
Variance | Returns the statistical variance of the values in each group. |
Variance (population) | Returns the statistical variance of the values in each group where it is known that the values represent the whole population. |
Group ID | Returns a numeric value that can be used to identify each group. |
Non-numeric values will be excluded (ignored) from numeric aggregation calculations. Zero will be returned by default if the values are all non-numeric.
In Explore mode you can drill into a particular group to see the rows that belong to it by right-clicking and selecting View rows in group.
This ability to drill down into grouped values will be retained if the step is connected to any of the following subsequent steps: Branch, Chart, Export, Filter, Fire event, Output, Sample, Sort, Split, Take Snapshot, Transform and Union.
Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.
Join two inputs into one based on one or more columns from each input.
To specify the columns you want to join, click Left side columns or Right side columns.
Use the Retention type menus to specify different combinations of join. The top menu is for the first join input and the bottom one for the second join input.
Click on the Venn diagram to specify the type of join you want to apply:
The Join step provides three matching options which determine how keys will be compared:
In the join dialog, click Show step results to view the results of the join. Specify which values from the joined columns to use in any following steps: all the matches, only duplicates or only singletons.
Click to see suggested join keys. Joins will be suggested only where column names match exact on the two side of the join. Click on the suggested join to apply it.
Allows you to match values from a column in the first input (source) to a column in the second input (the lookup table / lookup input). Values are returned from a second column from the lookup table after an operation is applied to the set of values for the matched rows.
You can use any source in a Workflow as the input for this step: a Dataset, View, or the output from another step. You can even use the same output node for both the source and the input (if you want to do a self-lookup).
To add multiple lookup tables, click Add additional lookup table.
Click Definitions to open up the configuration panel, and Add definitions to allow you to define a new lookup. Multiple lookup definitions can be created in a single Lookup step. A lookup definition consists of the following options:
The name of the resulting new column from the lookup definition.
A lookup type is the operation applied to the value (or values, if more than one match exists in the lookup table) returned by the Lookup. These types/operations are applied to the set of values for the return column where the lookup column matches the lookup value.
Lookup type | Description |
---|---|
Count | Returns a count of the number of matches that the lookup value has in the lookup column. |
Count excluding nulls | Returns a count of the number of matches that the lookup value has in the lookup column ignoring nulls. |
Exists | Returns true if the lookup value has a match in the lookup column, false if not. |
Not exists | Returns false if the lookup value has a match in the lookup column, true if not. |
First value | Returns the first value (based on input row order) from the return column where the lookup value matches the lookup column. |
First populated value | Returns the first non-null value (based on input row order) from the return column where the lookup value matches the lookup column. |
Last value | Returns the last value (based on input row order) from the return column where the lookup value matches the lookup column. |
Minimum value | Returns the minimum value (based on a sort of the values) from the return column where the lookup value matches the lookup column. |
Maximum value | Returns the maximum value (based on a sort of the values) from the return column where the lookup value matches the lookup column. |
Sum | Returns the sum of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be counted as '0'. |
Average | Returns the average (mean) of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Median | Returns the median (based on a sort of the values) of the values in the return column where the lookup value matches the lookup column. |
Variance | Returns the statistical variance of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Variance (population) | Returns the statistical variance of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Standard deviation | Returns the standard deviation of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
Standard deviation (population) | Returns the standard deviation of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric. |
All values | Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is as they appear in the input. |
Reverse values | Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a reverse sort of the values. |
Sorted values | Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a sort of the values. |
The column from the lookup table that you want the above operations to act on. This option is not present if you have selected the Exists or Not Exists lookup types, as no value from the lookup table is returned.
This is where you define how you want to match values between your lookup value and lookup column. The selected match type will apply to both sides.
Match type | Description |
---|---|
Exact | Will only match if the value and datatype exactly match. |
Ignore datatype | Will match if the values match, ignoring the datatype (e.g. alphanumeric '1' will match numeric value 1). |
Case-insensitive | Will match if the values match, ignoring the casing (e.g. 'experian' will match 'experIAN'). |
Denoise | Will match if the values match after special characters / whitespace etc. are removed (e.g. 'Experian' will match 'Ex'per ian'). |
Format | Will match if the formats of the two values are the same (e.g. '123abc' will match '456def'). |
Soundex | Will match if the values match after going through the Soundex phonetic algorithm. |
Metaphone | Will match if the values match after going through the Metaphone phonetic algorithm. |
Double metaphone | Will match if the values match after going through the Double metaphone phonetic algorithm. |
If selected, each value from the return column will only be used once per lookup value match group. Only present for lookup types which have the potential to operate across multiple values, for example Average and All values.
The value that is returned if no match for the lookup value is found. This field can be left blank in which case a null value is returned. Some lookup types will never use this (e.g. Exists).
The column from the lookup table that contains the values you want to match on.
The column from the source that contains the values you want to match on.
Allows you to define additional pairs of lookup columns (additional match criteria) to further narrow matches.
A source Dataset contains customer information alongside a column with a "discount code":
Customer ID | Code |
---|---|
1 | A |
2 | A |
3 | B |
4 | C |
5 | A |
6 | 9 |
Another dataset contains the "discount code" value alongside the rate that that code represents. This will be the data that we lookup in to:
Discount Code | Rate |
---|---|
A | 5 |
B | 10 |
C | 15 |
We want to define a lookup that will return the appropriate rate for each discount code. The rate will be added as a new column in my customer (source) Dataset.
To do this, in a Workflow add a Lookup step with the customer dataset as the source input and the discount code dataset as the lookup input.
The lookup definition is set up as follows:
Resulting table:
Customer ID | Code | Discount Rate |
---|---|---|
1 | A | 5 |
2 | A | 5 |
3 | B | 10 |
4 | C | 15 |
5 | A | 5 |
6 | 9 | NO MATCH |
Lookups and Joins can both achieve a lot of the same things when combining of two data sources. However, their performance is optimized for different scenarios, so for large data volumes it is important to understand these before choosing which one to use for your workflow.
In general, we recommend lookups for use-cases where the lookup table is small (e.g. a domain file of a few thousand, typically unique, values) or when you need to perform an operation on the results (e.g. calculate and average for the matched values). Joins are better suited for large to large file operations.
More specifically, for high cardinality columns (e.g. true unique IDs), lookups can take at least twice as long as an equivalent join. For low-cardinality (i.e. low uniqueness) / high row count keys, lookup indexes will built quickly compared to the corresponding join index.
This step is used to map columns in your data to a target schema definition. The target schema must be a Dataset that is already loaded into Data Studio.
Map to target is often used in combination with the Export step to populate a target database schema but can also be used to rename multiple columns easily.
Auto-map functionality will attempt to map columns automatically where possible, based on column names in the source (input columns) and the target (output columns). You can also map columns manually.
If an input column is not selected in the mapping for a given target column, the values in the target column will be null
.
If Include in output is unchecked for a target column, that column is not included in the step's output. Use this setting when mapping to a schema that includes auto-generated columns in the target database.
When a re-usable Workflow is using the Output step, the step's data will be available in another Workflows.
Click Show step results to view the result from previous step that connects to the Output step.
Profile your data at any point in a Workflow.
We examine each value and identify various attributes (such as type, min/max values, count) and determine its uniqueness and completeness.
Click Profiled columns to select the column(s) you want to profile. In the profiled columns list on the left, select the required ones and click Apply. To see the results, click Show step results in the step dialog.
You can now export or take a snapshot of your profiled data to track data uniqueness over time.
View up to three inputs side by side (horizontally or vertically). Nest these steps for more flexibility over layout.
Click on Vertical split to change it to horizontal.
To make a report available to Data Consumer users, click Expose to Data Consumer users.
To view the report, click Show report.
This step allows you to convert row(s) into columns by clicking Settings.
The step results will consist of the new column(s) and all other columns except for the Column with rows to split in new columns and Column with rows to split in column values.
To do the reverse, use the Columns to rows Workflow step.
You can use this step together with Group and Export to create a pivot table.
For example, you have the following car sales data:
Year | Brand | Sold | Product ID | Model |
---|---|---|---|---|
2019 | Honda | 10 | 1 | HRV |
2019 | Mazda | 15 | 2 | CX5 |
2020 | Honda | 7 | 3 | Civic Si |
2019 | BMW | 3 | 4 | 3 Series |
2019 | Honda | 5 | 5 | CRV |
2020 | Mazda | 50 | 6 | Mazda 3 |
What you're interested in is the amount of Honda and Mazda sales per year:
Year | Honda | Mazda |
---|---|---|
2019 | 15 | 15 |
2020 | 7 | 50 |
To achieve that:
Add a Source step with the relevant Dataset.
Connect the Rows to columns step.
Click Settings to configure it:
-Select Brand as the Column to split.
-Select Sold as the Values column.
-Click Profile for new values. The values Honda, Mazda, BMW should be automatically populated.
-Remove the BMW value and save changes.
Connect the Group step.
Click Grouping to configure it:
-Remove Count from the Grouped and aggregated columns.
-Move Year from the Ungrouped columns to the Grouped and aggregated columns.
-Click Add aggregate and enter the following values:
-Column selection type: Multiple
-Selected columns: Honda and Mazda
-Aggregate: Sum
-Click Apply twice, to save changes and to apply changes to the groupings.
Connect the Export step.
Click Show step results to preview the results and export as required.
Limit the number of rows or take a random sample or rows.
Examples for consecutive rows:
Examples for non-consecutive rows:
Sort your data in ascending/descending order.
Move columns from 'unsorted columns' to 'sorted columns', set the sort order (min to max or max to min) for each column using Flip sort direction or the up/down arrow on the right of each 'sorted column'.
Select a 'sorted column' and use the 'move up'/'move down' buttons to change the priority order if sorting on mutliple columns.
Your data source, often the first step in a Workflow. Select the required Dataset to start manipulating and transforming your data.
The options you see in the Source step will depend on the properties of the source you select.
Enable the Delete batches on completion setting to automatically remove processed batches from a Dataset when the Workflow successfully completes. This option is only available is the Dataset is multi-batch and has the Allow automatic batch deletion option enabled.
Enable the Allow auto-refresh setting to automatically load a new batch of data into the selected Dataset when the Workflow is executed. If a View is selected, it's underlying Dataset will be refreshed, even if that Dataset is in a different Space.
When Allow auto-refresh is enabled you have the option to Stop execution when refresh failed. This setting causes a workflow execution to fail is the refresh could not be completed, for example if the External System could not be accessed due to issues with network connection or authentication. When the settings is unchecked, the workflow will process the existing data when the Dataset could not be refreshed.
To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.
In the Source step, the available lineage metadata depends on the source type and can be included in the output using the dropdown under More Options after a Dataset has been selected.
{accordion-section|Splice}
Combine two inputs into one by presenting the values next to each other.
Data in the columns is not combined or merged - the duplicated values are renamed with an appended numeric value. The output – combined values from two sources – can then be used as input to any further step(s).
Click Show step results to view the results.
Apply a true/false filter to one input, splitting it into passing and failing rows.
To create a filter, click Filter in the step dialog and Create in the left-hand menu. Specify the necessary filters and click Apply to save the changes.
Click Show passing/Failing rows to view the values that passed/failed the filter. The passing and failing results can be individually linked to any further step(s).
Snapshots are copies of data that allow you to track changes over time or store your results for use elsewhere. A Snapshot is a type of Dataset that's created by and stored in Data Studio.
Connect the Take Snapshot step's input to the data you'd like to capture in a Snapshot. When the Workfow is run, a new batch of data will be created.
This data can then be added to an existing or new Dataset:
Select an existing one from the Dataset list.
Use Create new Dataset to create a new one.
Name - add a name as it will appear in Data Studio.
Summary - an optional short summary of the Dataset.
Description - an optional, longer description.
Dataset type - choose one of the options: Single batch will write the latest data to the Dataset and keep no history. Multi batch will retain older data in the Dataset allowing for trend analysis.
Add Batch Timestamp Column - When selected, an additional column will be added to the Dataset, recording the timestamp that the batch was created. This option is useful for trend analysis.
Allow automatic batch deletion - This will ensure that data batches which have been used in a Workflow are deleted after they have been processed (i.e. after the Workflow using those batches has been run). This option is intended to be used so that no batch is processed through a Workflow twice. It's used in conjunction with the Delete batches on completion setting in the Source step.
Publish to ODBC - Make the Dataset visible through ODBC connections.
Snapshots can be used like other Datasets:
Manipulate and configure your data.
This step allows you to add/remove various Functions and arrange or columns.
Combine two inputs into one output by presenting the values on top of each other. The source columns with matching headings will be combined.
The output – combined values from two sources – can then be used as input to any further step(s).
Click Show step results to view the results.
Use business rules to measure the quality, completeness and accuracy of your data.
A rule is a Function that returns a true or false value, which translates to a pass or fail. This can be either a default Function (like 'Contains' or 'Is Date') or a custom one.
The easiest way to get started is to use Suggest validation rules, which profiles the data to then suggest formulas to be applied to columns as validation rules. Uncheck any suggestions that are not useful and apply to create rules separated into rule groups ready for further changes if required.
In Explore mode, Profile your data to see the Suggest rules action button, which will create a new Workflow containing a Validate step.
On the Validate step, click Rules to view and edit any existing rules and to Add rule or Add group.
Each rule has a name and summary, a parent rule group and a Function definition. The Function can either be a pre-existing Function in Data Studio, or a custom Function created in the Function builder. When building a custom Function, pick one of the relevant columns (or a parameter) as the input value and ensure the Function returns true/false as the result.
Rule groups represent logical collections of validation rules and can have different thresholds set. For example, a group of compliance rules may have a fail threshold of 99%, but a data quality group of rules may use a lower threshold and be increased over time as data and processes are improved.
The Status result column for each rule is based on these thresholds, so Red below the fail threshold, Green at or above the pass threshold, and Amber in between.
Each group has a name and description, pass and fail threshold, plus an optional column to weight the results. As well as counting the number of Passed rows and Failed rows, the results will also include Passed weight and Failed weight columns, which contain a sum of the chosen column for all the passing/failing rows. For example, weighting each rule's results by total sales allows you to prioritize data quality issues for the highest spending customers with the most overall impact.
Validation results are available in several formats:
Similar to the Source and Profile step, the Validate step has a Source metadata dropdown that enables you to include the lineage metadata of input data. This information (such as file name or batch ID) is useful to include as additional columns in the aggregated results.
This step validates and enriches addresses in bulk.
Addresses will be cleaned by verifying them against the official postally-correct address files for the relevant country. Cleaned addresses are assigned a match result, based on the accuracy of the original address. You can define layouts specifying the number, content and format of the address output columns. Choose one of the available Additional datasets to enrich your data. The datasets that are available to you depend on your license.
Validate emails based on the format or domain address.
Select the Email column and pick one of the two available Validation type options:
Format Check: Checks whether the value matched a valid email format. Returns either true or false.
Examples of valid and invalid email formats:
Format | Result |
---|---|
info@gmail.com | Valid |
first.second-name@gmail.com | Valid |
first.name+tag@gmail.com | Valid |
name@info@gmail.com | Invalid |
name"not"right@test.com | Invalid |
another.test.com | Invalid |
name@incorrect_domain.com | Invalid |
com.domain@name | Invalid |
first_address@gmail.com, second_address@gmail.com | Invalid |
Only one email can be validated at once; lists of emails as seen in the last example will be rejected.
Domain Level: Checks whether the value has a domain that exists and is an email server. This option returns both an overall validity result (true or false) in the Email domain: Result column, and additional information in the Email domain: Error column describing the reason for failure. The possible outcomes are:
Error | Result | Description |
---|---|---|
Valid | True | Domain exists and is a mail server. |
Bad format | False | Email format is invalid. |
Invalid domain | False | Domain validation check failed. The domain may not exist, or may have been flagged as illegitimate, disposable, harmful, nondeterministic or unverifiable. |
Invalid name | False | Local part validation failed. For example it may have been identified as a spam trap or role account such as "admin@server.com". |
Failed to perform DNS lookup | False | An error occurred when attempting to perform the DNS lookup. |
Domain level validation results are cached with results refreshed every 30 days. The cache validity is configurable in Settings > Workflow steps by changing the Email validation cache validity setting.
Click Show step results to view the results.
Validate global phone numbers using the Validate phone numbers step in Workflows.
Connect the step to the source step and specify the following:
Click Show step results to view the results. The following columns will be appended to your data:
Each step in the Workflow has the option to Show step results. This opens the grid to view data up to that stage in the Workflow, Explore and apply any Actions. If the Action(s) have a corresponding Workflow step then click Add to Workflow to insert a new Workflow step into the current Workflow from whichever step was previously opened.
Workflow steps can be hidden, from the step list that is shown when editing a Workflow, to show only the steps relevant to the use cases of that Space. Clicking the Spaces menu options then Workflow steps allows you to turn both default and custom steps on/off from appearing in the list.
Workflow canvas is the main area of the page where Workflows are viewed or edited.
Designer users can build and optimize Workflows by dragging Workflow steps around the Workflow canvas.
Zoom In/Out within a Workflow to show more/fewer steps on a page. The current zoom-level is noted on the mini map in the bottom right corner. Hovering over a step in the mini map shows the step name and clicking on a step moves the canvas to show that step.
Each Workflow step has a menu with the Edit properties option which allows you to change:
Notes are very useful for adding detailed descriptions to a Workflow or around some specific Workflow steps. Notes are also shown in the Workflow mini map in the bottom right corner, which can aid navigation.
To add a note, double-click anywhere on the Workflow canvas. Change the color or resize the note then click away to save the note. Double-click an existing note to edit it.
Using Markdown text in a note allows you to style the contents, e.g. add links or create tables and lists.
For larger Workflows, it can be useful to search for a specific step using the keyboard shortcut Ctrl+F, which will highlight any matching steps and move the canvas to show any steps appearing off screen.
Create an HTML report, by clicking the Report button, containing a comprehensive and detailed summary of the Workflow:
There are different ways to execute (or run) a Workflow:
Executing a Workflow manually opens a dialog allowing you to specify the following execution options:
Executing a Workflow creates a Job in the Jobs list page.
A Workflow parameter is a placeholder value that can be used within different Workflow steps to apply values dynamically when running a Workflow. For example, create a mandatory parameter Initials that prompts a user to enter their initials which are then included as part of the filename on an Export step.
A Workflow can be re-used, embedded in another Workflow, and can be shared between Spaces. When a Workflow (the child Workflow) is used in another Workflow (the parent), it will appear in a similar way to a Workflow step.
To make a Workflow re-usable:
If you have also published and shared the Workflow, you can re-use (embed) it in Workflows in other Spaces.
The re-usable Workflow will essentially be made into a Workflow step consisting of three parts:
Workflow element | Step element |
---|---|
Replaceable source | Input node |
Output step | Output node |
Re-configurable Workflow parameter | Step setting |
Any Source steps in the child Workflow where Can supply source when executed is set will become input nodes when the Workflow is used as a step in a parent Workflow. Similarly, any Output steps.
If the input is connecting to an output with missing columns, a column mapping table will be shown, and you may need to manually re-map the columns.
Values for Workflow parameters used in the child Workflow can be passed in from the parent Workflow if the Runtime configurable setting is enabled.