A Dataset stores the schema (column layout) of the data that will be loaded, including the parsing options, settings, data tags and data batch details.

You don't have to store data within Data Studio. You can create an empty Dataset with column headers and load data in batches. This lets you re-use the same Dataset by replacing data batches whenever needed.

Data can be loaded into a Dataset from multiple sources. Each Source of data should have a similar schema, but can be of a different type (i.e. file and database table). For files, each Source has its own dropzone folder and can use a different encryption key.

There are a few reasons this could be useful:

  • Getting started with a file while waiting for database credentials.
  • Allowing multiple teams/users to upload to a Dataset without them needing access to the same file folder.
  • Switching from a development to a production data integration.

Viewing and creating a source

You can view the Sources for a Dataset from the Datasets list screen using Actions > Sources. From here you can Add another source, which runs through the usual Dataset loading wizard. Adding a new Source will load data into the Dataset.

Setting a default source

One of the Sources will be the Default. This will be used by any automatic data refresh or as the metadata in Workflow steps.

It is possible to specify which Source should be Default, by selecting Actions > Set Default from the list of Sources.

Deleting a source

To delete a Source, go to the Source list for a Dataset then Actions > Delete for the Source no longer required and confirm deletion.

Dataset dropzones

The Dataset dropzone directory is the unique External label of the Source. To update the directory/folder name, it is the Source External label (not the Dataset) that should be updated.

Data within Datasets is stored in one or more batches. Each batch has an ID, date, timestamp and row count.

When creating or editing a Dataset, choose whether it will contain one or more batches. Choose one of the following:

  • Overwrite (default) to create a single-batch Dataset. New data loaded into this Dataset will overwrite any existing data. The latest data will be the only data available, meaning you're using the freshest data.
  • Add to create a multi-batch Dataset. New data loaded into this Dataset will be added/appended to existing data. Older batches of data will remain, so this is useful for historical data accumulated over time, such as weekly sales metrics or daily validation results.

A multi-batch Dataset allows you track data vintage and trend changes over time. It also allows older data to be automatically deleted.

Selecting and viewing batches

You can view the batch details within a Dataset from the Datasets list screen using Actions > Show batches.

The Source step in a Workflow provides an option to select All batches (default), last N batches, Batches created in the latest time period or to take a sample batch over a time period.

Deleting batches

To delete manually, go to the batch details within a Dataset in the Datasets list using Actions > Show batches. Select one or more batches then Delete and confirm deletion.

Datasets also have an Allow automatic batch deletion setting. Checking this provides the Delete batches on completion option in the Source step using this Dataset. When the Workflow is executed, it can be set to automatically delete any batches of data that are processed successfully, which will ensure that no batch of data is processed through a Workflow twice.

Automatic batch deletion

Each multi-batch Dataset has an optional limit on the number of batches it can have. When new data is loaded into that Dataset, if the specified number of batches would exceed the limit, the oldest batch will be automatically deleted. If this setting is left blank, no automatic deletion of batches will occur.

Automatically delete old data

For compliance or housekeeping reasons, we recommend to automatically delete old data. The Automatically delete older batches after N days setting is checked daily for every Dataset that has this setting enabled and delete any batches that are older than today minus the specified number of days.

Create a Dataset

To create, go to Datasets and click Add Dataset. This will launch a wizard to guide you through the process.

First, choose to either Upload file from a server import directory or connect to an External system such as a cloud file store or a database.

Loading a file

Data Studio supports most common file formats:.csv, .txt, .xls, .xlsx, .json, .psv, .sas7bdat, .dat, .metro2, as well as encrypted files or password protected files with these extensions: .aes, .enc, .gpg, .pgp, .zip, .xls, .xlsx.

The metadata for fixed width files (.dat or .txt) is loaded in a .csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labelled correctly.

Col name Required? Description
COLUMN Required The name of the column. Headers in the data file are not used.
START_POSITION Required Start position of the column. 1 indexed, i.e. the first character in the file is position 1.
END_POSITION Required End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12.
LENGTH Required Length of the column. This is needed for validation.
TYPE Required The data type of the column. ALPHANUMERIC, NUMERIC or DATE.
SUMMARY Optional Column description.
TAGS Optional List of Data tags to be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname".

Data Studio will analyze the file and automatically determine the best parser options to extract the data from the file, but these can be manually amended.

File parser is auto-selected based on the file extension, but a different parser can be selected from the list.

Character set is auto-selected by comparing first 64k characters against a universal library.

Language and region is auto-selected to match the host machine OS locale, unless overridden in the Data Studio installation config files.
The setting controls how date and numeric values are parsed on load (if automatically detecting data types). For example:

  • Date: 'English (United Kingdom)' will parse the value "01/02/2020" as 1st February 2020, 'English (United States)' will parse the same value as 2nd January 2020
  • Numeric: 'English (United Kingdom)' will parse the value "1,001" as the number 1001, 'German' will parse the same value as "1.001"

Delimiter and Quote character are auto-selected from the first 20 rows of data, but can be selected from the list or a custom value entered.

Import data from row can be used with .csv, .psv, and .txt delimited files to specify a number of rows to skip.

Use first row for column names - is selected by default, unselecting will name columns 'Column 1, Column 2, etc.'

Parse newline characters within quoted sections as new rows - always terminate row at an end of line character. By default newline characters within Quote characters, i.e. paragraphs of text, will be incorporated into the cell data.

Column datatype options:

  • Automatically detect the type of data for each column - sets the datatype of each column based on the first 20 rows. This impacts how data is parsed later on using the Language and region setting. Examples: a Numeric column will automatically convert "0123" to "123" and a whitespace will be replaced with null.
  • Treat all columns as alphanumeric - will set Alphanumeric for all columns.
  • Allow columns to support mixed datatypes - will set Unknown datatype for all columns.

Date only

Valid Input Comments
02-Jan-1970 Date elements separated by hyphen, shortened month name.
04-January-1970 Date elements separated by hyphen, using full month name.
15.Dec.1970 Date elements separated by period.
15/Dec/1970 Date elements separated by forward slash.
01-01-20 For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020.
01-01-80 Will be parsed to 1980 using the default watershed.
1970-12-15 Year precedence.
12/15/2020 Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback.

Date and time

Valid Input Comments
01-01-1970T12:34:56 Basic date/time without millis, using T time designator
01-01-1970 12:34:56 Basic date/time without millis
01-01-1970T12:34 Basic date/time without seconds, using T time designator
01-01-1970 12:34 Basic date/time without seconds
01-01-1970 1234 Hour and minute without separator. Becomes 01/01/1970 12:34
01-01-1970 234 Hour and minute without separator. Becomes 01/01/1970 02:34
01-01-1970T12:34:56.987 With milliseconds
01-01-1970T12:34:56.9 Becomes 01/01/1970 12:34:56.900
01-01-1970T123456.987 Without time separators. Becomes 01/01/1970 12:34:56.987
01-01-1970T10:00:00Z With timezone Zulu
01-01-1970 10:00:00+01:00 Becomes 01/01/1970 09:00:00
01-01-1970 10:00:00+01:00[Europe/Paris] Becomes 01/01/1970 09:00:00
01-01-1970T10:00:00-01:00 Becomes 01/01/1970 11:00:00
02-Jan-1990 12:01:02Z[UTC] Becomes 02/01/1990 12:01:02

Time only

Valid Input Comments
1:02:03.004 Time with milliseconds
1:02:03 Time with hours, minutes and seconds
10:02 Time with hours and minutes

Some database systems and file formats can store time zone-aware date and time values.

In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.

row timestamp in CSV file timestamp loaded into Data Studio
1 01/01/2013 01:00:00+0300 31-Dec-2012 22:00:00
2 01/01/2013 01:00:00 01-Jan-2013 01:00:00

A watershed (also known as a century cutoff) is used to calculate the century that any two digit year is in. The default is a floating watershed of 50 years. This can be configured in Settings > Data handling with the option to use a fixed or floating watershed with configurable values for both.

  • A fixed watershed will interpret any year after the given value to be in the last century.
  • A floating watershed is similar to the fixed one but it's relative to the current year. The given value is added to the current year and any two digit years above will be considered to be in the last century. For example, if the floating watershed value is 10 and the current year is 2022, the watershed year will be 10 + 22, so any years after 32 will be converted to 19XX and years below it will be converted to 20XX. This is useful when parsing older dates such as date of birth.

Preview data

Once loaded, a preview is shown containing the first few rows of data, allowing you to check the file/table and settings are correct before proceeding. For multiple tables or files like Excel containing more than one sheet, the first few rows of each table are shown in the preview.

Annotate columns

Each column in a Dataset can have a summary description and Data tags assigned. The Optional setting, if selected, will allow data to be loaded in future without this column being included.

An existing Dataset’s column annotations can be changed by selecting Actions > Annotate columns.

Data tags

Data tags are labels shown next to column headers. As well as making it easier to find the correct column, they help improve the user experience by providing relevant defaults and suggestions in Workflow steps.

We recommend to Auto tag your data, which will analyze the data and automatically apply data tags.

Column datatypes

The Datatype of a column can have a value of Alphanumeric, Date, List, Numeric, Record, or Unknown. This can change how file values are automatically parsed and help users to quickly identify and find the correct column in future. The Record datatype is used for storing values in JSON format.

The first time data is loaded into a Dataset, the datatypes are suggested by looking at the first 20 lines of data, but these can be updated during creation or in future from the Actions > Annotate columns page.

If required, there are a number of conversion Functions that can be applied as part of a Workflow to ensure all values in a column are of the same type.

Dataset details and settings

Datasets have a Name, Summary, Description and an External Label that can be updated from the Actions > Edit details page.

Other settings on this page:

  • Allow auto-refresh – allows the Source step of a Workflow to automatically download new data during Workflow execution.
  • Publish to OData – allows a third-party application (with an API key) to use this Dataset as a source.
  • Enable dropzone – set up a folder that will be periodically checked for a new file and automatically load it into the Dataset.
  • Decryption key – used to process files that are encrypted or password protected.

To reduce disk space usage (by up to 80%) Data Studio will automatically compress your data as it's being loaded in, but compression may impact Workflow performance. The default compression applied can be updated in Settings > Performance. Each Dataset can have a different compression-level from the default. Any change of compression will be applied to data loaded from that point onwards.

  • Column-based compression is best for fast iterations of a single column, such as data validation or transformation where a Workflow is processing every value in a column.
  • Row-based compression often takes up least disk space and can be faster to compress the data. It creates a file that indexes each row, making it quick to access individual cells within the row.
  • No compression – the Dataset, including the data, will take up more disk space than the size of the file being loaded. However, loading data and processing Workflows will be fastest, so this is the best option for processing data in Data Studio without storing it long-term.

Loading new data

The Dataset batch settings will determine if data will overwrite or be added to any existing data.

To load a new version of a file into a file Dataset:

  • Manually: either from the Datasets list screen, by going to Actions > Upload new data or clicking into a Dataset (to check it's the correct one), then Upload data and selecting a required file. If the file structure has been amended, i.e. the schema no longer matches or spreadsheet names have changed, you will be prompted to configure and annotate columns.
  • Automatically: using a Dataset dropzone.

To load new data into an external system Dataset:

  • Manually: from the Datasets list screen, go to Actions > Refresh data (you can select multiple Datasets to refresh).
  • Automatically: ensure the Dataset has Allow auto-refresh enabled, create a Workflow with a Source step that has this Dataset as input and the Allow auto refresh option enabled.

Sharing Datasets

A Dataset cannot be shared with another Space, so the data will only be visible to users with access to that Space.

However, Data can be shared through the use of Views. To give a user or group of users access to a View but not its underlying source data, you can share a View into a Space which they have access to.

The ability to share a view to another Space is a particularly useful way to filter out or obfuscate sensitive data before making it available for processing by another set of users. It allows also data owners to create different representations of a given dataset to suppress irrelevant columns or run logical sorting in advance so downstream users can get more immediate value from the data.

You can also export data from any Dataset in a Space and import it to another Space, either in the same Environment or a different one.

Data tags allow you to enrich source data by specifying additional information about what type of data each column contains. Having tagged columns improves the user experience by providing sensible defaults and suggestions in Workflow steps. For example, Find duplicates and Validate addresses. Data Studio can use machine learning algorithms to automatically tag columns.

Tags are saved as part of the Dataset, so any new batch of data added to the Dataset will retain the same tags.

Data tags appear next to the column name in data grids, so if you're not familiar with your data they provide an overview at a glance.

There are two types of data tags:

  • System defined (cannot be modified, used to drive system behavior)
  • User defined

Tags in Data Studio are hierarchical: a parent tag (e.g. Address) can have multiple children (e.g. Postal Code, City).

While you can't modify system defined tags, you can create a child tag (e.g. PO box) and assign it to a system tag (e.g. Address). Once created, it will appear in the User defined list under that system parent tag.

Create a tag

To create new tags follow these steps:

  1. Go to System > Data tags.
  2. Click the Create new tag button to add a new one.
  3. Fill the name, and optionally assign a parent tag.
  4. Click Apply.

The newly created tag will appear in the list.

Manage tags

To manage existing tags:

  1. Go to System > Data tags.
  2. Click on Options > Edit details to edit it.
  3. Change the name. And optionally the parent tag of this new tag.
  4. Click Apply.
  5. The tag will be updated in the list.

Any data tag can have multiple training datasets. A training dataset is used to train a fingerprint, which will allow Data Studio to learn how to recognize data that have similar properties, and allow them to be auto tagged.

Training datasets

To create or view the Training datasets for a data tag:

  1. Go to System > Data tags.
  2. On the required data tag, click on Options > Manage training datasets.
  3. The list of training datasets for the selected data tag will appear.

When creating a new training dataset you will need to define:

  • Name for the training dataset.
  • Source type, either an existing Dataset or an existing View.
  • Name of the Dataset or View to use.
  • Column from that source that contains the values to be used for training.
  • Threshold, a percentage value between 35 and 99 that determines how similar the data must be to the fingerprint in order for the tag to be applied automatically.
  • If Strict length validation should be performed. If set, any data tagged must closely match the mean length of the training data.

Data Studio will train a fingerprint using the specified source and column. Once the fingerprint is trained, it can be used for auto tagging new Datasets. When auto tagging, Data Studio will analyze the input columns and tag them if a fingerprint match is found.

Training datasets can be disabled, which will prevent the training data from being included in the auto tagging process.

You may want to change the Dataset schema after creating it. You can do it manually or automatically.

Manual schema changes

You can apply schema changes to a Dataset when uploading new or refreshing data:

  1. Go to Datasets.
  2. Click the required Dataset's Options.
  3. Select Upload new data/Refresh data (the option will depend on whether the selected Dataset was created from a file upload or an external system connection).
  4. (For file upload Datasets) Click Browse for file and select the required file.
  5. Depending on the compatibility of the new data, you'll see the following options:
  • If compatible, you'll have the option to click Apply and load your data immediately, or optionally Configure the Dataset such as edit any column annotations or parser settings.
  • If incompatible, you must click Configure to review and resolve any differences before the data can be added to the Dataset.

Proceed to configure the Dataset. You'll be presented with a consolidated list of existing, new and missing columns.

To completely exclude a column, tick the Exclude checkbox: new columns will not be added and existing/missing ones will be permanently removed from the Dataset.

Optional columns

Set all columns that may be missing in future uploads or data refreshes as optional. When manually uploading/refreshing data, you'll be notified but able to continue. Missing optional columns will not prevent files being loaded to Dropzones or Workflows being run with the Auto-refresh sources enabled. All missing optional columns will be populated with empty data.

Automatic schema changes

  • Dropzone files
    You can add new data to a Dataset by copying files to the designated Dropzone. If no issues are found, the data will be parsed and uploaded. The result of the upload will fire any related events.
  • Auto-refresh data sources
    New data can be added to a Dataset during the auto-refresh process. Any changes detected will fire related events.

Schema change Events

Data Studio will automatically create the following events (which you can set events for):

  • Mandatory columns which are present in the Dataset but missing from the new data will fire a Dataset automatic load failed event. Your data will not be loaded.
  • Optional columns which are present in the Dataset but missing from the new data will not prevent loading, if there are no other errors. These missing columns will appear blank in the Dataset.
  • Columns in the new data which are not present in the Dataset will fire a Dataset automatic load warning. Your data will load, if there are no other errors. These columns will not be automatically added to the Dataset.
  • When your data loads, a Dataset loaded event will be fired.

Snapshot schema changes

You may want to update the schema if any columns have been added or removed when writing data to Snapshots.

Updating Snapshot schema

When the Take Snapshot step's input schema doesn't match the selected target Dataset schema, a warning will be shown. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.

Review and triage the changes before they take effect:

  • If new columns are added to the input schema, the Take Snapshot step will display a warning. You can Include or Exclude missing column(s) in/from the target Dataset. Until this is done, the step will execute but only write the old columns' data.
  • If columns have been removed from the input schema, the Snapshot step will display a warning and will become invalid, causing Workflow execution to fail. You'll have to either change the input or resolve the warning manually.

Data encryption is the process of encoding data in order to protect the confidentiality, integrity, and authenticity of information during transmission. Aperture Data Studio is designed with integrated security features, and one of these is data encryption - the capability of encrypting exported data and decrypting imported data.

Cryptography as a field has spurred the birth of public key encryption, which is an industry standard cryptographic system. In public key encryption, a public key is used for encrypting data, whereas a corresponding private key is used for decrypting data.

The supported encrypted file types are:

  • Advanced Encryption Standard (AES)
  • Microsoft Excel (.XLS and .XLSX)
  • GNU Privacy Guard (GPG)
  • Pretty Good Privacy (PGP)
  • .ZIP files

Additionally, Aperture Data Studio also supports encryption key management. Encryption keys can be assigned to specific users or roles through RBAC. Users with this capability are able to save, export, and import encryption keys in a secured manner.

Aperture Data Studio supports password-based encryption for Advanced Encryption Standard (AES) encrypted files.

AES is a symmetric-key algorithm where the same key is used to encrypt and decrypt data.

There is an optional salt value to turn a password into a secret key. Salt is random value that is used as an additional input to a one-way function that hashes data, a password.

Data Studio can load data from password-protected Excel (.XLS and .XLSX) files. The files can be protected using an encryption facility based on a specified password. In order to extract the original content from Excel files, the correct password has to be provided.

Aperture Data Studio supports public key encryptions for GNU Privacy Guard (GPG) encrypted files.

GPG is a key-based encryption method which utilizes two cryptographic keys; a public key to encrypt a file and a corresponding private key to decrypt it.

When generating a GPG key pair, a passphrase has to be specified. This is required by the private key; only an authorized user with knowledge of the passphrase can use the private key. This provides a two-layer protection against unauthorized access to the contents of the encrypted file.

Aperture Data Studio supports public key encryption for Pretty Good Privacy (PGP) encrypted files.

PGP is a key-based encryption method which utilizes two cryptographic keys; a public key to encrypt a file and a corresponding private key to decrypt it.

When generating a PGP key pair, a passphrase has to be specified. This is required by the private key; only an authorized user with knowledge of the passphrase can use the private key. This provides a two-layer protection against unauthorized access to the contents of the encrypted file.

Aperture Data Studio supports password-based encryption for encrypted .ZIP files. Compressed .ZIP files can be protected using an encryption facility based on a specified password. In order to extract the original content from the ZIP archive, the correct password has to be provided.

Prerequisites

To manage data encryption operations, ensure that:

  • you are assigned the Create Key Encryption capability.
  • the Manage permission is assigned to users who will manage (edit, add or delete) keys in Data Studio.
  • the Use permission is assigned to all users who should be able to view and use the keys for importing/exporting encrypted files in Data Studio.

Creating data encryption keys

To create a key:

  1. Navigate to System > Data encryption and select Add new data encryption.
  2. Select the encryption type.
  3. Give the encryption key a name and optional summary.
  4. Specify the usage type of the key. You may select either one or both usage types but at least one must be selected.
  5. Depending on the encryption type selected, complete the fields necessary to configure the data encryption. Find out more about AES, Excel,GPG, PGP and ZIP files.
  6. Click Next.
  7. Assign the required users.
  8. Click Finish to save changes.

Editing data encryption keys

To edit a key:

  1. Navigate to System > Data encryption.
  2. Select the required key and click on the hamburger menu in the Options column.
  3. Select Edit details to update the configuration of the key. You can also Enable, Disable or Delete it.
  4. Click Finish to save changes.

Using data encryption keys

When importing or exporting any encrypted files, a valid encryption key is required to ensure that the operation is permitted for the user.

Data encryption keys can be pinned to a particular usage in Datasets or Workflows, so that the same encryption key is used consistently.

Load encrypted Datasets using a data encryption key

When adding a Dataset from an encrypted file, you will need to select an existing encryption key or enter a new encryption key to be associated with the Dataset. The data encryption keys available for selection are only those that your user has permission to use.

Once set, this data encryption key will be used whenever the Dataset is refreshed by any user, even for those with no direct permission to access the data encryption key.

Data encryption keys are available in different Spaces within the same Environment.

To upload a locally stored encrypted file:

  1. Go to Datasets.
  2. Click Add Dataset.
  3. Select Upload file for a locally stored encrypted file and click Next.
  4. Click Browse for file and select the required one and click Next.
  5. Select the required Encryption mode - either Select a decryption key or Enter a new decryption key. If the first option is chosen, existing data encryption keys will be available for selection through a dropdown chooser.
  6. If Enter a new decryption key is selected, enter the required parameters to decrypt the file.
  7. Enable Add to existing keys to save the encryption key for future use. Provide a new encryption key name.
  8. Continue with the Dataset upload process.

Export an encrypted file from a Workflow

When exporting records using the Export step in the Workflow designer, specify the data encryption key to be used to encrypt the output file. You can also set the required encryption type for the exported file.

Other users with access to the Space can also modify the data encryption key used in the Export step. Once the key has been set, it will be used whenever the Workflow is executed (or scheduled for execution), so will be also accessible to users with no direct permissions to access the data encryption key.

You can also export encrypted files to an external system in the same way. Find out how to export data to an External system.

This feature allows Datasets to be automatically refreshed at the point of Workflow execution. The latest data will be fetched from the origin.

Source types that support Auto-refresh sources

Auto-refresh sources will work with External systems where the source table is still available. File uploads and dropzone files are excluded because once a Dataset is created, the origin file is not retained.

Valid data source types for Auto-refresh sources are JDBC, Azure BLOB, HDFS, Amazon S3, and SFTP.

Views will inherit the refresh setting from their base Dataset, so a Workflow that uses a View as source can also be configured to refresh that View's underlying Dataset on execution, even if the view is shared to another Space.

Enable Auto-refresh sources

There are three settings required in order to enable Auto-refresh sources:

When creating or editing a Dataset

These settings have a cumulative effect and you need all three to be enabled in order for data to be refreshed.
When all of these options are enabled, and the Workflow is executed, it will run with the latest data from the origin.

Provided your dataset is from a supported data source type:

  1. Go to Datasets.
  2. Find the Dataset or Add Dataset.
  3. On the Edit details page, tick the setting Allow auto-refresh.

When creating or editing a Workflow

On the Source step, tick Allow auto-refresh. On this step you can also control what the behaviour should be if the source refresh fails during workflow execution. Checking Stop execution when refresh failed will cause a workflow execution to fail if the Dataset could not be refreshed from the External System.

When executing the Workflow

In the Run Workflow dialog or Schedule:

  1. From the Workflow, press Run and tick Refresh sources.
  2. In the edit/ create page for a Schedule tick Refresh sources.

Having this level of control allows for scenarios such as:

  • Ensuring the latest data is always used during scheduled or manual executions of a Workflow.
  • The creation of a Dataset that can never be automatically reloaded (e.g. definitive reference data).
  • The creation of a View that can never be automatically reloaded.
  • A one-off test execution of a Workflow that does not reload Datasets.
  • A Schedule that, regardless of the Workflow configuration, will never reload Datasets.

Snapshots are copies of data that allow you to track changes over time or store your results for use elsewhere. A Snapshot is a type of Dataset that's created by and stored in Data Studio.

Taking Snapshots

Connect the Take Snapshot step's input to the data you'd like to capture in a Snapshot. When the Workfow is run, a new batch of data will be created.

This data can then be added to an existing or new Dataset:

  • Select an existing one from the Dataset list.

  • Use Create new Dataset to create a new one.

    • Name - add a name as it will appear in Data Studio.
    • Summary - an optional short summary of the Dataset.
    • Description - an optional, longer description.
    • Dataset type - choose one of the options: Single batch will write the latest data to the Dataset and keep no history. Multi batch will retain older data in the Dataset allowing for trend analysis.
    • Interactivity - This option will only be available if you're making a copy of Results by rule or Results by group from the Validate step. Selecting Interactive (with drill down) will result in a Dataset that allows users to drill into the underlying data and view the passing/failing rows for the rules/rule groups.
    • Add Batch Timestamp Column - When selected, an additional column will be added to the Dataset, recording the timestamp that the batch was created. This option is useful for trend analysis.
    • Allow automatic batch deletion - This will ensure that data batches which have been used in a Workflow are deleted after they have been processed (i.e. after the Workflow using those batches has been run). This option is intended to be used so that no batch is processed through a Workflow twice. It's used in conjunction with the Delete batches on completion setting in the Source step.
    • Allow auto-refresh - Not applicable to Snapshots.
    • Publish to ODBC - Make the Dataset visible through ODBC connections.
    • Compression level - Choose one of the options: Row-based to store data by row, Column-based to store data by column or None to not use compression for this Snapshot. The value selected by default is determined by what's been specified in Settings > Performance > Compression. Find out more about the compression levels.

Using Snapshots

Snapshots can be used like other Datasets:

  • As a Workflow source - add a Source step to a Workflow and select the required Snapshot Dataset.
  • As a View source - choose the required Snapshot Dataset when creating a View.

Snapshot schema changes

You may want to update the schema if columns have been added/removed when writing data to a Snapshot.

Resolving Snapshot warnings

A summary of the unresolved changes will appear as a warning on the step. To fix this, you can change the input or resolve the conflicts. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.

You can now:

  • Include or Exclude new columns in the Snapshot's schema. You can change this at any point later.
  • Set all missing mandatory columns as Optional. Optional columns will be retained and filled with empty values in the target Dataset.

Click Apply to update the schema. The step will become valid and you'll be able view the step results/run the Workflow.

Datasets in Data Studio can be accessed by other systems and applications using the OData protocol. Many applications are able to consume data from OData APIs directly. Alternatively, developers can access the API itself (see a list of supported programming languages).

Make Datasets available via OData

For security reasons, Datasets will not be accessible via OData unless both the Dataset and the Space it belongs to have been configured to allow access.

To allow Datasets within a Space to be exposed via OData:

  1. Click on the drop-down arrow in the top menu and select View all Spaces.
  2. Click on the Space's Options.
  3. Click Edit Details.
  4. Enable Allow publishing to ODBC and OData.
  5. Click Apply.
  1. Go to Datasets.
  2. Click on the Dataset's Options.
  3. Click Edit Details.
  4. Enable Publish to ODBC and OData.
  5. Click Apply.

Connect to the API

When you connect using OData, you will need to provide the following information:

Setting Value
URL http[s]://<server>/odata/<environment label> See manage Environments for details of how to determine the environment label
Authentication Type Basic
Username Your Data Studio username (typically your email address)
Password A valid Data Studio API key

Supported OData parameters

If you are using an application to read OData, you won't need to understand the details of the protocol. However for developers who wish to access the API from code, it may be useful to know the subset of OData query parameters that are currently supported by Data Studio:

  • $count can be used to determing the total number of rows available within a Dataset
  • $top and $skip can be used to retrieve data in batches/pages, rather than requesting the entire dataset in one request.

Please consult the OData Standard or the documentation for the OData library you are using for information about how to supply these parameters.