A Dataset stores the schema (column layout) of the data that will be loaded, including the parsing options, settings, data tags and data batch details.
You don't have to store data within Data Studio. You can create an empty Dataset with column headers and load data in batches. This lets you re-use the same Dataset by replacing data batches whenever needed.
Data can be loaded into a Dataset from multiple sources. Each Source of data should have a similar schema, but can be of a different type (i.e. file and database table). For files, each Source has its own dropzone folder and can use a different encryption key.
There are a few reasons this could be useful:
You can view the Sources for a Dataset from the Datasets list screen using Actions > Sources. From here you can Add another source, which runs through the usual Dataset loading wizard. Adding a new Source will load data into the Dataset.
One of the Sources will be the Default. This will be used by any automatic data refresh or as the metadata in Workflow steps.
It is possible to specify which Source should be Default, by selecting Actions > Set Default from the list of Sources.
To delete a Source, go to the Source list for a Dataset then Actions > Delete for the Source no longer required and confirm deletion.
The Dataset dropzone directory is the unique External label of the Source. To update the directory/folder name, it is the Source External label (not the Dataset) that should be updated.
Data within Datasets is stored in one or more batches. Each batch has an ID, date, timestamp and row count.
When creating or editing a Dataset, choose whether it will contain one or more batches. Choose one of the following:
A multi-batch Dataset allows you track data vintage and trend changes over time. It also allows older data to be automatically deleted.
You can view the batch details within a Dataset from the Datasets list screen using Actions > Show batches.
The Source step in a Workflow provides an option to select All batches (default), last N batches, Batches created in the latest time period or to take a sample batch over a time period.
To delete manually, go to the batch details within a Dataset in the Datasets list using Actions > Show batches. Select one or more batches then Delete and confirm deletion.
Datasets also have an Allow automatic batch deletion setting. Checking this provides the Delete batches on completion option in the Source step using this Dataset. When the Workflow is executed, it can be set to automatically delete any batches of data that are processed successfully, which will ensure that no batch of data is processed through a Workflow twice.
Each multi-batch Dataset has an optional limit on the number of batches it can have. When new data is loaded into that Dataset, if the specified number of batches would exceed the limit, the oldest batch will be automatically deleted. If this setting is left blank, no automatic deletion of batches will occur.
For compliance or housekeeping reasons, we recommend to automatically delete old data. The Automatically delete older batches after N days setting is checked daily for every Dataset that has this setting enabled and delete any batches that are older than today minus the specified number of days.
To create, go to Datasets and click Add Dataset. This will launch a wizard to guide you through the process.
First, choose to either Upload file from a server import directory or connect to an External system such as a cloud file store or a database.
Data Studio supports most common file formats:.csv, .txt, .xls, .xlsx, .json, .psv, .sas7bdat, .dat, .metro2, as well as encrypted files or password protected files with these extensions: .aes, .enc, .gpg, .pgp, .zip, .xls, .xlsx.
The metadata for fixed width files (.dat or .txt) is loaded in a .csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labelled correctly.
Col name | Required? | Description |
---|---|---|
COLUMN | Required | The name of the column. Headers in the data file are not used. |
START_POSITION | Required | Start position of the column. 1 indexed, i.e. the first character in the file is position 1. |
END_POSITION | Required | End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12. |
LENGTH | Required | Length of the column. This is needed for validation. |
TYPE | Required | The data type of the column. ALPHANUMERIC, NUMERIC or DATE. |
SUMMARY | Optional | Column description. |
TAGS | Optional | List of Data tags to be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname". |
Data Studio will analyze the file and automatically determine the best parser options to extract the data from the file, but these can be manually amended.
File parser is auto-selected based on the file extension, but a different parser can be selected from the list.
Character set is auto-selected by comparing first 64k characters against a universal library.
Language and region is auto-selected to match the host machine OS locale, unless overridden in the Data Studio installation config files.
The setting controls how date and numeric values are parsed on load (if automatically detecting data types). For example:
Delimiter and Quote character are auto-selected from the first 20 rows of data, but can be selected from the list or a custom value entered.
Import data from row can be used with .csv, .psv, and .txt delimited files to specify a number of rows to skip.
Use first row for column names - is selected by default, unselecting will name columns 'Column 1, Column 2, etc.'
Parse newline characters within quoted sections as new rows - always terminate row at an end of line character. By default newline characters within Quote characters, i.e. paragraphs of text, will be incorporated into the cell data.
Column datatype options:
Valid Input | Comments |
---|---|
02-Jan-1970 | Date elements separated by hyphen, shortened month name. |
04-January-1970 | Date elements separated by hyphen, using full month name. |
15.Dec.1970 | Date elements separated by period. |
15/Dec/1970 | Date elements separated by forward slash. |
01-01-20 | For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020. |
01-01-80 | Will be parsed to 1980 using the default watershed. |
1970-12-15 | Year precedence. |
12/15/2020 | Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback. |
Valid Input | Comments |
---|---|
01-01-1970T12:34:56 | Basic date/time without millis, using T time designator |
01-01-1970 12:34:56 | Basic date/time without millis |
01-01-1970T12:34 | Basic date/time without seconds, using T time designator |
01-01-1970 12:34 | Basic date/time without seconds |
01-01-1970 1234 | Hour and minute without separator. Becomes 01/01/1970 12:34 |
01-01-1970 234 | Hour and minute without separator. Becomes 01/01/1970 02:34 |
01-01-1970T12:34:56.987 | With milliseconds |
01-01-1970T12:34:56.9 | Becomes 01/01/1970 12:34:56.900 |
01-01-1970T123456.987 | Without time separators. Becomes 01/01/1970 12:34:56.987 |
01-01-1970T10:00:00Z | With timezone Zulu |
01-01-1970 10:00:00+01:00 | Becomes 01/01/1970 09:00:00 |
01-01-1970 10:00:00+01:00[Europe/Paris] | Becomes 01/01/1970 09:00:00 |
01-01-1970T10:00:00-01:00 | Becomes 01/01/1970 11:00:00 |
02-Jan-1990 12:01:02Z[UTC] | Becomes 02/01/1990 12:01:02 |
Valid Input | Comments |
---|---|
1:02:03.004 | Time with milliseconds |
1:02:03 | Time with hours, minutes and seconds |
10:02 | Time with hours and minutes |
Some database systems and file formats can store time zone-aware date and time values.
In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.
row | timestamp in CSV file | timestamp loaded into Data Studio |
---|---|---|
1 | 01/01/2013 01:00:00+0300 | 31-Dec-2012 22:00:00 |
2 | 01/01/2013 01:00:00 | 01-Jan-2013 01:00:00 |
A watershed (also known as a century cutoff) is used to calculate the century that any two digit year is in. The default is a floating watershed of 50 years. This can be configured in Settings > Data handling with the option to use a fixed or floating watershed with configurable values for both.
Once loaded, a preview is shown containing the first few rows of data, allowing you to check the file/table and settings are correct before proceeding. For multiple tables or files like Excel containing more than one sheet, the first few rows of each table are shown in the preview.
Each column in a Dataset can have a summary description and Data tags assigned. The Optional setting, if selected, will allow data to be loaded in future without this column being included.
An existing Dataset’s column annotations can be changed by selecting Actions > Annotate columns.
Data tags are labels shown next to column headers. As well as making it easier to find the correct column, they help improve the user experience by providing relevant defaults and suggestions in Workflow steps.
We recommend to Auto tag your data, which will analyze the data and automatically apply data tags.
The Datatype of a column can have a value of Alphanumeric, Date, List, Numeric, Record, or Unknown. This can change how file values are automatically parsed and help users to quickly identify and find the correct column in future. The Record datatype is used for storing values in JSON format.
The first time data is loaded into a Dataset, the datatypes are suggested by looking at the first 20 lines of data, but these can be updated during creation or in future from the Actions > Annotate columns page.
If required, there are a number of conversion Functions that can be applied as part of a Workflow to ensure all values in a column are of the same type.
Datasets have a Name, Summary, Description and an External Label that can be updated from the Actions > Edit details page.
Other settings on this page:
To reduce disk space usage (by up to 80%) Data Studio will automatically compress your data as it's being loaded in, but compression may impact Workflow performance. The default compression applied can be updated in Settings > Performance. Each Dataset can have a different compression-level from the default. Any change of compression will be applied to data loaded from that point onwards.
The Dataset batch settings will determine if data will overwrite or be added to any existing data.
To load a new version of a file into a file Dataset:
To load new data into an external system Dataset:
A Dataset cannot be shared with another Space, so the data will only be visible to users with access to that Space.
However, Data can be shared through the use of Views. To give a user or group of users access to a View but not its underlying source data, you can share a View into a Space which they have access to.
The ability to share a view to another Space is a particularly useful way to filter out or obfuscate sensitive data before making it available for processing by another set of users. It allows also data owners to create different representations of a given dataset to suppress irrelevant columns or run logical sorting in advance so downstream users can get more immediate value from the data.
You can also export data from any Dataset in a Space and import it to another Space, either in the same Environment or a different one.
Data tags allow you to enrich source data by specifying additional information about what type of data each column contains. Having tagged columns improves the user experience by providing sensible defaults and suggestions in Workflow steps. For example, Find duplicates and Validate addresses. Data Studio can use machine learning algorithms to automatically tag columns.
Tags are saved as part of the Dataset, so any new batch of data added to the Dataset will retain the same tags.
Data tags appear next to the column name in data grids, so if you're not familiar with your data they provide an overview at a glance.
There are two types of data tags:
Tags in Data Studio are hierarchical: a parent tag (e.g. Address) can have multiple children (e.g. Postal Code, City).
While you can't modify system defined tags, you can create a child tag (e.g. PO box) and assign it to a system tag (e.g. Address). Once created, it will appear in the User defined list under that system parent tag.
To create new tags follow these steps:
The newly created tag will appear in the list.
To manage existing tags:
Any data tag can have multiple training datasets. A training dataset is used to train a fingerprint, which will allow Data Studio to learn how to recognize data that have similar properties, and allow them to be auto tagged.
To create or view the Training datasets for a data tag:
When creating a new training dataset you will need to define:
Threshold suggestion
Data Studio will train a fingerprint using the specified source and column. Once the fingerprint is trained, it can be used for auto tagging new Datasets. When auto tagging, Data Studio will analyze the input columns and tag them if a fingerprint match is found.
Training datasets can be disabled, which will prevent the training data from being included in the auto tagging process.
You may want to change the Dataset schema after creating it. You can do it manually or automatically.
You can apply schema changes to a Dataset when uploading new or refreshing data:
Proceed to configure the Dataset. You'll be presented with a consolidated list of existing, new and missing columns.
To completely exclude a column, tick the Exclude checkbox: new columns will not be added and existing/missing ones will be permanently removed from the Dataset.
Set all columns that may be missing in future uploads or data refreshes as optional. When manually uploading/refreshing data, you'll be notified but able to continue. Missing optional columns will not prevent files being loaded to Dropzones or Workflows being run with the Auto-refresh sources enabled. All missing optional columns will be populated with empty data.
Data Studio will automatically create the following events (which you can set events for):
You may want to update the schema if any columns have been added or removed when writing data to Snapshots.
When the Take Snapshot step's input schema doesn't match the selected target Dataset schema, a warning will be shown. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.
Review and triage the changes before they take effect:
Data encryption is the process of encoding data in order to protect the confidentiality, integrity, and authenticity of information during transmission. Aperture Data Studio is designed with integrated security features, and one of these is data encryption - the capability of encrypting exported data and decrypting imported data.
Cryptography as a field has spurred the birth of public key encryption, which is an industry standard cryptographic system. In public key encryption, a public key is used for encrypting data, whereas a corresponding private key is used for decrypting data.
The supported encrypted file types are:
Additionally, Aperture Data Studio also supports encryption key management. Encryption keys can be assigned to specific users or roles through RBAC. Users with this capability are able to save, export, and import encryption keys in a secured manner.
Aperture Data Studio supports password-based encryption for Advanced Encryption Standard (AES) encrypted files.
AES is a symmetric-key algorithm where the same key is used to encrypt and decrypt data.
There is an optional salt value to turn a password into a secret key. Salt is random value that is used as an additional input to a one-way function that hashes data, a password.
Data Studio can load data from password-protected Excel (.XLS and .XLSX) files. The files can be protected using an encryption facility based on a specified password. In order to extract the original content from Excel files, the correct password has to be provided.
Aperture Data Studio supports public key encryptions for GNU Privacy Guard (GPG) encrypted files.
GPG is a key-based encryption method which utilizes two cryptographic keys; a public key to encrypt a file and a corresponding private key to decrypt it.
When generating a GPG key pair, a passphrase has to be specified. This is required by the private key; only an authorized user with knowledge of the passphrase can use the private key. This provides a two-layer protection against unauthorized access to the contents of the encrypted file.
Aperture Data Studio supports public key encryption for Pretty Good Privacy (PGP) encrypted files.
PGP is a key-based encryption method which utilizes two cryptographic keys; a public key to encrypt a file and a corresponding private key to decrypt it.
When generating a PGP key pair, a passphrase has to be specified. This is required by the private key; only an authorized user with knowledge of the passphrase can use the private key. This provides a two-layer protection against unauthorized access to the contents of the encrypted file.
To manage data encryption operations, ensure that:
To create a key:
To edit a key:
When importing or exporting any encrypted files, a valid encryption key is required to ensure that the operation is permitted for the user.
Data encryption keys can be pinned to a particular usage in Datasets or Workflows, so that the same encryption key is used consistently.
When adding a Dataset from an encrypted file, you will need to select an existing encryption key or enter a new encryption key to be associated with the Dataset. The data encryption keys available for selection are only those that your user has permission to use.
Once set, this data encryption key will be used whenever the Dataset is refreshed by any user, even for those with no direct permission to access the data encryption key.
Data encryption keys are available in different Spaces within the same Environment.
To upload a locally stored encrypted file:
When exporting records using the Export step in the Workflow designer, specify the data encryption key to be used to encrypt the output file. You can also set the required encryption type for the exported file.
Other users with access to the Space can also modify the data encryption key used in the Export step. Once the key has been set, it will be used whenever the Workflow is executed (or scheduled for execution), so will be also accessible to users with no direct permissions to access the data encryption key.
You can also export encrypted files to an external system in the same way. Find out how to export data to an External system.
This feature allows Datasets to be automatically refreshed at the point of Workflow execution. The latest data will be fetched from the origin.
Auto-refresh sources will work with External systems where the source table is still available. File uploads and dropzone files are excluded because once a Dataset is created, the origin file is not retained.
Valid data source types for Auto-refresh sources are JDBC, Azure BLOB, HDFS, Amazon S3, and SFTP.
Views will inherit the refresh setting from their base Dataset, so a Workflow that uses a View as source can also be configured to refresh that View's underlying Dataset on execution, even if the view is shared to another Space.
There are three settings required in order to enable Auto-refresh sources:
These settings have a cumulative effect and you need all three to be enabled in order for data to be refreshed.
When all of these options are enabled, and the Workflow is executed, it will run with the latest data from the origin.
Provided your dataset is from a supported data source type:
On the Source step, tick Allow auto-refresh. On this step you can also control what the behaviour should be if the source refresh fails during workflow execution. Checking Stop execution when refresh failed will cause a workflow execution to fail if the Dataset could not be refreshed from the External System.
In the Run Workflow dialog or Schedule:
Having this level of control allows for scenarios such as:
Snapshots are copies of data that allow you to track changes over time or store your results for use elsewhere. A Snapshot is a type of Dataset that's created by and stored in Data Studio.
Connect the Take Snapshot step's input to the data you'd like to capture in a Snapshot. When the Workfow is run, a new batch of data will be created.
This data can then be added to an existing or new Dataset:
Select an existing one from the Dataset list.
Use Create new Dataset to create a new one.
Snapshots can be used like other Datasets:
You may want to update the schema if columns have been added/removed when writing data to a Snapshot.
A summary of the unresolved changes will appear as a warning on the step. To fix this, you can change the input or resolve the conflicts. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.
You can now:
Click Apply to update the schema. The step will become valid and you'll be able view the step results/run the Workflow.
Datasets in Data Studio can be accessed by other systems and applications using the OData protocol. Many applications are able to consume data from OData APIs directly. Alternatively, developers can access the API itself (see a list of supported programming languages).
For security reasons, Datasets will not be accessible via OData unless both the Dataset and the Space it belongs to have been configured to allow access.
When you connect using OData, you will need to provide the following information:
Setting | Value |
---|---|
URL | http[s]://<server>/odata/<environment label> See manage Environments for details of how to determine the environment label |
Authentication Type | Basic |
Username | Your Data Studio username (typically your email address) |
Password | A valid Data Studio API key |
If you are using an application to read OData, you won't need to understand the details of the protocol. However for developers who wish to access the API from code, it may be useful to know the subset of OData query parameters that are currently supported by Data Studio:
Please consult the OData Standard or the documentation for the OData library you are using for information about how to supply these parameters.