S3
This page guides you through the process of setting up the S3 destination connector.
Prerequisites
List of required fields:
- S3 Bucket Name
- S3 Bucket Path
- S3 Bucket Region
If you are using STS Assume Role, you must provide the following:
- Role ARN
Otherwise, if you are using AWS credentials you must provide the following:
- Access Key ID
- Secret Access Key
If you are using an Instance Profile, you may omit the Access Key ID and Secret Access Key, as well as, the Role ARN.
Additionally the following prerequisites are required:
- Allow connections from HeroPixelserver to your AWS S3/ Minio S3 cluster (if they exist in separate VPCs).
- An S3 bucket with credentials, a Role ARN, or an instance profile with read/write permissions configured for the host (ec2, eks).
- Enforce encryption of data in transit
Setup guide
Step 1: Set up S3
Sign in to your AWS account.
Prepare S3 bucket that will be used as destination, see this to create an S3 bucket.
NOTE: If the S3 cluster is not configured to use TLS, the connection to Amazon S3 silently reverts to an unencrypted connection. HeroPixelrecommends all connections be configured to use TLS/SSL as support for AWS's shared responsibility model
Create bucket a Policy
- Open the IAM console.
- In the IAM dashboard, select Policies, then click Create Policy.
- Select the JSON tab, then paste the following JSON into the Policy editor (be sure to substitute in your bucket name):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObjectAcl",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:AbortMultipartUpload",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET_NAME/*",
"arn:aws:s3:::YOUR_BUCKET_NAME"
]
}
]
}
At this time, object-level permissions alone are not sufficient to successfully authenticate the connection. Please ensure you include the bucket-level permissions as provided in the example above.
- Give your policy a descriptive name, then click Create policy.
Authentication Option 1: Using an IAM Role (Most secure)
S3 authentication using an IAM role member must be enabled by a member of the HeroPixelteam. If you'd like to use this feature, please contact the tech team for more information.
- In the IAM dashboard, click Roles, then Create role.
- Choose the appropriate trust entity and attach the policy you created.
-
Set up a trust relationship for the role. For
example for AWS account trusted
entity use default AWS account on your instance
(it will be used to assume role). To use
External ID set it to environment
variables as
export AWS_ASSUME_ROLE_EXTERNAL_ID="{your-external-id}"
. Edit the trust relationship policy to reflect this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{your-aws-account-id}:user/{your-username}"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "{your-external-id}"
}
}
}
]
}
- Choose the AWS account trusted entity type.
- Set up a trust relationship for the role. This allows the HeroPixelinstance's AWS account to assume this role. You will also need to specify an external ID, which is a secret key that the trusting service (HHeroPixeland the trusted role (the role you're creating) both know. This ID is used to prevent the "confused deputy" problem. The External ID should be your HeHeroPixelrkspace ID, which can be found in the URL of your workspace page. Edit the trust relationship policy to include the external ID:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::094410056844:user/delegated_access_user"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "{your-heropixel-workspace-id}"
}
}
}
]
}
- Complete the role creation and note the Role ARN.
- Select Attach policies directly, then find and check the box for your new policy. Click Next, then Add permissions.
Authentication Option 2: Using an IAM User
Use an existing or create new Access Key ID and Secret Access Key.
- In the IAM dashboard, click Users. Select an existing IAM user or create a new one by clicking Add users.
- If you are using an existing IAM user, click the Add permissions dropdown menu and select Add permissions. If you are creating a new user, you will be taken to the Permissions screen after selecting a name.
- Select Attach policies directly, then find and check the box for your new policy. Click Next, then Add permissions.
- After successfully creating your user, select the Security credentials tab and click Create access key. You will be prompted to select a use case and add optional tags to your access key. Click Create access key to generate the keys.
Step 2: Set up the S3 destination connector in HeroPixel
-
Go to local HeroPixelpage.
-
In the left navigation bar, click Destinations. In the top-right corner, click + new destination.
-
On the destination setup page, select S3 from the Destination type dropdown and enter a name for this connector.
-
Configure fields: _ Access Key Id _ See this on how to generate an access key. _ See this on how to create a instanceprofile. _ We recommend creating an HeroPixelspecific user. This user will require read and write permissions to objects in the staging bucket. _ If the Access Key and Secret Access Key are not provided, the authentication will rely either on the Role ARN using STS Assume Role or on the instanceprofile.
-
_ Secret Access Key _ Corresponding key to the above key id. _ Make sure your S3 bucket is accessible from the machine running HeroPixel _ This depends on your networking setup. _ You can check AWS S3 documentation with a tutorial on how to properly configure your S3's access here. _ If you use instance profile authentication, make sure the role has permission to read/write on the bucket. _ The easiest way to verify if HeroPixelis able to connect to your S3 bucket is via the check connection tool in the UI. _ S3 Bucket Name _ See this to create an S3 bucket. _ S3 Bucket Path _ Subdirectory under the above bucket to sync the data into. _ S3 Bucket Region _ See here for all region codes. _ S3 Path Format _ Additional string format on how to store data under S3 Bucket Path. Default value is
${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_
. _ S3 Endpoint _ Leave empty if using AWS S3, fill in S3 URL if using Minio S3.-
S3 Filename pattern * The
pattern allows you to set the file-name format
for the S3 staging file(s), next placeholders
combinations are currently supported:
{date}
,{date:yyyy_MM}
,{timestamp}
,{timestamp:millis}
,{timestamp:micros}
,{part_number}
,{sync_id}
,{format_extension}
. Please, don't use empty space and not supportable placeholders, as they won't recognized.
-
S3 Filename pattern * The
pattern allows you to set the file-name format
for the S3 staging file(s), next placeholders
combinations are currently supported:
-
Click
Set up destination
.
The full path of the output data with the default S3
Path Format
${NAMESPACE}/${STREAM_NAME}/${YEAR}_${MONTH}_${DAY}_${EPOCH}_
is:
<bucket-name>/<source-namespace-if-exists>/<stream-name>/<upload-date>_<epoch>_<partition-id>.<format-extension>
For example:
testing_bucket/data_output_path/public/users/2021_01_01_1234567890_0.csv.gz
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
| | | | | | | format extension
| | | | | | unique incremental part id
| | | | | milliseconds since epoch
| | | | upload date in YYYY_MM_DD
| | | stream name
| | source namespace (if it exists)
| bucket path
bucket name
The rationales behind this naming pattern are:
- Each stream has its own directory.
- The data output files can be sorted by upload time.
- The upload time composes of a date part and millis part so that it is both readable and unique.
But it is possible to further customize by using the available variables to format the bucket path:
-
${NAMESPACE}
: Namespace where the stream comes from or configured by the connection namespace fields. -
${STREAM_NAME}
: Name of the stream -
${YEAR}
: Year in which the sync was writing the output data in. -
${MONTH}
: Month in which the sync was writing the output data in. -
${DAY}
: Day in which the sync was writing the output data in. -
${HOUR}
: Hour in which the sync was writing the output data in. -
${MINUTE}
: Minute in which the sync was writing the output data in. -
${SECOND}
: Second in which the sync was writing the output data in. -
${MILLISECOND}
: Millisecond in which the sync was writing the output data in. -
${EPOCH}
: Milliseconds since Epoch in which the sync was writing the output data in. ${UUID}
: random uuid string
Note:
-
Multiple
/
characters in the S3 path are collapsed into a single/
character. -
If the output bucket contains too many files, the
part id variable is using a
UUID
instead. It uses sequential ID otherwise.
Please note that the stream name may contain a prefix, if it is configured on the connection. A data sync may create multiple files as the output files can be partitioned by size (targeting a size of 200MB compressed or lower) .
Supported sync modes
Feature | Support | Notes |
---|---|---|
Full Refresh Sync | ✅ | Warning: this mode deletes all previously synced data in the configured bucket path. |
Incremental - Append Sync | ✅ | Warning: HeroPixelprovides at-least-once delivery. |
Incremental - Append + Deduped | ❌ | |
Namespaces | ❌ | Setting a specific bucket path is equivalent to having separate namespaces. |
The HeroPixelS3 destination allows you to sync data to AWS S3 or Minio S3. Each stream is written to its own directory under the bucket.
⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you to provision a dedicated S3 resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️
Supported Output schema
Each stream will be outputted to its dedicated directory according to the configuration. The complete datastore of each stream includes all the output files under that directory. You can think of the directory as equivalent of a Table in the database world.
- Under Full Refresh Sync mode, old output files will be purged before new files are created.
- Under Incremental - Append Sync mode, new output files will be added that only contain the new data.
Avro
Apache Avro serializes data in a compact binary format. Currently, the HeroPixelS3 Avro connector always uses the binary encoding, and assumes that all data records follow the same schema.
Configuration
Here is the available compression codecs:
- No compression
-
deflate
-
Compression level
-
Range
[0, 9]
. Default to 0. - Level 0: no compression & fastest.
- Level 9: best compression & slowest.
-
Range
-
Compression level
bzip2
-
xz
-
Compression level
-
Range
[0, 9]
. Default to 6. - Level 0-3 are fast with medium compression.
- Level 4-6 are fairly slow with high compression.
- Level 7-9 are like level 6 but use bigger dictionaries and have higher memory requirements. Unless the uncompressed size of the file exceeds 8 MiB, 16 MiB, or 32 MiB, it is waste of memory to use the presets 7, 8, or 9, respectively.
-
Range
-
Compression level
-
zstandard
-
Compression level
-
Range
[-5, 22]
. Default to 3. -
Negative levels are 'fast' modes
akin to
lz4
orsnappy
. - Levels above 9 are generally for archival purposes.
- Levels above 18 use a lot of memory.
-
Range
-
Include checksum
-
If set to
true
, a checksum will be included in each data block.
-
If set to
-
Compression level
snappy
Data schema
Under the hood, an HeroPixeldata stream in JSON schema is first converted to an Avro schema, then the JSON object is converted to an Avro record. Because the data stream can come from any data source, the JSON to Avro conversion process has arbitrary rules and limitations.
CSV
Like most of the other HeroPixeldestination connectors, usually the output has three columns: a UUID, an emission timestamp, and the data blob. With the CSV output, it is possible to normalize (flatten) the data blob to multiple columns.
Column | Condition | Description |
---|---|---|
_airbyte_ab_id
|
Always exists | A uuid assigned by HeroPixelto each processed record. |
_airbyte_emitted_at
|
Always exists. | A timestamp representing when the event was pulled from the data source. |
_airbyte_data
|
When no normalization (flattening) is needed, all data reside under this column as a json blob. | |
root level fields | When root level normalization (flattening) is selected, the root level fields are expanded. |
For example, given the following json object from a source:
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
}
With no normalization, the output CSV is:
_airbyte_ab_id
|
_airbyte_emitted_at
|
_airbyte_data
|
---|---|---|
26d73cde-7eb1-4e1e-b7db-a4c03b4cf206
|
1622135805000 |
{ "user_id": 123, name: {
"first": "John",
"last": "Doe" } }
|
With root level normalization, the output CSV is:
_airbyte_ab_id
|
_airbyte_emitted_at
|
user_id
|
name
|
---|---|---|---|
26d73cde-7eb1-4e1e-b7db-a4c03b4cf206
|
1622135805000 | 123 |
{ "first": "John",
"last": "Doe" }
|
Output files can be compressed. The default option
is GZIP compression. If compression is selected, the
output filename will have an extra extension (GZIP:
.csv.gz
).
JSON Lines (JSONL)
JSON Lines is a text format with one JSON per line. Each line has a structure as follows:
{
"_airbyte_ab_id": "<uuid>",
"_airbyte_emitted_at": "<timestamp-in-millis>",
"_airbyte_data": "<json-data-from-source>"
}
For example, given the following two json objects from a source:
[
{
"user_id": 123,
"name": {
"first": "John",
"last": "Doe"
}
},
{
"user_id": 456,
"name": {
"first": "Jane",
"last": "Roe"
}
}
]
They will be like this in the output file:
{ "_airbyte_ab_id": "26d73cde-7eb1-4e1e-b7db-a4c03b4cf206", "_airbyte_emitted_at": "1622135805000", "_airbyte_data": { "user_id": 123, "name": { "first": "John", "last": "Doe" } } }
{ "_airbyte_ab_id": "0a61de1b-9cdd-4455-a739-93572c9a5f20", "_airbyte_emitted_at": "1631948170000", "_airbyte_data": { "user_id": 456, "name": { "first": "Jane", "last": "Roe" } } }
Output files can be compressed. The default option
is GZIP compression. If compression is selected, the
output filename will have an extra extension (GZIP:
.jsonl.gz
).
Parquet
Configuration
The following configuration is available to configure the Parquet output:
Parameter | Type | Default | Description |
---|---|---|---|
compression_codec
|
enum |
UNCOMPRESSED
|
Compression algorithm.
Available candidates are:
UNCOMPRESSED ,
SNAPPY , GZIP ,
LZO , BROTLI ,
LZ4 , and ZSTD .
|
block_size_mb
|
integer | 128 (MB) | Block size (row group size) in MB. This is the size of a row group being buffered in memory. It limits the memory usage when writing. Larger values will improve the IO when reading, but consume more memory when writing. |
max_padding_size_mb
|
integer | 8 (MB) | Max padding size in MB. This is the maximum size allowed as padding to align row groups. This is also the minimum size of a row group. |
page_size_kb
|
integer | 1024 (KB) | Page size in KB. The page size is for compression. A block is composed of pages. A page is the smallest unit that must be read fully to access a single record. If this value is too small, the compression will deteriorate. |
dictionary_page_size_kb
|
integer | 1024 (KB) | Dictionary Page Size in KB. There is one dictionary page per column per row group when dictionary encoding is used. The dictionary page size works like the page size but for dictionary. |
dictionary_encoding
|
boolean |
true
|
Dictionary encoding. This parameter controls whether dictionary encoding is turned on. |
These parameters are related to the
ParquetOutputFormat
. See the
Java doc
for more details. Also see
Parquet documentation
for their recommended configurations (512 - 1024 MB
block size, 8 KB page size).
Data schema
Under the hood, an HeroPixeldata stream in JSON schema is first converted to an Avro schema, then the JSON object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the JSON to Avro conversion process has arbitrary rules and limitations.
In order for everything to work correctly, it is also necessary that the user whose "S3 Key Id" and "S3 Access Key" are used have access to both the bucket and its contents. Policies to use:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::YOUR_BUCKET_NAME/*",
"arn:aws:s3:::YOUR_BUCKET_NAME"
]
}
]
}