tokenizer transform

Accepts `log` events and allows you to tokenize a field's value by splitting on white space, ignoring special wrapping characters, and zipping the tokens into ordered field names.

The tokenizer transform accepts log events and allows you to tokenize a field's value by splitting on white space, ignoring special wrapping characters, and zipping the tokens into ordered field names.

Config File

vector.toml (example)
vector.toml (schema)
vector.toml (specification)
[transforms.my_transform_id]
# REQUIRED - General
type = "tokenizer" # must be: "tokenizer"
inputs = ["my-source-id"]
field_names = ["timestamp", "level", "message"]
# OPTIONAL - General
drop_field = true # default
field = "message" # default
# OPTIONAL - Types
[transforms.my_transform_id.types]
status = "int"
duration = "float"
success = "bool"
timestamp = "timestamp|%s"
timestamp = "timestamp|%+"
timestamp = "timestamp|%F"
timestamp = "timestamp|%a %b %e %T %Y"

Options

Key

Type

Description

REQUIRED - General

type

string

The component type required must be: "tokenizer"

inputs

[string]

A list of upstream source or transform IDs. See Config Composition for more info. required example: ["my-source-id"]

field_names

[string]

The field names assigned to the resulting tokens, in order. required example: (see above)

OPTIONAL - General

drop_field

bool

If true the field will be dropped after parsing. default: true

field

string

The field to tokenize. default: "message"

OPTIONAL - Types

types.*

string

A definition of mapped field types. They key is the field name and the value is the type. strftime specifiers are supported for the timestamp type. required enum: "string", "int", "float", "bool", and "timestamp\|strftime"

Examples

Given the following log line:

log
{
"message": "5.86.210.12 - zieme4647 [19/06/2019:17:20:49 -0400] "GET /embrace/supply-chains/dynamic/vertical" 201 20574"
}

And the following configuration:

vector.toml
[transforms.<transform-id>]
type = "tokenizer"
field = "message"
fields = ["remote_addr", "ident", "user_id", "timestamp", "message", "status", "bytes"]

A log event will be emitted with the following structure:

{
// ... existing fields
"remote_addr": "5.86.210.12",
"user_id": "zieme4647",
"timestamp": "19/06/2019:17:20:49 -0400",
"message": "GET /embrace/supply-chains/dynamic/vertical",
"status": "201",
"bytes": "20574"
}

A few things to note about the output:

  1. The message field was overwritten.

  2. The ident field was dropped since it contained a "-" value.

  3. All values are strings, we have plans to add type coercion.

  4. Special wrapper characters were dropped, such as

    wrapping [...] and "..." characters.

How It Works

Blank Values

Both " " and "-" are considered blank values and their mapped field will be set to null.

Environment Variables

Environment variables are supported through all of Vector's configuration. Simply add ${MY_ENV_VAR} in your Vector configuration file and the variable will be replaced before being evaluated.

You can learn more in the Environment Variables section.

Special Characters

In order to extract raw values and remove wrapping characters, we must treat certain characters as special. These characters will be discarded:

  • "..." - Quotes are used tp wrap phrases. Spaces are preserved, but the wrapping quotes will be discarded.

  • [...] - Brackets are used to wrap phrases. Spaces are preserved, but the wrapping brackets will be discarded.

  • \ - Can be used to escape the above characters, Vector will treat them as literal.

Types

By default, extracted (parsed) fields all contain string values. You can coerce these values into types via the types table as shown in the Config File example above. For example:

[transforms.my_transform_id]
# ...
# OPTIONAL - Types
[transforms.my_transform_id.types]
status = "int"
duration = "float"
success = "bool"
timestamp = "timestamp|%s"
timestamp = "timestamp|%+"
timestamp = "timestamp|%F"
timestamp = "timestamp|%a %b %e %T %Y"

The available types are:

Type

Desription

bool

Coerces to a true/false boolean. The 1/0 and t/f values are also coerced.

float

Coerce to 64 bit floats.

int

Coerce to a 64 bit integer.

string

Coerces to a string. Generally not necessary since values are extracted as strings.

timestamp

Coerces to a Vector timestamp. strftime specificiers must be used to parse the string.

Troubleshooting

The best place to start with troubleshooting is to check the Vector logs. This is typically located at /var/log/vector.log, then proceed to follow the Troubleshooting Guide.

If the Troubleshooting Guide does not resolve your issue, please:

  1. If encountered a bug, please file a bug report.

  2. If encountered a missing feature, please file a feature request.

  3. If you need help, join our chat/forum community. You can post a question and search previous questions.

Alternatives

Finally, consider the following alternatives:

Resources