Table of Contents
engineConfiguration
engineConfigurationToGroups
groupConfiguration
groupConfigurationToChannels
channelConfiguration
channelConfigurationToCompressionLevels
samplesBucketSize and samplesBucketSize_*
samples and samples_*
This manual is divided in four chapters (not counting this introduction). The first chapter introduces the concepts of Apache Cassandra in general and the Cassandra Archiver in particular. The second chapter summarizes the new features of the present version of the Cassandra Archiver and outlines the changes compared to older versions. The third chapter describes the few steps needed to setup a basic installation of the Cassandra Archiver. The fourth chapter gives more detailed instructions on how to install the Cassandra Archiver. Finally, the fifth chapter explains how to configure the archiver.
This chapter presents the changes in newer versions of the Cassandra Archiver in comparison to previous versions. If you are new to the Cassandra Archiver, you can safely skip this chapter and procede with the next chapter.
Version 2.0 of the Cassandra Archiver changes the structure of the database used to store the samples. The new structure brings significant improvements for the performance of Cassandra, in particular for large databases.
Existing databases cannot be converted to the new format. Instead the new database has to be setup first and then the data needs to be transferred manually.
Other changes mainly affect the internal code structure which has been tidied up. In particular, major changes have been made to the sample compressor, which also profits from the new database structure.
Finally an option for changing the consistency levels used for reads and writes has been added. Expert users can use this setting to tweak the availability trade-off between read and write scenarios.
Version 2.1 of the Cassandra Archiver uses Apache Cassandra 1.2 instead of version 1.1 that was used for earlier releases. This way users of the Cassandra Archiver can benefit from the many useful features that are new in Cassandra 1.2 (e.g. virtual nodes).
Another change affects the format that channel values are stored in. This format has been optimized, so that integers typically need considerably less space than before. Cassandra Archiver 2.1 can read values in the old format, but values written by Cassandra Archiver 2.0 cannot be read by previous versions. This change does not affect the database structure, thus no special actions have to be taken when upgrading.
Table of Contents
engineConfiguration
engineConfigurationToGroups
groupConfiguration
groupConfigurationToChannels
channelConfiguration
channelConfigurationToCompressionLevels
samplesBucketSize and samplesBucketSize_*
samples and samples_*
This chapter introduces the concepts behind the Cassandra Archiver and and Apache Cassandra. First column-oriented database systems and Apache Cassanra are presented shortly. Subsequently, the Cassandra Archiver for Control System Studio is introduced. Finally, the structure of the keyspace storing data for the Cassandra Archiver is explained. You might want to skip this last section if you are reading this manual for the first time and just interested in getting started with the Cassandra Archiver.
Apache Cassandra is a column-oriented database management system (CDBMS), which is optimized for storing large amounts (tera- or even petabytes) of data grouped in column-families. It is a a special form of a key-value store. Unlike a relational database management system (RDBMS) it is not optimized for storing relational data or modifying data in a transactional way. The main advantages of a CDBMS compared to a RDBMS are superior read-write performance, linear scalability and high availability at low operation costs.
In a CDBMS data is stored in column families. Each column family contains an arbitrary number of rows, which (for a multi-node setup) are distributed over all cluster nodes. Each row is identified by a unique row key. Each row contains one or more columns. Each column is identified by a column name, that must be unique for the respective row. Each column can but does not have to store a value. Row keys, column names and column values are stored as array of bytes. The meaning of the bytes depend on the application accessing the database. Therefore, data-types are completely transparent to the CDBMS.
In a multi-node setup, data is distributed across the nodes, so that the amount of data stored is not limited by the disk-space of a single computer. Typically, low-price servers which are not fault-tolerant are used and the same data is stored on multiple nodes (typically three). The database clients and servers have built-in facilities, that automatically switch to a different node, if the first node fails. Therefore, the database cluster is fault-tolerant and highly availabe, although cheap, unreliable computers are used.
This document does not provide a detailed introduction into column-oriented database management systems or Apache Cassandra. For understanding the concepts of a CDBMS, the original paper about Google Bigtable by Chang et al. is a good starting point.
If you want to setup a cluster of Cassandra servers or are interested in advanced configuration options and performance tuning, you should read the Apache Cassandra Documentation provided by DataStax. However, this manual describes the basic steps needed to setup a single-node Cassandra cluster for getting started with the Cassandra Archiver.
The Cassandra Archiver for Control System Studio is a set of plugins that extend the existing archive reader and writer architecture so that a database hosted by Apache Cassandra can be used instead of a traditional RDBMS like MySQL or Oracle.
By using a column-oriented database management system, huge amounts of channel samples can be archived. The Cassandra Archiver uses two column families for storing all samples of all channels for a certain compression level. Several samples of a channel are aggregated in a so called bucket and each bucket is stored in one row. Each row storeing a bucket is identified by a key that aggregates the channel name, the size of the bucket and the start time-stamp of the bucket (both in nanoseconds). Each column of a bucket-row stores one sample, using the column's name for the sample's time-stamp and the column's value for the sample's value and meta-data (e.g. alarm severity). As data stored in column-families is compressed before being written to disk by default, the space requirements of the database are reduced. Due to the way Cassandra stores data, the read and write perfomance of the database is not reduced by using compression. In fact, using compression can even slightly increase the data throughput.
The Cassandra Archiver can be regarded as a hybrid between the RDB Archiver and the Channel Archiver. Like the RDB Archiver, the Cassandra Archiver uses an existing, well-tested database management system for storing data. However, like the Channel Archiver, the Cassandra Archiver uses a storage format that is more optimized for storing channel samples and can provide high write and read rates.
The HyperArchiver uses a similar concept as the Channel Archiver. However, it uses Hypertable to store the samples and MySQL to store the configuration, while the Cassandra Archiver stores the configuration and the samples in the same database, simplifying installation and maintenance. For a HyperArchiver setup, where the Hypertable server is not running on the same node as the archive engine, the source code of the HyperArchiver has to be modified, because important configuration values are hard-coded. Unlike Apache Cassandra, which does not have a single-point of failure, Hypertable has a master server, which, when down, causes the whole cluster to fail. Besides, Cassandra is implemented as pure Java and thus 100 percent platform independent, while Hypertable needs to be compiled for each supported platform. In summary, the Cassandra Archiver is easier to setup and maintain and more reliable than the HyperArchiver, making it the better choice for most scenarios.
This section explains the various column families which are used to store the configuration and samples. If you are not interested in the details, you can simply skip this section and read on at the next chapter. The information in this section is not needed for setting up the Cassandra Archiver.
For row keys and column names, which have several parts, a composite type is used.
The engineConfiguration
column family stores
information about archive engines. The engine name, which must be
unique, is used as the row key. Each row has columns with the names
url
and description
storing
the URL and the description of the respective archive engine.
The engineConfigurationToGroups
column family
maps engines to their respective archive groups. The engine name is
used as the row key. A column exists for each group in the archive
engine, using the name of the group as the column name.
The groupConfiguration
column family
stores the configuration for each group. The row key is a combination
of the engine name and the group name. The column
enablingChannel
stores the name of the channel that
enables or disables the group.
The groupConfigurationToChannels
column family
maps archive groups to the channels they contain. The row key is the
same as used for the groupConfiguration
column
family. A column exists for each channel in the archive group, using
the name of the channel as the column name.
The channelConfiguration
column family stores
information about channels. The channel name, which must be
unique, is used as the row key. Each row has columns with the names
engine
, group
,
sampleMode
, samplePeriod
,
sampleDelta
and lastSampleTime
storing the engine and group, each channel is associated with, the
sampling options and the time of the last raw sample that has been
written for the channel.
The channelConfigurationToCompressionLevels
column
family maps channels to their respective compression levels. The
channel name is used as the row key. A column exists for each
compression-level that is configured for the respective channel.
Each compression level is stored as a column, where the column name is
the compression period and the column value is the retention period.
The special compression level that stores raw samples always exists,
even if there is no column with a column name of zero (this is the
compression period internally assigned to the raw compression-level).
The samplesBucketSize
column family
stores the bucket size of the sample buckets for all channels in the
raw compression-level. The row key is the channel name.
Each column stores a bucket size, using the time the bucket size
started to be used as the column name and the bucket size as the
column value.
The same bucket size is usually used for many buckets in a row and the time-stamp used for a bucket size is not aligned with the time-stamp of one of these buckets. The time-stamp for the bucket size just means that samples with a time-stamp greater than or equal to the time-stamp for this bucket size are stored in a bucket of this size, unless there is a bucket size with a greater time-stamp that is still less than or equal to the sample's time-stamp.
This database structure is used, because Cassandra does not perform well if many "skinny" rows are stored. Therefore several samples have to be aggregated in one row. On the other hand, a maximum number of 231-1 columns can be stored in a row, which is clearly not enough for channels that change at a very high rate or are intended to be stored for a long period of time. In fact the practical limit for the number of columns in a row is even lower than that. Typically, a good number of columns for one row is in the magnitude of a few millions.
The Cassandra Archiver tries to reach this number by determining the bucket size (the period of time that is stored in one bucket) by dividing one million by the scan period (or expected update rate) for a channel. The only drawback of this database structure is, that there is no way to tell what the time-stamp of the newest sample is. Therefore the Cassandra Archive only accepts samples that have a time-stamp which is a maximum of two hours ahead in time and only looks for samples with a time-stamp that is a maximum of four hours ahead, when searching for samples. The extra two hours allow for a certain clock skew between participating systems.
For each compression level a column familiy with the name
samplesBucketSize_*
, where the asterisk is replaced
by the compression period in seconds, is created. The structure of
this column family is exactly the same as the one of the
samplesBucketSize
column-family.
The samples
column family
stores the actual raw samples for all channels.
The row key is a combination of the channel name, the bucket's length
and the bucket's start time-stamp. Each column stores a sample, using
the sample's time-stamp as the column name and the sample's value and
meta-data as the column value. The column value is a single blob
aggregating all data of a sample except the time-stamp.
For each compression level a column familiy with the name
samples_*
, where the asterisk is replaced by the
compression period in seconds, is created. The structure of this
column family is exactly the same as the one of the
samples
column-family.
For setting up a simple test environment for the Cassandra Archiver, four steps are needed. First, Apache Cassandra has to be installed. Second, the Cassandra Archiver Engine and the accompanying tools have to be installed. Third, the keyspace used by the Cassandra Archiver has to be setup and an initial archiver engine configuration has to be imported. Finally, the Cassandra Archiver Reader has to be installed in Control System Studio.
All the steps needed to install and configure the Cassandra Archiver are described in Chapter 5, Installation and Chapter 6, Configuration. If you are using a simple setup, where the Archive Engine, the Apache Cassandra Server and Control System Studio are all running on the same host, you can simply skip the sections marked as optional in these two chapters.
Table of Contents
This section describes the steps needed for setting up the Apache Cassandra server for use with the Cassandra Archiver.
Important | |
---|---|
Earlier versions of the Cassandra Archiver (before version 2.0.0) used a different database structure which required an order-preserving partitioner to be used. Newer version of the Cassandra archiver (starting with version 2.0.0) do not require this any longer and in fact should not be installed on a cluster with an order-preserving partitioner, because this will cause hot-spots. |
You can download Apache Cassandra from the project's website or use one of the builds provided by DataStax. You should choose the newest version of the binary download from the 1.2 branch. Apache Cassandra is implemented in Java, so that the binary download is the same for all platforms. You need a Java Runtime Environment version 6 or higher in order to run Cassandra.
For the rest of this document, we assume that Cassandra is installed
in /path/to/cassandra
. The actual location
depends on which of the provided binary packages you use.
Apache Cassandra stores its configuration in
/path/to/cassandra/conf
. For a simple,
single-node configuration, there are two relevant files:
cassandra.yaml
and
log4j-server.properties
.
Before starting Cassandra, you either have to change the paths where Cassandra stores its data, or you have to create the directories used by default and make sure the user, that is running Cassandra can write to these directories.
There are four directories Cassandra uses to store data. The first
three are configured in cassandra.yaml
. The
option data_file_directories
is set to
/var/lib/cassandra/data
by default and defines
where the actual data from the various column families is saved.
The option saved_caches_directory
defaults to
/var/lib/cassandra/saved_caches
and is used
to store cached data. The third option is the
commitlog_directory
, which defaults to
/var/lib/cassandra/commitlog
. This directory is
used for storing the write-ahead log. If you aim for maximum
performance, you might want to consider storing the commit-log on
a different disk than the data directories. For simple test setups
however, storing the commit-log on the same disk is fine.
The last directory is configured in
log4j-server.properties
and is used to store
the server log. The option log4j.appender.R.File
defaults to /var/log/cassandra/system.log
.
In contrast to the other options, this option specifies the file and
not a directory.
Note | |
---|---|
If the Cassandra server, the Cassandra Archiver Engine and the Control System Studio client are all running on the same machine, you can skip this step. |
There are five configuration regarding the network interface used
by the cassandra server. The first three options
(storage_port
,
ssl_storage_port
and
listen_address
) are only relevant for a
multi-node Cassandra cluster and thus outside the scope of this
manual.
The other two options (rpc_address
and
rpc_port
) are relevant if you want to run
Control System Studio or the archive engine on different machines
than the Cassandra server. By default rpc_address
is configured to only listen on the loopback interface. You should
change this to the IP address of the network interface your machine
uses to connect to the rest of the network. If you are sure, your
hostname and IP address configuration is correct (in particular
/etc/hosts
and
/etc/hostname
are configured correctly), you
can also set a blank value, to make Cassandra deterine the right
IP address by itself.
The rpc_port
option needs only to be changed, if
you run two or more Cassandra servers on the same host, or a
different service uses the same port. By default TCP port 9160 is
used for the
Thrift service.
If you change this port number, you also have to adjust the setting
in the archive engine and archive reader configurations.
Note | |
---|---|
Configuring the authentication options is completely optional. By default, Cassandra grants full write-access to all connections without any authentication. If using Cassandra in a production environment, you might want to use authentication for better security however. |
Cassandra's security system divides into two components: authentication and authorization. Authentication is the task of checking credentials provided by a client and assigning a principal. Authentication is the task of checking whether a specific principal may perform a certain operation.
By default Cassandra is distributed with an authenticator which accepts any credentials and an authority which grants any permission to any principal.
The SimpleAuthenticator
and
SimpleAuthority
are part of the Cassandra
source code but are not distributed with the binary distribution.
For your convenience, a JAR file with the compiled versions of the
two classes is distributed with the Cassandra Archiver in the
cassandra-simpleauth
directory.
Copy this JAR to the lib
directory of the
Cassandra installation and add the following two lines to the end
of the cassandra-env.sh
configuration file:
JVM_OPTS="$JVM_OPTS -Dpasswd.properties=$CASSANDRA_CONF/passwd.properties" JVM_OPTS="$JVM_OPTS -Daccess.properties=$CASSANDRA_CONF/access.properties"
Besides adding these system properties, you also have to adjust
the authenticator
and
authority
options in
cassandra.yaml
to refer to
org.apache.cassandra.auth.SimpleAuthenticator
and
org.apache.cassandra.auth.SimpleAuthority
respectively.
You also have to create the configuration files
passwd.properties
and
access.properties
in the
conf
directory of the Cassandra installation.
The passwd.properties
file uses a simple
syntax where the property name is the username and the property
value is the clear-text password for the user. The following
examples defines four users with different passwords:
admin=superSafePassword archive-read=somePassword archive-write=someDifferentPassword archive-config=anotherPassword
The access.properties
uses a syntax, where the
property name represents a privilege and the property value is a
comma-separated list of principals, which are granted that
privilege. The following example assigns four levels of privileges:
The user admin
may perform any operation, the
user archive-read
may read data from the
column-families in the cssArchive
keyspace and
the user archive-write
may write data to the
column-families samples
,
channelConfigurations
and
compressionLevelConfigurations
in the
cssArchive
keyspace, and the user
archive-config
may write data to any column
family in the cssArchive
keyspace:
<modify-keyspaces>=admin cssArchive.<ro>=archive-read,archive-write,archive-config cssArchive.<rw>=admin cssArchive.engineConfiguration.<ro>=archive-read,archive-write cssArchive.engineConfiguration.<rw>=archive-config,admin cssArchive.engineConfigurationToGroups.<ro>=archive-read,archive-write cssArchive.engineConfigurationToGroups.<rw>=archive-config,admin cssArchive.groupConfiguration.<ro>=archive-read,archive-write cssArchive.groupConfiguration.<rw>=archive-config,admin cssArchive.groupConfigurationToChannels.<ro>=archive-read,archive-write cssArchive.groupConfigurationToChannels.<rw>=archive-config,admin cssArchive.channelConfiguration.<ro>=archive-read,archive-write cssArchive.channelConfiguration.<rw>=archive-config,admin cssArchive.channelConfigurationToCompressionLevels.<ro>=archive-read,archive-write cssArchive.channelConfigurationToCompressionLevels.<rw>=archive-config,admin cssArchive.samples.<ro>=archive-read,archive-config cssArchive.samples.<rw>=archive-write,admin cssArchive.samplesBucketSize.<ro>=archive-read,archive-config cssArchive.samplesBucketSize.<rw>=archive-write,admin
The lines for the samples
and
samplesBucketSize
families have to be repeated
for all the column families used for compression levels (e.g.
samples_5
and
samplesBucketSize_5
for the compression level
with a compression period of five seconds).
While you can grant write permissions for the whole keyspace to the
archive-write
user, this does not work for read
permissions due to the limitations of the
SimpleAuthority
.
Warning | |
---|---|
The |
The Cassandra server can be started using the script
/path/to/cassandra/bin/cassandra.
You can use the -f
flag to start Cassandra
in foreground (recommended when testing Cassandra the first time).
In order to use the Cassandra Archiver, you first have to create the
keyspace and the column families used by the the archiver.
You can do this by starting
/path/to/cassandra/bin/cassandra-cli -h <hostname or IP address of your Cassandra server>.
If you enabled authentication for your Cassandra server, you have to
specify additional parameters. Call
cassandra-cli -h
for getting a list of all
supported command-line parameters.
Once you successfully started the Cassandra CLI and it is connected to the Cassandra server, you can execute the following command to create the keyspace for the Cassandra Archiver.
CREATE KEYSPACE cssArchive;
Instead of cssArchive
you can use a different name
for the keyspace. However, you will have to configure the keyspace
name for the tools using the Cassandra server, if you do not use the
default keyspace name. The column-family names are fixed and cannot
be changed.
After downloading the binary distribution from the Cassandra Archiver website you should unpack the archive. The archive contains four directories:
archive-engine
archive-cleanup-tool
archive-config-tool
css-plugins
While the programs in the first three directories can be used as-is,
the files in the css-plugins
directory have to
be copied to the plugins
directory of your
Control System Studio installation. The plugins have been developed
for version 3.1 of CSS, so they might not work with other versions.
In order to keep your CSS installation small you might want to consider using the JSON Archive Proxy instead of installing the Cassandra Archiver plugins in CSS directly. The JSON Archive Proxy only needs two plugins (instead of about twenty for the Cassandra Archiver) and can help you decouple the actual store type and version used from your CSS installation.
If the Cassandra server is not running on the same host as the archive engine, you have configure Cassandra to listen on a different port than the default port, or you enabled authentication, you have to create a plug-in customization file.
While the archive config-tool and the archive cleanup-tool can also be configured using command-line paramters, the use of a plug-in customization file is mandatory for the archive engine. For Control System Studio, no plug-in customization file is needed, because all options can be set in the archive URL.
The plug-in customization file is usually called
plugin_customization.ini
and placed in the root
directory of the software it is used for. Here is an example
of a plug-in customization file specifying the relevant options
for the Cassandra Archiver:
; Comma-Separated List of Cassandra Servers. ; You can specify only one server, but if you have a cluster ; with several nodes, you want to list more here for fail-over. com.aquenos.csstudio.archive.cassandra/hosts=first-host.example.com,second-host.example.com ; Thrift Port for the Cassandra Server(s). com.aquenos.csstudio.archive.cassandra/port=9160 ; Cassandra Keyspace Name. com.aquenos.csstudio.archive.cassandra/keyspace=cssArchive ; Cassandra Username com.aquenos.csstudio.archive.cassandra/username=myCassandraWriteUser ; Cassandra Password com.aquenos.csstudio.archive.cassandra/password=myPassword ; Number of Compressor Worker Threads com.aquenos.csstudio.archive.writer.cassandra/numCompressorWorkers=1 ; Consistency Levels ;com.aquenos.csstudio.archive.cassandra/readDataConsistencyLevel=QUORUM ;com.aquenos.csstudio.archive.cassandra/writeDataConsistencyLevel=QUORUM ;com.aquenos.csstudio.archive.cassandra/readMetaDataConsistencyLevel=QUORUM ;com.aquenos.csstudio.archive.cassandra/writeMetaDataConsistencyLevel=QUORUM
The hosts
property has to be specified, if the
Cassandra server is not running on the same host as the archive engine
or if you use a multi-node Cassandra cluster.
The port
property has to be specified, if you do
not use the default Thrift port.
The keyspace
property has to be specified, if you
are using a different keyspace name than
cssArchive
.
The username
and password
properties have to be specified, if you enabled authentication for
the Cassandra server.
The numCompressorWorkers
property (note the
different bundle name) specifies how many thread run in parallel to
perform the sample compression and deletion (see
Section 6.2, “Compression Levels”). The default
setting is 1
. This number can be increased if the
compression process does not catch up with the generation of new data
(usually because the same archive engine is handling a lot of
channels). If this number is set to zero, the compression process is
disabled. This means that no data for compression levels is generated
and old samples are not deleted. This option was introduced in version
1.2.0. In earlier versions there always is exactly one compressor
thread.
The readDataConsistencyLevel
,
writeDataConsistencyLevel
,
readMetaDataConsistencyLevel
and
writeMetaDataConsistencyLevel
properties specify
the consistency levels being used for reading data (samples), writing
data, reading meta-data (sample bucket sizes and configuration
information) and writing meta-data respectively. For most scenarios
(in particular for a single-node setup) there is no need to change
these parameters. If you change them, you should be very careful to
apply the changes to all programs (the archive engine, the archive
configuration tool and the archive clean-up tool) at the same time.
You also should make sure that the sum of the replicas used for
reading and writing a certain category of data is always greater than
the number of replicas used. If the sum is less or equal, reads may
return inconsistent data.
In order to tell a program to use the
plugin_customization.ini
you can use the
command-line parameter
-pluginCustomization plugin_customization.ini
.
Note | |
---|---|
A configuration has to be loaded into the database before the archive engine can be started. Refer to Section 6.3, “Loading the Configuration” for details about how to load a configuration. |
The archive engine can be started by changing to the directory where
it is installed (usually archive-engine
) and
executing ArchiveEngine.sh
.
You will have to specify a few parameters, e.g.
./ArchiveEngine.sh -engine MyEngineName -data workspace.
Two instances of the archive engine can not share the same engine name
or workspace, so make sure the parameter values are unique within your
cluster.
Call ./ArchiveEngine.sh -help for a full
list of available command-line options. If the Cassandra database is
not running on the same host as the archive engine, you are using
a non-default keyspace name, or you enabled authentication, you can
specify a plug-in customization file using the
-pluginCustomization
parameter. See
Section 5.2.2, “Configuration” for details
on how to define plugin customization options.
Follow the instructions in Section 5.2.1, “Download and Installation” for installing the plugins needed to integrate the Cassandra Archive Reader into the data browser in Control System Studio.
The Cassandra Archive Reader is configured the same way as the other archive readers in Control System Studio:
In Control System Studio, go to Archive Data Server URLs.
→ . This will open the preferences window. In the tree to the left select → → . Now you can add the URL of the Cassandra database to the list
The URLs supported by the Cassandra Archive Reader have the format
cassandra://<hosts>:<port>/<keyspace>?username=<username>&password=<password>
.
In EBNF the syntax is:
|
The symbols used but not defined here, are defined in RFC 3986.
For a multi-node Cassandra setup, the list of hosts should include all hosts which export the service via Thrift. In this case the client can try all available hosts and continue operation if some of the hosts are down. The port specified here must be the same as the Thrift port specified in the Cassandra configuration (see the section called “Configuring the Network Interface”). This port must be the same for all nodes in the cluster.
Table of Contents
Basically, the configuration format used by the Cassandra Archiver is the same that is used by the RDB Archiver. However, the syntax is extended by a new tag used to configure compression levels.
For explaining the syntax of the configuration file, we use a simple example:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <engineconfig> <group> <name>firstGroup</name> <channel> <name>firstChannel</name> <period>0.5</period> <monitor/> <compression-level retention-period="86400"/> <compression-level compression-period="30"/> <compression-level compression-period="300"/> </channel> <channel> <enable/> <name>secondChannel</name> <period>1</period> <scan/> </channel> </group> <group> <name>anotherGroup</name> <channel> <name>someOtherChannel</name> <period>10</period> <scan/> <compression-level compression-period="30"/> <compression-level compression-period="300"/> </channel> </group> <group> <name>__disabled_channels</name> <channel> <name>someOldChannel</name> <period>5</period> <scan/> <compression-level compression-period="30"/> </channel> </group> </engineconfig>
Every engine configuration is enclosed by the engineconfig
tag. Within the engineconfig
there must be at least one
group
. Each group must have a name
. The group name
must be unique within the engine configuration.
Within a group, there can be an arbitrary number of channel
tags. Each channel
must specify a name
. The
channel name must be unique across all engine configurations.
A channel
must also specify a period
and either
the scan
or monitor
mode.
In scan
mode,
the period
specifies the interval (as a floating point number
in seconds) between the snapshots taken
from the channel. If the channel has not changed since the last snapshot
the new snapshot is discarded.
In monitor
mode, every change
received for the channel is saved. In this case, period
specifies the expected change rate. This is used to allocate the queue,
which stores new samples, before they are written to the database. If
the specified period is too long and the actual change rate is higher,
samples might be lost, because the queue fills up. If the specified
period is much shorter than the actual change period, more memory than
needed is allocated for the channel. The size of the buckets storing
the samples is also determined based on the configured period. Choosing
a period that is too far off the actual period will either result in
too many buckets being created (hurting the read and write performance)
or in too many samples being stored in a single bucket (which means that
rows will grow very big). Therefore you should choose the period as what
you expect for the average change rate. If you are woried about losing
samples in periods of bursts, you can increase the
org.csstudio.archive.engine/buffer_reserve
property.
The compression-level
tag is optional and its meaning is
discussed in the
next section.
The group name __disabled_channels
is
reserved for special use:
You can use this group for channels, which you want to disable
permanently (e.g.
because the corresponding device has been removed), but which you still
want to see in the data browser.
Moving a channel to this group
basically has the same effect as moving it to a group with an enabling
channel that is always false, but it avoids the warning about the
channel being disconnected.
Unlike the RDB Archiver, the Cassandra Archiver does not perform compression of samples for each read request, but stores the compressed samples instead. This has the advantage, that for queries requesting samples for a long period, less data has to be read and thus the query can be answered more quickly.
The compression levels are independently configured for each channel.
Each compression-level
(except the compression level that
stores the raw samples and has an implicit compression period of zero)
level must specify a compression-period
attribute. This interval (an integer number of seconds) specifies the
time between two compressed samples. If two consecutive samples have
the same (average) value as well as the same minimum and maximum bounds,
the seconds sample is not saved. All compressed samples are aligned to
January 1st, 1970, 00:00:00 UTC. This way, the compressed samples from
two different channels but using the same compression period are aligned
with respect to each other.
The retention-period
attribute is optional
for the compression-level
tag. If a positive retention period
(in integer seconds) is defined, samples that are older than the newest
sample minus the specified period are deleted. The
retention-period
attribute is also valid
for the special raw compression level.
The raw compression-level is always defined, even if you do not specify
a compression-level
for it. By default its retention period
is zero (meaning that samples are never deleted).
Important | |
---|---|
When specifying a retention period, you have to make sure that all compressed samples have been calculated before the samples needed for this calculation are deleted. Compressed samples are usually calculated from the compression level with the next shorter compression period, that is an even integer fraction of the compression period of the level to be calculated. However, if no such compression level exists, the raw samples are used however. As a rule of thumb, the retention period for any compression level should be at least double the largest compression period for the same channel. |
For loading or updating an engine configuration, you have to use the
archive config tool, which is distributed in the
archive-config-tool
directory of the binary
distribution. For importing an engine configuration file, you can call
./ArchiveConfigTool.sh -engine myEngineName -config myEngineConfig.xml -import.
If you want to replace the configuration of an existing engine, you have
to add the -replace_engine
parameter. Replacing
an engine configuration will first delete the existing configuration and
than import the new configuration. Thus, it is equivalent to first
using the -delete_config
parameter and then importing
the configuration with the -replace_engine
parameter.
Deleting an engine configuration will never delete the samples
associated with the engine's channels. However, if a channel does not
exist in the configuration, there is no way to retrieve the samples
using the archive reader. Therefore, instead of completely deleting
channels, you should move them to a disabled group, if you want to be
able to retrieve historic data. If you finally want to delete samples
for deleted channels, you have to use the
clean-up tool.
If the default connection parameters (Cassandra host is
localhost
, port is 9160
,
keyspace name is cssArchive
and no authentication is
used) are not correct for your setup, you either have to specify the
connection parameters as command-line parameters, or you have to
specify a
plug-in customization file.
Call ./ArchiveConfigTool -help for a list of all
supported command-line parameters.
Important | |
---|---|
No channels should be modified using the archive configuration tool,
while the engine for the respective channel is running (in particular
no channel should be deleted). This means that you should not use the
|
If you want to delete the samples for non-existing channels or want to clean-up small inconsistencies, which can occur if a write operation is interrupted, you can use the clean-up tool.
The clean-up tool is distributed in the
archive-cleanup-tool
directory of the binary
distribution. You can start it by invoking
./ArchiveCleanUpTool.sh. If you are using non-default
connection parameters, the same considerations as for the
archive config tool apply.
Important | |
---|---|
The run of the clean-up tool can take a very long time. During this time you should not use the archive config tool, because new configurations added by the config tool and the respective samples might be deleted by the clean-up tool. However, the archive engine can run while the clean-up process is running. |