Cassandra Archiver Manual

This manual is divided in four chapters (not counting this introduction). The first chapter introduces the concepts of Apache Cassandra in general and the Cassandra Archiver in particular. The second chapter describes the few steps needed to setup a basic installation of the Cassandra Archiver. The third chapter gives more detailed instructions on how to install the Cassandra Archiver. Finally, the fourth chapter explains how to configure the archiver.

Chapter 2. Concept

Table of Contents

2.1. Apache Cassandra

2.2. Cassandra Archiver for Control System Studio

2.3. Database Structure

2.3.1. Column Family engineConfiguration
2.3.2. Column Family engineConfigurationToGroups
2.3.3. Column Family groupConfiguration
2.3.4. Column Family groupConfigurationToChannels
2.3.5. Column Family channelConfiguration
2.3.6. Column Family channelConfigurationToCompressionLevels
2.3.7. Column Family compressionLevelConfiguration
2.3.8. Column Family samples

This chapter introduces the concepts behind the Cassandra Archiver and and Apache Cassandra. First column-oriented database systems and Apache Cassanra are presented shortly. Subsequently, the Cassandra Archiver for Control System Studio is introduced. Finally, the structure of the keyspace storing data for the Cassandra Archiver is explained. You might want to skip this last section if you are reading this manual for the first time and just interested in getting started with the Cassandra Archiver.

2.1. Apache Cassandra

Apache Cassandra is a column-oriented database management system (CDBMS), which is optimized for storing large amounts (tera- or even petabytes) of data grouped in column-families. It is a a special form of a key-value store. Unlike a relational database management system (RDBMS) it is not optimized for storing relational data or modifying data in a transactional way. The main advantages of a CDBMS compared to a RDBMS are superior read-write performance, linear scalability and high availability at low operation costs.

In a CDBMS data is stored in column families. Each column family contains an arbitrary number of rows, which (for a multi-node setup) are distributed over all cluster nodes. Each row is identified by a unique row key. Each row contains one or more columns. Each column is identified by a column name, that must be unique for the respective row. Each column can but does not have to store a value. Row keys, column names and column values are stored as array of bytes. The meaning of the bytes depend on the application accessing the database. Therefore, data-types are completely transparent to the CDBMS.

In a multi-node setup, data is distributed across the nodes, so that the amount of data stored is not limited by the disk-space of a single computer. Typically, low-price servers which are not fault-tolerant are used and the same data is stored on multiple nodes (typically three). The database clients and servers have built-in facilities, that automatically switch to a different node, if the first node fails. Therefore, the database cluster is fault-tolerant and highly availabe, although cheap, unreliable computers are used.

This document does not provide a detailed introduction into column-oriented database management systems or Apache Cassandra. For understanding the concepts of a CDBMS, the original paper about Google Bigtable by Chang et al. is a good starting point.

If you want to setup a cluster of Cassandra servers or are interested in advanced configuration options and performance tuning, you should read the Apache Cassandra Documentation provided by DataStax. However, this manual describes the basic steps needed to setup a single-node Cassandra cluster for getting started with the Cassandra Archiver.

2.2. Cassandra Archiver for Control System Studio

The Cassandra Archiver for Control System Studio is a set of plugins that extend the existing archive reader and writer architecture so that a database hosted by Apache Cassandra can be used instead of a traditional RDBMS like MySQL or Oracle.

By using a column-oriented database management system, huge amounts of channel samples can be archived. The Cassandra Archiver uses one column family for storing all channel samples. Each row stores one sample and is identified by a key that aggregates the channel name, the time-stamp of the sample and the compression-level name. The columns of each row store the sample's value and meta-data (e.g. alarm severity). As the data in the samples column-family is compressed before being written to disk, the space requirements of the database are reduced. Due to the way Cassandra stores data, the read and write perfomance of the database is not reduced by using compression. In fact, using compression can even slightly increase the data throughput.

The Cassandra Archiver can be regarded as a hybrid between the RDB Archiver and the Channel Archiver. Like the RDB Archiver, the Cassandra Archiver uses an existing, well-tested database management system for storing data. However, like the Channel Archiver, the Cassandra Archiver uses a storage format that is more optimized for storing channel samples and can provide high write and read rate.

The HyperArchiver uses a similar concept as the Channel Archiver. However, it uses Hypertable to store the samples and MySQL to store the configuration, while the Cassandra Archiver stores the configuration and the samples in the same database, simplifying installation and maintenance. For a HyperArchiver setup, where the Hypertable server is not running on the same node as the archive engine, the source code of the HyperArchiver has to be modified, because important configuration values are hard-coded. Unlike Apache Cassandra, which does not have a single-point of failure, Hypertable has a master server, which, when down, causes the whole cluster to fail. Besides, Cassandra is implemented as pure Java and thus 100 percent platform independent, while Hypertable needs to be compiled for each supported platform. In summary, the Cassandra Archiver is easier to setup and maintain and more reliable than the HyperArchiver, making it the better choice for most scenarios.

2.3. Database Structure

This section explains the various column families which are used to store the configuration and samples. If you are not interested in the details, you can simply skip this section and read on at the next chapter. The information in this section is not needed for setting up the Cassandra Archiver.

For row keys which have several parts, the various parts are seperated by a null byte. All row-keys are prepended by a (binary) MD5 hash followed by a null byte in order to make sure that they are evenly distributed across the cluster nodes. The MD5 hash is calculated by appending the constituent byte arrays of the key (without a separating null byte) and then caclulating the MD5 hash of the result byte array.

2.3.1. Column Family `engineConfiguration`

The engineConfiguration column family stores information about archive engines. The engine name, which must be unique, is used as the row key. Each row has columns with the names url and description storing the URL and the description of the respective archive engine.

2.3.2. Column Family `engineConfigurationToGroups`

The engineConfigurationToGroups column family maps engines to their respective archive groups. The engine name is used as the row key. A column exists for each group in the archive engine, using the name of the group as the column name.

2.3.3. Column Family `groupConfiguration`

The groupConfiguration column family stores the configuration for each group. The row key is a combination of the engine name and the group name. The column enablingChannel stores the name of the channel that enables or disables the group.

2.3.4. Column Family `groupConfigurationToChannels`

The groupConfigurationToChannels column family maps archive groups to the channels they contain. The row key is the same as used for the groupConfiguration column family. A column exists for each channel in the archive group, using the name of the channel as the column name.

2.3.5. Column Family `channelConfiguration`

The channelConfiguration column family stores information about channels. The channel name, which must be unique, is used as the row key. Each row has columns with the names engine, group, sampleMode, samplePeriod, sampleDelta and lastSampleTime storing the engine and group, each channel is associated with, the sampling options and the time of the last raw sample that has been written for the channel.

2.3.6. Column Family `channelConfigurationToCompressionLevels`

The channelConfigurationToCompressionLevels column family maps channels to their respective compression levels. The channel name is used as the row key. A column exists for each compression-level that is configured for the respective channel. However, the special "raw" compression level always exists, even if there is no column.

2.3.7. Column Family `compressionLevelConfiguration`

The compressionLevelConfiguration column family stores the configuration for each compression level of a channel. The row key is a combination of the channel name and the compression-level name. The columns compressionPeriod, retentionPeriod, lastSavedSampleTime and nextSampleTime store the period between samples (not for the "raw" compression level), the time after which samples are deleted, the time-stamp of the latest sample and the time-stamp of the next sample to be calculated (not for the "raw" compression level).

2.3.8. Column Family `samples`

The samples column family stores the actual samples for the different channels. The row key is a combination of the compression-level name, the channel name and the time-stamp. However, the time-stamp is not included when calculating the MD5 hash.

The columns severity and status exist for all rows and store the alarm severity and status of the sample.

For samples of the type IDoubleValue, the column doubleValue stores the value(s). For samples that are not in the "raw" compression level, the valueDoubleMin and valueDoubleMax columns store the minimum and maximum value in the compression interval.

For samples of the type IEnumValue the valueEnum column stores the value(s) of the sample. If the names associated with the different enum states are known, they are stored in the metaDataEnumStates column.

For samples of the type ILongValue the valueLong column stores the value(s) of the sample.

For samples of the type IStringValue the valueString column stores the value(s) of the sample.

If the sample has meta-data of the type INumericMetaData associated with it, the columns metaDataNumDispLow, metaDataNumDispHigh, metaDataNumWarnLow, metaDataNumWarnHigh, metaDataNumAlarmLow, metaDataNumAlarmHigh, metaDataNumPrecision and metaDataNumUnits store the meta-information for the sample.

For all samples except the first sample for a given channel and compression-level the column precedingSampleTime stores the timestamp of the sample directly preceding the sample.

Chapter 3. Getting Started

For setting up a simple test environment for the Cassandra Archiver, four steps are needed. First, Apache Cassandra has to be installed. Second, the Cassandra Archiver Engine and the accompanying tools have to be installed. Third, the keyspace used by the Cassandra Archiver has to be setup and an initial archiver engine configuration has to be imported. Finally, the Cassandra Archiver Reader has to be installed in Control System Studio.

All the steps needed to install and configure the Cassandra Archiver are described in Chapter 4, Installation and Chapter 5, Configuration. If you are using a simple setup, where the Archive Engine, the Apache Cassandra Server and Control System Studio are all running on the same host, you can simply skip the sections marked as optional in these two chapters.

Chapter 4. Installation

Table of Contents

4.1. Apache Cassandra

4.1.1. Download and Unpacking
4.1.2. Configuration

4.2. Cassandra Archive Engine and Tools

4.2.1. Download and Installation
4.2.2. Configuration
4.2.3. Starting the Archive Engine

4.3. Cassandra Archive Reader for Control System Studio

4.3.1. Download and Installation
4.3.2. Configuration

4.1. Apache Cassandra

This section describes the steps needed for setting up the Apache Cassandra server for use with the Cassandra Archiver.

	Important
	This section contains important information about configuration options that must be set for the Cassandra Archiver to work correctly. Thus, you should carefully read this section (in particular the section called “Configuring the Partitioner”), even if you already have a running Cassandra server.

4.1.1. Download and Unpacking

You can download Apache Cassandra from the project's website. You should choose the newest version of the binary download from the 1.0 branch, having a filename like apache-cassandra-1.0.x-bin.tar.gz. Apache Cassandra is implemented in Java, so that the binary download is the same for all platforms. You need a Java Runtime Environment version 6 or higher in order to run Cassandra.

After downloading the tarball, extract it to some place on your hard-disk. For the rest of this document, we assume that you unpacked it to /path/to/cassandra.

4.1.2. Configuration

Apache Cassandra stores its configuration in /path/to/cassandra/conf. For a simple, single-node configuration, there are two relevant files: cassandra.yaml and log4j-server.properties.

Data Paths

Before starting Cassandra, you either have to change the paths where Cassandra stores its data, or you have to create the directories used by default and make sure the user, that is running Cassandra can write to these directories.

There are four directories Cassandra uses to store data. The first three are configured in cassandra.yaml. The option data_file_directories is set to /var/lib/cassandra/data by default and defines where the actual data from the various column families is saved. The option saved_caches_directory defaults to /var/lib/cassandra/saved_caches and is used to store cached data. The third option is the commitlog_directory, which defaults to /var/lib/cassandra/commitlog. This directory is used for storing the write-ahead log. If you aim for maximum performance, you might want to consider storing the commit-log on a different disk than the data directories. For most setups however, storing the commit-log on the same disk is fine.

The last directory is configured in log4j-server.properties and is used to store the server log. The option log4j.appender.R.File defaults to /var/log/cassandra/system.log. In contrast to the other options, this option specifies the file and not a directory.

Configuring the Partitioner

	Important
	An order-preserving partitioner must be used for the Cassandra Archiver. The partioner cannot be changed after data has been stored in the database, therefore you have to change this option before starting Cassandra the first time.

In cassandra.yaml the partitioner option has to be changed to refer to org.apache.cassandra.dht.ByteOrderedPartitioner. The Cassandra Archiver uses key-range queries for retrieving samples in a specific time range, so that an ordered partioner must be used. If you have other applications, which do not use ranged queries, you should run them on a different Cassandra cluster using the random partioner. Using applications which are desgined for use with the random partionier on a cluster with an ordered partioner will lead to unequal data distribution across the cluster nodes and bad read and write performance.

Configuring the Network Interface

	Note
	If the Cassandra server, the Cassandra Archiver Engine and the Control System Studio client are all running on the same machine, you can skip this step.

There are five configuration regarding the network interface used by the cassandra server. The first three options (storage_port, ssl_storage_port and listen_address) are only relevant for a multi-node Cassandra cluster and thus outside the scope of this manual.

The other two options (rpc_address and rpc_port) are relevant if you want to run Control System Studio or the archive engine on different machines than the Cassandra server. By default rpc_address is configured to only listen on the loopback interface. You should change this to the IP address of the network interface your machine uses to connect to the rest of the network. If you are sure, your hostname and IP address configuration is correct (in particular /etc/hosts and /etc/hostname are configured correctly), you can also set a blank value, to make Cassandra deterine the right IP address by itself.

The rpc_port option needs only to be changed, if you run two or more Cassandra servers on the same host, or a different service uses the same port. By default TCP port 9160 is used for the Thrift service. If you change this port number, you also have to adjust the setting in the archive engine and archive reader configurations.

Configuring Authentication Options

	Note
	Configuring the authentication options is completely optional. By default, Cassandra grants full write-access to all connections without any authentication. If using Cassandra in a production environment, you might want to use authentication for better security however.

Cassandra's security system divides into two components: authentication and authorization. Authentication is the task of checking credentials provided by a client and assigning a principal. Authentication is the task of checking whether a specific principal may perform a certain operation.

By default Cassandra is distributed with an authenticator which accepts any credentials and an authority which grants any permission to any principal.

The SimpleAuthenticator and SimpleAuthority are part of the Cassandra source code but are not distributed with the binary distribution.

For your convenience, a JAR file with the compiled versions of the two classes is distributed with the Cassandra Archiver in the cassandra-simpleauth directory.

Copy this JAR to the lib directory of the Cassandra installation and add the following two lines to the end of the cassandra-env.sh configuration file:

JVM_OPTS="$JVM_OPTS -Dpasswd.properties=$CASSANDRA_CONF/passwd.properties"
JVM_OPTS="$JVM_OPTS -Daccess.properties=$CASSANDRA_CONF/access.properties"

Besides adding these system properties, you also have to adjust the authenticator and authority options in cassandra.yaml to refer to org.apache.cassandra.auth.SimpleAuthenticator and org.apache.cassandra.auth.SimpleAuthority respectively.

You also have to create the configuration files passwd.properties and access.properties in the conf directory of the Cassandra installation.

The passwd.properties file uses a simple syntax where the property name is the username and the property value is the clear-text password for the user. The following examples defines four users with different passwords:

admin=superSafePassword
archive-read=somePassword
archive-write=someDifferentPassword
archive-config=anotherPassword

The access.properties uses a syntax, where the property name represents a privilege and the property value is a comma-separated list of principals, which are granted that privilege. The following example assigns four levels of privileges: The user admin may perform any operation, the user archive-read may read data from the column-families in the cssArchive keyspace and the user archive-write may write data to the column-families samples, channelConfigurations and compressionLevelConfigurations in the cssArchive keyspace, and the user archive-config may write data to any column family in the cssArchive keyspace:

<modify-keyspaces>=admin
cssArchive.<ro>=archive-read,archive-write,archive-config
cssArchive.<rw>=admin
cssArchive.engineConfiguration.<ro>=archive-read,archive-write
cssArchive.engineConfiguration.<rw>=archive-config,admin
cssArchive.engineConfigurationToGroups.<ro>=archive-read,archive-write
cssArchive.engineConfigurationToGroups.<rw>=archive-config,admin
cssArchive.groupConfiguration.<ro>=archive-read,archive-write
cssArchive.groupConfiguration.<rw>=archive-config,admin
cssArchive.groupConfigurationToChannels.<ro>=archive-read,archive-write
cssArchive.groupConfigurationToChannels.<rw>=archive-config,admin
cssArchive.channelConfiguration.<ro>=archive-read
cssArchive.channelConfiguration.<rw>=archive-config,archive-write,admin
cssArchive.channelConfigurationToCompressionLevels.<ro>=archive-read,archive-write
cssArchive.channelConfigurationToCompressionLevels.<rw>=archive-config,admin
cssArchive.compressionLevelConfiguration.<ro>=archive-read
cssArchive.compressionLevelConfiguration.<rw>=archive-config,archive-write,admin
cssArchive.samples.<ro>=archive-read
cssArchive.samples.<rw>=archive-config,archive-write,admin

Starting the Server

The Cassandra server can be started using the script /path/to/cassandra/bin/cassandra. You can use the -f flag to start Cassandra in foreground (recommended when testing Cassandra the first time).

Creating the Keyspace for the Cassandra Archiver

In order to use the Cassandra Archiver, you first have to create the keyspace and the column families used by the the archiver. You can do this by starting /path/to/cassandra/bin/cassandra-cli -h <hostname or IP address of your Cassandra server>. If you enabled authentication for your Cassandra server, you have to specify additional parameters. Call cassandra-cli -h for getting a list of all supported command-line parameters.

Once you successfully started the Cassandra CLI and it is connected to the Cassandra server, you can execute the following commands to create the keyspace and the column families for the Cassandra Archiver.

CREATE KEYSPACE cssArchive;
USE cssArchive;
CREATE COLUMN FAMILY engineConfiguration;
CREATE COLUMN FAMILY engineConfigurationToGroups;
CREATE COLUMN FAMILY groupConfiguration;
CREATE COLUMN FAMILY groupConfigurationToChannels;
CREATE COLUMN FAMILY channelConfiguration;
CREATE COLUMN FAMILY channelConfigurationToCompressionLevels;
CREATE COLUMN FAMILY compressionLevelConfiguration;
CREATE COLUMN FAMILY samples WITH
  compression_options = {
    sstable_compression: DeflateCompressor,
    chunk_length_kb: 256
  };

Instead of cssArchive you can use a different name for the keyspace. However, you will have to configure the keyspace name for the tools using the Cassandra server, if you do not use the default keyspace name. The column-family names are fixed and cannot be changed.

You can change the chunk_length_kb option for the samples column family. Choosing the right chunk length is a trade-off between the optimal compression ratio and the best performance for random reads. Using a value of 256 kilobytes should be okay for most environments, because on one hand random reads of samples are rare, so there is no significant benefit from using a smaller chunk size. On the other hand, for the kind of data typically stored in the samples column family, increasing the chunk size will not improve the compression ratio significantly.

4.2. Cassandra Archive Engine and Tools

4.2.1. Download and Installation

After downloading the binary distribution from the Cassandra Archiver website you should unpack the archive. The archive contains four directories:

archive-engine
archive-cleanup-tool
archive-config-tool
css-plugins

While the programs in the first three directories can be used as-is, the files in the css-plugins directory has to be copied to the plugins directory of your Control System Studio installation. The plugins have been developed for version 3.0.2 of CSS, so they might not work with other versions.

4.2.2. Configuration

If the Cassandra server is not running on the same host as the archive engine, you have configured Cassandra to listen on a different port than the default port, or you enabled authentication, you have to create a plug-in customization file.

While the archive config-tool and the archive cleanup-tool can also be configured using command-line paramters, the use of a plug-in customization file is mandatory for the archive engine. For Control System Studio, no plug-in customization file is needed, because all options can be set in the archive URL.

The plug-in customization file is usually called plugin_customization.ini and placed in the root directory of the software it is used for. Here is an example of a plug-in customization file specifying the relevant options for the Cassandra Archiver:

; Comma-Separated List of Cassandra Servers.
; You can specify only one server, but if you have a cluster
; with several nodes, you want to list more here for fail-over.
com.aquenos.csstudio.archive.cassandra/hosts=first-host.example.com,second-host.example.com

; Thrift Port for the Cassandra Server(s).
com.aquenos.csstudio.archive.cassandra/port=9160

; Cassandra Keyspace Name.
com.aquenos.csstudio.archive.cassandra/keyspace=cssArchive

; Cassandra Username
com.aquenos.csstudio.archive.cassandra/username=myCassandraWriteUser

; Cassandra Password
com.aquenos.csstudio.archive.cassandra/password=myPassword

The hosts property has to be specified, if the Cassandra server is not running on the same host as the archive engine or if you use a multi-node Cassandra cluster.

The port property has to be specified, if you do not use the default Thrift port.

The keyspace property has to be specified, if you are using a different keyspace name than cssArchive.

The username and password properties have to be specified, if you enabled authentication for the Cassandra server.

In order to tell a program to use the plugin_customization.ini you can use the command-line parameter -pluginCustomization plugin_customization.ini.

4.2.3. Starting the Archive Engine

	Note
	A configuration has to be loaded into the database before the archive engine can be started. Refer to Section 5.3, “Loading the Configuration” for details about how to load a configuration.

The archive engine can be started by changing to the directory where it is installed (usually archive-engine) and executing ArchiveEngine.sh. You will have to specify a few parameters, e.g. ./ArchiveEngine.sh -engine MyEngineName -data workspace. Two instances of the archive engine can not share the same engine name or workspace, so make sure the parameter values are unique within your cluster.

Call ./ArchiveEngine.sh -help for a full list of available command-line options. If the Cassandra database is not running on the same host as the archive engine, you are using a non-default keyspace name, or you enabled authentication, you can specify a plug-in customization file using the -pluginCustomization parameter. See Section 4.2.2, “Configuration” for details on how to define plugin customization options.

4.3. Cassandra Archive Reader for Control System Studio

4.3.1. Download and Installation

Follow the instructions in Section 4.2.1, “Download and Installation” for installing the plugins needed to integrate the Cassandra Archive Reader into the data browser in Control System Studio.

4.3.2. Configuration

The Cassandra Archive Reader is configured the same way as the other archive readers in Control System Studio:

In Control System Studio, go to CSS → Preferences.... This will open the preferences window. In the tree to the left select CSS Applications → Trends → Data Browser. Now you can add the URL of the Cassandra database to the list Archive Data Server URLs.

The URLs supported by the Cassandra Archive Reader have the format cassandra://<hosts>:<port>/<keyspace>?username=<username>&password=<password>.

In EBNF the syntax is:

[1]	url	`::=`	"cassandra://", host, { ",", host }, [":", port ], keyspace, [ "?", "username", "=", username, "&", "password", "=", password ] ;
[2]	keyspace	`::=`	path-absolute ;	/* The keyspace must not only be a valid path according to the URL specifications but (after URL decoding) also be a valid Cassandra keyspace name. */
[3]	username	`::=`	parameter value ;
[4]	password	`::=`	parameter value ;
[5]	parameter value	`::=`	{ pchar } ;

The symbols used but not defined here, are defined in RFC 3986.

For a multi-node Cassandra setup, the list of hosts should include all hosts which export the service via Thrift. In this case the client can try all available hosts and continue operation if some of the hosts are down. The port specified here must be the same as the Thrift port specified in the Cassandra configuration (see the section called “Configuring the Network Interface”). This port must be the same for all nodes in the cluster.

Chapter 5. Configuration

Table of Contents

5.1. Configuration Format
5.2. Compression Levels
5.3. Loading the Configuration
5.4. Cleaning up the Database

5.1. Configuration Format

Basically, the configuration format used by the Cassandra Archiver is the same that is used by the RDB Archiver. However, the syntax is extended by a new tag used to configure compression levels.

For explaining the syntax of the configuration file, we use a simple example:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<engineconfig>
  <group>
    <name>firstGroup</name>
    <channel>
      <name>firstChannel</name>
      <period>0.5</period>
      <monitor/>
      <compression-level name="raw" retention-period="86400"/>
      <compression-level name="30s" compression-period="30"/>
      <compression-level name="5m" compression-period="300"/>
    </channel>
    <channel>
      <enable/>      
      <name>secondChannel</name>
      <period>1</period>
      <scan/>
    </channel>
  </group>
  <group>
    <name>anotherGroup</name>
    
    <channel>
      <name>someOtherChannel</name>
      <period>10</period>
      <scan/>
      <compression-level name="30s" compression-period="30"/>
      <compression-level name="5m" compression-period="300"/>
    </channel>
  </group>
</engineconfig>

Every engine configuration is enclosed by the engineconfig tag. Within the engineconfig there must be at least one group. Each group must have a name. The group name must be unique within the engine configuration.

Within a group, there can be an arbitrary number of channel tags. Each channel must specify a name. The channel name must be unique across all engine configurations.

A channel must also specify a period and either the scan or monitor mode.

In scan mode, the period specifies the interval (as a floating point number in seconds) between the snapshots taken from the channel. If the channel has not changed since the last snapshot the new snapshot is discarded.

In monitor mode, every change received for the channel is saved. In this case, period specifies the expected change rate. This is used to allocate the queue, which stores new samples, before they are written to the database. If the specified period is too long and the actual change rate is higher, samples might be lost, because the queue fills up. If the specified period is much shorter than the actual change period, more memory than needed is allocated for the channel. As computer memory is rather cheap today, you should rather choose this value too small than too big.

The compression-level tag is optional and its meaning is discussed in the next section.

5.2. Compression Levels

Unlike the RDB Archiver, the Cassandra Archiver does not perform compression of samples for each read request, but stores the compressed samples instead. This has the advantage, that for queries requesting samples for a long period, less data has to be read and thus the query can be answered more quickly.

The compression levels are independently configured for each channel. If no compression levels are configured, only raw samples are saved and they are never deleted. Each compression-level tag must have a name attribute. The compression-level name must be unique within the channel configuration. The special name raw is reserved for the raw samples, which are not calculated but represent the samples received from the channel.

Each compression-level except the raw level must specify a compression-period attribute. This interval (an integer number of seconds) specifies the time between two compressed samples. If two consecutive samples have the same (average) value as well as the same minimum and maximum bounds, the seconds sample is not saved. All compressed samples are aligned to January 1st, 1970, 00:00:00 UTC. This way, the compressed samples from two different channels but using the same compression period are aligned with respect to each other. The compression-period attribute is not valid for the special raw compression level.

The retention-period attribute is optional for the compression-level tag. If a positive retention period (in integer seconds) is defined, samples that are older than the newest sample minus the specified period are deleted. The retention-period attribute is also valid for the special raw compression level.

Important

When specifying a retention period for raw samples, you have to make sure that all compressed samples have been calculated before the raw samples are deleted. As compressed samples are always calculated from raw samples, they could not be calculated, if the raw samples were deleted too early. As a rule of thumb, the retention period for the raw samples should be at least double the largest compression period for the same channel.

5.3. Loading the Configuration

For loading or updating an engine configuration, you have to use the archive config tool, which is distributed in the archive-config-tool directory of the binary distribution. For importing an engine configuration file, you can call ./ArchiveConfigTool.sh -engine myEngineName -config myEngineConfig.xml -import. If you want to replace the configuration of an existing engine, you have to add the -replace_engine parameter. Replacing an engine configuration will first delete the existing configuration and than import the new configuration. Thus, it is equivalent to first using the -delete_config parameter and then importing the configuration with the -replace_engine parameter. Deleting an engine configuration will never delete the samples associated with the engine's channels. However, if a channel does not exist in the configuration, there is no way to retrieve the samples using the archive reader. Therefore, instead of completely deleting channels, you should move them to a disabled group, if you want to be able to retrieve historic data. If you finally want to delete samples for deleted channels, you have to use the clean-up tool.

If the default connection parameters (Cassandra host is localhost, port is 9160, keyspace name is cssArchive and no authentication is used) are not correct for your setup, you either have to specify the connection parameters as command-line parameters, or you have to specify a plug-in customization file. Call ./ArchiveConfigTool -help for a list of all supported command-line parameters.

5.4. Cleaning up the Database

If you want to delete the samples for non-existing channels or want to clean-up small inconsistencies, which can occur if a write operation is interrupted, you can use the clean-up tool.

The clean-up tool is distributed in the archive-cleanup-tool directory of the binary distribution. You can start it by invoking ./ArchiveCleanUpTool.sh. If you are using non-default connection parameters, the same considerations as for the archive config tool apply.

	Important
	The run of the clean-up tool can take a very long time. During this time you should not use the archive config tool, because new configurations added by the config tool and the respective samples might be deleted by the clean-up tool. However, the archive engine can run while the clean-up process is running.

Cassandra Archiver Manual

Sebastian Marsching

Chapter 1. Introduction

Chapter 2. Concept

2.1. Apache Cassandra

2.2. Cassandra Archiver for Control System Studio

2.3. Database Structure

2.3.1. Column Family engineConfiguration

2.3.2. Column Family engineConfigurationToGroups

2.3.3. Column Family groupConfiguration

2.3.4. Column Family groupConfigurationToChannels

2.3.5. Column Family channelConfiguration

2.3.6. Column Family channelConfigurationToCompressionLevels

2.3.7. Column Family compressionLevelConfiguration

2.3.8. Column Family samples

Chapter 3. Getting Started

Chapter 4. Installation

4.1. Apache Cassandra

4.1.1. Download and Unpacking

4.1.2. Configuration

Data Paths

Configuring the Partitioner

Configuring the Network Interface

Configuring Authentication Options

Starting the Server

Creating the Keyspace for the Cassandra Archiver

4.2. Cassandra Archive Engine and Tools

4.2.1. Download and Installation

4.2.2. Configuration

4.2.3. Starting the Archive Engine

4.3. Cassandra Archive Reader for Control System Studio

4.3.1. Download and Installation

4.3.2. Configuration

Chapter 5. Configuration

5.1. Configuration Format

5.2. Compression Levels

5.3. Loading the Configuration

5.4. Cleaning up the Database

2.3.1. Column Family `engineConfiguration`

2.3.2. Column Family `engineConfigurationToGroups`

2.3.3. Column Family `groupConfiguration`

2.3.4. Column Family `groupConfigurationToChannels`

2.3.5. Column Family `channelConfiguration`

2.3.6. Column Family `channelConfigurationToCompressionLevels`

2.3.7. Column Family `compressionLevelConfiguration`

2.3.8. Column Family `samples`