Cassandra Archiver Manual

Sebastian Marsching

aquenos GmbH

Table of Contents

1. Introduction
2. News
2.1. What's New in Cassandra Archiver 2.0
2.2. What's New in Cassandra Archiver 2.1
2.3. What's New in Cassandra Archiver 2.2
2.4. What's New in Cassandra Archiver 2.3
3. Concept
3.1. Apache Cassandra
3.2. Cassandra Archiver for Control System Studio
3.3. Database Structure
3.3.1. Column Family engineConfiguration
3.3.2. Column Family engineConfigurationToGroups
3.3.3. Column Family groupConfiguration
3.3.4. Column Family groupConfigurationToChannels
3.3.5. Column Family channelConfiguration
3.3.6. Column Family channelConfigurationToCompressionLevels
3.3.7. Column Families samplesBucketSize and samplesBucketSize_*
3.3.8. Column Families samples and samples_*
4. Getting Started
5. Installation
5.1. Apache Cassandra
5.1.1. Download and Unpacking
5.1.2. Configuration
5.2. Cassandra Archive Engine and Tools
5.2.1. Download and Installation
5.2.2. Configuration
5.2.3. Starting the Archive Engine
5.3. Cassandra Archive Reader for Control System Studio
5.3.1. Download and Installation
5.3.2. Configuration
6. Configuration
6.1. Configuration Format
6.2. Compression Levels
6.3. Loading the Configuration
6.4. Cleaning up the Database

Chapter 1. Introduction

This manual is divided in four chapters (not counting this introduction). The first chapter introduces the concepts of Apache Cassandra in general and the Cassandra Archiver in particular. The second chapter summarizes the new features of the present version of the Cassandra Archiver and outlines the changes compared to older versions. The third chapter describes the few steps needed to setup a basic installation of the Cassandra Archiver. The fourth chapter gives more detailed instructions on how to install the Cassandra Archiver. Finally, the fifth chapter explains how to configure the archiver.

Chapter 2. News

This chapter presents the changes in newer versions of the Cassandra Archiver in comparison to previous versions. If you are new to the Cassandra Archiver, you can safely skip this chapter and procede with the next chapter.

2.1. What's New in Cassandra Archiver 2.0

Version 2.0 of the Cassandra Archiver changes the structure of the database used to store the samples. The new structure brings significant improvements for the performance of Cassandra, in particular for large databases.

Existing databases cannot be converted to the new format. Instead the new database has to be setup first and then the data needs to be transferred manually.

Other changes mainly affect the internal code structure which has been tidied up. In particular, major changes have been made to the sample compressor, which also profits from the new database structure.

Finally an option for changing the consistency levels used for reads and writes has been added. Expert users can use this setting to tweak the availability trade-off between read and write scenarios.

2.2. What's New in Cassandra Archiver 2.1

Version 2.1 of the Cassandra Archiver uses Apache Cassandra 1.2 instead of version 1.1 that was used for earlier releases. This way users of the Cassandra Archiver can benefit from the many useful features that are new in Cassandra 1.2 (e.g. virtual nodes).

Another change affects the format that channel values are stored in. This format has been optimized, so that integers typically need considerably less space than before. Cassandra Archiver 2.1 can read values in the old format, but values written by Cassandra Archiver 2.0 cannot be read by previous versions. This change does not affect the database structure, thus no special actions have to be taken when upgrading.

2.3. What's New in Cassandra Archiver 2.2

Version 2.2 of the Cassandra Archiver brings an improved version of the sample compressor algorithm. Instead of polling the database for new samples, new samples are now send to the compressor thread using a queue, thus significantly reducing the number of read requests needed. The retention mechanism has also been optimized, so that samples are now deleted once about every four hours instead of constantly. These retention runs and compression runs for channels without new samples are spread out over a period of about 4 hours, thus avoiding load peaks. In summary these changes reduce the system load for compressed samples significantly.

The database structure and format have not changed compared to version 2.1, thus older versions of the reader can be used in combination with version 2.2 of the archive engine.

2.4. What's New in Cassandra Archiver 2.3

Version 2.3 of the Cassandra Archiver improves the way how the retention period is applied. In earlier versions, individual samples would be deleted when they expired. This caused many tombstones to accumulate in the Cassandra database, affecting performance and possibly even making queries for the corresponding channel impossible due to limitations in Apache Cassandra.

The new algorithm only deletes complete “buckets” of samples, thus reducing the number of tombstones created significantly. In addition to that, the way how the archive reader reads samples has been improved slightly, so that it is less prone to be affected by tombstones due to the deletion of old samples.

The database structure and format have not changed compared to version 2.1 and 2.2, thus older versions of the reader can be used in combination with version 2.2 of the archive engine.

Chapter 3. Concept

This chapter introduces the concepts behind the Cassandra Archiver and and Apache Cassandra. First column-oriented database systems and Apache Cassanra are presented shortly. Subsequently, the Cassandra Archiver for Control System Studio is introduced. Finally, the structure of the keyspace storing data for the Cassandra Archiver is explained. You might want to skip this last section if you are reading this manual for the first time and just interested in getting started with the Cassandra Archiver.

3.1. Apache Cassandra

Apache Cassandra is a column-oriented database management system (CDBMS), which is optimized for storing large amounts (tera- or even petabytes) of data grouped in column-families. It is a a special form of a key-value store. Unlike a relational database management system (RDBMS) it is not optimized for storing relational data or modifying data in a transactional way. The main advantages of a CDBMS compared to a RDBMS are superior read-write performance, linear scalability and high availability at low operation costs.

In a CDBMS data is stored in column families. Each column family contains an arbitrary number of rows, which (for a multi-node setup) are distributed over all cluster nodes. Each row is identified by a unique row key. Each row contains one or more columns. Each column is identified by a column name, that must be unique for the respective row. Each column can but does not have to store a value. Row keys, column names and column values are stored as array of bytes. The meaning of the bytes depend on the application accessing the database. Therefore, data-types are completely transparent to the CDBMS.

In a multi-node setup, data is distributed across the nodes, so that the amount of data stored is not limited by the disk-space of a single computer. Typically, low-price servers which are not fault-tolerant are used and the same data is stored on multiple nodes (typically three). The database clients and servers have built-in facilities, that automatically switch to a different node, if the first node fails. Therefore, the database cluster is fault-tolerant and highly availabe, although cheap, unreliable computers are used.

This document does not provide a detailed introduction into column-oriented database management systems or Apache Cassandra. For understanding the concepts of a CDBMS, the original paper about Google Bigtable by Chang et al. is a good starting point.

If you want to setup a cluster of Cassandra servers or are interested in advanced configuration options and performance tuning, you should read the Apache Cassandra Documentation provided by DataStax. However, this manual describes the basic steps needed to setup a single-node Cassandra cluster for getting started with the Cassandra Archiver.

3.2. Cassandra Archiver for Control System Studio

The Cassandra Archiver for Control System Studio is a set of plugins that extend the existing archive reader and writer architecture so that a database hosted by Apache Cassandra can be used instead of a traditional RDBMS like MySQL or Oracle.

By using a column-oriented database management system, huge amounts of channel samples can be archived. The Cassandra Archiver uses two column families for storing all samples of all channels for a certain compression level. Several samples of a channel are aggregated in a so called bucket and each bucket is stored in one row. Each row storeing a bucket is identified by a key that aggregates the channel name, the size of the bucket and the start time-stamp of the bucket (both in nanoseconds). Each column of a bucket-row stores one sample, using the column's name for the sample's time-stamp and the column's value for the sample's value and meta-data (e.g. alarm severity). As data stored in column-families is compressed before being written to disk by default, the space requirements of the database are reduced. Due to the way Cassandra stores data, the read and write perfomance of the database is not reduced by using compression. In fact, using compression can even slightly increase the data throughput.

The Cassandra Archiver can be regarded as a hybrid between the RDB Archiver and the Channel Archiver. Like the RDB Archiver, the Cassandra Archiver uses an existing, well-tested database management system for storing data. However, like the Channel Archiver, the Cassandra Archiver uses a storage format that is more optimized for storing channel samples and can provide high write and read rates.

The HyperArchiver uses a similar concept as the Channel Archiver. However, it uses Hypertable to store the samples and MySQL to store the configuration, while the Cassandra Archiver stores the configuration and the samples in the same database, simplifying installation and maintenance. For a HyperArchiver setup, where the Hypertable server is not running on the same node as the archive engine, the source code of the HyperArchiver has to be modified, because important configuration values are hard-coded. Unlike Apache Cassandra, which does not have a single-point of failure, Hypertable has a master server, which, when down, causes the whole cluster to fail. Besides, Cassandra is implemented as pure Java and thus 100 percent platform independent, while Hypertable needs to be compiled for each supported platform. In summary, the Cassandra Archiver is easier to setup and maintain and more reliable than the HyperArchiver, making it the better choice for most scenarios.

3.3. Database Structure

This section explains the various column families which are used to store the configuration and samples. If you are not interested in the details, you can simply skip this section and read on at the next chapter. The information in this section is not needed for setting up the Cassandra Archiver.

For row keys and column names, which have several parts, a composite type is used.

3.3.1. Column Family engineConfiguration

The engineConfiguration column family stores information about archive engines. The engine name, which must be unique, is used as the row key. Each row has columns with the names url and description storing the URL and the description of the respective archive engine.

3.3.2. Column Family engineConfigurationToGroups

The engineConfigurationToGroups column family maps engines to their respective archive groups. The engine name is used as the row key. A column exists for each group in the archive engine, using the name of the group as the column name.

3.3.3. Column Family groupConfiguration

The groupConfiguration column family stores the configuration for each group. The row key is a combination of the engine name and the group name. The column enablingChannel stores the name of the channel that enables or disables the group.

3.3.4. Column Family groupConfigurationToChannels

The groupConfigurationToChannels column family maps archive groups to the channels they contain. The row key is the same as used for the groupConfiguration column family. A column exists for each channel in the archive group, using the name of the channel as the column name.

3.3.5. Column Family channelConfiguration

The channelConfiguration column family stores information about channels. The channel name, which must be unique, is used as the row key. Each row has columns with the names engine, group, sampleMode, samplePeriod, sampleDelta and lastSampleTime storing the engine and group, each channel is associated with, the sampling options and the time of the last raw sample that has been written for the channel.

3.3.6. Column Family channelConfigurationToCompressionLevels

The channelConfigurationToCompressionLevels column family maps channels to their respective compression levels. The channel name is used as the row key. A column exists for each compression-level that is configured for the respective channel. Each compression level is stored as a column, where the column name is the compression period and the column value is the retention period. The special compression level that stores raw samples always exists, even if there is no column with a column name of zero (this is the compression period internally assigned to the raw compression-level).

3.3.7. Column Families samplesBucketSize and samplesBucketSize_*

The samplesBucketSize column family stores the bucket size of the sample buckets for all channels in the raw compression-level. The row key is the channel name. Each column stores a bucket size, using the time the bucket size started to be used as the column name and the bucket size as the column value.

The same bucket size is usually used for many buckets in a row and the time-stamp used for a bucket size is not aligned with the time-stamp of one of these buckets. The time-stamp for the bucket size just means that samples with a time-stamp greater than or equal to the time-stamp for this bucket size are stored in a bucket of this size, unless there is a bucket size with a greater time-stamp that is still less than or equal to the sample's time-stamp.

This database structure is used, because Cassandra does not perform well if many “skinny” rows are stored. Therefore several samples have to be aggregated in one row. On the other hand, a maximum number of 231-1 columns can be stored in a row, which is clearly not enough for channels that change at a very high rate or are intended to be stored for a long period of time. In fact the practical limit for the number of columns in a row is even lower than that. Typically, a good number of columns for one row is in the magnitude of a few millions.

The Cassandra Archiver tries to reach this number by determining the bucket size (the period of time that is stored in one bucket) by dividing one million by the scan period (or expected update rate) for a channel. The only drawback of this database structure is, that there is no way to tell what the time-stamp of the newest sample is. Therefore the Cassandra Archive only accepts samples that have a time-stamp which is a maximum of two hours ahead in time and only looks for samples with a time-stamp that is a maximum of four hours ahead, when searching for samples. The extra two hours allow for a certain clock skew between participating systems.

For each compression level a column familiy with the name samplesBucketSize_*, where the asterisk is replaced by the compression period in seconds, is created. The structure of this column family is exactly the same as the one of the samplesBucketSize column-family.

3.3.8. Column Families samples and samples_*

The samples column family stores the actual raw samples for all channels. The row key is a combination of the channel name, the bucket's length and the bucket's start time-stamp. Each column stores a sample, using the sample's time-stamp as the column name and the sample's value and meta-data as the column value. The column value is a single blob aggregating all data of a sample except the time-stamp.

For each compression level a column familiy with the name samples_*, where the asterisk is replaced by the compression period in seconds, is created. The structure of this column family is exactly the same as the one of the samples column-family.

Chapter 4. Getting Started

For setting up a simple test environment for the Cassandra Archiver, four steps are needed. First, Apache Cassandra has to be installed. Second, the Cassandra Archiver Engine and the accompanying tools have to be installed. Third, the keyspace used by the Cassandra Archiver has to be setup and an initial archiver engine configuration has to be imported. Finally, the Cassandra Archiver Reader has to be installed in Control System Studio.

All the steps needed to install and configure the Cassandra Archiver are described in Chapter 5, Installation and Chapter 6, Configuration. If you are using a simple setup, where the Archive Engine, the Apache Cassandra Server and Control System Studio are all running on the same host, you can simply skip the sections marked as optional in these two chapters.

Chapter 5. Installation

5.1. Apache Cassandra

This section describes the steps needed for setting up the Apache Cassandra server for use with the Cassandra Archiver.

[Important]Important

Earlier versions of the Cassandra Archiver (before version 2.0.0) used a different database structure which required an order-preserving partitioner to be used. Newer version of the Cassandra archiver (starting with version 2.0.0) do not require this any longer and in fact should not be installed on a cluster with an order-preserving partitioner, because this will cause hot-spots.

5.1.1. Download and Unpacking

You can download Apache Cassandra from the project's website or use one of the builds provided by DataStax. You should choose the newest version of the binary download from the 1.2 branch. Apache Cassandra is implemented in Java, so that the binary download is the same for all platforms. You need a Java Runtime Environment version 6 or higher in order to run Cassandra.

For the rest of this document, we assume that Cassandra is installed in /path/to/cassandra. The actual location depends on which of the provided binary packages you use.

5.1.2. Configuration

Apache Cassandra stores its configuration in /path/to/cassandra/conf. For a simple, single-node configuration, there are two relevant files: cassandra.yaml and log4j-server.properties.

Data Paths

Before starting Cassandra, you either have to change the paths where Cassandra stores its data, or you have to create the directories used by default and make sure the user, that is running Cassandra can write to these directories.

There are four directories Cassandra uses to store data. The first three are configured in cassandra.yaml. The option data_file_directories is set to /var/lib/cassandra/data by default and defines where the actual data from the various column families is saved. The option saved_caches_directory defaults to /var/lib/cassandra/saved_caches and is used to store cached data. The third option is the commitlog_directory, which defaults to /var/lib/cassandra/commitlog. This directory is used for storing the write-ahead log. If you aim for maximum performance, you might want to consider storing the commit-log on a different disk than the data directories. For simple test setups however, storing the commit-log on the same disk is fine.

The last directory is configured in log4j-server.properties and is used to store the server log. The option log4j.appender.R.File defaults to /var/log/cassandra/system.log. In contrast to the other options, this option specifies the file and not a directory.

Configuring the Network Interface

[Note]Note

If the Cassandra server, the Cassandra Archiver Engine and the Control System Studio client are all running on the same machine, you can skip this step.

There are five configuration regarding the network interface used by the cassandra server. The first three options (storage_port, ssl_storage_port and listen_address) are only relevant for a multi-node Cassandra cluster and thus outside the scope of this manual.

The other two options (rpc_address and rpc_port) are relevant if you want to run Control System Studio or the archive engine on different machines than the Cassandra server. By default rpc_address is configured to only listen on the loopback interface. You should change this to the IP address of the network interface your machine uses to connect to the rest of the network. If you are sure, your hostname and IP address configuration is correct (in particular /etc/hosts and /etc/hostname are configured correctly), you can also set a blank value, to make Cassandra deterine the right IP address by itself.

The rpc_port option needs only to be changed, if you run two or more Cassandra servers on the same host, or a different service uses the same port. By default TCP port 9160 is used for the Thrift service. If you change this port number, you also have to adjust the setting in the archive engine and archive reader configurations.

Configuring Authentication Options

[Note]Note

Configuring the authentication options is completely optional. By default, Cassandra grants full write-access to all connections without any authentication. If using Cassandra in a production environment, you might want to use authentication for better security however.

Cassandra's security system divides into two components: authentication and authorization. Authentication is the task of checking credentials provided by a client and assigning a principal. Authentication is the task of checking whether a specific principal may perform a certain operation.

By default Cassandra is distributed with an authenticator which accepts any credentials and an authority which grants any permission to any principal.

If you want to enable authentication and authorization starting with Cassandra 1.2 there is a built-in mechanism that stores the required data in the Cassandra database. Please refer to the Cassandra manual for configuring authentication and authorization.

You usually will want to configure two users for the Cassandra archiver: One with read-access, used by the archive reader, and one with write-access, used by the archive engine, configuration tool and clean-up tool.

You can create the two users using the following commands in cqlsh:

CREATE USER archive_write with PASSWORD 'somePassword' NOSUPERUSER;
CREATE USER archive_read with PASSWORD 'anotherPassword' NOSUPERUSER;

You can grant the privileges using the following commands:

GRANT ALTER PERMISSION on KEYSPACE "cssArchive" to archive_write;
GRANT CREATE PERMISSION on KEYSPACE "cssArchive" to archive_write;
GRANT DROP PERMISSION on KEYSPACE "cssArchive" to archive_write;
GRANT MODIFY PERMISSION on KEYSPACE "cssArchive" to archive_write;
GRANT SELECT PERMISSION on KEYSPACE "cssArchive" to archive_write;
GRANT SELECT PERMISSION on KEYSPACE "cssArchive" to archive_read;

Starting the Server

The Cassandra server can be started using the script /path/to/cassandra/bin/cassandra. You can use the -f flag to start Cassandra in foreground (recommended when testing Cassandra the first time).

Creating the Keyspace for the Cassandra Archiver

In order to use the Cassandra Archiver, you first have to create the keyspace and the column families used by the the archiver. You can do this by starting /path/to/cassandra/bin/cassandra-cli -h <hostname or IP address of your Cassandra server>. If you enabled authentication for your Cassandra server, you have to specify additional parameters. Call cassandra-cli -h for getting a list of all supported command-line parameters.

Once you successfully started the Cassandra CLI and it is connected to the Cassandra server, you can execute the following command to create the keyspace for the Cassandra Archiver.

CREATE KEYSPACE cssArchive;

For compatibility with older versions of the Cassandra Archiver, the default keyspace name used is cssArchive. However, the use of mixed-case keyspace names is discouraged in recent Apache Cassandra versions and can cause problems when using CQL to access the keyspace. Therefore, you should prefer the keyspace name css_archive when deploying a new cluster. However, you will have to configure the keyspace name for the tools using the Cassandra server, if you do not use the keyspace name cssArchive. See Section 5.2.2, “Configuration” for instructions on how to configure the keyspace name by creating a plugin_customization.ini for the archive engine, archive clean-up tool and archive configuration tool.

5.2. Cassandra Archive Engine and Tools

5.2.1. Download and Installation

After downloading the binary distribution from the Cassandra Archiver website you should unpack the archive. The archive contains four directories:

  • archive-engine

  • archive-cleanup-tool

  • archive-config-tool

  • css-plugins

While the programs in the first three directories can be used as-is, the files in the css-plugins directory have to be copied to the plugins directory of your Control System Studio installation. The plugins have been developed for version 3.1 of CSS, so they might not work with other versions.

In order to keep your CSS installation small you might want to consider using the JSON Archive Proxy instead of installing the Cassandra Archiver plugins in CSS directly. The JSON Archive Proxy only needs two plugins (instead of about twenty for the Cassandra Archiver) and can help you decouple the actual store type and version used from your CSS installation.

If you want to use the Cassandra Archiver with version 3.2 of CSS, you can use version 1.x of the JSON Archive Proxy Server together with the reader plugins of the Cassandra archiver. You can then install version 2.x of the JSON Archive Proxy reader plugins into CSS 3.2 and connect them to the version 1.x server.

5.2.2. Configuration

If the Cassandra server is not running on the same host as the archive engine, you have configure Cassandra to listen on a different port than the default port, or you enabled authentication, you have to create a plug-in customization file.

While the archive config-tool and the archive cleanup-tool can also be configured using command-line paramters, the use of a plug-in customization file is mandatory for the archive engine. For Control System Studio, no plug-in customization file is needed, because all options can be set in the archive URL.

The plug-in customization file is usually called plugin_customization.ini and placed in the root directory of the software it is used for. Here is an example of a plug-in customization file specifying the relevant options for the Cassandra Archiver:

; Comma-Separated List of Cassandra Servers.
; You can specify only one server, but if you have a cluster
; with several nodes, you want to list more here for fail-over.
com.aquenos.csstudio.archive.cassandra/hosts=first-host.example.com,second-host.example.com

; Thrift Port for the Cassandra Server(s).
com.aquenos.csstudio.archive.cassandra/port=9160

; Cassandra Keyspace Name.
com.aquenos.csstudio.archive.cassandra/keyspace=cssArchive

; Cassandra Username
com.aquenos.csstudio.archive.cassandra/username=myCassandraWriteUser

; Cassandra Password
com.aquenos.csstudio.archive.cassandra/password=myPassword

; Number of Compressor Worker Threads
com.aquenos.csstudio.archive.writer.cassandra/numCompressorWorkers=1

; Consistency Levels
;com.aquenos.csstudio.archive.cassandra/readDataConsistencyLevel=QUORUM
;com.aquenos.csstudio.archive.cassandra/writeDataConsistencyLevel=QUORUM
;com.aquenos.csstudio.archive.cassandra/readMetaDataConsistencyLevel=QUORUM
;com.aquenos.csstudio.archive.cassandra/writeMetaDataConsistencyLevel=QUORUM

The hosts property has to be specified, if the Cassandra server is not running on the same host as the archive engine or if you use a multi-node Cassandra cluster.

The port property has to be specified, if you do not use the default Thrift port.

The keyspace property has to be specified, if you are using a different keyspace name than cssArchive.

The username and password properties have to be specified, if you enabled authentication for the Cassandra server.

The numCompressorWorkers property (note the different bundle name) specifies how many thread run in parallel to perform the sample compression and deletion (see Section 6.2, “Compression Levels”). The default setting is 1. This number can be increased if the compression process does not catch up with the generation of new data (usually because the same archive engine is handling a lot of channels). If this number is set to zero, the compression process is disabled. This means that no data for compression levels is generated and old samples are not deleted. This option was introduced in version 1.2.0. In earlier versions there always is exactly one compressor thread.

The readDataConsistencyLevel, writeDataConsistencyLevel, readMetaDataConsistencyLevel and writeMetaDataConsistencyLevel properties specify the consistency levels being used for reading data (samples), writing data, reading meta-data (sample bucket sizes and configuration information) and writing meta-data respectively. For most scenarios (in particular for a single-node setup) there is no need to change these parameters. If you change them, you should be very careful to apply the changes to all programs (the archive engine, the archive configuration tool and the archive clean-up tool) at the same time. You also should make sure that the sum of the replicas used for reading and writing a certain category of data is always greater than the number of replicas used. If the sum is less or equal, reads may return inconsistent data.

In order to tell a program to use the plugin_customization.ini you can use the command-line parameter -pluginCustomization plugin_customization.ini.

5.2.3. Starting the Archive Engine

[Note]Note

A configuration has to be loaded into the database before the archive engine can be started. Refer to Section 6.3, “Loading the Configuration” for details about how to load a configuration.

The archive engine can be started by changing to the directory where it is installed (usually archive-engine) and executing ArchiveEngine.sh. You will have to specify a few parameters, e.g. ./ArchiveEngine.sh -engine MyEngineName -data workspace. Two instances of the archive engine can not share the same engine name or workspace, so make sure the parameter values are unique within your cluster.

Call ./ArchiveEngine.sh -help for a full list of available command-line options. If the Cassandra database is not running on the same host as the archive engine, you are using a non-default keyspace name, or you enabled authentication, you can specify a plug-in customization file using the -pluginCustomization parameter. See Section 5.2.2, “Configuration” for details on how to define plugin customization options.

5.3. Cassandra Archive Reader for Control System Studio

5.3.1. Download and Installation

Follow the instructions in Section 5.2.1, “Download and Installation” for installing the plugins needed to integrate the Cassandra Archive Reader into the data browser in Control System Studio.

5.3.2. Configuration

The Cassandra Archive Reader is configured the same way as the other archive readers in Control System Studio:

In Control System Studio, go to CSSPreferences.... This will open the preferences window. In the tree to the left select CSS ApplicationsTrendsData Browser. Now you can add the URL of the Cassandra database to the list Archive Data Server URLs.

The URLs supported by the Cassandra Archive Reader have the format cassandra://<hosts>:<port>/<keyspace>?username=<username>&password=<password>.

In EBNF the syntax is:

[1]url::="cassandra://", host, { ",", host }, [":", port ], "/", keyspace, [ "?", "username", "=", username, "&", "password", "=", password ] ; 
[2]keyspace::=path-absolute ; /* The keyspace must not only be a valid path according to the URL specifications but (after URL decoding) also be a valid Cassandra keyspace name. */
[3]username::=parameter value ; 
[4]password::=parameter value ; 
[5]parameter value::={ pchar } ; 

The symbols used but not defined here, are defined in RFC 3986.

For a multi-node Cassandra setup, the list of hosts should include all hosts which export the service via Thrift. In this case the client can try all available hosts and continue operation if some of the hosts are down. The port specified here must be the same as the Thrift port specified in the Cassandra configuration (see the section called “Configuring the Network Interface”). This port must be the same for all nodes in the cluster.

Chapter 6. Configuration

6.1. Configuration Format

Basically, the configuration format used by the Cassandra Archiver is the same that is used by the RDB Archiver. However, the syntax is extended by a new tag used to configure compression levels.

For explaining the syntax of the configuration file, we use a simple example:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<engineconfig>
  <group>
    <name>firstGroup</name>
    <channel>
      <name>firstChannel</name>
      <period>0.5</period>
      <monitor/>
      <compression-level retention-period="86400"/>
      <compression-level compression-period="30"/>
      <compression-level compression-period="300"/>
    </channel>
    <channel>
      <enable/>      
      <name>secondChannel</name>
      <period>1</period>
      <scan/>
    </channel>
  </group>
  <group>
    <name>anotherGroup</name>
    
    <channel>
      <name>someOtherChannel</name>
      <period>10</period>
      <scan/>
      <compression-level compression-period="30"/>
      <compression-level compression-period="300"/>
    </channel>
  </group>
  <group>
    <name>__disabled_channels</name>
    
    <channel>
      <name>someOldChannel</name>
      <period>5</period>
      <scan/>
      <compression-level compression-period="30"/>
    </channel>
  </group>
</engineconfig>

Every engine configuration is enclosed by the engineconfig tag. Within the engineconfig there must be at least one group. Each group must have a name. The group name must be unique within the engine configuration.

Within a group, there can be an arbitrary number of channel tags. Each channel must specify a name. The channel name must be unique across all engine configurations.

A channel must also specify a period and either the scan or monitor mode.

In scan mode, the period specifies the interval (as a floating point number in seconds) between the snapshots taken from the channel. If the channel has not changed since the last snapshot the new snapshot is discarded.

In monitor mode, every change received for the channel is saved. In this case, period specifies the expected change rate. This is used to allocate the queue, which stores new samples, before they are written to the database. If the specified period is too long and the actual change rate is higher, samples might be lost, because the queue fills up. If the specified period is much shorter than the actual change period, more memory than needed is allocated for the channel. The size of the buckets storing the samples is also determined based on the configured period. Choosing a period that is too far off the actual period will either result in too many buckets being created (hurting the read and write performance) or in too many samples being stored in a single bucket (which means that rows will grow very big). Therefore you should choose the period as what you expect for the average change rate. If you are woried about losing samples in periods of bursts, you can increase the org.csstudio.archive.engine/buffer_reserve property.

The compression-level tag is optional and its meaning is discussed in the next section.

The group name __disabled_channels is reserved for special use: You can use this group for channels, which you want to disable permanently (e.g. because the corresponding device has been removed), but which you still want to see in the data browser. Moving a channel to this group basically has the same effect as moving it to a group with an enabling channel that is always false, but it avoids the warning about the channel being disconnected.

6.2. Compression Levels

Unlike the RDB Archiver, the Cassandra Archiver does not perform compression of samples for each read request, but stores the compressed samples instead. This has the advantage, that for queries requesting samples for a long period, less data has to be read and thus the query can be answered more quickly.

The compression levels are independently configured for each channel. Each compression-level (except the compression level that stores the raw samples and has an implicit compression period of zero) level must specify a compression-period attribute. This interval (an integer number of seconds) specifies the time between two compressed samples. If two consecutive samples have the same (average) value as well as the same minimum and maximum bounds, the seconds sample is not saved. All compressed samples are aligned to January 1st, 1970, 00:00:00 UTC. This way, the compressed samples from two different channels but using the same compression period are aligned with respect to each other.

The retention-period attribute is optional for the compression-level tag. If a positive retention period (in integer seconds) is defined, samples that are older than the newest sample minus the specified period are deleted. The retention-period attribute is also valid for the special raw compression level.

The raw compression-level is always defined, even if you do not specify a compression-level for it. By default its retention period is zero (meaning that samples are never deleted).

[Important]Important

When specifying a retention period, you have to make sure that all compressed samples have been calculated before the samples needed for this calculation are deleted. Compressed samples are usually calculated from the compression level with the next shorter compression period, that is an even integer fraction of the compression period of the level to be calculated. However, if no such compression level exists, the raw samples are used however. As a rule of thumb, the retention period for any compression level should be at least double the largest compression period for the same channel.

6.3. Loading the Configuration

For loading or updating an engine configuration, you have to use the archive config tool, which is distributed in the archive-config-tool directory of the binary distribution. For importing an engine configuration file, you can call ./ArchiveConfigTool.sh -engine myEngineName -config myEngineConfig.xml -import. If you want to replace the configuration of an existing engine, you have to add the -replace_engine parameter. Replacing an engine configuration will first delete the existing configuration and than import the new configuration. Thus, it is equivalent to first using the -delete_config parameter and then importing the configuration with the -replace_engine parameter. Deleting an engine configuration will never delete the samples associated with the engine's channels. However, if a channel does not exist in the configuration, there is no way to retrieve the samples using the archive reader. Therefore, instead of completely deleting channels, you should move them to a disabled group, if you want to be able to retrieve historic data. If you finally want to delete samples for deleted channels, you have to use the clean-up tool.

If the default connection parameters (Cassandra host is localhost, port is 9160, keyspace name is cssArchive and no authentication is used) are not correct for your setup, you either have to specify the connection parameters as command-line parameters, or you have to specify a plug-in customization file. Call ./ArchiveConfigTool -help for a list of all supported command-line parameters.

[Important]Important

No channels should be modified using the archive configuration tool, while the engine for the respective channel is running (in particular no channel should be deleted). This means that you should not use the -import or -delete_config actions for an archive engine that is running and that you should not use the -steal_channels option, while any engine is running. You also should not modify the configuration while the archive clean-up tool is running.

6.4. Cleaning up the Database

If you want to delete the samples for non-existing channels or want to clean-up small inconsistencies, which can occur if a write operation is interrupted, you can use the clean-up tool.

The clean-up tool is distributed in the archive-cleanup-tool directory of the binary distribution. You can start it by invoking ./ArchiveCleanUpTool.sh. If you are using non-default connection parameters, the same considerations as for the archive config tool apply.

[Important]Important

The run of the clean-up tool can take a very long time. During this time you should not use the archive config tool, because new configurations added by the config tool and the respective samples might be deleted by the clean-up tool. However, the archive engine can run while the clean-up process is running.