Appendix A. CQL table layout

Table of Contents
1. Table channels

The Cassandra PV Archiver stores its data in several CQL tables, listed in Table A.1, “Cassandra PV Archiver CQL tables”.

Table A.1. Cassandra PV Archiver CQL tables
Table nameDescription
cluster_servers Status and location information for Cassandra PV Archiver servers
pending_channel_operations_by_server Protective entries for channels in order to prevent concurrent modifications
channels Channel configuration and information about sample buckets for each channel
channels_by_server Channel configuration and state for all channels associated with each server (for faster startup)
generic_data_store Generic configuration information (e.g. credentials for the administrative user interface)

In addition to these tables, each control-system support has one ore more tables. Please refer to the documentation of the respective control-system support for details. Most of the tables listed earlier are considered internal to the operation of the Cassandra PV Archiver and thus are not discussed in greater detail. Only the channels table is relevant for accessing data stored in the archive. This table is discussed in Section 1, “Table channels”.

1. Table channels

The channels table stores configuration information and information about sample buckets for each channel. The table’s structure is described by Table A.2, “Columns of table channels”.

Table A.2. Columns of table channels
Column nameColumn typeData typeDescription
channel_namePartition keytext Channel name.
decimation_levelClustering Keyint Decimation level (identified by the decimation period in seconds). Zero indicates raw samples.
bucket_start_timeClustering Keybigint Start time of the sample bucket (in nanoseconds since epoch, which is January 1st, 1970, 00:00:00 UTC).
bucket_end_timeRegularbigint End time of the sample bucket (in nanoseconds since epoch, which is January 1st, 1970, 00:00:00 UTC).
channel_data_idStaticuuid Data ID associated with the channel. This information is used to identify associated data in the control-system support’s table(s).
control_system_typeStatictext ID of the control-system support used for the channel.
decimation_levelsStaticset<int> Set containing all decimation levels that exist for the channel (identified by their decimation periods in seconds).
server_idStaticuuid UUID of the server to which the channel belongs.

The channel name is used as the partition key and the decimation level and bucket start time are used as clustering keys. This means that for each channel, there is a partition and for each sample bucket there is a row in this partition. The ordering of the clustering keys (decimation level first, bucket start time second) makes it possible to search for sample buckets for a specific decimation level that are in a certain time range. All configuration information is stored in static columns (columns that are shared among all rows in the partition) because this information obviously does not depend on the sample bucket.

The bucket end time is a regular column and thus it is not possible to search by end time. However, the end time is typically just one nanosecond before the start time of the following bucket (it is guaranteed to be strictly less than the start time of the next bucket). Therefore, there is usually no need to search based on the end time.

When reading samples, one has to search for the sample buckets that store the samples for the relevant period of time. One can use a query like the following to search for all sample buckets that start in a certain period of time:

SELECT * FROM channels WHERE
  channel_name = 'myChannel' AND
  decimation_level = 0 AND
  bucket_start_time >= 1468429000000000000 AND
  bucket_start_time <= 1468431000000000000
  ORDER BY decimation_level ASC;

In this example, myChannel is the name of the channel and we search for sample buckets storing raw samples (decimation period of 0) and starting between the time stamps 1468429000000000000 and 1468431000000000000. It might seem strange to order by the decimation_level column when we actually want to order by the bucket_start_time column. However, Cassandra (currently) only allows specifying the first column of a composite clustering key in the ORDER BY clause. The ORDER BY clause still has the intended effect of (also) ordering by the bucket_start_time column.

Typically, one also needs the sample bucket that starts before the lower time-stamp, unless there is a sample bucket starting right at the lower limit of the search period, which will only happen by chance. One can retrieve information about this sample bucket with a query like the following:

SELECT * FROM channels WHERE
  channel_name = 'myChannel' AND
  decimation_level = 0 AND
  bucket_start_time < 1468429000000000000
  ORDER BY decimation_level DESC
  LIMIT 1;

We are only interested in the first sample bucket just before our lower limit, which is why we use descending order and limit the results to a single row.

Once we know the sample buckets, we can retrieve the corresponding samples from the control-system support’s table(s). We need the channel_data_id, decimation_level, and bucket_start_time in order to identify the sample bucket in the control-system support’s table(s). When querying these tables, the time stamp of the samples should be limited to the range specified by the bucket_start_time and bucket_end_time, unless the limits imposed by the time period that is queried are more narrow. Always using these limits ensures that we do not read samples that have accidentally been written into a sample bucket where they do not belong. Usually, such samples should not exist, but it is better to be safe.