The Cassandra PV Archiver stores its data in several CQL tables, listed in Table A.1, “Cassandra PV Archiver CQL tables” .
Table name | Description |
---|---|
cluster_servers | Status and location information for Cassandra PV Archiver servers |
pending_channel_operations_by_server | Protective entries for channels in order to prevent concurrent modifications |
channels | Channel configuration and information about sample buckets for each channel |
channels_by_server | Channel configuration and state for all channels associated with each server (for faster startup) |
generic_data_store | Generic configuration information (e.g. credentials for the administrative user interface) |
In addition to these tables, each control-system support has one
ore more
tables.
Please refer to the documentation of the respective
control-system support
for details.
Most of the tables listed
earlier are considered internal to the operation
of the Cassandra
PV Archiver and thus are not discussed in greater detail.
Only the
channels
table is relevant for accessing data
stored in the archive. This
table is discussed in
Section 1, “Table channels”
.
The
channels
table stores configuration information
and information about
sample buckets for each channel.
The table’s structure is
described by
Table A.2, “Columns of table channels”
.
Column name | Column type | Data type | Description |
---|---|---|---|
channel_name | Partition key | text | Channel name. |
decimation_level | Clustering Key | int | Decimation level (identified by the decimation period in seconds). Zero indicates raw samples. |
bucket_start_time | Clustering Key | bigint | Start time of the sample bucket (in nanoseconds since epoch, which is January 1 st , 1970, 00:00:00 UTC). |
bucket_end_time | Regular | bigint | End time of the sample bucket (in nanoseconds since epoch, which is January 1 st , 1970, 00:00:00 UTC). |
channel_data_id | Static | uuid | Data ID associated with the channel. This information is used to identify associated data in the control-system support’s table(s). |
control_system_type | Static | text | ID of the control-system support used for the channel. |
decimation_levels | Static | set<int> | Set containing all decimation levels that exist for the channel (identified by their decimation periods in seconds). |
server_id | Static | uuid | UUID of the server to which the channel belongs. |
The channel name is used as the partition key and the decimation level and bucket start time are used as clustering keys. This means that for each channel, there is a partition and for each sample bucket there is a row in this partition. The ordering of the clustering keys (decimation level first, bucket start time second) makes it possible to search for sample buckets for a specific decimation level that are in a certain time range. All configuration information is stored in static columns (columns that are shared among all rows in the partition) because this information obviously does not depend on the sample bucket.
The bucket end time is a regular column and thus it is not possible to search by end time. However, the end time is typically just one nanosecond before the start time of the following bucket (it is guaranteed to be strictly less than the start time of the next bucket). Therefore, there is usually no need to search based on the end time.
When reading samples, one has to search for the sample buckets that store the samples for the relevant period of time. One can use a query like the following to search for all sample buckets that start in a certain period of time:
SELECT * FROM channels WHERE channel_name = 'myChannel' AND decimation_level = 0 AND bucket_start_time >= 1468429000000000000 AND bucket_start_time <= 1468431000000000000 ORDER BY decimation_level ASC;
In this example,
myChannel
is the name of the channel
and we search for sample buckets
storing raw samples (decimation period of
0
) and starting between the time stamps
1468429000000000000
and
1468431000000000000
.
It might seem strange to order by the
decimation_level
column when we actually want to
order by the
bucket_start_time
column.
However, Cassandra (currently) only allows specifying the
first column
of a composite clustering key in the
ORDER BY
clause.
The
ORDER BY
clause still has the intended effect of
(also) ordering by the
bucket_start_time
column.
Typically, one also needs the sample bucket that starts before the lower time-stamp, unless there is a sample bucket starting right at the lower limit of the search period, which will only happen by chance. One can retrieve information about this sample bucket with a query like the following:
SELECT * FROM channels WHERE channel_name = 'myChannel' AND decimation_level = 0 AND bucket_start_time < 1468429000000000000 ORDER BY decimation_level DESC LIMIT 1;
We are only interested in the first sample bucket just before our lower limit, which is why we use descending order and limit the results to a single row.
Once we know the sample buckets, we can retrieve the
corresponding
samples from the control-system support’s table(s).
We need the
channel_data_id
,
decimation_level
, and
bucket_start_time
in order to identify the sample
bucket in the control-system
support’s table(s).
When querying these tables, the time stamp of
the samples should be
limited to the range specified by the
bucket_start_time
and
bucket_end_time
, unless the limits imposed by the
time period that is queried
are more narrow.
Always using these limits ensures that we do not
read samples that have
accidentally been written into a sample
bucket where they do not belong.
Usually, such samples should not
exist, but it is better to be safe.