The configuration options used by the Cassandra PV Archiver
server are
controlled through a configuration file in the
YAML
format.
The configuration file is located in the
conf
directory of the binary distribution or in the
/etc/cassandra-pv-archiver
directory when using the
Debian package.
In either case, the
configuration file is called
cassandra-pv-archiver.yaml
.
It is not an error if the configuration file does not exists at
the
expected location.
In this case the server starts using
default values for all
configuration options.
The path to the configuration file can be overridden by
specifying the
--config-file
command line option to the
cassandra-pv-archiver-server
script.
When this configuration option is specified, the default
location is not
used.
Unlike the configuration file in the default
location, a configuration
file specified with
--config-file
option must exist
and the server does not start if it is missing.
The configuration options are organized in a hierarchy. For the rest of this document, the first level of this hierarchy is called the section. The hierarchical path to a configuration option can either be specified inline or through indentation. For example, specifying
level1a: option1: value1 level2: option1: value2 level1b: option1: value3
is equivalent to specifying
level1a.option1: value1 level1a.level2.option1: value2 level1b:option1: value3
The default values specified in this document are the default values that are used when a configuration option is not specified at all, not the value of the option that is specified in the configuration file distributed as part of the binary distribution or Debian package.
This section only describes the part of the configuration that is stored in the per-server configuration file, not the configuration that is stored in the database. Regarding the latter one, please refer to Section 4, “Administrative user interface” .
The
cassandra
section configures the server’s
connection to the Cassandra
cluster.
The
cassandra.hosts
option specifies the list
of hosts which are used for
initially establishing the connection
with the Cassandra
cluster.
This list does not have to contain all Cassandra
hosts because all
hosts in the cluster are detected
automatatically once the
connection to at least one host has
been established.
However, it is still a good idea to specify
more than one host here
because this will ensure that the
connection can be established
even if one of the hosts is
down when the Cassandra PV Archiver
server is started.
By default, the list only contains
localhost
.
The list of hosts has to be specified as a YAML list, using
the
regular or the inline list syntax. For example, a list
specifying
three hosts might look like this:
cassandra: hosts: - server1.example.com - server2.example.com - server3.example.com
The
cassandra.port
option specifies the port
number on which the Cassandra hosts
are listening for incoming
connections (for Cassandra’s
native protocol).
The default value is 9042, which is also
the default value used by
Cassandra.
The
cassandra.keyspace
option specifies the name
of the keyspace in which the
Cassandra PV Archiver stores its data.
The default value is
pv_archive
.
While strictly speaking mixed-case names are allowed, the
use of
such names is discouraged because many tools have
problem with them
and they typically require quoting.
For this
reason, the keyspace name should be all lower-case when
possible.
The
cassandra.username
option specifies the
username that is specified when
authenticating with the Cassandra
cluster.
When empty, the
connection to the Cassandra cluster is established
without
trying to authenticate the client.
The default value is the
empty string (no authentication).
The
cassandra.password
option specifies the
password that is specified when
authenticating with the Cassandra
cluster.
The password is
only used when the username is not empty.
The default value
is the empty string.
The
cassandra.fetchSize
option specifies the
default fetch size that is used when
reading data from the Cassandra
database.
The fetch size
specifies how many rows are read from the database in
a
single page.
Specifying a larger value typically improves
performance when
processing a query that returns many rows,
but results in more
memory usage in both the database server
and the client because the
full page of rows has to be kept
in memory.
The default value is zero, which causes the default fetch size of the Cassandra driver to be used. As of version 3.1.4 of the Cassandra driver, that default fetch size is 5000 rows. If specified, this option has to be set to an integer between 0 and 2147483647.
The fetch size specified here is only used for queries that do not explicitly specify a fetch size.
The
cassandra.useLocalConsistencyLevel
option
specifies the consistency level that is used for all
database
operations.
The default value is
false
.
This option only has an effect when the Cassandra cluster
is
distributed across multiple data centers.
By setting this
option to
true
, the
LOCAL_QUORUM
consistency level is used where
usually the
QUORUM
consistency level would be
used.
In the same way, the
LOCAL_SERIAL
consistency
level is used instead of the
SERIAL
consistency
level.
This option must only be enabled if only a single data center makes modifications to the data and all other data centers only use the database for read access. In this case, enabling this option can reduce the latency of operations because the client only has to wait for nodes local to the data center. The most likely scenario is a situation where all nodes running the Cassandra PV Archiver servers are in a single data center, but there is a second data center to which all data is replicated for disaster recovery.
Important | |
---|---|
Never enable this option when there is more than one data center that is used for write access to the database. In this case, enabling this option will lead to data corruption because operations that are expected to result in a consistent state might actually leave inconsistencies.
This option merely provides a performance optimization, so
in case
of doubt, leave it at its default value of
|
The
server
section configures the archiving server
(for example the ID
assigned to each server instance and on which
address and ports
the archiving server listens).
While the address and port
settings can usually be left at their
defaults the server’s ID
has to be set.
Each server in the cluster is identified by a unique ID
(UUID).
As this UUID has to be unique for each server, there
is no
reasonable default value, but it has to be specified
explicitly.
The server’s UUID can be specified using the
server.uuid
option.
Alternatively, it can be specified by passing the
--server-uuid
parameter to the server’s start
script.
Important | |
---|---|
Starting two server instances with the same UUID results in data corruption, regardless of whether these instances are started on the same host or different hosts. For this reason, care should be taken to ensure that each UUID is only used for exactly one process. |
As an alternative to specifying the server’s UUID in the
configuration file or on the command line, it is possible to
have a
separate file that specifies the UUID.
The path to this
file can be specified with the
server.uuidFile
option.
If this file exists, it is expected to contain a
single line with
the UUID that is then used as the server’s
UUID.
If this file does not exist, the server tries to create
it on
startup, using a randomly generated UUID.
By default
this option is not set so that the server expects an
explicitly specified UUID.
This option is particularly useful
in an environment where servers
are deployed automatically
and should thus automatically generate a
UUID the first time
they are started.
The
server.listenAddress
option specifies the IP
address (or the hostname resolving to
the IP address) on which the
server listens for incoming
connections.
If it is empty (the default), the server listens
on the first
non-loopback address that is found.
This means
that typically, this option only has to be set for
servers
that have more than one (non-loopback) interface.
The specified address is used for the administrative user-interface, the archive-access interface, and the inter-node communication interface. In addition to the specified address, the administrative user-interface and the archive-access interface are also made available on the loopback address.
This option should never be set to
localhost
,
127.0.0.1
,
::1
, or any other
loopback address because other servers will
try to contact the
server on the specified address and
obviously this will lead to
unexpected results when the
address is a loopback address.
The
server.adminPort
option specifies the TCP
port number on which the
administrative user-interface is made
available.
The default
is port 4812.
The
server.archiveAccessPort
option specifies the
TCP port number on which the
archive-access interface is made
available.
The default is
port 9812.
The archive-access interface is the web-interface
through which
clients access the data stored in the archive.
The
server.interNodeCommunicationPort
option
specifies the TCP port number on which the inter-node
communication
interface is made available.
The default is port
9813.
Like the name suggests, the inter-node communication
interface is
used for internal communication between
Cassandra PV Archiver
servers that is needed in order to
coordinate the cluster operation
(for example in case of
configuration changes).
The
server.interNodeCommunicationRequestTimeout
option specifies the timeout used for the communication
between
nodes.
The timeout is specified in milliseconds.
If
chosen too low, complex requests (e.g. a request to modify
the
configuration of many channels when importing a
configuration file)
may time out.
If chosen too high, requests
will take a very long time before
timing out in case of a
sudden server crash or network disruption.
The default value is 900000 milliseconds (15 minutes). Valid values are integer numbers between 1 and 2147483647.
The
throttling
section contains options for
throttling database operations.
The Cassandra PV Archiver server tries to run database
operations in
parallel in order to reduce the effective latency
of complex
operations (e.g. operations involing many channels).
However, depending on the exact configuration of the Cassandra
cluster
(for example the size of the cluster, network bandwidth
and latency,
hardware used for the cluster, load caused by
other applications), the
number of operations that can safely
be run in parallel might differ.
When running too many operations in parallel, this results in some of the operations timing out. This can be avoided by reducing the number of operations allowed to run in parallel. On the other hand, when operations never time out, one might try to increase the limits in order to improve the performance.
The limits can be controlled separately for read and write
operations
and for operations touching the channels’ meta-data
(for example the
configuration and information about sample
buckets) and the actual
samples.
Operations modifying channel
meta-data are typically carried out using
the
SERIAL
consistency level, so in this case write
operations typically
are more expensive than read operations.
Thus the limit for
write operations should be lower than the limit for
read
operations.
In the case of operations dealing with actual
samples, read operations
typically are more expensive than
write operation (due to how
Cassandra works internally), so the
limit for read operations shold be
lower than the limit for
write operations.
Note | |
---|---|
When trying to optimize the throttling settings, it can be helpful to connect to the Cassandra PV Archiver server via JMX (for example using JConsole from the JDK). The current number of operations that are running and waiting is exposed via MBeans, so that it is possible to monitor how changing the throttling parameters affects the operation. |
The
throttling.maxConcurrentChannelMetaDataReadStatements
configuration option controls how many read operations for
channel
meta-data should be allowed to run in parallel.
Usually, these are statements reading from the
channels
,
channels_by_server
,
and
pending_channel_operations_by_server
tables.
Typically, this limit should be greater than the
limit set by the
throttling.maxConcurrentChannelMetaDataWriteStatements
option.
The default value is 64.
The
throttling.maxConcurrentChannelMetaDataWriteStatements
configuration option controls how many write operations for
channel
meta-data should be allowed to run in parallel.
Usually, these are statements writing to the
channels
,
channels_by_server
,
and
pending_channel_operations_by_server
tables.
Typically, such operations are light-weight
transactions and thus
this limit should be less than the
limit set by the
throttling.maxConcurrentChannelMetaDataReadStatements
option.
The default value is 16.
The
throttling.maxConcurrentControlSystemSupportReadStatements
configuration option controls how many read operations the
control-system supports (all of them combined) are allowed
to run in
parallel.
Usually, these are statements that read
actual samples and thus read
from the tables used by the
control-system support(s).
Typically, this limit should be
less than the limit set by the
throttling.maxConcurrentControlSystemSupportWriteStatements
option, but significantly greater than the limit set by the
throttling.maxConcurrentChannelMetaDataReadStatements
option.
The default value is 128.
The
throttling.maxConcurrentControlSystemSupportWriteStatements
configuration option controls how many write operations the
control-system supports (all of them combined) are allowed
to run in
parallel.
Usually, these are statements that write
actual samples (for each
sample that is written, an
INSERT
statement is
triggered) and that thus write to the tables
used by the
control-system support(s).
Typically, this limit
should be greater than the limit set by the
throttling.maxConcurrentControlSystemSupportReadStatements
option and significantly greater than the limits set by the
throttling.maxConcurrentChannelMetaDataReadStatements
and
throttling.maxConcurrentChannelMetaDataWriteStatements
options.
The default value is 512.
The
throttling.sampleDecimation.maxFetchedSamplesInMemory
configuration option controls how many samples may be
fetched into
memory when generating decimated samples.
The sample decimation process might consume a lot of memory when generating decimated samples from already existing source samples for a lot of channels. The amount of samples that may be fetched into memory is directly connected to memory usage. Each fetched sample occupies about 1 KB of memory (for scalar Channel Access samples), so one million samples are roughly equivalent to 1 GB of memory.
As the exact number of samples returned by a fetch operation
cannot
be known in advance, this threshold might actually be
exceeded
slightly.
The
maxRunningFetchOperations
option can be used to control by how much the threshold may
be exceeded.
The default value for this option is 1000000 samples.
The
throttling.sampleDecimation.maxRunningFetchOperations
configuration option controls how many fetch operations may
run in
parallel when generating decimated samples.
As the exact number of samples returned by a fetch operation
cannot
be known in advance, the threshold set by the
maxFetchedSamplesInMemory
option might actually be exceeded slightly.
This
configuration option can be used to control by how much the
threshold may be exceeded.
The max. number of running fetch
operations multiplied by the
fetch size
is the max. number of samples by which the limit might be
exceeded.
The default value for this option is 20.
The
controlSystemSupport
section contains the
configuration options for the various
control-system supports.
For each available control-system
support, this section has a
corresponding sub-section.
The
configuration options in these sub-sections are not handled by
the Cassandra PV Archiver server itself but passed as-is to
the
respective control-system support.
For this reason, the
names of the available options entirely depend
on the
respective control-system support.
Please refer to the
documentation of the respective control-system
support for
details.
For example, the documentation for the Channel Access
control-system
support is available in
Appendix D, Channel Access control-system support
.
The Cassandra PV Archiver server is based on the Spring Boot framework. For this reason, the options supported for configuring logging are actually the same ones that are supported by Spring Boot. These options are documented in the Spring Boot Reference Guide . The Cassanra PV Archiver server uses Logback as its logging backend, so the specifics of how to configure Logback for Spring Boot might also be interesting.
In order to get started more easily, this section contains a few pointers on how the logging configuration can be modified.
The log level can be set both globally and for specific subtrees of the class hierarchy. When specifying different log levels for different parts of the hierarchy, more specific definitions (the ones covering a smaller sub-tree of the hierarchy) take precedence over more general definitions.
The available log levels are
ERROR
,
WARN
,
INFO
,
DEBUG
, and
TRACE
.
Each log level contains the preceding log levels (for
example
the log level
INFO
also contains
ERROR
and
WARN
).
The log level for the root of the hierarchy (that is used
for all
loggers that do not have a more specific definition)
is set through
the
logging.root.level
option.
By default, this log level is set to
INFO
.
This results in a lot of diagnostic messages being logged,
so you
might want to consider reducing it to
WARN
.
The log level for individual parts of the hierarchy can be
set by
using a configuration option containing the path to
the respective
hierarchy level.
For example, in order to
enable DEBUG messages for all classes in
the
com.aquenos.cassandra.pvarchiver
package (and
its sub-packages), one could set
logging.com.aquenos.cassandra.pvarchiver.level
to
DEBUG
.
The path to the log file can be specified using the
logging.file
option.
If no log file is specified (the default),
log
messages are only written to the standard output.
In order to
log to more than one log file (for example depending
on the
log level or the class writing the log message) or in order
to disable logging to the standard output, one has to
specify a
custom logback configuration file (see the next
section).
When the configuration options directly available through
the
Cassandra PV Archiver server configuration-file are not
sufficient,
one can specify a custom Logback configuration
file.
The path to this file is specified using the
logging.config
option.
The
information
available in the
Spring Boot Reference Guide
might be useful when using this option.
In addition to the configuration options that can be specified
in the
server’s configuration file, there are two environment
variables that
can be passed to the server’s startup script.
When using the Debian package, these environment variables
should be
set in the file
/etc/default/cassandra-pv-archiver-server
.
The first environment variable is
JAVA_HOME
.
It specifies the path to the JRE.
When starting the Java
process, the server’s startup scripts uses the
$JAVA_HOME/bin/java
executable
(
%JAVA_HOME%/bin/java.exe
on Windows).
When
JAVA_HOME
is not set, the startup script uses the
java
executable that is in the search
PATH
of the shell executing the startup script.
The second environment variable is
JAVA_OPTS
.
When set, the value of this environment variable is added to
the
parameters passed to the
java
executable.
It can be used to configure JVM options like the
maximum heap size.