The configuration options used by the Cassandra PV Archiver server are
controlled through a configuration file in the
YAML format.
The configuration file is located in the conf
directory of the binary distribution or in the
/etc/cassandra-pv-archiver
directory when using the
Debian package.
In either case, the configuration file is called
cassandra-pv-archiver.yaml
.
It is not an error if the configuration file does not exists at the
expected location.
In this case the server starts using default values for all
configuration options.
The path to the configuration file can be overridden by specifying the
--config-file
command line option to the
cassandra-pv-archiver-server
script.
When this configuration option is specified, the default location is not
used.
Unlike the configuration file in the default location, a configuration
file specified with --config-file
option must exist
and the server does not start if it is missing.
The configuration options are organized in a hierarchy. For the rest of this document, the first level of this hierarchy is called the section. The hierarchical path to a configuration option can either be specified inline or through indentation. For example, specifying
level1a: option1: value1 level2: option1: value2 level1b: option1: value3
is equivalent to specifying
level1a.option1: value1 level1a.level2.option1: value2 level1b:option1: value3
The default values specified in this document are the default values that are used when a configuration option is not specified at all, not the value of the option that is specified in the configuration file distributed as part of the binary distribution or Debian package.
This section only describes the part of the configuration that is stored in the per-server configuration file, not the configuration that is stored in the database. Regarding the latter one, please refer to Section 4, “Administrative user interface”.
The cassandra
section configures the server’s
connection to the Cassandra cluster.
The cassandra.hosts
option specifies the list
of hosts which are used for initially establishing the connection
with the Cassandra cluster.
This list does not have to contain all Cassandra hosts because all
hosts in the cluster are detected automatatically once the
connection to at least one host has been established.
However, it is still a good idea to specify more than one host here
because this will ensure that the connection can be established
even if one of the hosts is down when the Cassandra PV Archiver
server is started.
By default, the list only contains localhost
.
The list of hosts has to be specified as a YAML list, using the
regular or the inline list syntax. For example, a list specifying
three hosts might look like this:
cassandra: hosts: - server1.example.com - server2.example.com - server3.example.com
The cassandra.port
option specifies the port
number on which the Cassandra hosts are listening for incoming
connections (for Cassandra’s native protocol).
The default value is 9042, which is also the default value used by
Cassandra.
The cassandra.keyspace
option specifies the name
of the keyspace in which the Cassandra PV Archiver stores its data.
The default value is pv_archive
.
While strictly speaking mixed-case names are allowed, the use of
such names is discouraged because many tools have problem with them
and they typically require quoting.
For this reason, the keyspace name should be all lower-case when
possible.
The cassandra.username
option specifies the
username that is specified when authenticating with the Cassandra
cluster.
When empty, the connection to the Cassandra cluster is established
without trying to authenticate the client.
The default value is the empty string (no authentication).
The cassandra.password
option specifies the
password that is specified when authenticating with the Cassandra
cluster.
The password is only used when the username is not empty.
The default value is the empty string.
The cassandra.fetchSize
option specifies the
default fetch size that is used when reading data from the Cassandra
database.
The fetch size specifies how many rows are read from the database in
a single page.
Specifying a larger value typically improves performance when
processing a query that returns many rows, but results in more
memory usage in both the database server and the client because the
full page of rows has to be kept in memory.
The default value is zero, which causes the default fetch size of the Cassandra driver to be used. As of version 3.1.4 of the Cassandra driver, that default fetch size is 5000 rows. If specified, this option has to be set to an integer between 0 and 2147483647.
The fetch size specified here is only used for queries that do not explicitly specify a fetch size.
The cassandra.useLocalConsistencyLevel
option
specifies the consistency level that is used for all database
operations.
The default value is false
.
This option only has an effect when the Cassandra cluster is
distributed across multiple data centers.
By setting this option to true
, the
LOCAL_QUORUM
consistency level is used where
usually the QUORUM
consistency level would be
used.
In the same way, the LOCAL_SERIAL
consistency
level is used instead of the SERIAL
consistency
level.
This option must only be enabled if only a single data center makes modifications to the data and all other data centers only use the database for read access. In this case, enabling this option can reduce the latency of operations because the client only has to wait for nodes local to the data center. The most likely scenario is a situation where all nodes running the Cassandra PV Archiver servers are in a single data center, but there is a second data center to which all data is replicated for disaster recovery.
Important | |
---|---|
Never enable this option when there is more than one data center that is used for write access to the database. In this case, enabling this option will lead to data corruption because operations that are expected to result in a consistent state might actually leave inconsistencies.
This option merely provides a performance optimization, so in case
of doubt, leave it at its default value of
|
The server
section configures the archiving server
(for example the ID assigned to each server instance and on which
address and ports the archiving server listens).
While the address and port settings can usually be left at their
defaults the server’s ID has to be set.
Each server in the cluster is identified by a unique ID (UUID).
As this UUID has to be unique for each server, there is no
reasonable default value, but it has to be specified explicitly.
The server’s UUID can be specified using the
server.uuid
option.
Alternatively, it can be specified by passing the
--server-uuid
parameter to the server’s start
script.
Important | |
---|---|
Starting two server instances with the same UUID results in data corruption, regardless of whether these instances are started on the same host or different hosts. For this reason, care should be taken to ensure that each UUID is only used for exactly one process. |
As an alternative to specifying the server’s UUID in the
configuration file or on the command line, it is possible to have a
separate file that specifies the UUID.
The path to this file can be specified with the
server.uuidFile
option.
If this file exists, it is expected to contain a single line with
the UUID that is then used as the server’s UUID.
If this file does not exist, the server tries to create it on
startup, using a randomly generated UUID.
By default this option is not set so that the server expects an
explicitly specified UUID.
This option is particularly useful in an environment where servers
are deployed automatically and should thus automatically generate a
UUID the first time they are started.
The server.listenAddress
option specifies the IP
address (or the hostname resolving to the IP address) on which the
server listens for incoming connections.
If it is empty (the default), the server listens on the first
non-loopback address that is found.
This means that typically, this option only has to be set for
servers that have more than one (non-loopback) interface.
The specified address is used for the administrative user-interface, the archive-access interface, and the inter-node communication interface. In addition to the specified address, the administrative user-interface and the archive-access interface are also made available on the loopback address.
This option should never be set to localhost
,
127.0.0.1
, ::1
, or any other
loopback address because other servers will try to contact the
server on the specified address and obviously this will lead to
unexpected results when the address is a loopback address.
The server.adminPort
option specifies the TCP
port number on which the administrative user-interface is made
available.
The default is port 4812.
The server.archiveAccessPort
option specifies the
TCP port number on which the archive-access interface is made
available.
The default is port 9812.
The archive-access interface is the web-interface through which
clients access the data stored in the archive.
The server.interNodeCommunicationPort
option
specifies the TCP port number on which the inter-node communication
interface is made available.
The default is port 9813.
Like the name suggests, the inter-node communication interface is
used for internal communication between Cassandra PV Archiver
servers that is needed in order to coordinate the cluster operation
(for example in case of configuration changes).
The server.interNodeCommunicationRequestTimeout
option specifies the timeout used for the communication between
nodes.
The timeout is specified in milliseconds.
If chosen too low, complex requests (e.g. a request to modify the
configuration of many channels when importing a configuration file)
may time out.
If chosen too high, requests will take a very long time before
timing out in case of a sudden server crash or network disruption.
The default value is 900000 milliseconds (15 minutes). Valid values are integer numbers between 1 and 2147483647.
The throttling
section contains options for
throttling database operations.
The Cassandra PV Archiver server tries to run database operations in
parallel in order to reduce the effective latency of complex
operations (e.g. operations involing many channels).
However, depending on the exact configuration of the Cassandra cluster
(for example the size of the cluster, network bandwidth and latency,
hardware used for the cluster, load caused by other applications), the
number of operations that can safely be run in parallel might differ.
When running too many operations in parallel, this results in some of the operations timing out. This can be avoided by reducing the number of operations allowed to run in parallel. On the other hand, when operations never time out, one might try to increase the limits in order to improve the performance.
The limits can be controlled separately for read and write operations
and for operations touching the channels’ meta-data (for example the
configuration and information about sample buckets) and the actual
samples.
Operations modifying channel meta-data are typically carried out using
the SERIAL
consistency level, so in this case write
operations typically are more expensive than read operations.
Thus the limit for write operations should be lower than the limit for
read operations.
In the case of operations dealing with actual samples, read operations
typically are more expensive than write operation (due to how
Cassandra works internally), so the limit for read operations shold be
lower than the limit for write operations.
Note | |
---|---|
When trying to optimize the throttling settings, it can be helpful to connect to the Cassandra PV Archiver server via JMX (for example using JConsole from the JDK). The current number of operations that are running and waiting is exposed via MBeans, so that it is possible to monitor how changing the throttling parameters affects the operation. |
The
throttling.maxConcurrentChannelMetaDataReadStatements
configuration option controls how many read operations for channel
meta-data should be allowed to run in parallel.
Usually, these are statements reading from the
channels
, channels_by_server
,
and pending_channel_operations_by_server
tables.
Typically, this limit should be greater than the limit set by the
throttling.maxConcurrentChannelMetaDataWriteStatements
option.
The default value is 64.
The
throttling.maxConcurrentChannelMetaDataWriteStatements
configuration option controls how many write operations for channel
meta-data should be allowed to run in parallel.
Usually, these are statements writing to the
channels
, channels_by_server
,
and pending_channel_operations_by_server
tables.
Typically, such operations are light-weight transactions and thus
this limit should be less than the limit set by the
throttling.maxConcurrentChannelMetaDataReadStatements
option.
The default value is 16.
The
throttling.maxConcurrentControlSystemSupportReadStatements
configuration option controls how many read operations the
control-system supports (all of them combined) are allowed to run in
parallel.
Usually, these are statements that read actual samples and thus read
from the tables used by the control-system support(s).
Typically, this limit should be less than the limit set by the
throttling.maxConcurrentControlSystemSupportWriteStatements
option, but significantly greater than the limit set by the
throttling.maxConcurrentChannelMetaDataReadStatements
option.
The default value is 128.
The
throttling.maxConcurrentControlSystemSupportWriteStatements
configuration option controls how many write operations the
control-system supports (all of them combined) are allowed to run in
parallel.
Usually, these are statements that write actual samples (for each
sample that is written, an INSERT
statement is
triggered) and that thus write to the tables used by the
control-system support(s).
Typically, this limit should be greater than the limit set by the
throttling.maxConcurrentControlSystemSupportReadStatements
option and significantly greater than the limits set by the
throttling.maxConcurrentChannelMetaDataReadStatements
and
throttling.maxConcurrentChannelMetaDataWriteStatements
options.
The default value is 512.
The
throttling.sampleDecimation.maxFetchedSamplesInMemory
configuration option controls how many samples may be fetched into
memory when generating decimated samples.
The sample decimation process might consume a lot of memory when generating decimated samples from already existing source samples for a lot of channels. The amount of samples that may be fetched into memory is directly connected to memory usage. Each fetched sample occupies about 1 KB of memory (for scalar Channel Access samples), so one million samples are roughly equivalent to 1 GB of memory.
As the exact number of samples returned by a fetch operation cannot
be known in advance, this threshold might actually be exceeded
slightly.
The
maxRunningFetchOperations
option can be used to control by how much the threshold may be exceeded.
The default value for this option is 1000000 samples.
The
throttling.sampleDecimation.maxRunningFetchOperations
configuration option controls how many fetch operations may run in
parallel when generating decimated samples.
As the exact number of samples returned by a fetch operation cannot
be known in advance, the threshold set by the
maxFetchedSamplesInMemory
option might actually be exceeded slightly.
This configuration option can be used to control by how much the
threshold may be exceeded.
The max. number of running fetch operations multiplied by the
fetch size
is the max. number of samples by which the limit might be exceeded.
The default value for this option is 20.
The controlSystemSupport
section contains the
configuration options for the various control-system supports.
For each available control-system support, this section has a
corresponding sub-section.
The configuration options in these sub-sections are not handled by
the Cassandra PV Archiver server itself but passed as-is to the
respective control-system support.
For this reason, the names of the available options entirely depend
on the respective control-system support.
Please refer to the documentation of the respective control-system
support for details.
For example, the documentation for the Channel Access control-system
support is available in Appendix D, Channel Access control-system support.
The Cassandra PV Archiver server is based on the Spring Boot framework. For this reason, the options supported for configuring logging are actually the same ones that are supported by Spring Boot. These options are documented in the Spring Boot Reference Guide. The Cassanra PV Archiver server uses Logback as its logging backend, so the specifics of how to configure Logback for Spring Boot might also be interesting.
In order to get started more easily, this section contains a few pointers on how the logging configuration can be modified.
The log level can be set both globally and for specific subtrees of the class hierarchy. When specifying different log levels for different parts of the hierarchy, more specific definitions (the ones covering a smaller sub-tree of the hierarchy) take precedence over more general definitions.
The available log levels are ERROR
,
WARN
, INFO
,
DEBUG
, and TRACE
.
Each log level contains the preceding log levels (for example
the log level INFO
also contains
ERROR
and WARN
).
The log level for the root of the hierarchy (that is used for all
loggers that do not have a more specific definition) is set through
the logging.root.level
option.
By default, this log level is set to INFO
.
This results in a lot of diagnostic messages being logged, so you
might want to consider reducing it to WARN
.
The log level for individual parts of the hierarchy can be set by
using a configuration option containing the path to the respective
hierarchy level.
For example, in order to enable DEBUG messages for all classes in
the com.aquenos.cassandra.pvarchiver
package (and
its sub-packages), one could set
logging.com.aquenos.cassandra.pvarchiver.level
to
DEBUG
.
The path to the log file can be specified using the
logging.file
option.
If no log file is specified (the default),
log messages are only written to the standard output.
In order to log to more than one log file (for example depending
on the log level or the class writing the log message) or in order
to disable logging to the standard output, one has to specify a
custom logback configuration file (see the next section).
When the configuration options directly available through the
Cassandra PV Archiver server configuration-file are not sufficient,
one can specify a custom Logback configuration file.
The path to this file is specified using the
logging.config
option.
The
information
available in the
Spring Boot Reference Guide
might be useful when using this option.
In addition to the configuration options that can be specified in the
server’s configuration file, there are two environment variables that
can be passed to the server’s startup script.
When using the Debian package, these environment variables should be
set in the file
/etc/default/cassandra-pv-archiver-server
.
The first environment variable is JAVA_HOME
.
It specifies the path to the JRE.
When starting the Java process, the server’s startup scripts uses the
$JAVA_HOME/bin/java
executable
(%JAVA_HOME%/bin/java.exe
on Windows).
When JAVA_HOME
is not set, the startup script uses the
java
executable that is in the search
PATH
of the shell executing the startup script.
The second environment variable is JAVA_OPTS
.
When set, the value of this environment variable is added to the
parameters passed to the java
executable.
It can be used to configure JVM options like the maximum heap size.