This section gives some hints on how to fix certain problems that might appear while running the Cassandra PV Archiver server. Readers may skip this section and come back later in case they experience one of the problems.
Apache Cassandra limits the time that is spent trying to process a statement. When a statement cannot be processed within this time limit, it fails with a timeout error. Such an error might appear in the form of a message like “Cassandra timeout during write query at consistency SERIAL…” or a similar message being displayed when trying to apply configuration changes or being displayed as the error message for a channel that is in the error state.
Typically, statements time out because the Cassandra cluster
is
overloaded with requests and thus cannot process all of them
in a
timely manner.
In this case, reducing the number of
statements that are run in
parallel can help alleviate the
problem.
When a write statement with a consistency level of
SERIAL
fails, this is most likely caused by the
throttling.maxConcurrentChannelMetaDataWriteStatements
option having a too large value.
Please refer to
Section 3.3, “Throttling”
for
details regarding the throttling of concurrent statements.
Timeouts when reading data might also occur because of too many tombstones being present. In this case, there typically is a coressponding message in the log file of the Cassandra server. Please refer to Section 5.4, “Too many tombstones” for details about handling tombstones.
There are two ways how channels can be listed: All channels in the cluster can be listed or only the channels managed by a certain server can be listed. It can happen that these two lists get out of sync, so that channels are shown in the list of all channels, but not in the list for a specific server.
The reason for this is that the two lists are retrieved in different ways. The all channels list is generated by getting the channels from the database (technically speaking, there is a cache layer involved, but typically this layer is not responsible for the inconsistencies). The per-server list, on the other hand, is retrieved from the server’s in-memory configuration when the server is online.
When adding or removing channels fails, it can happen that the operation actually succeeded up to a point where the channel already exists in the database, but the server’s in-memory configuration has not been updated.
When a channel that has been removed still exists in the per-server list, but has been removed from the all channels list, forcing a reinitialization of the channel usually fixes the problem. When, on the other hand, a channel that has been added exists in the all channels list but is missing in the per-server list, the only way to solve this is by restarting the affected server.
Usually, either problem only occurs when some database operations fail due to a transient database problem or timeouts. Please refer to Section 5.1, “Timeouts” for more information about how to fix timeouts.
Some operations regarding channels (in particular configuration changes and the creation of new sample buckets) require special protection in order to avoid data corruption. Without this protection, data corruption could happen when the server crashes after the operation has started but before it has completed. Because of how Cassandra applies data changes and due to possible clock skew in distributed systems, this mechanism has to ensure that no other modification is attempted for a certain amount of time after such an operation failed.
This means that any further modifications (including the archiving of samples) are blocked for up to ten minutes after an operation has failed. When being initialized, the channel switches to the error state with an error message like “The channel cannot be initialized because an operation of type … is pending”. When trying to make changes to the channel’s configuration, a similar message is displayed.
There is only one way to resolve this issue: Waiting until the protection period has passed. Usually, the channel is automatically initialized again after the period has passed. Otherwise, a reinitialization can be triggered from the administrative UI.
There is a very similar message after moving a channel from one server to another. In this case, further modifications are also blocked in order to allow for some clock skew between servers. In contrast to the issue described earlier, the protection period is very short in this case and the channel is typically put back in operation after less than 30 seconds.
When deleting data from a Cassandra database, this data is actually not deleted immediately. Instead, special markers (so-called tombstones) are inserted in order to mark the data as deleted. Due to how Cassandra works internally, these tombstones might not be present on all nodes when some of the nodes were down while the data was being deleted. In this case, it is important that the tombstones are replicated to these nodes before they can safely be removed (together with the data thas has been marked as deleted).
The time how long tombstones are kept is configured in
Cassandra by
setting the GC grace period.
It is very important
that
nodetool repair
(which
ensures consistent replication) is run more frequently
than the time
specified by the GC grace period.
After the GC
grace period has passed, a failed node must not be
brought back
online because this would result in deleted data suddenly
reappearing, which in the context of the Cassandra PV Archiver
could
lead to data corruption.
When reading data, Cassandra has to keep all the tombstones it finds on the way, so that data presented by other nodes can be checked against these tombstones (because it might actually have been marked as deleted). Keeping track of these tombstones consumes memory on the coordinator node and affects performance, which is why Cassandra limits the number of tombstones that it allows before aborting a query. Even before hitting this limit, Cassandra starts logging a warning message to inform the user that a high number of tombstones has been detected. Such a message might look like “Read … live rows and … tombstone cells for query SELECT * FROM … WHERE server_id = … LIMIT 5000 (see tombstone_warn_threshold)”.
In the Cassandra PV Archiver, there are three tables where
such a
problem is likely to appear: the
pending_channel_operations_by_server
,
channels
, and
channels_by_server
tables.
The
pending_channel_operations_by_server
table
and (even though less likely) the
channels_by_server
table are affected when a
large number of channels is modified,
in particular when they are
added or removed.
The
channels
tables might be affected when a large
number of samples is
deleted in a rather short period of time
(typically because
samples are archived at a very high data rate).
In general, reducing the GC grace period is a good idea to avoid such a situation, but the GC grace period must only be reduced when anti-entropy repairs are run more often.
For problems with the
pending_channel_operations_by_server
table, there
is a workaround that involves manually deleting
all data from that
table.
Before using this workaround, one has
to ensure that all Cassandra PV
Archiver servers have been
shutdown for at least ten minutes (and stay
shutdown while
applying the workaround) and all Cassandra database
nodes are
up.
One can then use the following statement on the CQL shell
after
switching to the keyspace used by the Cassandra PV
Archiver:
TRUNCATE pending_channel_operations_by_server;
This statement deletes all data for this table, including all tombstones. This is why it is important that all Cassandra nodes are up and running. After applying this statement, the Cassandra PV Archiver servers can be started again.
When this problem appears for the
channels_by_server
table, adding a new server and
moving all channels from the
affected server to the new server can
help.
After this, the
affected server can be brought up again with a new
UUID (the
old UUID should not be reused in order to avoid hitting the
problem again).
When this problem appears for the
channels
table,
renaming the channel and then renaming it back to the
original name
might help.
However, sometimes this workaround
will not show any effect.
In this case, one can only wait until
the GC grace period has passed.
The Cassandra PV Archiver server (and Apache Cassandra, too) relies on well-synchronized server clocks. When the clock skew between servers is too large or when the clock of a server skips back in time, this results in an error message like “The system clock of this server is skewed by at least … ms compared to server … - shutting down now” or “System clock skipped back - shutting down now”. In this case, one should check the mechanism (typically NTP) that is used for synchronizing the server clocks.
A clock that leaps forward should only be synchronized by slewing it, not by jumping back to an earlier point in time. Jumping back to an earlier point in time is problematic because Apache Cassandra decides which update has been applied last by checking the time stamp associated with the update. This means that going back to an earlier time can result in data being written, but being superseded by data that has been written earlier, but appears newer because of a more recent time stamp.
When trying to sign in to the administrative UI, one might get an error message like “You could not be signed in. Please check the username and password”. Typically, this message indicates that the username or password were wrong, but this message might also be displayed when they are actually correct. In this case, the reason is that the credentials cannot be verified because the server cannot read from the Cassandra database.
For this reason, when trying to sign in and presumably correct credentials are rejected, one should go the dashboard of the administrative UI and verify that the server is actually connected to the Cassandra database cluster.
When one cannot sign in to the administrative UI any longer because the password has been lost, one might have to reset this password. This can be done by connecting to the Cassandra database with the CQL shell, switching to the keyspace used by the Cassandra PV Archiver, and issuing the following statement:
DELETE FROM generic_data_store WHERE component_id = ad5e517b-4ab6-4c4e-8eed-5d999de7484f AND item_key = 'admin' IF EXISTS;
This deletes the entry for the
admin
user from the
database.
As this user is always assumed to exist,
even if it is not in the
database, the Cassandra PV Archiver
server will assume that it again
uses the default password
admin
.
After signing in using the default password, one can
immediately
change the password back to a secure one.