5. Troubleshooting

This section gives some hints on how to fix certain problems that might appear while running the Cassandra PV Archiver server. Readers may skip this section and come back later in case they experience one of the problems.

5.1. Timeouts

Apache Cassandra limits the time that is spent trying to process a statement. When a statement cannot be processed within this time limit, it fails with a timeout error. Such an error might appear in the form of a message like “Cassandra timeout during write query at consistency SERIAL…” or a similar message being displayed when trying to apply configuration changes or being displayed as the error message for a channel that is in the error state.

Typically, statements time out because the Cassandra cluster is overloaded with requests and thus cannot process all of them in a timely manner. In this case, reducing the number of statements that are run in parallel can help alleviate the problem. When a write statement with a consistency level of SERIAL fails, this is most likely caused by the throttling.maxConcurrentChannelMetaDataWriteStatements option having a too large value. Please refer to Section 3.3, “Throttling” for details regarding the throttling of concurrent statements.

Timeouts when reading data might also occur because of too many tombstones being present. In this case, there typically is a coressponding message in the log file of the Cassandra server. Please refer to Section 5.4, “Too many tombstones” for details about handling tombstones.

5.2. Inconsistencies in the channels list

There are two ways how channels can be listed: All channels in the cluster can be listed or only the channels managed by a certain server can be listed. It can happen that these two lists get out of sync, so that channels are shown in the list of all channels, but not in the list for a specific server.

The reason for this is that the two lists are retrieved in different ways. The all channels list is generated by getting the channels from the database (technically speaking, there is a cache layer involved, but typically this layer is not responsible for the inconsistencies). The per-server list, on the other hand, is retrieved from the server’s in-memory configuration when the server is online.

When adding or removing channels fails, it can happen that the operation actually succeeded up to a point where the channel already exists in the database, but the server’s in-memory configuration has not been updated.

When a channel that has been removed still exists in the per-server list, but has been removed from the all channels list, forcing a reinitialization of the channel usually fixes the problem. When, on the other hand, a channel that has been added exists in the all channels list but is missing in the per-server list, the only way to solve this is by restarting the affected server.

Usually, either problem only occurs when some database operations fail due to a transient database problem or timeouts. Please refer to Section 5.1, “Timeouts” for more information about how to fix timeouts.

5.3. Pending channel operations

Some operations regarding channels (in particular configuration changes and the creation of new sample buckets) require special protection in order to avoid data corruption. Without this protection, data corruption could happen when the server crashes after the operation has started but before it has completed. Because of how Cassandra applies data changes and due to possible clock skew in distributed systems, this mechanism has to ensure that no other modification is attempted for a certain amount of time after such an operation failed.

This means that any further modifications (including the archiving of samples) are blocked for up to ten minutes after an operation has failed. When being initialized, the channel switches to the error state with an error message like “The channel cannot be initialized because an operation of type … is pending”. When trying to make changes to the channel’s configuration, a similar message is displayed.

There is only one way to resolve this issue: Waiting until the protection period has passed. Usually, the channel is automatically initialized again after the period has passed. Otherwise, a reinitialization can be triggered from the administrative UI.

There is a very similar message after moving a channel from one server to another. In this case, further modifications are also blocked in order to allow for some clock skew between servers. In contrast to the issue described earlier, the protection period is very short in this case and the channel is typically put back in operation after less than 30 seconds.

5.4. Too many tombstones

When deleting data from a Cassandra database, this data is actually not deleted immediately. Instead, special markers (so-called tombstones) are inserted in order to mark the data as deleted. Due to how Cassandra works internally, these tombstones might not be present on all nodes when some of the nodes were down while the data was being deleted. In this case, it is important that the tombstones are replicated to these nodes before they can safely be removed (together with the data thas has been marked as deleted).

The time how long tombstones are kept is configured in Cassandra by setting the GC grace period. It is very important that nodetool repair (which ensures consistent replication) is run more frequently than the time specified by the GC grace period. After the GC grace period has passed, a failed node must not be brought back online because this would result in deleted data suddenly reappearing, which in the context of the Cassandra PV Archiver could lead to data corruption.

When reading data, Cassandra has to keep all the tombstones it finds on the way, so that data presented by other nodes can be checked against these tombstones (because it might actually have been marked as deleted). Keeping track of these tombstones consumes memory on the coordinator node and affects performance, which is why Cassandra limits the number of tombstones that it allows before aborting a query. Even before hitting this limit, Cassandra starts logging a warning message to inform the user that a high number of tombstones has been detected. Such a message might look like “Read … live rows and … tombstone cells for query SELECT * FROM … WHERE server_id = … LIMIT 5000 (see tombstone_warn_threshold)”.

In the Cassandra PV Archiver, there are three tables where such a problem is likely to appear: the pending_channel_operations_by_server, channels, and channels_by_server tables. The pending_channel_operations_by_server table and (even though less likely) the channels_by_server table are affected when a large number of channels is modified, in particular when they are added or removed. The channels tables might be affected when a large number of samples is deleted in a rather short period of time (typically because samples are archived at a very high data rate).

In general, reducing the GC grace period is a good idea to avoid such a situation, but the GC grace period must only be reduced when anti-entropy repairs are run more often.

For problems with the pending_channel_operations_by_server table, there is a workaround that involves manually deleting all data from that table. Before using this workaround, one has to ensure that all Cassandra PV Archiver servers have been shutdown for at least ten minutes (and stay shutdown while applying the workaround) and all Cassandra database nodes are up. One can then use the following statement on the CQL shell after switching to the keyspace used by the Cassandra PV Archiver:

TRUNCATE pending_channel_operations_by_server;

This statement deletes all data for this table, including all tombstones. This is why it is important that all Cassandra nodes are up and running. After applying this statement, the Cassandra PV Archiver servers can be started again.

When this problem appears for the channels_by_server table, adding a new server and moving all channels from the affected server to the new server can help. After this, the affected server can be brought up again with a new UUID (the old UUID should not be reused in order to avoid hitting the problem again).

When this problem appears for the channels table, renaming the channel and then renaming it back to the original name might help. However, sometimes this workaround will not show any effect. In this case, one can only wait until the GC grace period has passed.

5.5. Too large clock skew

The Cassandra PV Archiver server (and Apache Cassandra, too) relies on well-synchronized server clocks. When the clock skew between servers is too large or when the clock of a server skips back in time, this results in an error message like “The system clock of this server is skewed by at least … ms compared to server … - shutting down now” or “System clock skipped back - shutting down now”. In this case, one should check the mechanism (typically NTP) that is used for synchronizing the server clocks.

A clock that leaps forward should only be synchronized by slewing it, not by jumping back to an earlier point in time. Jumping back to an earlier point in time is problematic because Apache Cassandra decides which update has been applied last by checking the time stamp associated with the update. This means that going back to an earlier time can result in data being written, but being superseded by data that has been written earlier, but appears newer because of a more recent time stamp.

5.6. Credentials are not accepted

When trying to sign in to the administrative UI, one might get an error message like “You could not be signed in. Please check the username and password”. Typically, this message indicates that the username or password were wrong, but this message might also be displayed when they are actually correct. In this case, the reason is that the credentials cannot be verified because the server cannot read from the Cassandra database.

For this reason, when trying to sign in and presumably correct credentials are rejected, one should go the dashboard of the administrative UI and verify that the server is actually connected to the Cassandra database cluster.

5.7. Resetting a lost password

When one cannot sign in to the administrative UI any longer because the password has been lost, one might have to reset this password. This can be done by connecting to the Cassandra database with the CQL shell, switching to the keyspace used by the Cassandra PV Archiver, and issuing the following statement:

DELETE FROM generic_data_store WHERE
  component_id = ad5e517b-4ab6-4c4e-8eed-5d999de7484f AND
  item_key = 'admin'
  IF EXISTS;

This deletes the entry for the admin user from the database. As this user is always assumed to exist, even if it is not in the database, the Cassandra PV Archiver server will assume that it again uses the default password admin. After signing in using the default password, one can immediately change the password back to a secure one.