This section gives some hints on how to fix certain problems that might appear while running the Cassandra PV Archiver server. Readers may skip this section and come back later in case they experience one of the problems.
Apache Cassandra limits the time that is spent trying to process a statement. When a statement cannot be processed within this time limit, it fails with a timeout error. Such an error might appear in the form of a message like “Cassandra timeout during write query at consistency SERIAL…” or a similar message being displayed when trying to apply configuration changes or being displayed as the error message for a channel that is in the error state.
          Typically, statements time out because the Cassandra cluster is
          overloaded with requests and thus cannot process all of them in a
          timely manner.
          In this case, reducing the number of statements that are run in
          parallel can help alleviate the problem.
          When a write statement with a consistency level of
          SERIAL fails, this is most likely caused by the
          throttling.maxConcurrentChannelMetaDataWriteStatements
          option having a too large value.
          Please refer to Section 3.3, “Throttling” for
          details regarding the throttling of concurrent statements.
        
Timeouts when reading data might also occur because of too many tombstones being present. In this case, there typically is a coressponding message in the log file of the Cassandra server. Please refer to Section 5.4, “Too many tombstones” for details about handling tombstones.
There are two ways how channels can be listed: All channels in the cluster can be listed or only the channels managed by a certain server can be listed. It can happen that these two lists get out of sync, so that channels are shown in the list of all channels, but not in the list for a specific server.
The reason for this is that the two lists are retrieved in different ways. The all channels list is generated by getting the channels from the database (technically speaking, there is a cache layer involved, but typically this layer is not responsible for the inconsistencies). The per-server list, on the other hand, is retrieved from the server’s in-memory configuration when the server is online.
When adding or removing channels fails, it can happen that the operation actually succeeded up to a point where the channel already exists in the database, but the server’s in-memory configuration has not been updated.
When a channel that has been removed still exists in the per-server list, but has been removed from the all channels list, forcing a reinitialization of the channel usually fixes the problem. When, on the other hand, a channel that has been added exists in the all channels list but is missing in the per-server list, the only way to solve this is by restarting the affected server.
Usually, either problem only occurs when some database operations fail due to a transient database problem or timeouts. Please refer to Section 5.1, “Timeouts” for more information about how to fix timeouts.
Some operations regarding channels (in particular configuration changes and the creation of new sample buckets) require special protection in order to avoid data corruption. Without this protection, data corruption could happen when the server crashes after the operation has started but before it has completed. Because of how Cassandra applies data changes and due to possible clock skew in distributed systems, this mechanism has to ensure that no other modification is attempted for a certain amount of time after such an operation failed.
This means that any further modifications (including the archiving of samples) are blocked for up to ten minutes after an operation has failed. When being initialized, the channel switches to the error state with an error message like “The channel cannot be initialized because an operation of type … is pending”. When trying to make changes to the channel’s configuration, a similar message is displayed.
There is only one way to resolve this issue: Waiting until the protection period has passed. Usually, the channel is automatically initialized again after the period has passed. Otherwise, a reinitialization can be triggered from the administrative UI.
There is a very similar message after moving a channel from one server to another. In this case, further modifications are also blocked in order to allow for some clock skew between servers. In contrast to the issue described earlier, the protection period is very short in this case and the channel is typically put back in operation after less than 30 seconds.
When deleting data from a Cassandra database, this data is actually not deleted immediately. Instead, special markers (so-called tombstones) are inserted in order to mark the data as deleted. Due to how Cassandra works internally, these tombstones might not be present on all nodes when some of the nodes were down while the data was being deleted. In this case, it is important that the tombstones are replicated to these nodes before they can safely be removed (together with the data thas has been marked as deleted).
          The time how long tombstones are kept is configured in Cassandra by
          setting the GC grace period.
          It is very important that nodetool repair (which
          ensures consistent replication) is run more frequently than the time
          specified by the GC grace period.
          After the GC grace period has passed, a failed node must not be
          brought back online because this would result in deleted data suddenly
          reappearing, which in the context of the Cassandra PV Archiver could
          lead to data corruption.
        
When reading data, Cassandra has to keep all the tombstones it finds on the way, so that data presented by other nodes can be checked against these tombstones (because it might actually have been marked as deleted). Keeping track of these tombstones consumes memory on the coordinator node and affects performance, which is why Cassandra limits the number of tombstones that it allows before aborting a query. Even before hitting this limit, Cassandra starts logging a warning message to inform the user that a high number of tombstones has been detected. Such a message might look like “Read … live rows and … tombstone cells for query SELECT * FROM … WHERE server_id = … LIMIT 5000 (see tombstone_warn_threshold)”.
          In the Cassandra PV Archiver, there are three tables where such a
          problem is likely to appear: the
          pending_channel_operations_by_server,
          channels, and channels_by_server
          tables.
          The pending_channel_operations_by_server table
          and (even though less likely) the
          channels_by_server table are affected when a
          large number of channels is modified, in particular when they are
          added or removed.
          The channels tables might be affected when a large
          number of samples is deleted in a rather short period of time
          (typically because samples are archived at a very high data rate).
        
In general, reducing the GC grace period is a good idea to avoid such a situation, but the GC grace period must only be reduced when anti-entropy repairs are run more often.
          For problems with the
          pending_channel_operations_by_server table, there
          is a workaround that involves manually deleting all data from that
          table.
          Before using this workaround, one has to ensure that all Cassandra PV
          Archiver servers have been shutdown for at least ten minutes (and stay
          shutdown while applying the workaround) and all Cassandra database
          nodes are up.
          One can then use the following statement on the CQL shell after
          switching to the keyspace used by the Cassandra PV Archiver: 
        
TRUNCATE pending_channel_operations_by_server;
This statement deletes all data for this table, including all tombstones. This is why it is important that all Cassandra nodes are up and running. After applying this statement, the Cassandra PV Archiver servers can be started again.
          When this problem appears for the
          channels_by_server table, adding a new server and
          moving all channels from the affected server to the new server can
          help.
          After this, the affected server can be brought up again with a new
          UUID (the old UUID should not be reused in order to avoid hitting the
          problem again).
        
          When this problem appears for the channels table,
          renaming the channel and then renaming it back to the original name
          might help.
          However, sometimes this workaround will not show any effect.
          In this case, one can only wait until the GC grace period has passed.
        
The Cassandra PV Archiver server (and Apache Cassandra, too) relies on well-synchronized server clocks. When the clock skew between servers is too large or when the clock of a server skips back in time, this results in an error message like “The system clock of this server is skewed by at least … ms compared to server … - shutting down now” or “System clock skipped back - shutting down now”. In this case, one should check the mechanism (typically NTP) that is used for synchronizing the server clocks.
A clock that leaps forward should only be synchronized by slewing it, not by jumping back to an earlier point in time. Jumping back to an earlier point in time is problematic because Apache Cassandra decides which update has been applied last by checking the time stamp associated with the update. This means that going back to an earlier time can result in data being written, but being superseded by data that has been written earlier, but appears newer because of a more recent time stamp.
When trying to sign in to the administrative UI, one might get an error message like “You could not be signed in. Please check the username and password”. Typically, this message indicates that the username or password were wrong, but this message might also be displayed when they are actually correct. In this case, the reason is that the credentials cannot be verified because the server cannot read from the Cassandra database.
For this reason, when trying to sign in and presumably correct credentials are rejected, one should go the dashboard of the administrative UI and verify that the server is actually connected to the Cassandra database cluster.
When one cannot sign in to the administrative UI any longer because the password has been lost, one might have to reset this password. This can be done by connecting to the Cassandra database with the CQL shell, switching to the keyspace used by the Cassandra PV Archiver, and issuing the following statement:
DELETE FROM generic_data_store WHERE component_id = ad5e517b-4ab6-4c4e-8eed-5d999de7484f AND item_key = 'admin' IF EXISTS;
          This deletes the entry for the admin user from the
          database.
          As this user is always assumed to exist, even if it is not in the
          database, the Cassandra PV Archiver server will assume that it again
          uses the default password admin.
          After signing in using the default password, one can immediately
          change the password back to a secure one.