Runbooks¶

This documentation page contains runbooks for the alerts sent out by the Anapaya products.

Alerts¶

The names of the alerts follow a uniform structure to provide a quick insight into their meaning:

Namespace: This is the overall part that is having a problem.
Subsystem: Within the namespace what component is affected.
Subject: The subject that is having an issue.
State: The state of the subject, explaining in what way it is not conforming to expectations.

BGPDaemonDisconnected¶

In general, the gateway is connected to a BGP daemon (FRR) and uses it to publish the routes received via SGRP to the local network. This alert fires when the connection between the gateway and the BGP daemon breaks. Typically, this is caused by the BGP daemon being dead. Given that the appliance automatically restarts the crashed daemon, it is likely that the alert was triggered by the daemon crashlooping on startup.

Alternatively, the daemon may be stuck and thus the connection cannot be established.

When the alert is firing, the routes received via SGRP cannot be published to the local network. This means that the remote IP prefixes are not reachable any more through IP-in-SCION tunneling.

Actions¶

Check at the logs of the frr service to understand what the state of the daemon is.
Restart the service if needed.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation for further information.

BGPDaemonUnresponsive¶

In general, the gateway is connected to a BGP daemon (FRR) and uses it to publish the routes received via SGRP to the local network. This alert fires when the connection between the gateway and the BGP daemon exists, but the gateway cannot publish paths to the daemon. Most likely, the BGP daemon is stuck.

When the alert is firing, the routes received via SGRP cannot be published to the local network. This means that the remote IP prefixes are not reachable anymore through IP-in-SCION tunneling.

Actions¶

Look at the logs of the frr service to understand what the state of the daemon is.
Restart the service if needed.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.

BGPDaemonDown¶

This alert fires when the BGP daemon (FRR) on the appliance is down.

Actions¶

Look at the logs of the frr service to understand why the daemon is not running.
Restart the service if needed.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.

BGPPeerDown¶

This alert fires when a BGP peer of the appliance’s BGP daemon is down. If the appliance has multiple BGP peers configured, this might not be critical.

Actions¶

Check if the peer IP is reachable from the appliance using the IP ping command.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.

BGPUnexportedRoutes¶

This alert fires when there is a mismatch between the number of routes the IP-in-SCION tunneling can route and what is advertised to BGP neighbors.

NOTE: If you have a custom FRR configuration and it is expected that certain neighbors do not receive all routes, this alert will always fire for those neighbors. It is recommended to silence the alert for those neighbors.

Actions¶

Run appliance-cli debug frr non-advertised-routes –fix on the affected host to automatically fix it. To run the fix only for a specific neighbor, use the –neighbor flag.

BGPUnexportedRoutesLegacy¶

Same as BGPUnexportedRoutes except for appliance versions for which the fix of the alert haven’t yet been backported.

ClusterTopologySyncFetchError¶

This alert fires if there is an error when fetching topology information from a remote node.

Actions¶

Check the network connectivity between the two appliances using the IP ping command.
1. Check the logs of the appliance-controller to find information on the other appliance and its address.
2. If the appliance-controller logs do not have this information, check the cluster section of the appliance configuration for all the necessary information. Please refer to Cluster for further information on the cluster configuration fields.
```
appliance-cli get config
```
In case the connectivity between the two appliances does not work, then there might be an underlying network problem. Please check and troubleshoot the internal network setup to resolve the issue.
If the connectivity works, check the logs of the appliance-controller to further investigate this issue.

ClusterTopologySyncInterfaceMergeConflict¶

This alert fires if there are conflicts in the topology merge for the specified interface. This indicates a severe misconfiguration of appliances and means that multiple appliances have the same SCION interfaces configured.

Actions¶

Follow the instructions on how to fix a topology synchronization misconfiguration.

ClusterTopologySyncServiceMergeConflict¶

This alert fires if there is a conflict in the topology merge. It indicates that multiple appliances have the same address for the SCION control service configured.

Actions¶

Follow the instructions on how to fix a topology synchronization misconfiguration.

DataplaneControlSyncFailing¶

This alert fires if the dataplane control process could not apply the desired configuration. This should only happen after a new configuration has been pushed.

Actions:¶

Check the logs of the dataplane-control process. Identify the problematic configuration and correct it.

ManagementSoftwareChecksumInconsistent¶

The appliance-controller periodically checks whether the installed software version is consistent with the version on the disk. This alert fires if the checksum of the installed software package does not match the checksum in the signatures files.

Actions¶

Check the checksum of the installed package version: appliance-cli info
Check if the signatures file for the same version exists (anapaya-scion-{version}.tar.gz.signatures.json):
```
ls -l /var/anapaya/installer/packages
```
In case the file exists, compare the sha256sum inside the file to the one from the first step.
If the file does not exist, upload the package signatures.
Compute the checksum of the package to check if it matches the one from above:
```
sha256sum anapaya-scion-{version}.tar.gz
```
If the checksum does not match, reinstall the package.

ManagementSoftwareLicenseExpiresSoon¶

This alert fires if the license for the Anapaya software expires in less than 30 days.

Actions¶

Follow the Anapaya license installation guide to install a new license.

SCIONBeaconOriginationError¶

The beacon origination fails for a given SCION interface. The alert fires after a prolonged period of beacon origination errors. If beacons can’t be originated, connectivity to other ASes via the affected interface will eventually break.

Actions¶

Check the logs of the control service for the given ISD-AS. Look for the line Unable to originate on interface. It should contain more details about the problem.
In case beacon creation failed:
1. Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
2. Check that time synchronization is working properly.
In case the creation of the beacon sender or the sending of the beacon failed, make sure that the connection to the neighboring AS is working properly by following the basic troubleshooting guide.

SCIONApplicationRegistrationError¶

SCION applications are having errors while registering with the dispatcher. If this issue persists, SCION control traffic will no longer be sent or received.

Actions¶

Check the logs of the dispatcher for the given appliance.
Restart the dispatcher.

SCIONBeaconPropagationError¶

The beacon propagation fails for a given SCION interface. The alert fires after a prolonged period of beacon propagation errors. Beacon propagation is needed to keep connectivity to downstream and sibling core ASes. This means that if beacon propagation does not work for an extended period of time, connectivity will eventually break.

Actions¶

Check the logs of the control service for the given ISD-AS. Look for the lines Unable to propagate beacons and Error propagating beacons on interface. These logs should contain more details about the problem.
In case the problem is related to beacon extension:
1. Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
2. Check that time synchronization is working properly.
In case the creation of the beacon sender or the sending of the beacon failed, make sure that the connection to the neighboring AS is working properly by following the basic troubleshooting guide.

SCIONBeaconPropagationInternalError¶

This alert fires if there is an internal error when attempting to do beacon propagation. Beacon propagation is needed to keep connectivity to downstream and sibling core ASes. This means that if beacon propagation does not work for an extended period of time, connectivity will eventually break.

Since the error is internal, this indicates that there is an issue with the local beacon database.

Actions¶

Check the logs of the control service for the given ISD-AS. Look for the Unable to propagate beacons line. This log should contain more details about what exactly is the error.
The beacon database is in a docker volume. To check the underlying disk,
1. Find the name of the control service container.
2. Check the underlying disk using the following command (replace the <control-name> with the actual control service name):
```
findmnt $(docker inspect <control-name> -f '{{.GraphDriver.Data.MergedDir}}')
```
  In a normally working system, this should list a disk with options rw which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.
Try to restart the control service.

SCIONBeaconReceiveError¶

There is an increased number of errors regarding the process that receives beacons from neighboring ASes. This can be either because an AS is sending bogus beacons that cannot be verified or because the beacons can not be stored locally.

Actions¶

Check the result label of the metric involved in the alert.
If there are a lot of err_verify counts, it means that a neighbor is sending bogus beacons that can’t be validated or verified:
1. Check the logs of the control service for the given ISD-AS.
2. Check that time synchronization is working properly.
3. Contact the neighboring AS to inquire why it is sending non-verifiable beacons.
If there are a lot of err_db or other error labels:
1. Enable debug logs on the control service for the given ISD-AS.
2. Check the logs of the control service for the given ISD-AS for the detailed errors. The errors can either be related to an issue with the disk, or the neighboring AS sending bogus beacons.
3. If the issue is related to the latter, contact the neighboring AS to inquire why it is sending non-verifiable beacons.

SCIONCAUnavailable¶

This alert fires if the given ISD-AS is unable to connect to the CA backend.

Actions¶

For the SCION CA backend provided by Anapaya:

Check if the ca-frontend service is running.
Get the ca-frontend management API address from the start of the log and check if the API is reachable using:
```
curl <management_addr>/api/v1/config
```

For any SCION CA backend:

If the API is reachable, check the connection to the CA backend. Depending on the CA backend used (Anapaya SCION CA or the respective third-party CA), try to investigate the connectivity issue using the respective documentation.
If you are using an external CA backend, verify the connectivity with the external service and check the configured credentials.

CertificateRenewalFailing¶

This alert fires if the renewal of AS certificates fails on the appliance.

Actions¶

For the SCION CA backend provided by Anapaya:

Check if the ca-frontend service is running.

For any SCION CA backend:

Check the connection to the CA backend. Depending on the CA backend used (Anapaya SCION CA or the respective third-party CA), try to investigate the connectivity issue using the respective documentation.
If you are using an external CA backend, verify the connectivity with the external service and check the configured credentials.

SCIONCryptoASCertificateExpiresSoon¶

If the AS certificate for the given ISD-AS expires in less than 48 hours, this alert is triggered. This indicates that the automatic renewal process failed to renew the AS certificate.

Renewal happens when the certificate has passed 3/4 of its lifetime. Therefore, if this alert triggers, the automatic renewal process failed to renew the AS certificate for at least six hours.

Actions¶

Check the Certificate/TRC Provisioning section to manually renew your AS certificate.
If the above step also fails, the error message gives a more detailed insight into the cause of the error.

SCIONCryptoTRCDiskWriteError¶

This alert fires when writing a TRC to the disk fails.

Actions¶

Check the logs of the control service.
Look for the logs Failed to write TRC to disk... or Failed to stat TRC file on disk... indicating that writing the TRC to the disk failed.
Check the appended error message to get the root cause of this issue.
Check the disk space and if the disk is configured as read-only. If that is the case, try to mitigate the cause to avoid any further problems.

SCIONCryptoTRCExpiresSoon¶

This alert fires if the TRC for the given ISD-AS expires in less than 30 days.

Actions¶

If you are a non-voting party in the ISD, contact the responsible party of the TRC to check if there is an underlying issue.
If you are a voting party, follow the actions below.
1. Inspect the TRC to verify that it expires soon.
2. Initiate the TRC update process with a quorum of voting members. The details of this process depend on the governance rules of your respective ISD.

SCIONInterfaceStateDown¶

This alert fires when the SCION interface to the neighboring AS is down.

Actions¶

Check if the administative_state for the given interface is set as expected in the appliance config:

appliance-cli get config -f 'body.config.scion.ases[isd_as == "<ISD-AS>"].neighbors[neighbor_isd_as == "<neighbor ISD-AS>"].interfaces[interface_id == <interface_id>]'

Check the actual state of the SCION interface:

appliance-cli get debug/scion/interfaces -f 'body.interfaces[local.interface_id == <interface_id>]'

If the states do not match, make sure your network interface configuration is correct.
```
appliance-cli get config
```
Determine whether the underlying IP connectivity is the issue. Get the local and remote interface addresses from the appliance configuration and use the IP ping command: ping <remote> -I <local>.
Check the Anapaya/router dashboard to see if BFD packets are being sent and received (BFD graphs).

SCIONInterfaceStateFlapping¶

The Bidirectional Forwarding Detection (BFD) protocol is used to continuously check the health of the links. If the BFD session state on a given interface changes too frequently, this alert is triggered.

Actions¶

Determine whether the underlying connection is the issue. Use the IP ping command with the option -i 0.01 to see if there are any anomalies when the BFD flap happens. You can get the destination address of the relevant neighbor mentioned in the alert from the appliance configuration.
```
appliance-cli get config
```
Consider relaxing the BFD timers by updating the detection_multiplier in the Bidirectional Forwarding Detection section of the appliance configuration.

SCIONNeighborPathsMissing¶

This alert fires when there is no SCION connectivity from this appliance to one of its direct neighbors.

Actions¶

Follow the basic troubleshooting guide to investigate why there is a SCION connectivity issue.

GatewayFlowsCloseToLimit¶

The number of flows that the gateway tracks is close to the limit. If the limit is reached, the gateway will stop tracking new flows.

Actions¶

If the increase of flows is expected, increase the limit in the appliance configuration.
If the increase of flows is not expected, investigate the reason for the increase.

GatewayFlowsNotExported¶

Flow metrics are not being exported. This means that the reports about gateway usage will be missing data, which, in turn, could lead to the loss of business revenue.

The data is kept for some time (30 minutes by default) so that the operator has the opportunity to resolve the issue without losing any data.

Actions¶

Check gateway logs and check for some exporting flow metrics error.
Depending on the logs, try to resolve the issue.

GatewayNetlinkListenerNotSubscribed¶

The gateway is not subscribed to netlink route updates and can thus not learn and redistribute routes received from BGP peers. This indicates that something with the underlying Linux operating system is not working as expected.

Actions¶

Restart the gateway and check if the problem gets resolved.
If the problem persists, reboot the appliance:
```
.. code-block:: bash

   sudo reboot
```

GatewayNetlinkListenerErrors¶

The gateway is subscribed to netlink route updates, but the netlink listener is missing route updates and might thus not correctly redistribute all route updates to its remotes. This indicates that something with the underlying Linux operating system is not working as expected.

Actions¶

Restart the gateway and check if the problem gets resolved.
If the problem persists, reboot the appliance:
```
.. code-block:: bash

   sudo reboot
```

GatewayASCertificateExpiresSoon¶

The AS certificate of the gateway is expiring soon. Once the AS certificate expires the IP-in-SCION tunneling connectivity no longer works.

Actions¶

Check the Certificate/TRC Provisioning section to manually renew your AS certificate.
If the above step also fails, the error message gives a more detailed insight into the cause of the error.

SCIONSegmentRegistrationError¶

Received beacons are converted to segments and registered in the path segment database. Up and core segments are registered in the local database. Down segments are registered at the originating core AS. If this alert fires, then segment registration fails. First, check the segment type from the alert description and depending on it follow the actions below.

Actions for down segments¶

Check the logs of the control service for the given ISD-AS. Look for the line Unable to register segment. It contains more details about the problem.
Make sure that the connection to the originating core AS is working properly by following the basic troubleshooting guide.

Actions for up/core segments¶

This alert should not trigger for those segment types. Contact Anapaya Support.

SCIONSegmentRegistrationInternalError¶

This alert fires if there is an internal error when attempting to register a segment. Segment registration is needed to keep connectivity to other ASes. That means if segment registration does not work for an extended period of time, connectivity will eventually break.

Since the error is internal, this indicates that there is an issue with the local beacon or segment database.

Actions¶

Check the logs of the control service for the error to get more insight into what the error could be.
The beacon database is in a docker volume. To check the underlying disk:
1. Find the name of the control service container.
2. Check the underlying disk using the following command (replace the <control-name> with the actual control service name):
findmnt $(docker inspect <control-name> -f '{{.GraphDriver.Data.MergedDir }}')
In a normally working system, this should list a disk with options rw which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.
Try to restart the control service.

SCIONSyncFailing¶

This alert fires if there has been no successful synchronization of the shard database in the last hour. The synchronization is part of the SCION control service for the given ISD-AS.

Actions¶

Look at the logs of the control service. You will find the reason for the failed synchronization in the logs. The process that needs to be followed to resolve the issue depends on the failure case.

SCIONSyncFetchFailing¶

This alert fires if the SCION control service fails to receive beacons or path segments from some of its peers.

Actions¶

Check the logs of the control service and look for the entry Fetch from network failed..., which will help to determine the underlying issue.
If the issue is network related, make sure the network connectivity is working as expected using the IP ping command for the destination mentioned in the logs.
In case of a parsing error, investigate why the other party is sending malformed objects.

SCIONSyncStoreFailing¶

This alert fires if the SCION control service fails to store objects which is equivalent to a database write error.

Actions¶

Check the logs of the control service and look for the entry Write to database failed..., which will help to determine the underlying issue. The error could be caused due to an issue with the disk.
If the error is related to the disk, check the underlying disk using the following command (replace the <control-name> with the actual control service name):
```
findmnt $(docker inspect <control-name> -f '{{.GraphDriver.Data.MergedDir}}')
```
In a normally working system, this should list a disk with options rw which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.

ServiceApplianceControllerConfigInvalid¶

This alert triggers if the appliance successfully validated the configuration, but failed when applying it. This indicates a missing validation in the appliance and therefore that the appliance could not be correctly configured. The reason is not a misconfiguration, but the lack of validation in this case.

Contact Anapaya¶

In this case, please contact Anapaya Customer Service and provide the following information:

General information
Logs of the appliance-controller

ServiceApplianceControllerPanic¶

This alert fires when something in the appliance-controller went unexpectedly wrong and resulted in a panic.

The appliance-controller is the centerpiece of the Anapaya appliance. Based on the appliance configuration, the appliance-controller automatically configures all the required SCION infrastructure services.

Actions¶

Make sure the system is up and running. Check if the appliance-controller is running. The appliance-installer should have restarted it in case of a failure.
If the appliance-controller is not running, check the logs of the appliance-controller to get an insight in why it is crashing.
Check the release notes, for any known issues with the installed release. If there are known issues, consider updating the host to the latest version to resolve the issue.

Contact Anapaya¶

Please contact Anapaya Customer Support and provide the following information:

installed scion package version: appliance-cli version
Logs of the appliance-controller
Logs of the appliance-installer

SystemClockSyncFailing¶

The system is not able to synchronize its clock. For the clock synchronization, the system service (systemd-timesyncd) is used as an NTP client.

Actions¶

Follow the timesyncd checks to check if the corresponding system service is running correctly.

SystemClockSkew¶

The system clock is skewed by more than 0.05 seconds.

Actions¶

Follow the timesyncd checks to check if the corresponding system service is running correctly.

SystemDiskSpaceBootLow¶

This alert fires when the free disk space for the boot partition (/boot) is less than 100MiB.

Actions¶

Analyze disk space usage for the /boot partition to find what takes up the most memory.
Free up space using the APT related commands.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemDiskSpaceRootLow¶

This alert fires when the free disk space for the root partition (/) is less than 10%.

Actions¶

Analyze disk space usage for the root (/) partition to find what takes up the most memory.
Free up space using the APT related commands.
Remove unused docker images.
Increase size of root partition.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemNTPSyncServiceNotRunning¶

This alert fires if the systemd-timesyncd service is not running. The service is used to syncronize the local system clock with a remote Network Time Protovol (NTP) server.

Actions¶

Follow the timesyncd checks to check if the corresponding system service is running correctly and restart it if required.

SystemResourcesAverageLoadHigh¶

This alert fires if the system is heavily loaded on average in the last 15 minutes. Make sure the system is running properly and try to determine if some processes are using more CPU than usual.

Configuration¶

Check the number of CPU cores:
```
nproc
```
If the number is two, make sure you have configured the appliance to not use any workers for the dataplane.

Actions¶

Identify the processes causing the high CPU load:

ps -eo pcpu,pid,user,args --sort -%cpu | head -10

Assess if the process should be using the amount of resources. If this is not the case, one option is to restart the process. Note, that this might cause a service interruption.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemResourcesFreeMemoryLow¶

This alert fires if the system runs low on memory. Make sure the system is running properly and try to free up memory by cleaning caches and buffers.

Actions¶

Check the summary of RAM usage:
```
free -h
```

Identify the processes that require the most memory:

ps -eo pmem,pid,user,args --sort -%mem | head -10

Assess if the process should be using the amount of resources. If this is not the case, one option is to restart the process. Note, that this might cause a service interruption.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemServiceDown¶

This alert fires if the job specified in the alert cannot be reached by the monitoring system.

The appliance job works as a proxy for all other jobs, so when the alert concerns the appliance job it may be that there is overall no network connectivity from the monitoring system to the given host.

Otherwise, this alert means that one of the services on the Anapaya appliance is not working. Services are docker containers managed by the appliance itself. If the service fails, it will be automatically restarted. However, this alert indicates that the service is not available for a prolonged amount of time. Common cases where this alert is fired is (1) when the service is in a crashloop trying to start up and failing repeatedly and (2) when the appliance notifications are disabled and the appliance-controller does not restart the containers.

Actions¶

If the problem is with the connectivity to the appliance you may try to ping it from the monitoring host. You can find the address to ping in the instance label of the alert.
If the address is reachable, you can check scrape_duration_seconds metric value, which is helpful to determine whether the scraping failed entirely or whether it is flapping. In the latter case, it may be a latency or a bandwidth problem.
If the connectivity is fine, check the logs of the failing service. If the service is repeatedly crashing you should be able to find the reason in the logs.
If there are no logs from the service it may also be that the appliance-controller is not able to start the service at all. Check the logs of the appliance-controller to find out why this is the case.

SystemServiceRestarted¶

This alert fires if the job specified in the alert has restarted recently. This may be a legitimate restart (e.g. because a new version of the service was installed or a configuration change required a restart) or it may be that the service crashed.

Note

If the service is functional after the restart, the alert will stop firing in few minutes.

Actions¶

Check the logs of the service. You should be able to find the reason why the service exited.

SystemServiceDebugLogsEnabled¶

This alert fires if the job specified in the alert is running in debug log mode for 24 hours. The debug log level should only be used for troubleshooting.

Actions¶

Revert the log level to the default info state for the given service.

SystemVPPMemoryUsageHigh¶

VPP preallocates memory on start up. This alert is triggered when 90% of the preallocated memory is used.

Actions¶

Look at the historic VPP memory usage.
1. It could have run organically with the increased usage or it may have grown when a new feature was enabled.
2. If the memory usage slowly grows each time after the dataplane restart, it is probably a bug and Anapaya support should be contacted.
To get more information about the VPP memory usage, run:
```
appliance-cli get debug/vpp/memory
```

SystemVPPPacketReceiveErrors¶

This alert fires if a VPP interface is experiencing errors when receiving packets. This could be due to a faulty NIC, faulty cables, faulty SFP, wrong CRC, L1 configuration mismatch, etc.

Actions¶

Check the logs of the dataplane for error entries that highlight the issue.
Check the state of the interface in question using
```
appliance-cli get debug/vpp/hardware
```
Check the networking hardware itself for issues.

SystemVPPPacketTransmitErrors¶

This alert fires if a VPP interface is experiencing errors when transmitting packets due to NIC and/or carrier errors such as faulty cable, L1 configuration mismatch, etc.

Actions¶

Check the logs of the dataplane for error entries that highlight the issue.
Check the state of the interface in question using
```
appliance-cli get debug/vpp/hardware
```
Check the networking hardware itself for issues.

SystemVPPReceiveBuffersLow¶

This alert fires if VPP is running out of buffers. This is probably related to a bug in the software.

Actions¶

Escalate to Anapaya support.
To get more information about the VPP buffers, run:
```
appliance-cli get debug/vpp/buffers
```

SystemVPPReceiveQueueFull¶

This alert fires if a VPP interface is missing packets because the queue is full. This could be caused by VPP worker threads not being able to process the received traffic.

Actions¶

Check the VPP runtime data:
```
appliance-cli get debug/vpp/runtime
```
1. Look at Vectors/Call column. This is the average number of packets processed in one go.The maximum number of packets in one batch is 256. Therefore, if the CPU is not keeping up with the traffic, the numbers will be close to 256.
2. If this is the case, either configure VPP to use more cores, or, if not possible, consider splitting the traffic between multiple machines.

SystemVPPBuffersLowAvailability¶

This alert fires if the VPP process is missing buffers. This is probably because VPP did not allocate enough buffers for all its interfaces and workers.

Actions¶

Verify the number:
```
appliance-cli get debug/vpp/buffers
```
Allocate more buffers in hugepages. Note that this option needs at least version v0.34.3.

TunnelingReceivedInvalidPrefixes¶

This alert fires if a remote AS announced an invalid of non-canonical prefix. The prefix is ignored and the gateway otherwise works normally.

Actions¶

Inform the remote AS that they are trying to announce an invalid prefix.

TunnelingDomainNoAlivePaths¶

This alert fires if there are no alive paths for a specific domain and traffic matcher. Given the above, the packets that match the specified domain cannot be delivered to the destination.

Actions¶

Click on the Graph link in the alert. You will see four time series. Total means the number of paths to the relevant remote ASes as returned by the SCION deamon. Eligible is the number out of those paths that passes through the path filters. Monitored is the number out of those paths that are chosen to be monitored by the gateway. Alive is the number out of those paths where the probes are passing through.
If total is zero, investigate why paths to the remote ASes are not present. This may be a network connectivity issue.
Otherwise, if eligible is zero, check the paths to the relevant remote ASes using scion showpaths. All those paths are filtered out by the path filters. Is that what is intended? If not, adapt the path filter. If so, find out why the paths that would pass the filters are missing.
Otherwise, if monitored is zero, this is probably a bug in the gateway. If there are eligible paths, at least some of them should be monitored. Contact the support.
Otherwise, if alive is zero, this means that the paths are monitored but the probes are not passing through. Investigate the network connectivity.

TunnelingDomainNoRemoteGateways¶

This alert means that there are paths to the remote ASes, but no remote gateways can be reached.

Check whether any remote gateways are announced by the relevenat remote ASes:

appliance-cli get debug/scion-tunneling/discovery

If the gateways are announced but are not reachable, it may be that either the remote gateway instances are down or that there is a connectivity issue inside of the remote LAN.

TunnelingDomainHealthyPathsMissing¶

This alert fires if there are either no paths for a specific domain or if the paths that are available are not alive (probes don’t pass through). Given the above, the packets that match the specified domain cannot be delivered to the destination.

Actions¶

Find the available paths by running:

appliance-cli inspect scion-tunneling summary --all-paths \
  --domain <domain>

Inspect the individual paths. Some of them may be dead (no probes are passing through), or expired.
1. If the paths are dead, you should investigate the network connectivity.
2. If they are expired, you should investigate why the paths are not being updated.

TunnelingTrafficPolicyPathsExpiringSoon¶

This alert fires if the last remaining path for a specific domain and traffic matcher expires within three hours. After it expires the packets that match the domain and the traffic matcher cannot be delivered to the destination.

The expiration of paths is set to six hours by default, but paths should be updated much more often. If only three hours are remaining, it means there is a problem with refreshing paths.

Actions¶

Check whether there are other firing alerts, which should give more context on what the problem is
Ensure that all services and especially the daemon or control services are running.
Check if there are problems with certificate renewal, which could have lead to this issue.
1. Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
2. If it is not,check the Certificate/TRC Provisioning section to manually renew your AS certificate.
3. Check that time synchronization is working properly.