Troubleshooting & Runbooks¶

This documentation page contains runbooks for the alerts sent out by the Anapaya products as well as common operations that are helpful when troubleshooting.

Alerts¶

The names of the alerts follow a uniform structure to provide a quick insight into their meaning:

Namespace: This is the overall part that is having a problem.
Subsystem: Within the namespace what component is affected.
Subject: The subject that is having an issue.
State: The state of the subject, explaining in what way it is not conforming to expectations.

BGPDaemonDisconnected¶

In general, the gateway is connected to a BGP daemon (FRR) and uses it to publish the routes received via SGRP to the local network. This alert fires when the connection between the gateway and the BGP daemon breaks. Typically, this is caused by the BGP daemon being dead. Given that the appliance automatically restarts the crashed daemon, it is likely that the alert was triggered by the daemon crashlooping on startup.

Alternatively, the daemon may be stuck and thus the connection cannot be established.

When the alert is firing, the routes received via SGRP cannot be published to the local network. This means that the remote IP prefixes are not reachable any more through IP-in-SCION tunneling.

Actions¶

Check at the logs of the frr service to understand what the state of the daemon is.
Restart the service if needed.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation for further information.

BGPDaemonUnresponsive¶

In general, the gateway is connected to a BGP daemon (FRR) and uses it to publish the routes received via SGRP to the local network. This alert fires when the connection between the gateway and the BGP daemon exists, but the gateway cannot publish paths to the daemon. Most likely, the BGP daemon is stuck.

When the alert is firing, the routes received via SGRP cannot be published to the local network. This means that the remote IP prefixes are not reachable anymore through IP-in-SCION tunneling.

Actions¶

Look at the logs of the frr service to understand what the state of the daemon is.
Restart the service if needed.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.

BGPDaemonDown¶

This alert fires when the BGP daemon (FRR) on the appliance is down.

Actions¶

Look at the logs of the frr service to understand why the daemon is not running.
Restart the service if needed.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.

BGPPeerDown¶

This alert fires when a BGP peer of the appliance’s BGP daemon is down. If the appliance has multiple BGP peers configured, this might not be critical.

Actions¶

Check if the peer IP is reachable from the appliance using the IP ping command.

For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.

ClusterTopologySyncFetchError¶

This alert fires if there is an error when fetching topology information from a remote node.

Actions¶

Check the network connectivity between the two appliances using the IP ping command.
1. Check the logs of the appliance-controller to find information on the other appliance and its address.
2. If the appliance-controller logs do not have this information, check the cluster section of the appliance configuration for all the necessary information. Please refer to Cluster for further information on the cluster configuration fields.
```
appliance-cli get config
```
In case the connectivity between the two appliances does not work, then there might be an underlying network problem. Please check and troubleshoot the internal network setup to resolve the issue.
If the connectivity works, check the logs of the appliance-controller to further investigate this issue.

ClusterTopologySyncInterfaceMergeConflict¶

This alert fires if there are conflicts in the topology merge for the specified interface. This indicates a severe misconfiguration of appliances and means that multiple appliances have the same SCION interfaces configured.

Actions¶

Follow the instructions on how to fix a topology synchronization misconfiguration.

ClusterTopologySyncServiceMergeConflict¶

This alert fires if there is a conflict in the topology merge. It indicates that multiple appliances have the same address for the SCION control service configured.

Actions¶

Follow the instructions on how to fix a topology synchronization misconfiguration.

DataplaneControlSyncFailing¶

This alert fires if the dataplane control process could not apply the desired configuration. This should only happen after a new configuration has been pushed.

Actions:¶

Check the logs of the dataplane-control process. Identify the problematic configuration and correct it.

ManagementSoftwareChecksumInconsistent¶

The appliance-controller periodically checks whether the installed software version is consistent with the version on the disk. This alert fires if the checksum of the installed software package does not match the checksum in the signatures files.

Actions¶

Check the checksum of the installed package version: appliance-cli info
Check if the signatures file for the same version exists (anapaya-scion-{version}.tar.gz.signatures.json):
```
ls -l /var/anapaya/installer/packages
```
In case the file exists, compare the sha256sum inside the file to the one from the first step.
If the file does not exist, upload the package signatures.
Compute the checksum of the package to check if it matches the one from above:
```
sha256sum anapaya-scion-{version}.tar.gz
```
If the checksum does not match, reinstall the package.

SCIONBeaconOriginationError¶

The beacon origination fails for a given SCION interface. The alert fires after a prolonged period of beacon origination errors. If beacons can’t be originated, connectivity to other ASes via the affected interface will eventually break.

Actions¶

Check the logs of the control service for the given ISD-AS. Look for the line Unable to originate on interface. It should contain more details about the problem.
In case beacon creation failed:
1. Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
2. Check that time synchronization is working properly.
In case the creation of the beacon sender or the sending of the beacon failed, make sure that the connection to the neighboring AS is working properly by following the basic troubleshooting guide.

SCIONApplicationRegistrationError¶

SCION applications are having errors while registering with the dispatcher. If this issue persists, SCION control traffic will no longer be sent or received.

Actions¶

Check the logs of the dispatcher for the given appliance.
Restart the dispatcher.

SCIONBeaconPropagationError¶

The beacon propagation fails for a given SCION interface. The alert fires after a prolonged period of beacon propagation errors. Beacon propagation is needed to keep connectivity to downstream and sibling core ASes. This means that if beacon propagation does not work for an extended period of time, connectivity will eventually break.

Actions¶

Check the logs of the control service for the given ISD-AS. Look for the lines Unable to propagate beacons and Error propagating beacons on interface. These logs should contain more details about the problem.
In case the problem is related to beacon extension:
1. Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
2. Check that time synchronization is working properly.
In case the creation of the beacon sender or the sending of the beacon failed, make sure that the connection to the neighboring AS is working properly by following the basic troubleshooting guide.

SCIONBeaconPropagationInternalError¶

This alert fires if there is an internal error when attempting to do beacon propagation. Beacon propagation is needed to keep connectivity to downstream and sibling core ASes. This means that if beacon propagation does not work for an extended period of time, connectivity will eventually break.

Since the error is internal, this indicates that there is an issue with the local beacon database.

Actions¶

Check the logs of the control service for the given ISD-AS. Look for the Unable to propagate beacons line. This log should contain more details about what exactly is the error.
The beacon database is in a docker volume. To check the underlying disk,
1. Find the name of the control service container.
2. Check the underlying disk using the following command (replace the <control-name> with the actual control service name):
```
findmnt $(docker inspect <control-name> -f '{{ .GraphDriver.Data.MergedDir}}')
```
  In a normally working system, this should list a disk with options rw which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.
Try to restart the control service.

SCIONBeaconReceiveError¶

There is an increased number of errors regarding the process that receives beacons from neighboring ASes. This can be either because an AS is sending bogus beacons that cannot be verified or because the beacons can not be stored locally.

Actions¶

Check the result label of the metric involved in the alert.
If there are a lot of err_verify counts, it means that a neighbor is sending bogus beacons that can’t be validated or verified:
1. Check the logs of the control service for the given ISD-AS.
2. Check that time synchronization is working properly.
3. Contact the neighboring AS to inquire why it is sending non-verifiable beacons.
If there are a lot of err_db or other error labels:
1. Enable debug logs on the control service for the given ISD-AS.
2. Check the logs of the control service for the given ISD-AS for the detailed errors. The errors can either be related to an issue with the disk, or the neighboring AS sending bogus beacons.
3. If the issue is related to the latter, contact the neighboring AS to inquire why it is sending non-verifiable beacons.

SCIONCAUnavailable¶

This alert fires if the given ISD-AS is unable to connect to the CA backend.

Actions¶

For the SCION CA backend provided by Anapaya:

Check if the ca-frontend service is running.
Get the ca-frontend management API address from the start of the log and check if the API is reachable using:
```
curl <management_addr>/api/v1/config
```
If the API is reachable, check the connection to the CA backend. Depending on the CA backend used (Anapaya SCION CA or the respective third-party CA), try to investigate the connectivity issue using the respective documentation.
If you are using an external CA backend, verify the connectivity with the external service and check the configured credentials.

SCIONCryptoASCertificateExpiresSoon¶

If the AS certificate for the given ISD-AS expires in less than 48 hours, this alert is triggered. This indicates that the automatic renewal process failed to renew the AS certificate.

Renewal happens when the certificate has passed 3/4 of its lifetime. Therefore, if this alert triggers, the automatic renewal process failed to renew the AS certificate for at least six hours.

Actions¶

Check the Certificate/TRC Provisioning section to manually renew your AS certificate.
If the above step also fails, the error message gives a more detailed insight into the cause of the error.

SCIONCryptoTRCDiskWriteError¶

This alert fires when writing a TRC to the disk fails.

Actions¶

Check the logs of the control service.
Look for the logs Failed to write TRC to disk... or Failed to stat TRC file on disk... indicating that writing the TRC to the disk failed.
Check the appended error message to get the root cause of this issue.
Check the disk space and if the disk is configured as read-only. If that is the case, try to mitigate the cause to avoid any further problems.

SCIONCryptoTRCExpiresSoon¶

This alert fires if the TRC for the given ISD-AS expires in less than 30 days.

Actions¶

If you are a non-voting party in the ISD, contact the responsible party of the TRC to check if there is an underlying issue.
If you are a voting party, follow the actions below.
1. Inspect the TRC to verify that it expires soon.
2. Initiate the TRC update process with a quorum of voting members. The details of this process depend on the governance rules of your respective ISD.

SCIONInterfaceStateDown¶

This alert fires when the SCION interface to the neighboring AS is down.

Actions¶

Check if the administative_state for the given interface is set as expected in the appliance config:

appliance-cli get config -f 'body.config.scion.ases[isd_as == "<ISD-AS>"].neighbors[neighbor_isd_as == "<neighbor ISD-AS>"].interfaces[interface_id == <interface_id>]'

Check the actual state of the SCION interface:

appliance-cli get debug/scion/interfaces -f 'body.interfaces[local.interface_id == <interface_id>]'

If the states do not match, make sure your network interface configuration is correct.
```
appliance-cli get config
```
Determine whether the underlying IP connectivity is the issue. Get the local and remote interface addresses from the appliance configuration and use the IP ping command: ping <remote> -I <local>.
Check the Anapaya/router dashboard to see if BFD packets are being sent and received (BFD graphs).

SCIONInterfaceStateFlapping¶

The Bidirectional Forwarding Detection (BFD) protocol is used to continuously check the health of the links. If the BFD session state on a given interface changes too frequently, this alert is triggered.

Actions¶

Determine whether the underlying connection is the issue. Use the IP ping command with the option -i 0.01 to see if there are any anomalies when the BFD flap happens. You can get the destination address of the relevant neighbor mentioned in the alert from the appliance configuration.
```
appliance-cli get config
```
Consider relaxing the BFD timers by updating the detection_multiplier in the Bidirectional Forwarding Detection section of the appliance configuration.

SCIONNeighborPathsMissing¶

This alert fires when there is no SCION connectivity from this appliance to one of its direct neighbors.

Actions¶

Follow the basic troubleshooting guide to investigate why there is a SCION connectivity issue.

GatewayFlowsNotExported¶

Flow metrics are not being exported. This means that the reports about gateway usage will be missing data, which, in turn, could lead to the loss of business revenue.

The data is kept for some time (30 minutes by default) so that the operator has the opportunity to resolve the issue without losing any data.

Actions¶

Check gateway logs and check for some exporting flow metrics error.
Depending on the logs, try to resolve the issue.

GatewayNetlinkListenerNotSubscribed¶

The gateway is not subscribed to netlink route updates and can thus not learn and redistribute routes received from BGP peers. This indicates that something with the underlying Linux operating system is not working as expected.

Actions¶

Restart the gateway and check if the problem gets resolved.
If the problem persists, reboot the appliance:
```
.. code-block:: bash

   sudo reboot
```

GatewayNetlinkListenerErrors¶

The gateway is subscribed to netlink route updates, but the netlink listener is missing route updates and might thus not correctly redistribute all route updates to its remotes. This indicates that something with the underlying Linux operating system is not working as expected.

Actions¶

Restart the gateway and check if the problem gets resolved.
If the problem persists, reboot the appliance:
```
.. code-block:: bash

   sudo reboot
```

SCIONSegmentRegistrationError¶

Received beacons are converted to segments and registered in the path segment database. Up and core segments are registered in the local database. Down segments are registered at the originating core AS. If this alert fires, then segment registration fails. First, check the segment type from the alert description and depending on it follow the actions below.

Actions for down segments¶

Check the logs of the control service for the given ISD-AS. Look for the line Unable to register segment. It contains more details about the problem.
Make sure that the connection to the originating core AS is working properly by following the basic troubleshooting guide.

Actions for up/core segments¶

This alert should not trigger for those segment types. Contact Anapaya Support.

SCIONSegmentRegistrationInternalError¶

This alert fires if there is an internal error when attempting to register a segment. Segment registration is needed to keep connectivity to other ASes. That means if segment registration does not work for an extended period of time, connectivity will eventually break.

Since the error is internal, this indicates that there is an issue with the local beacon or segment database.

Actions¶

Check the logs of the control service for the error to get more insight into what the error could be.
The beacon database is in a docker volume. To check the underlying disk:
1. Find the name of the control service container.
2. Check the underlying disk using the following command (replace the <control-name> with the actual control service name):
findmnt $(docker inspect <control-name> -f '{{ .GraphDriver.Data.MergedDir }}')
In a normally working system, this should list a disk with options rw which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.
Try to restart the control service.

SCIONSyncFailing¶

This alert fires if there has been no successful synchronization of the shard database in the last hour. The synchronization is part of the SCION control service for the given ISD-AS.

Actions¶

Look at the logs of the control service. You will find the reason for the failed synchronization in the logs. The process that needs to be followed to resolve the issue depends on the failure case.

SCIONSyncFetchFailing¶

This alert fires if the SCION control service fails to receive beacons or path segments from some of its peers.

Actions¶

Check the logs of the control service and look for the entry Fetch from network failed..., which will help to determine the underlying issue.
If the issue is network related, make sure the network connectivity is working as expected using the IP ping command for the destination mentioned in the logs.
In case of a parsing error, investigate why the other party is sending malformed objects.

SCIONSyncStoreFailing¶

This alert fires if the SCION control service fails to store objects which is equivalent to a database write error.

Actions¶

Check the logs of the control service and look for the entry Write to database failed..., which will help to determine the underlying issue. The error could be caused due to an issue with the disk.
If the error is related to the disk, check the underlying disk using the following command (replace the <control-name> with the actual control service name):
```
findmnt $(docker inspect <control-name> -f '{{ .GraphDriver.Data.MergedDir}}')
```
In a normally working system, this should list a disk with options rw which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.

ServiceApplianceControllerConfigInvalid¶

This alert triggers if the appliance successfully validated the configuration, but failed when applying it. This indicates a missing validation in the appliance and therefore that the appliance could not be correctly configured. The reason is not a misconfiguration, but the lack of validation in this case.

Contact Anapaya¶

In this case, please contact Anapaya Customer Service and provide the following information:

General information
Logs of the appliance-controller

ServiceApplianceControllerPanic¶

This alert fires when something in the appliance-controller went unexpectedly wrong and resulted in a panic.

The appliance-controller is the centerpiece of the Anapaya appliance. Based on the appliance configuration, the appliance-controller automatically configures all the required SCION infrastructure services.

Actions¶

Make sure the system is up and running. Check if the appliance-controller is running. The appliance-installer should have restarted it in case of a failure.
If the appliance-controller is not running, check the logs of the appliance-controller to get an insight in why it is crashing.
Check the release notes, for any known issues with the installed release. If there are known issues, consider updating the host to the latest version to resolve the issue.

Contact Anapaya¶

Please contact Anapaya Customer Support and provide the following information:

installed scion package version: appliance-cli version
Logs of the appliance-controller
Logs of the appliance-installer

SystemClockSyncFailing¶

The system is not able to synchronize its clock. For the clock synchronization, the system service (systemd-timesyncd) is used as an NTP client.

Actions¶

Follow the timesyncd checks to check if the corresponding system service is running correctly.

SystemDiskSpaceBootLow¶

This alert fires when the free disk space for the boot partition (/boot) is less than 100MiB.

Actions¶

Analyze disk space usage for the /boot partition to find what takes up the most memory.
Free up space using the APT related commands.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemDiskSpaceRootLow¶

This alert fires when the free disk space for the root partition (/) is less than 10%.

Actions¶

Analyze disk space usage for the root (/) partition to find what takes up the most memory.
Free up space using the APT related commands.
Remove unused docker images.
Increase size of root partition.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemNTPSyncServiceNotRunning¶

This alert fires if the systemd-timesyncd service is not running. The service is used to syncronize the local system clock with a remote Network Time Protovol (NTP) server.

Actions¶

Follow the timesyncd checks to check if the corresponding system service is running correctly and restart it if required.

SystemResourcesAverageLoadHigh¶

This alert fires if the system is heavily loaded on average in the last 15 minutes. Make sure the system is running properly and try to determine if some processes are using more CPU than usual.

Configuration¶

Check the number of CPU cores:
```
nproc
```
If the number is two, make sure you have configured the appliance to not use any workers for the dataplane.

Actions¶

Identify the processes causing the high CPU load:

ps -eo pcpu,pid,user,args --sort -%cpu | head -10

Assess if the process should be using the amount of resources. If this is not the case, one option is to restart the process. Note, that this might cause a service interruption.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemResourcesFreeMemoryLow¶

This alert fires if the system runs low on memory. Make sure the system is running properly and try to free up memory by cleaning caches and buffers.

Actions¶

Check the summary of RAM usage:
```
free -h
```

Identify the processes that require the most memory:

ps -eo pmem,pid,user,args --sort -%mem | head -10

Assess if the process should be using the amount of resources. If this is not the case, one option is to restart the process. Note, that this might cause a service interruption.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.

SystemServiceDown¶

This alert fires if the job specified in the alert cannot be reached by the monitoring system.

The appliance job works as a proxy for all other jobs, so when the alert concerns the appliance job it may be that there is overall no network connectivity from the monitoring system to the given host.

Otherwise, this alert means that one of the services on the Anapaya appliance is not working. Services are docker containers managed by the appliance itself. If the service fails, it will be automatically restarted. However, this alert indicates that the service is not available for a prolonged amount of time. Common cases where this alert is fired is (1) when the service is in a crashloop trying to start up and failing repeatedly and (2) when the appliance notifications are disabled and the appliance-controller does not restart the containers.

Actions¶

If the problem is with the connectivity to the appliance you may try to ping it from the monitoring host. You can find the address to ping in the instance label of the alert.
If the address is reachable, you can check scrape_duration_seconds metric value, which is helpful to determine whether the scraping failed entirely or whether it is flapping. In the latter case, it may be a latency or a bandwidth problem.
If the connectivity is fine, check the logs of the failing service. If the service is repeatedly crashing you should be able to find the reason in the logs.
If there are no logs from the service it may also be that the appliance-controller is not able to start the service at all. Check the logs of the appliance-controller to find out why this is the case.

SystemServiceRestarted¶

This alert fires if the job specified in the alert has restarted recently. This may be a legitimate restart (e.g. because a new version of the service was installed or a configuration change required a restart) or it may be that the service crashed.

Note

If the service is functional after the restart, the alert will stop firing in few minutes.

Actions¶

Check the logs of the service. You should be able to find the reason why the service exited.

SystemServiceDebugLogsEnabled¶

This alert fires if the job specified in the alert is running in debug log mode for 24 hours. The debug log level should only be used for troubleshooting.

Actions¶

Revert the log level to the default info state for the given service.

SystemVPPMemoryUsageHigh¶

VPP preallocates memory on start up. This alert is triggered when 90% of the preallocated memory is used.

Actions¶

Look at the historic VPP memory usage.
1. It could have run organically with the increased usage or it may have grown when a new feature was enabled.
2. If the memory usage slowly grows each time after the dataplane restart, it is probably a bug and Anapaya support should be contacted.
To get more information about the VPP memory usage, run:
```
appliance-cli get debug/vpp/memory
```

SystemVPPPacketReceiveErrors¶

This alert fires if a VPP interface is experiencing errors when receiving packets. This could be due to a faulty NIC, faulty cables, faulty SFP, wrong CRC, L1 configuration mismatch, etc.

Actions¶

Check the logs of the dataplane for error entries that highlight the issue.
Check the state of the interface in question using
```
appliance-cli get debug/vpp/hardware
```
Check the networking hardware itself for issues.

SystemVPPPacketTransmitErrors¶

This alert fires if a VPP interface is experiencing errors when transmitting packets due to NIC and/or carrier errors such as faulty cable, L1 configuration mismatch, etc.

Actions¶

Check the logs of the dataplane for error entries that highlight the issue.
Check the state of the interface in question using
```
appliance-cli get debug/vpp/hardware
```
Check the networking hardware itself for issues.

SystemVPPReceiveBuffersLow¶

This alert fires if VPP is running out of buffers. This is probably related to a bug in the software.

Actions¶

Escalate to Anapaya support.
To get more information about the VPP buffers, run:
```
appliance-cli get debug/vpp/buffers
```

SystemVPPReceiveQueueFull¶

This alert fires if a VPP interface is missing packets because the queue is full. This could be caused by VPP worker threads not being able to process the received traffic.

Actions¶

Check the VPP runtime data:
```
appliance-cli get debug/vpp/runtime
```
1. Look at Vectors/Call column. This is the average number of packets processed in one go.The maximum number of packets in one batch is 256. Therefore, if the CPU is not keeping up with the traffic, the numbers will be close to 256.
2. If this is the case, either configure VPP to use more cores, or, if not possible, consider splitting the traffic between multiple machines.

TunnelingReceivedInvalidPrefixes¶

This alert fires if a remote AS announced an invalid of non-canonical prefix. The prefix is ignored and the gateway otherwise works normally.

Actions¶

Inform the remote AS that they are trying to announce an invalid prefix.

TunnelingDomainHealthyPathsMissing¶

This alert fires if there are either no paths for a specific domain or if the paths that are available are not alive (probes don’t pass through). Given the above, the packets that match the specified domain cannot be delivered to the destination.

Actions¶

Find the available paths by running:

appliance-cli inspect scion-tunneling summary --all-paths \
  --domain <domain>

Inspect the individual paths. Some of them may be dead (no probes are passing through), or expired.
1. If the paths are dead, you should investigate the network connectivity.
2. If they are expired, you should investigate why the paths are not being updated.

TunnelingTrafficPolicyPathsExpiringSoon¶

This alert fires if the last remaining path for a specific domain and traffic matcher expires within three hours. After it expires the packets that match the domain and the traffic matcher cannot be delivered to the destination.

The expiration of paths is set to six hours by default, but paths should be updated much more often. If only three hours are remaining, it means there is a problem with refreshing paths.

Actions¶

Check whether there are other firing alerts, which should give more context on what the problem is
Ensure that all services and especially the daemon or control services are running.
Check if there are problems with certificate renewal, which could have lead to this issue.
1. Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
2. If it is not,check the Certificate/TRC Provisioning section to manually renew your AS certificate.
3. Check that time synchronization is working properly.

Common Operations¶

Gather appliance information¶

To collect appliance-related information to provide it to the Anapaya Customer Support:

SSH to the given machine.
Collect general information by running:
```
appliance-cli info > appliance.info
```
Fetch the appliance configuration by running:
```
appliance-cli get config > config.json
```

Warning

The appliance config contains secrets, so please remove them before sending the information to anyone!

Gather general host information¶

To collect host-related information to provide it to the Anapaya Customer Support:

SSH to the given machine.
Run
```
sudo lshw
```

Check docker services¶

To check whether the services (run as docker containers) are running:

SSH to the given machine

Use docker ps -a:

$ docker ps -a
CONTAINER ID  IMAGE                  COMMAND                 CREATED     STATUS     PORTS  NAMES
c718397beaf9  scion-all:v0.32.2      "/app/scion-all netw…"  7 days ago  Up 7 days         dataplane-control
5beecfb5d081  vpp-dataplane:v0.32.2  "/usr/bin/vpp -c /sh…"  7 days ago  Up 7 days         dataplane
...

The output of the command shows whether the service is up and for how long it has been running. If the service is up for a very short amount of time, there is a chance that it is crashlooping.

For further information please refer to the official Docker documentation.

Change log level¶

To change the log level to debug to gather more information when investigating an issue:

SSH to the given machine
Run the following command to change the debug level of a specific service to debug.
```
appliance-cli services log level <service-name> debug
```

Warning

Don’t forget to revert your changes after troubleshooting.

Inspect docker service logs¶

To inspect the logs of services running as docker containers:

SSH to the given machine.
If needed, you can use the following command to see the list of services:
```
docker ps -a
```
Inspect the logs by running the following command:
```
docker logs <service-name>
```

Note

To see only the recent logs use:

docker logs <service-name> --since=<time-duration>

For example, to check the logs of the last minute, run:

docker logs <service-name> --since=1m

Note

The logs are printed to stderr.

To save the logs in a file use:

docker logs <service-name> 2> <filename>

To grep through the logs use

docker logs <service-name> 2>&1 | grep <query>

For further information please refer to the official Docker documentation.

Restart a service¶

To restart a service you can use the appliance-cli:

appliance-cli post debug/services/${service_name}/restart

where ${service_name} is the name of the service you want to restart. To get the possible values for the ${service_name}, use the following command:

appliance-cli get debug/services

Note

Alternatively, you can restart a service by running the following commands:

SSH to the given machine
Run docker restart <service-name>

Note

The Anapaya appliance restarts failed services automatically, so manual restarting is likely to be useful only when the service is stuck and/or unresponsive.

For further information, please refer to the official Docker documentation.

Clean up docker images¶

To remove docker images that are no longer used:

SSH to the given machine.
List all docker images by running:
```
docker image ls
```
Remove old unused images by running:
```
docker image prune
```

For further information please refer to the official Docker documentation.

Connect to the BGP daemon’s interactive console¶

To connect to the BGP daemon’s shell:

SSH to the given machine.
Open the interactive console by running:
```
docker exec -it frr vtysh
```

For further information on the console please refer to the official FRR documentation.

Check systemd services¶

To check if the systemd services are running:

SSH to the given machine

Run systemctl list-units '<service-name>':

$ systemctl list-units 'appliance*'
UNIT                        LOAD   ACTIVE SUB     DESCRIPTION
appliance-host.service      loaded active running Anapaya Appliance Host Service
appliance-installer.service loaded active running Anapaya Appliance Installer
...

To get a more detailed overview of a specific service, use systemctl status <service-name>:

$ systemctl status appliance-installer.service
      ● appliance-installer.service - Anapaya Appliance Installer
   Loaded: loaded (/etc/systemd/system/appliance-installer.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-02-13 08:26:29 UTC; 7min ago
Main PID: 166 (appliance-insta)
   Tasks: 13 (limit: 38262)
   CGroup: /system.slice/appliance-installer.service
         └─166 /usr/bin/appliance-installer --config /etc/anapaya/installer/appliance-installer.toml
...

With these commands you can see whether the service is active and running and for how long it has been running.

Note

A systemd service can be restarted using systemctl restart <service-name>.

Inspect systemd service logs¶

To view the systemd service logs:

SSH to the given machine.
If needed, you can list the appliance-related services by running:
```
systemctl list-units 'appliance*'
```
Inspect the logs by running:
```
journalctl -eu <service-name>
```

Note

To see only the recent logs use the --since flag. For example, to see only the logs from today use journalctl -eu <service-name> --since today.

To show the most recent 20 entries, use the -n 20 option.

Note

Note that the logs are printed to stdout.

To save the logs in a file use:

journalctl -u <service-name> > <filename>

To grep through the logs use:

journalctl -eu <service-name> | grep <query>

Check the systemd-timesyncd service¶

We use the systemd-timesyncd service, which acts as an NTP client and connects to a pool of NTP servers for time synchronization. The following actions provide some starting points for troubleshooting the timesyncd service called systemd-timesyncd.service. For further information please refer to the official documentation.

SSH to the given machine.
Check if the system clock is synchronized and if NTP service is active using:
```
timedatectl status
```
Check if the service is running.
Check the log of the service.

Restart the service:

systemctl restart systemd-timesyncd.service

Find the configured NTP servers:

cat /etc/systemd/timesyncd.conf | grep NTP

Disk usage analysis¶

This section contains some helpful commands that you may need when investigating if you run out of disk space.

SSH to the given machine.
Check the current space:
```
df -h <path>
```
Check the list of the current files:
```
ls -l <path>
```
The du command can be used to get a more detailed overview of which directory consumes how much space. You can vary the max-depth option or the starting directory:
```
du -cha --max-depth=1 / | grep -E "M|G"
```

For further information about the du command please refer to the official documentation.

Clean up disk space¶

There are several ways to free up disk space. The options are divided depending on the context.

Systemd journal logs¶

Check systemd journal logs:
```
journalctl --disk-usage
```
Clear the logs that are older than 3 days:
```
sudo journalctl --vacuum-time=3d
```

Docker images¶

Remove unused docker images.

Fix topology synchronization error¶

Appliances in a cluster share their topology information with each other. This either happens statically through configuration or dynamically through an exchange protocol. For further information on how to configure topology synchronization in the appliance configuration, refer to Topology Synchronization. The instructions below should help to identify a misconfiguration.

Check the logs of the appliance-controller service. The logs should contain an error describing the misconfiguration.
Fix the misconfigured appliances and update them.

Inspect SCION paths used for IP-in-SCION tunneling¶

While troubleshooting SCION connectivity, it is often useful to check the available paths for each domain. This section provides an overview on how to achieve this.

SSH to the given machine.
Show the currently available paths for all domains and traffic matchers by running the following command. This also shows whether the path is alive, dead (no probes are passing through), expired or similar.
```
appliance-cli inspect scion-tunneling summary --all-paths
```

Show the currently used paths for a specific domain.

appliance-cli inspect scion-tunneling summary --all-paths \
  --domain <domain>

For the used paths for a specific traffic matcher within the given domain, run:

appliance-cli inspect scion-tunneling summary --all-paths \
  --domain <domain> --traffic-matcher <traffic matcher>

Ping the underlay network¶

When investigating an issue, it is often helpful to determine whether the underlying IP connectivity is the problem.

For further information, please refer to the official ping documentation.

Tip

The ping command runs indefinitely, unless specified otherwise:

ping -c <number> <destination>

Changing the source address is possible either directly via the address or the interface name:

ping <destination> -I <interface/address>

The default time interval between successive packet transmissions is one second. You can specify a custom interval in seconds:

ping -i <interval> <destination>