Troubleshooting & Runbooks¶
This documentation page contains runbooks for the alerts sent out by the Anapaya products as well as common operations that are helpful when troubleshooting.
Alerts¶
The names of the alerts follow a uniform structure to provide a quick insight into their meaning:
Namespace: This is the overall part that is having a problem.
Subsystem: Within the namespace what component is affected.
Subject: The subject that is having an issue.
State: The state of the subject, explaining in what way it is not conforming to expectations.
BGPDaemonDisconnected¶
In general, the gateway is connected to a BGP daemon (FRR) and uses it to publish the routes received via SGRP to the local network. This alert fires when the connection between the gateway and the BGP daemon breaks. Typically, this is caused by the BGP daemon being dead. Given that the appliance automatically restarts the crashed daemon, it is likely that the alert was triggered by the daemon crashlooping on startup.
Alternatively, the daemon may be stuck and thus the connection cannot be established.
When the alert is firing, the routes received via SGRP cannot be published to the local network. This means that the remote IP prefixes are not reachable any more through IP-in-SCION tunneling.
Actions¶
Check at the logs of the
frr
service to understand what the state of the daemon is.Restart the service if needed.
For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation for further information.
BGPDaemonUnresponsive¶
In general, the gateway is connected to a BGP daemon (FRR) and uses it to publish the routes received via SGRP to the local network. This alert fires when the connection between the gateway and the BGP daemon exists, but the gateway cannot publish paths to the daemon. Most likely, the BGP daemon is stuck.
When the alert is firing, the routes received via SGRP cannot be published to the local network. This means that the remote IP prefixes are not reachable anymore through IP-in-SCION tunneling.
Actions¶
Look at the logs of the
frr
service to understand what the state of the daemon is.Restart the service if needed.
For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.
BGPDaemonDown¶
This alert fires when the BGP daemon (FRR) on the appliance is down.
Actions¶
Look at the logs of the
frr
service to understand why the daemon is not running.Restart the service if needed.
For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.
BGPPeerDown¶
This alert fires when a BGP peer of the appliance’s BGP daemon is down. If the appliance has multiple BGP peers configured, this might not be critical.
Actions¶
Check if the peer IP is reachable from the appliance using the IP ping command.
For more detailed debugging, you can connect to FRR’s interactive console and refer to the official FRR documentation.
ClusterTopologySyncFetchError¶
This alert fires if there is an error when fetching topology information from a remote node.
Actions¶
Check the network connectivity between the two appliances using the IP ping command.
Check the logs of the appliance-controller to find information on the other appliance and its address.
If the appliance-controller logs do not have this information, check the cluster section of the appliance configuration for all the necessary information. Please refer to Cluster for further information on the cluster configuration fields.
appliance-cli get config
In case the connectivity between the two appliances does not work, then there might be an underlying network problem. Please check and troubleshoot the internal network setup to resolve the issue.
If the connectivity works, check the logs of the appliance-controller to further investigate this issue.
ClusterTopologySyncInterfaceMergeConflict¶
This alert fires if there are conflicts in the topology merge for the specified interface. This indicates a severe misconfiguration of appliances and means that multiple appliances have the same SCION interfaces configured.
Actions¶
Follow the instructions on how to fix a topology synchronization misconfiguration.
ClusterTopologySyncServiceMergeConflict¶
This alert fires if there is a conflict in the topology merge. It indicates that multiple appliances have the same address for the SCION control service configured.
Actions¶
Follow the instructions on how to fix a topology synchronization misconfiguration.
DataplaneControlSyncFailing¶
This alert fires if the dataplane control process could not apply the desired configuration. This should only happen after a new configuration has been pushed.
Actions:¶
Check the logs of the
dataplane-control
process. Identify the problematic configuration and correct it.
ManagementSoftwareChecksumInconsistent¶
The appliance-controller periodically checks whether the installed software version is consistent with the version on the disk. This alert fires if the checksum of the installed software package does not match the checksum in the signatures files.
Actions¶
Check the checksum of the installed package version:
appliance-cli info
Check if the signatures file for the same version exists (
anapaya-scion-{version}.tar.gz.signatures.json
):ls -l /var/anapaya/installer/packages
In case the file exists, compare the
sha256sum
inside the file to the one from the first step.If the file does not exist, upload the package signatures.
Compute the checksum of the package to check if it matches the one from above:
sha256sum anapaya-scion-{version}.tar.gz
If the checksum does not match, reinstall the package.
SCIONBeaconOriginationError¶
The beacon origination fails for a given SCION interface. The alert fires after a prolonged period of beacon origination errors. If beacons can’t be originated, connectivity to other ASes via the affected interface will eventually break.
Actions¶
Check the logs of the control service for the given ISD-AS. Look for the line
Unable to originate on interface
. It should contain more details about the problem.In case beacon creation failed:
Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
Check that time synchronization is working properly.
In case the creation of the beacon sender or the sending of the beacon failed, make sure that the connection to the neighboring AS is working properly by following the basic troubleshooting guide.
SCIONApplicationRegistrationError¶
SCION applications are having errors while registering with the dispatcher. If this issue persists, SCION control traffic will no longer be sent or received.
Actions¶
Check the logs of the dispatcher for the given appliance.
SCIONBeaconPropagationError¶
The beacon propagation fails for a given SCION interface. The alert fires after a prolonged period of beacon propagation errors. Beacon propagation is needed to keep connectivity to downstream and sibling core ASes. This means that if beacon propagation does not work for an extended period of time, connectivity will eventually break.
Actions¶
Check the logs of the control service for the given ISD-AS. Look for the lines
Unable to propagate beacons
andError propagating beacons on interface
. These logs should contain more details about the problem.In case the problem is related to beacon extension:
Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
Check that time synchronization is working properly.
In case the creation of the beacon sender or the sending of the beacon failed, make sure that the connection to the neighboring AS is working properly by following the basic troubleshooting guide.
SCIONBeaconPropagationInternalError¶
This alert fires if there is an internal error when attempting to do beacon propagation. Beacon propagation is needed to keep connectivity to downstream and sibling core ASes. This means that if beacon propagation does not work for an extended period of time, connectivity will eventually break.
Since the error is internal, this indicates that there is an issue with the local beacon database.
Actions¶
Check the logs of the control service for the given ISD-AS. Look for the
Unable to propagate beacons
line. This log should contain more details about what exactly is the error.The beacon database is in a docker volume. To check the underlying disk,
Find the name of the control service container.
Check the underlying disk using the following command (replace the
<control-name>
with the actual control service name):findmnt $(docker inspect <control-name> -f '{{ .GraphDriver.Data.MergedDir}}')
In a normally working system, this should list a disk with options
rw
which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.
Try to restart the control service.
SCIONBeaconReceiveError¶
There is an increased number of errors regarding the process that receives beacons from neighboring ASes. This can be either because an AS is sending bogus beacons that cannot be verified or because the beacons can not be stored locally.
Actions¶
Check the result label of the metric involved in the alert.
If there are a lot of
err_verify
counts, it means that a neighbor is sending bogus beacons that can’t be validated or verified:Check the logs of the control service for the given ISD-AS.
Check that time synchronization is working properly.
Contact the neighboring AS to inquire why it is sending non-verifiable beacons.
If there are a lot of
err_db
or other error labels:Enable debug logs on the control service for the given ISD-AS.
Check the logs of the control service for the given ISD-AS for the detailed errors. The errors can either be related to an issue with the disk, or the neighboring AS sending bogus beacons.
If the issue is related to the latter, contact the neighboring AS to inquire why it is sending non-verifiable beacons.
SCIONCryptoASCertificateExpiresSoon¶
If the AS certificate for the given ISD-AS expires in less than 48 hours, this alert is triggered. This indicates that the automatic renewal process failed to renew the AS certificate.
Renewal happens when the certificate has passed 3/4 of its lifetime. Therefore, if this alert triggers, the automatic renewal process failed to renew the AS certificate for at least six hours.
Actions¶
Check the Certificate/TRC Provisioning section to manually renew your AS certificate.
If the above step also fails, the error message gives a more detailed insight into the cause of the error.
SCIONCryptoTRCDiskWriteError¶
This alert fires when writing a TRC to the disk fails.
Actions¶
Check the logs of the control service.
Look for the logs
Failed to write TRC to disk...
orFailed to stat TRC file on disk...
indicating that writing the TRC to the disk failed.Check the appended error message to get the root cause of this issue.
Check the disk space and if the disk is configured as read-only. If that is the case, try to mitigate the cause to avoid any further problems.
SCIONCryptoTRCExpiresSoon¶
This alert fires if the TRC for the given ISD-AS expires in less than 30 days.
Actions¶
If you are a non-voting party in the ISD, contact the responsible party of the TRC to check if there is an underlying issue.
If you are a voting party, follow the actions below.
Inspect the TRC to verify that it expires soon.
Initiate the TRC update process with a quorum of voting members. The details of this process depend on the governance rules of your respective ISD.
SCIONInterfaceStateDown¶
This alert fires when the SCION interface to the neighboring AS is down.
Actions¶
Check if the administative_state for the given interface is set as expected in the appliance config:
appliance-cli get config -f 'body.config.scion.ases[isd_as == "<ISD-AS>"].neighbors[neighbor_isd_as == "<neighbor ISD-AS>"].interfaces[interface_id == <interface_id>]'
Check the actual state of the SCION interface:
appliance-cli get debug/scion/interfaces -f 'body.interfaces[local.interface_id == <interface_id>]'
If the states do not match, make sure your network interface configuration is correct.
appliance-cli get config
Determine whether the underlying IP connectivity is the issue. Get the local and remote interface addresses from the appliance configuration and use the IP ping command:
ping <remote> -I <local>
.Check the
Anapaya/router
dashboard to see if BFD packets are being sent and received (BFD
graphs).
SCIONInterfaceStateFlapping¶
The Bidirectional Forwarding Detection (BFD) protocol is used to continuously check the health of the links. If the BFD session state on a given interface changes too frequently, this alert is triggered.
Actions¶
Determine whether the underlying connection is the issue. Use the IP ping command with the option
-i 0.01
to see if there are any anomalies when the BFD flap happens. You can get the destination address of the relevant neighbor mentioned in the alert from the appliance configuration.appliance-cli get config
Consider relaxing the BFD timers by updating the
detection_multiplier
in the Bidirectional Forwarding Detection section of the appliance configuration.
SCIONNeighborPathsMissing¶
This alert fires when there is no SCION connectivity from this appliance to one of its direct neighbors.
Actions¶
Follow the basic troubleshooting guide to investigate why there is a SCION connectivity issue.
GatewayFlowsNotExported¶
Flow metrics are not being exported. This means that the reports about gateway usage will be missing data, which, in turn, could lead to the loss of business revenue.
The data is kept for some time (30 minutes by default) so that the operator has the opportunity to resolve the issue without losing any data.
Actions¶
Check gateway logs and check for some
exporting flow metrics
error.Depending on the logs, try to resolve the issue.
GatewayNetlinkListenerNotSubscribed¶
The gateway is not subscribed to netlink route updates and can thus not learn and redistribute routes received from BGP peers. This indicates that something with the underlying Linux operating system is not working as expected.
Actions¶
Restart the gateway and check if the problem gets resolved.
If the problem persists, reboot the appliance:
.. code-block:: bash sudo reboot
GatewayNetlinkListenerErrors¶
The gateway is subscribed to netlink route updates, but the netlink listener is missing route updates and might thus not correctly redistribute all route updates to its remotes. This indicates that something with the underlying Linux operating system is not working as expected.
Actions¶
Restart the gateway and check if the problem gets resolved.
If the problem persists, reboot the appliance:
.. code-block:: bash sudo reboot
SCIONSegmentRegistrationError¶
Received beacons are converted to segments and registered in the path segment database. Up and core segments are registered in the local database. Down segments are registered at the originating core AS. If this alert fires, then segment registration fails. First, check the segment type from the alert description and depending on it follow the actions below.
Actions for down segments¶
Check the logs of the control service for the given ISD-AS. Look for the line
Unable to register segment
. It contains more details about the problem.Make sure that the connection to the originating core AS is working properly by following the basic troubleshooting guide.
Actions for up/core segments¶
This alert should not trigger for those segment types. Contact Anapaya Support.
SCIONSegmentRegistrationInternalError¶
This alert fires if there is an internal error when attempting to register a segment. Segment registration is needed to keep connectivity to other ASes. That means if segment registration does not work for an extended period of time, connectivity will eventually break.
Since the error is internal, this indicates that there is an issue with the local beacon or segment database.
Actions¶
Check the logs of the control service for the error to get more insight into what the error could be.
The beacon database is in a docker volume. To check the underlying disk:
Find the name of the control service container.
Check the underlying disk using the following command (replace the
<control-name>
with the actual control service name):
findmnt $(docker inspect <control-name> -f '{{ .GraphDriver.Data.MergedDir }}')
In a normally working system, this should list a disk with options
rw
which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.Try to restart the control service.
SCIONSyncFailing¶
This alert fires if there has been no successful synchronization of the shard database in the last hour. The synchronization is part of the SCION control service for the given ISD-AS.
Actions¶
Look at the logs of the control service. You will find the reason for the failed synchronization in the logs. The process that needs to be followed to resolve the issue depends on the failure case.
SCIONSyncFetchFailing¶
This alert fires if the SCION control service fails to receive beacons or path segments from some of its peers.
Actions¶
Check the logs of the control service and look for the entry
Fetch from network failed...
, which will help to determine the underlying issue.If the issue is network related, make sure the network connectivity is working as expected using the IP ping command for the destination mentioned in the logs.
In case of a parsing error, investigate why the other party is sending malformed objects.
SCIONSyncStoreFailing¶
This alert fires if the SCION control service fails to store objects which is equivalent to a database write error.
Actions¶
Check the logs of the control service and look for the entry
Write to database failed...
, which will help to determine the underlying issue. The error could be caused due to an issue with the disk.If the error is related to the disk, check the underlying disk using the following command (replace the
<control-name>
with the actual control service name):findmnt $(docker inspect <control-name> -f '{{ .GraphDriver.Data.MergedDir}}')
In a normally working system, this should list a disk with options
rw
which means the disk is read-writeable. If that isn’t the case, the underlying system disk may have an issue.
ServiceApplianceControllerConfigInvalid¶
This alert triggers if the appliance successfully validated the configuration, but failed when applying it. This indicates a missing validation in the appliance and therefore that the appliance could not be correctly configured. The reason is not a misconfiguration, but the lack of validation in this case.
Contact Anapaya¶
In this case, please contact Anapaya Customer Service and provide the following information:
Logs of the appliance-controller
ServiceApplianceControllerPanic¶
This alert fires when something in the appliance-controller went unexpectedly wrong and resulted in a panic.
The appliance-controller is the centerpiece of the Anapaya appliance. Based on the appliance configuration, the appliance-controller automatically configures all the required SCION infrastructure services.
Actions¶
Make sure the system is up and running. Check if the appliance-controller is running. The appliance-installer should have restarted it in case of a failure.
If the appliance-controller is not running, check the logs of the appliance-controller to get an insight in why it is crashing.
Check the release notes, for any known issues with the installed release. If there are known issues, consider updating the host to the latest version to resolve the issue.
Contact Anapaya¶
Please contact Anapaya Customer Support and provide the following information:
SystemClockSyncFailing¶
The system is not able to synchronize its clock. For the clock synchronization, the system service
(systemd-timesyncd
) is used as an NTP client.
Actions¶
Follow the timesyncd checks to check if the corresponding system service is running correctly.
SystemDiskSpaceBootLow¶
This alert fires when the free disk space for the boot partition (/boot
) is less than 100MiB.
Actions¶
Analyze disk space usage for the
/boot
partition to find what takes up the most memory.Free up space using the APT related commands.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.
SystemDiskSpaceRootLow¶
This alert fires when the free disk space for the root partition (/
) is less than 10%.
Actions¶
Analyze disk space usage for the root (
/
) partition to find what takes up the most memory.Free up space using the APT related commands.
Increase size of root partition.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.
SystemNTPSyncServiceNotRunning¶
This alert fires if the systemd-timesyncd
service is not running. The service is used
to syncronize the local system clock with a remote Network Time Protovol (NTP) server.
Actions¶
Follow the timesyncd checks to check if the corresponding system service is running correctly and restart it if required.
SystemResourcesAverageLoadHigh¶
This alert fires if the system is heavily loaded on average in the last 15 minutes. Make sure the system is running properly and try to determine if some processes are using more CPU than usual.
Configuration¶
Check the number of CPU cores:
nproc
If the number is two, make sure you have configured the appliance to not use any workers for the dataplane.
Actions¶
Identify the processes causing the high CPU load:
ps -eo pcpu,pid,user,args --sort -%cpu | head -10
Assess if the process should be using the amount of resources. If this is not the case, one option is to restart the process. Note, that this might cause a service interruption.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.
SystemResourcesFreeMemoryLow¶
This alert fires if the system runs low on memory. Make sure the system is running properly and try to free up memory by cleaning caches and buffers.
Actions¶
Check the summary of RAM usage:
free -h
Identify the processes that require the most memory:
ps -eo pmem,pid,user,args --sort -%mem | head -10
Assess if the process should be using the amount of resources. If this is not the case, one option is to restart the process. Note, that this might cause a service interruption.
If the problem persists, consider the possibility that the machine is underprovisioned to meet the resource requirements. Contact Anapaya Customer Support with the hardware configuration.
SystemServiceDown¶
This alert fires if the job
specified in the alert cannot be reached by the
monitoring system.
The appliance
job works as a proxy for all other jobs, so when the alert
concerns the appliance
job it may be that there is overall no network
connectivity from the monitoring system to the given host.
Otherwise, this alert means that one of the services on the Anapaya appliance is not working. Services are docker containers managed by the appliance itself. If the service fails, it will be automatically restarted. However, this alert indicates that the service is not available for a prolonged amount of time. Common cases where this alert is fired is (1) when the service is in a crashloop trying to start up and failing repeatedly and (2) when the appliance notifications are disabled and the appliance-controller does not restart the containers.
Actions¶
If the problem is with the connectivity to the appliance you may try to ping it from the monitoring host. You can find the address to ping in the
instance
label of the alert.If the address is reachable, you can check scrape_duration_seconds metric value, which is helpful to determine whether the scraping failed entirely or whether it is flapping. In the latter case, it may be a latency or a bandwidth problem.
If the connectivity is fine, check the logs of the failing service. If the service is repeatedly crashing you should be able to find the reason in the logs.
If there are no logs from the service it may also be that the appliance-controller is not able to start the service at all. Check the logs of the appliance-controller to find out why this is the case.
SystemServiceRestarted¶
This alert fires if the job
specified in the alert has restarted
recently. This may be a legitimate restart (e.g.
because a new version of the service was installed or a configuration change
required a restart) or it may be that the service crashed.
Note
If the service is functional after the restart, the alert will stop firing in few minutes.
Actions¶
Check the logs of the service. You should be able to find the reason why the service exited.
SystemServiceDebugLogsEnabled¶
This alert fires if the job
specified in the alert is running in debug log
mode for 24 hours. The debug log level should only be used for troubleshooting.
Actions¶
Revert the log level to the default info state for the given service.
SystemVPPMemoryUsageHigh¶
VPP preallocates memory on start up. This alert is triggered when 90% of the preallocated memory is used.
Actions¶
Look at the historic VPP memory usage.
It could have run organically with the increased usage or it may have grown when a new feature was enabled.
If the memory usage slowly grows each time after the dataplane restart, it is probably a bug and Anapaya support should be contacted.
To get more information about the VPP memory usage, run:
appliance-cli get debug/vpp/memory
SystemVPPPacketReceiveErrors¶
This alert fires if a VPP interface is experiencing errors when receiving packets. This could be due to a faulty NIC, faulty cables, faulty SFP, wrong CRC, L1 configuration mismatch, etc.
Actions¶
Check the logs of the dataplane for error entries that highlight the issue.
Check the state of the interface in question using
appliance-cli get debug/vpp/hardware
Check the networking hardware itself for issues.
SystemVPPPacketTransmitErrors¶
This alert fires if a VPP interface is experiencing errors when transmitting packets due to NIC and/or carrier errors such as faulty cable, L1 configuration mismatch, etc.
Actions¶
Check the logs of the dataplane for error entries that highlight the issue.
Check the state of the interface in question using
appliance-cli get debug/vpp/hardware
Check the networking hardware itself for issues.
SystemVPPReceiveBuffersLow¶
This alert fires if VPP is running out of buffers. This is probably related to a bug in the software.
Actions¶
Escalate to Anapaya support.
To get more information about the VPP buffers, run:
appliance-cli get debug/vpp/buffers
SystemVPPReceiveQueueFull¶
This alert fires if a VPP interface is missing packets because the queue is full. This could be caused by VPP worker threads not being able to process the received traffic.
Actions¶
Check the VPP runtime data:
appliance-cli get debug/vpp/runtime
Look at
Vectors/Call
column. This is the average number of packets processed in one go.The maximum number of packets in one batch is 256. Therefore, if the CPU is not keeping up with the traffic, the numbers will be close to 256.If this is the case, either configure VPP to use more cores, or, if not possible, consider splitting the traffic between multiple machines.
TunnelingReceivedInvalidPrefixes¶
This alert fires if a remote AS announced an invalid of non-canonical prefix. The prefix is ignored and the gateway otherwise works normally.
Actions¶
Inform the remote AS that they are trying to announce an invalid prefix.
TunnelingDomainHealthyPathsMissing¶
This alert fires if there are either no paths for a specific domain or if the paths that are available are not alive (probes don’t pass through). Given the above, the packets that match the specified domain cannot be delivered to the destination.
Actions¶
Find the available paths by running:
appliance-cli inspect scion-tunneling summary --all-paths \ --domain <domain>
Inspect the individual paths. Some of them may be
dead
(no probes are passing through), orexpired
.If the paths are dead, you should investigate the network connectivity.
If they are expired, you should investigate why the paths are not being updated.
TunnelingTrafficPolicyPathsExpiringSoon¶
This alert fires if the last remaining path for a specific domain and traffic matcher expires within three hours. After it expires the packets that match the domain and the traffic matcher cannot be delivered to the destination.
The expiration of paths is set to six hours by default, but paths should be updated much more often. If only three hours are remaining, it means there is a problem with refreshing paths.
Actions¶
Check whether there are other firing alerts, which should give more context on what the problem is
Ensure that all services and especially the
daemon
orcontrol
services are running.Check if there are problems with certificate renewal, which could have lead to this issue.
Make sure that the AS certificate is valid, through the /cppki/certificates API endpoint.
If it is not,check the Certificate/TRC Provisioning section to manually renew your AS certificate.
Check that time synchronization is working properly.
Common Operations¶
Gather appliance information¶
To collect appliance-related information to provide it to the Anapaya Customer Support:
SSH to the given machine.
Collect general information by running:
appliance-cli info > appliance.info
Fetch the appliance configuration by running:
appliance-cli get config > config.json
Warning
The appliance config contains secrets, so please remove them before sending the information to anyone!
Gather general host information¶
To collect host-related information to provide it to the Anapaya Customer Support:
SSH to the given machine.
Run
sudo lshw
Check docker services¶
To check whether the services (run as docker containers) are running:
SSH to the given machine
Use
docker ps -a
:$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES c718397beaf9 scion-all:v0.32.2 "/app/scion-all netw…" 7 days ago Up 7 days dataplane-control 5beecfb5d081 vpp-dataplane:v0.32.2 "/usr/bin/vpp -c /sh…" 7 days ago Up 7 days dataplane ...
The output of the command shows whether the service is up and for how long it has been running. If the service is up for a very short amount of time, there is a chance that it is crashlooping.
For further information please refer to the official Docker documentation.
Change log level¶
To change the log level to debug to gather more information when investigating an issue:
SSH to the given machine
Run the following command to change the debug level of a specific service to debug.
appliance-cli services log level <service-name> debug
Warning
Don’t forget to revert your changes after troubleshooting.
Inspect docker service logs¶
To inspect the logs of services running as docker containers:
SSH to the given machine.
If needed, you can use the following command to see the list of services:
docker ps -a
Inspect the logs by running the following command:
docker logs <service-name>
Note
To see only the recent logs use:
docker logs <service-name> --since=<time-duration>
For example, to check the logs of the last minute, run:
docker logs <service-name> --since=1m
Note
The logs are printed to stderr
.
To save the logs in a file use:
docker logs <service-name> 2> <filename>
To grep through the logs use
docker logs <service-name> 2>&1 | grep <query>
For further information please refer to the official Docker documentation.
Restart a service¶
To restart a service you can use the appliance-cli
:
appliance-cli post debug/services/${service_name}/restart
where ${service_name}
is the name of the service you want to restart.
To get the possible values for the ${service_name}
, use the following command:
appliance-cli get debug/services
Note
Alternatively, you can restart a service by running the following commands:
SSH to the given machine
Run
docker restart <service-name>
Note
The Anapaya appliance restarts failed services automatically, so manual restarting is likely to be useful only when the service is stuck and/or unresponsive.
For further information, please refer to the official Docker documentation.
Clean up docker images¶
To remove docker images that are no longer used:
SSH to the given machine.
List all docker images by running:
docker image ls
Remove old unused images by running:
docker image prune
For further information please refer to the official Docker documentation.
Connect to the BGP daemon’s interactive console¶
To connect to the BGP daemon’s shell:
SSH to the given machine.
Open the interactive console by running:
docker exec -it frr vtysh
For further information on the console please refer to the official FRR documentation.
Check systemd services¶
To check if the systemd services are running:
SSH to the given machine
Run
systemctl list-units '<service-name>'
:$ systemctl list-units 'appliance*' UNIT LOAD ACTIVE SUB DESCRIPTION appliance-host.service loaded active running Anapaya Appliance Host Service appliance-installer.service loaded active running Anapaya Appliance Installer ...
To get a more detailed overview of a specific service, use
systemctl status <service-name>
:$ systemctl status appliance-installer.service ● appliance-installer.service - Anapaya Appliance Installer Loaded: loaded (/etc/systemd/system/appliance-installer.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2023-02-13 08:26:29 UTC; 7min ago Main PID: 166 (appliance-insta) Tasks: 13 (limit: 38262) CGroup: /system.slice/appliance-installer.service └─166 /usr/bin/appliance-installer --config /etc/anapaya/installer/appliance-installer.toml ...
With these commands you can see whether the service is active and running and for how long it has been running.
Note
A systemd service can be restarted using systemctl restart <service-name>
.
Inspect systemd service logs¶
To view the systemd service logs:
SSH to the given machine.
If needed, you can list the appliance-related services by running:
systemctl list-units 'appliance*'
Inspect the logs by running:
journalctl -eu <service-name>
Note
To see only the recent logs use the --since
flag. For example, to see
only the logs from today use journalctl -eu <service-name> --since today
.
To show the most recent 20 entries, use the -n 20
option.
Note
Note that the logs are printed to stdout
.
To save the logs in a file use:
journalctl -u <service-name> > <filename>
To grep through the logs use:
journalctl -eu <service-name> | grep <query>
Check the systemd-timesyncd service¶
We use the systemd-timesyncd
service, which acts as an NTP client and
connects to a pool of NTP servers for time synchronization. The following
actions provide some starting points for troubleshooting the timesyncd service
called systemd-timesyncd.service
. For further information please refer to
the official documentation.
SSH to the given machine.
Check if the system clock is synchronized and if NTP service is active using:
timedatectl status
Restart the service:
systemctl restart systemd-timesyncd.service
Find the configured NTP servers:
cat /etc/systemd/timesyncd.conf | grep NTP
Disk usage analysis¶
This section contains some helpful commands that you may need when investigating if you run out of disk space.
SSH to the given machine.
Check the current space:
df -h <path>
Check the list of the current files:
ls -l <path>
The
du
command can be used to get a more detailed overview of which directory consumes how much space. You can vary themax-depth
option or the starting directory:du -cha --max-depth=1 / | grep -E "M|G"
For further information about the du
command please refer to the official
documentation.
Clean up disk space¶
There are several ways to free up disk space. The options are divided depending on the context.
Systemd journal logs¶
Check systemd journal logs:
journalctl --disk-usage
Clear the logs that are older than 3 days:
sudo journalctl --vacuum-time=3d
Docker images¶
Fix topology synchronization error¶
Appliances in a cluster share their topology information with each other. This either happens statically through configuration or dynamically through an exchange protocol. For further information on how to configure topology synchronization in the appliance configuration, refer to Topology Synchronization. The instructions below should help to identify a misconfiguration.
Check the logs of the appliance-controller service. The logs should contain an error describing the misconfiguration.
Fix the misconfigured appliances and update them.
Inspect SCION paths used for IP-in-SCION tunneling¶
While troubleshooting SCION connectivity, it is often useful to check the available paths for each domain. This section provides an overview on how to achieve this.
SSH to the given machine.
Show the currently available paths for all domains and traffic matchers by running the following command. This also shows whether the path is alive, dead (no probes are passing through), expired or similar.
appliance-cli inspect scion-tunneling summary --all-paths
Show the currently used paths for a specific domain.
appliance-cli inspect scion-tunneling summary --all-paths \ --domain <domain>
For the used paths for a specific traffic matcher within the given domain, run:
appliance-cli inspect scion-tunneling summary --all-paths \ --domain <domain> --traffic-matcher <traffic matcher>
Ping the underlay network¶
When investigating an issue, it is often helpful to determine whether the underlying IP connectivity is the problem.
For further information, please refer to the official ping documentation.
Tip
The ping command runs indefinitely, unless specified otherwise:
ping -c <number> <destination>
Changing the source address is possible either directly via the address or the interface name:
ping <destination> -I <interface/address>
The default time interval between successive packet transmissions is one second. You can specify a custom interval in seconds:
ping -i <interval> <destination>