LoginSignup
9
3

More than 1 year has passed since last update.

Symbol Node Maintenance: Monitoring, Backups, Crashes, and Recovery

Last updated at Posted at 2022-12-22

Greetings Symbol Community & Fellow Node Operators!

This article focuses on an area I feel the Symbol Guide is somewhat lacking - node monitoring, backups, crashes, and recovery. The Symbol Guide's entry on Running a Symbol Node does a good job covering the basics of getting a node up and running, and the Maintaining a Symbol Node covers updates and https key renewals. This article briefly addresses a few topics new or existing node operators may find relevant once they are up and running.

Monitoring

The obvious choice for a quick node check is one of the Symbol Node Lists; these show a list of active nodes along with chain height and even version number. This is a handy way to manually check your node for a crash, desync, or even a missed update.. Much as these are useful, most of us prefer something automated that proactively notifies you when issues arise. A few solutions have been developed by the community such as the XYM Harvest Bot, my own self-hosted XYM Node Monitor python script, and a number of other community-built solutions (not all of which are maintained). Beyond solutions built by others, there are a number of handy API calls you can use in your browser to get some details on your Dual or API node (not applicable to peer-only nodes)!

Simple Browser-Based API Calls
Many newcomers assume APIs (Application Programming Interfaces) are too complicated for anyone but developers and IT professionals to use, but a web browser is all you need to quickly and easily access relevant information on your node! This next segment details a few browser-friendly API endpoints that are particularly useful to new node operators.

Useful API Endpoints

Node Health - this query doesn’t do much beyond confirming the API functionality and underlying database are online. This can still be very helpful when checking if a recent node issue has been resolved.
http://yournode.url:3000/node/health
{"status":{"apiNode":"up","db":"up"}}

Node Info - this query is primarily used by fellow nodes and other services, though it can be used to check custom node descriptors, retrieve the public key for delegated harvesting, and check your node's version code.
(“nodePublicKey” - this is the ‘Transport’ key in your node’s addresses.yml file).
http://yournode.url:3000/node/info
{"version":16777985,"publicKey":"AC1A6E1D8DE5B17D2C6B1293F1CAD3829EEACF38D09311BB3C8E5A880092DE26","networkGenerationHashSeed":"57F7DA205008026C776CB6AED843393F04CD458E0AA2D9F1D5F31A402072B2D6","roles":3,"port":7900,"networkIdentifier":104,"host":"OrisDorch.xymposium.xyz","friendlyName":"Oris Dorch Node","nodePublicKey":"AC1A6E1D8DE5B17D2C6B1293F1CAD3829EEACF38D09311BB3C8E5A880092DE26"}

Chain Info – this query shows your node’s block height among other things. This can used to monitor a new or reset node as it synchronizes with the network.
http://yournode.url:3000/chain/info
{"height":"804723","scoreHigh":"4","scoreLow":"12351162821671795071","latestFinalizedBlock":{"finalizationEpoch":560,"finalizationPoint":58,"height":"804696","hash":"4285C0EE90CD234E975217F26C19451893709CB3560ED30E7D24BE854CCECA95"}}

Unlocked Accounts - this query lists the public key of each account that has successfully delegated harvesting to your node. You can search by public key on http://symbol.fyi/ to see the associated XYM address.
http://yournode.url:3000/node/unlockedaccount
{"unlockedAccount":["AC1A6E1D8DE5B17D2C6B1293F1CAD3829EEACF38D09311BB3C8E5A880092DE26","AC1A6E1D8DE5B17D2C6B1293F1CAD3829EEACF38D09311BB3C8E5A880092DE26"]}

Node Peers - this query lists peer nodes that connect your node to the broader network. This could be useful if you suspect that your node is on a side chain, as it can be compared against external lists of nodes such as https://symbol.fyi/nodes.
http://yournode.url:3000/node/peers
[{"version":16777985,"publicKey":"AC1A6E1D8DE5B17D2C6B1293F1CAD3829EEACF38D09311BB3C8E5A880092DE26","networkGenerationHashSeed":"57F7DA205008026C776CB6AED843393F04CD458E0AA2D9F1D5F31A402072B2D6","roles":3,"port":7900,"networkIdentifier":104,"host":"0-0-0-0-1.my.awesome.node","friendlyName":"My Awesome Node"},

Additional API documentation can be found here: https://symbol.github.io/symbol-openapi/. The symbol.services resource is another potentially useful ‘meta level’ node service used by bigger aggregators such as the Symbol Explorer's node list (https://symbol.fyi/nodes) and even the Symbol wallet - you can check out its documentation here: https://symbol.services/openapi/.

Node Issues & Crashes

Sometimes your node monitoring reveals an issue; maybe it’s stuck on a block, on a fork, has a corrupted database, or something else has caused your node to become unresponsive. I’ve experienced a variety of issues on my bootstrap dual-node (peer/API) over the months, and many of the underlying issues / root causes have eluded me, but I’ve always managed to get it back online with relative ease.

The easiest solution has typically been to stop and restart Bootstrap in the hopes that it will recover or at least provide some details on the cause of the issue. The broker recovery program almost always runs into issues with leftover lock files preventing recovery, meaning I have to stop bootstrap again and remove the .lock files in [path]/target/nodes/node/data/ before restarting it. If this doesn’t work, I’ll review the various log files in [path]/target/nodes/node/logs/ to see if anything obvious stands out, but troubleshooting can be a protracted process and I often find myself looking for the most direct path to getting my node back online. Typically this means resyncing or restoring my database..

Node Backups & Recovery

Resynchronizing your node from scratch (i.e. symbol-bootstrap resetData) is often the easiest solution for a stuck node or one that will not boot, but it can take quite some time to synchronize, as you have to incrementally download and validate the entire chain from your node’s peers. I’ve come to greatly appreciate how useful node backups can be for rapid recovery - even backups that are several months old!

I can only really speak to bootstrap dual (API + Peer) nodes , though I expect that much of this applies to non-bootstrap nodes as well. I tend to backup my entire target folder onto another server/device in case anything gets corrupted, deleted, or the server fails / gets wiped. That said, I typically find that I only need to restore the following directories (while my node / bootstrap is stopped) to get my node back up and running without a full resync:
[path]/target/databases (MongoDB contents used by API Nodes)
[path]/target/nodes/node/data
It is worth noting that these are the largest folders on my node.

Initially my backup process involved stopping bootstrap, then making a copy of the Target folder. I always stop bootstrap before backing up, as I don’t want to catch it in the middle of database updates, and I used to keep these backups on my node's server in case of emergency. More recently my server’s storage has gradually filled up, and transferring backups to another site is taking longer and longer as the chain grows. In order to address this, I've recently started working on a more efficient approach to backups, and I am hoping to build a solution that automates backups and transfers them to offsite storage via a scheduled bash script using pigz and rclone, but I’m afraid I haven’t gotten quite that far yet.

I recently started backing up directly to .tar.gz using parallel gzip (pigz), which runs really fast considering the number and size of files it’s compressing (~8 min on my low-end 6 core VPS). In my current process, I:
1. Purge and old local backups from the /backups/ folder I’ve created for this purpose (mkdir backups) to free up space for the new backup
2. Stop bootstrap
3. Create a timestamped backup.tar.gz
4. restart bootstrap
5. Begin manually transferring the backup to offsite storage

Most of this is facilitated with the following shell commands:
rm -r ./backups/
symbol-bootstrap stop
tar cf - ./target | pigz -1 -p 6 > ./backups/$(date +%Y%m%d-%H%M%S).tar.gz
symbol-bootstrap-start -c ./custom.yml

I would do things a bit differently If I had more storage space.. I would prefer to keep my latest backup before starting a new one, and I’d also want to minimize downtime with a slightly different sequence of events like:
1. Stop Bootstrap
2. Copy the target folder
3. Restart Bootstrap
4. Compress the new copy of the target folder (making sure not to compress the live target folder) [Alternatively I might opt to use rclone for file-by-file synchronization of an offsite backup to minimize the file transfer bandwidth]
5. Transfer the compressed backup to offsite storage

This process would reduce my node’s downtime by a few minutes, but it does require more than twice as much storage space and also means compressing/synchronizing all those files while my node is live; my VPS is minimalistic, so this could cause a resource shortage, which could create issues with my node..

Conclusion

I hope you've found something helpful in this post, and I would love to hear how you manage your node! I strongly encourage fellow node operators to share their experiences and lessons learned - the more we share, the stronger we are!

That is all from me today - Happy Holidays to those who celebrate, and Happy Thoughts to those who do not!

9
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
9
3