Most organizations rely heavily on their IT infrastructure to run their operations. Unplanned system failures or performance degradation can lead to disruptions, financial losses, and damage to reputation.

Automated system health checks are crucial to ensure that IT infrastructure remains stable and reliable. By monitoring critical metrics and promptly detecting anomalies, you can minimize downtime.

Output of a system monitoring program in an IDE

Defining Health Checks

It is essential to define what health checks you want to perform on your system. You should establish clear criteria for what you’ll monitor and why. Begin by identifying the primary goals of your system. What functions or services does it provide?

Then, set performance benchmarks based on historical data and ensure your health checks assess the efficient use of system resources. Finally, define the thresholds that indicate a problem. What percentage of resource usage do you consider high or low? At what point should the system trigger an alert?

Choosing Libraries and Setting Up Your Environment

To automate the system monitoring process in Python, you will need the following libraries to help you gather system metrics and then schedule the checks.

Start setting things up bycreating a new Python virtual environment. This will prevent any potential version library conflicts. Then run the following terminal command toinstall the required libraries with Pip:

Once the libraries are installed on your system, your environment is ready.

The full source code is available in aGitHub repository.

Importing the Required Libraries

Create a new script,monitoring.py, and begin it by importing the required libraries:

Importing the libraries will allow you to use the functionality they offer in your code.

Logging and Reporting

You need a way to log the results of your health checks. Logging serves as a vital tool for capturing and preserving a historical record of events anddebugging problems in your code. It also plays a critical role in performance analysis.

Use the built-in logging library to create your logs for this project. You can save the log messages to a file namedsystem_monitor.log.

For reporting, print an alert message on the console to serve as immediate notification about any issues that require attention.

The health check functions will use these functions to log and report their relevant findings.

Creating Health Check Functions

For each health check, define a function that will encapsulate a specific test that evaluates a critical aspect of your infrastructure.

CPU Usage Monitoring

Start by defining a function that will monitor CPU usage. This will serve as a critical indicator of a system’s overall performance and resource utilization. Excessive CPU usage leads to system slowdowns, unresponsiveness, and even crashes, severely disrupting essential services.

By regularly checking the CPU usage and setting appropriate thresholds, system administrators can identify performance bottlenecks, resource-intensive processes, or potential hardware issues.

The function checks the current CPU usage of the system. If the CPU usage exceeds the threshold in percentage, it logs a message indicating high CPU usage and prints an alert message.

Memory Usage Monitoring

Define another function that will monitor the memory usage. By regularly tracking memory utilization, you candetect memory leaks, resource-hungry processes, and potential bottlenecks. This method prevents system slowdowns, crashes, and outages.

Similar to the CPU usage check, you set a threshold for high memory usage. If memory usage surpasses the threshold, it logs and prints an alert.

Disk Space Monitoring

Define a function that will monitor the disk space. By continuously monitoring the availability of disk space, you may address potential issues stemming from resource depletion. Running out of disk space can result in system crashes, data corruption, and service interruptions. Disk space checks help ensure that there is sufficient storage capacity.

This function examines the disk space usage of a specified path. The default path is the root directory/. If disk space falls below the threshold, it logs and prints an alert.

Network Traffic Monitoring

Define a final function that will monitor your system’s data flow. It will help in the early detection of unexpected spikes in network traffic, which could be indicative of security breaches or infrastructure issues.

The function monitors network traffic by summing the bytes sent and received. The threshold is in bytes. If network traffic exceeds the threshold, it logs and prints an alert.

Implementing Monitoring Logic

Now that you have the health check functions, simply call each one in turn from a controller function. You can print output and log a message each time this overall check runs:

This function runs all health checks, providing a unified view of your system’s health status.

Scheduling Automated Checks and Running the Program

To automate the monitoring at specific intervals, you will use the schedule library. You can adjust the interval as needed.

Now run the system monitoring process in a continuous loop.

This loop continuously checks for scheduled tasks and executes them when their time comes. When you run the program the output is as follows:

The program records the monitoring logs on thesystem_monitor.logfile and displays an alert on the terminal.

Advancing the System Monitoring Program

These monitoring checks are not the only ones that psutil supports. You can add more monitoring functions, using a similar approach, to suit your requirements.

You can also improve the reporting function to use email rather than outputting a simple message on the console.