A monthly server maintenance checklist keeps your infrastructure healthy, catches problems before they become outages, and gives you a documented record that everything has been checked. Here is a practical checklist covering the tasks that matter most, with guidance on what to look for in each.
How to Use This Checklist
Run through this checklist once a month — or more frequently for critical servers. Keep a log (even a simple spreadsheet) recording the date, who performed the check, and any findings. A documented maintenance history is valuable when diagnosing recurring problems or planning hardware replacements.
Hardware and Physical
- Visual inspection: check front panel LEDs — any amber lights on the system health or drive bay indicators need immediate investigation. See the full server visual inspection guide for detail.
- Fan noise: listen for any grinding, rattling, or changes in pitch from server fans
- Cable check: confirm all network and power cables are fully seated
- Temperature check: review thermal sensor data in iDRAC/iLO — inlet temperature, CPU temperatures, any thermal events in the hardware event log
- UPS status: confirm UPS is on mains power, battery is healthy, and estimated runtime is acceptable. Test under load annually.
- Server room temperature: confirm ambient temperature is within the recommended 18–27°C range
Storage
- Disk space: check free space on all volumes — alert threshold at 20% free, critical at 10%. Run
Get-PSDrive -PSProvider FileSystemin PowerShell. - Drive health: run
Get-PhysicalDisk | Select-Object FriendlyName, HealthStatus— all drives should be Healthy - RAID status: check via iDRAC/iLO or RAID management software — array should be Optimal. Any Degraded status means a drive has failed and needs replacing before the next check.
- Event Viewer disk errors: filter the System log for Event ID 7 and 11 — repeated disk I/O errors indicate a failing drive
Performance
- CPU usage baseline: review average CPU usage over the past week from Performance Monitor or your monitoring tool. Sustained above 80% indicates the server is undersized or has a runaway process.
- Memory usage: check available memory — consistent low availability may require a RAM upgrade or application tuning
- Page file activity: check Memory → Pages/sec in Performance Monitor. Regular high paging indicates memory pressure.
- Slow queries or application issues: review application logs for repeated errors or performance timeouts that did not generate a helpdesk ticket
Security
- Windows Updates: verify the server is up to date with security patches — run
Get-HotFix | Sort-Object InstalledOn -Descending | Select-Object -First 5to check the most recent patches - Security event log — failed logins: review the Security log for clusters of Event ID 4625 (failed login). Repeated failures for the Administrator account indicate a brute-force attempt.
- Local administrator accounts: confirm no unexpected local admin accounts have been added — run
Get-LocalGroupMember -Group Administrators - Antivirus definitions: confirm AV definitions are up to date and no threats have been detected
- SSL certificates: check that no certificates expire within 60 days — run
Get-ChildItem Cert:\LocalMachine\My | Where-Object {$_.NotAfter -lt (Get-Date).AddDays(60)}
Backup and Recovery
- Backup job status: confirm the last backup completed successfully — check backup software logs or Event Viewer backup events
- Backup age: ensure backups are current — a “last successful backup” older than 48 hours for a production server is a risk
- Offsite / cloud copy: confirm that an offsite copy of backup data exists (local-only backup does not protect against fire, theft, or ransomware that encrypts the backup destination)
- Restore test: quarterly (not just monthly) — test restoring a single file from backup to confirm the backup is actually usable
Services and Applications
- Services check: confirm all critical services are in the expected state —
Get-Service | Where-Object {$_.StartType -eq 'Automatic' -and $_.Status -ne 'Running'} - Event Viewer review: check the Administrative Events custom view (all Critical and Error events) for any recurring issues that need attention
- Uptime: review server uptime — any unplanned reboots in the past month? Check Event ID 6008 in the System log.
- Application logs: review logs for the key application the server hosts (SQL Server error log, IIS logs, Exchange health report)
Documentation
- Update records: note any changes made this month — new roles installed, configuration changes, hardware swaps
- Hardware age: flag any components approaching expected end of life — most enterprise hard drives and SSDs have a 3–5 year warranty; servers are typically replaced on a 5–7 year cycle
- Warranty and support status: check that hardware warranty and OS support contracts are current and not approaching expiry