Jump to content

SOP: Network Troubleshooting & pfSense Monitoring 251130

From MediawikiCIT

SOP: Network Troubleshooting & pfSense Monitoring

Source: Lawrence Systems: pfSense Packet Loss and Latency Monitoring Guide

Purpose: To standardize the diagnosis of intermittent internet connectivity issues using the "Fault Isolation" methodology and pfSense Gateway Monitoring tools.
Target Audience: IT Support, Network Administrators, and Technical Staff.

Part 1: The Methodology (Fault Isolation)

The goal of troubleshooting is not just to fix the problem, but to prove where the failure lies. We use the Process of Elimination to isolate variables in the connection chain.

The Connection Chain

Visualize the path data takes from the user to the internet. A failure at any point breaks the chain.

Isolation Logic

  • Isolate the Device: If only one user drops, the issue is at Node A.
  • Isolate the Local Network: If all users drop, but pfSense can still ping the modem, the issue is at Node B.
  • Isolate the ISP: If pfSense cannot reach the Public Internet (Node F) despite a valid link to the Modem (Node D), the issue is likely Node E (The ISP).

Part 2: Configuring pfSense for Accurate Monitoring

By default, pfSense monitors the Gateway IP (usually the ISP's local modem or first hop). You must determine if this is the correct target based on your equipment setup.

Objective: Ensure we are monitoring the Internet, not just the local modem.

Step 1: Verify Equipment Mode & Monitor IP

  1. Navigate to System > Routing > Gateways.
  2. Click the Edit (Pencil) icon next to the primary WAN gateway (e.g., WAN_DHCP).
  3. Check the Monitor IP field.

Scenario A: Modem is in Bridge Mode (Public IP on pfSense)

If your modem is in Bridge Mode, the Gateway IP is usually the ISP's first hop on their network.

  • Verdict: Default settings are usually fine, but changing to a public DNS (Step 2) is still recommended for reliability.

Scenario B: Modem acts as Router (Private IP on pfSense)

If your pfSense WAN has a private IP (e.g., 192.168.x.x), the default Gateway is just your local modem.

  • Verdict: You MUST change the Monitor IP. Monitoring the default gateway only confirms the cable between pfSense and the modem is working. It tells you nothing about the actual internet connection.

Step 2: Set an Off-Premise Monitor IP

To test the actual internet connection, set the Monitor IP to a stable, off-premise target:

  • 1.1.1.1 (Cloudflare DNS)
  • 8.8.8.8 (Google DNS)
  • 208.67.222.222 (OpenDNS)
  • Corporate Option: The IP of the company VPN or Relay Server.

Click Save and Apply Changes.

Step 3: Tune Latency Thresholds (Optional)

If using high-latency connections (Starlink, Satellite, LTE) or if you see false alarms:

  1. In the Gateway Edit screen, click Display Advanced.
  2. Adjust Latency Thresholds (Lower/Upper limits in ms).
  3. Adjust Packet Loss Thresholds (Percentage limits).

Part 3: Visualizing "Intermittent" Issues (RRD Graphs)

Intermittent issues are difficult to catch in real-time. pfSense RRD (Round-Robin Database) graphs provide historical evidence.

Accessing the Quality Graph

  1. Navigate to Status > Monitoring.
  2. Click the Wrench Icon (View Settings).
  3. Category: Select System.
  4. Graph: Select Quality.
  5. Time Period: Select 1 Day (for immediate issues) or 1 Month (for pattern analysis).
  6. Click Save View to make this your default if desired.

Interpreting the Data

  • Packet Loss (Red Bars): Vertical red bars indicating data that never reached the destination. Any red bars usually indicate a physical line fault or severe ISP failure.
  • Latency/Delay (Blue Line): The time it takes for a ping to return.
  • Standard Deviation (Jitter): How much the latency varies.

Part 4: Root Cause Analysis (Correlation)

To prove the cause, we overlay different metrics to see what else was happening on the firewall during the spike.

  1. In Status > Monitoring, click the Wrench Icon.
  2. Left Axis: Set to Quality (Packet Loss/Delay).
  3. Right Axis: Select a correlation metric (see below).
  4. Update Graph.

Correlation Scenarios

Check 1: Bandwidth Saturation

  • Right Axis: Traffic (WAN Throughput)
  • Analysis: If Latency spikes exactly when Traffic is high, the pipe is full.
  • Action: Upgrade bandwidth or implement Traffic Shaping (QoS).

Check 2: CPU/System Overload

  • Right Axis: System > Processor
  • Analysis: If Packet Loss correlates with 100% CPU usage, the firewall hardware is the bottleneck, not the ISP.

Check 3: VPN Usage

  • Right Axis: OpenVPN or WireGuard > Users (or Traffic)
  • Analysis: If instability begins exactly when remote users connect, the VPN encryption overhead may be stressing the CPU or saturating the upload speed.

Check 4: The "Clean" Failure (ISP Fault)

  • Analysis: If Packet Loss (Red Bars) occurs when Traffic is flat/low and CPU is idle, the issue is external.
  • Action: Contact ISP. (See Part 5).

Part 5: Evidence Gathering & Reporting

ISPs often dismiss intermittent complaints. Providing raw data logs forces escalation.

Exporting Data to CSV

  1. In Status > Monitoring, load the view showing the issue (e.g., "1 Month Quality" or "3 Month View").
  2. Click the Export Button (Arrow pointing into a box) below the graph.
  3. Save the .csv file.

Visualizing in LibreOffice Calc / Excel

  1. Open the CSV file.
  2. Select the Timestamp column and the Packet Loss column.
  3. Insert a Line Chart.
  4. Highlight the outages.
    • Example: "Connection drops daily between 14:00 and 16:00."
  5. Save the chart as a PDF and attach it to the ISP Support Ticket.

Note: When submitting this data to an ISP, explicitly state: "I have isolated the issue to the modem/street level. My internal firewall logs show packet loss to 8.8.8.8 occurring during periods of zero bandwidth usage, ruling out local congestion."

Part 6: Multi-WAN Performance Comparison

In environments with multiple gateways (Load Balancing or Failover), comparing performance simultaneously is critical to ruling out shared hardware failures (e.g., the firewall itself) versus specific ISP failures.

Configuring the Comparative Graph

  1. Navigate to Status > Monitoring.
  2. Click the Wrench Icon (Settings).
  3. Configure the axes to display two ISPs at once:
    • Left Axis:
      • Category: System
      • Graph: Quality
      • Specific Selection: WAN_DHCP (Primary ISP)
    • Right Axis:
      • Category: System
      • Graph: Quality
      • Specific Selection: OPT1 or WAN2 (Secondary ISP)
  4. Click Update Graph.

Interpreting Comparative Data

This view allows you to see if an outage is Global (Router/Power issue) or Isolated (ISP issue).

Graph Observation Diagnosis
Only Primary ISP shows Packet Loss Isolated ISP Failure. The issue is specific to the Primary ISP line. The firewall hardware is functioning correctly because the Secondary ISP is clear.
Both ISPs show Packet Loss simultaneously Global Hardware Failure. If two independent ISPs fail at the exact same second, the issue is likely the pfSense hardware (CPU overload), a shared switch, or a power fluctuation.
Secondary ISP shows high Latency Backup Quality Check. Ensure your backup line is actually viable. High latency on a backup line might mean it is unsuitable for failover.