Skip to main content

Building an end-to-end monitoring solution with Azure Arc, Log Analytics and Workbooks–Part 1: Overview & Architecture

On-premises VMs don't disappear just because you are working on a cloud strategy. We are running a lot of Windows workloads on-prem — application pools, Windows services, scheduled tasks — and still need visibility into whether they're healthy.

Traditional on-prem monitoring solutions could work, but they come with their own operational overhead and are directly tied to our on-premise infrastructure. When an incident happens, we don’t want to context-switch between our cloud monitoring stack and our on-prem monitoring stack. It's not ideal.

We wanted a single, cloud-native view into the health of our on-prem workloads without having to lift and shift them into Azure. Azure Arc made this possible by extending Azure's management plane to our on-premises infrastructure. By combining Arc with Log Analytics and Workbooks, we built a unified health dashboard that sits alongside our cloud monitoring, uses the same query language (KQL), and requires no additional on-prem infrastructure.

The problem we were solving

Before we built this, our health monitoring looked like this: developers would RDP into individual servers, open Services.msc or IIS Manager, and manually check whether critical components were running. For scheduled tasks, they'd dig through Task Scheduler on each box. This worked at small scale, but it didn't scale operationally. It also required giving direct administrator access on these machines which is a ‘no-go’ from a security perspective.

We needed:

  • A single dashboard showing the health of all monitored components across all servers
  • The ability to filter and drill down by server, component type, or status
  • Historical visibility to spot patterns (is this service flapping? did this task fail last night too?)
  • Integration with Azure's alerting stack so we could route notifications to teams already using Azure Monitor

The solution had to be low-friction. We didn't want to deploy another monitoring appliance on-prem, and we didn't want to maintain custom scripts running on every VM that would inevitably drift over time.

The 3 puzzle pieces

Azure Arc is the bridge. It extends Azure's management plane to resources that live outside of Azure — in our case, on-premises Windows VMs. Once a machine is Arc-enabled, Azure treats it as a first-class managed resource. You can see it in the portal, tag it, assign policies to it, and — crucially for our use case — deploy the Azure Monitor Agent to it.

Arc doesn't move your workloads. The VMs stay exactly where they are, on-prem, running the same services and tasks they always have. What Arc does is project them into Azure's control plane so you can manage them using the same tools and patterns you'd use for cloud VMs. This is what lets us push the Azure Monitor Agent to on-prem machines and pull telemetry back into Azure without punching permanent inbound firewall holes or setting up VPN tunnels just for monitoring.

Log Analytics Workspace is where the data lives. Azure Monitor Agent ships telemetry to Log Analytics, and we defined a custom table to store the health state of application pools, Windows services, and scheduled tasks. Using a custom table gave us full control over the schema — we weren't constrained by what the built-in Windows event logs expose out of the box.


The built-in Event table in Log Analytics contains Windows event log data, and you can extract service state changes from it. But that data is verbose, semi-structured XML, and every query involves parsing and filtering through thousands of events to find the handful that matter. By creating a custom table, we store pre-structured health snapshots — one row per component, collected at a regular interval — which makes queries fast and simple.

Azure Workbooks is where the data becomes actionable. Workbooks let us build interactive dashboards backed by KQL queries against our custom table. The result is a live, filterable view of what's running, what's stopped, and what's failing — across all of our on-prem VMs. Workbooks support parameterization (filter by server name, component type, time range), conditional formatting (red for failed, green for running), and can be shared with the team or embedded in operational runbooks.


Why we choose this setup

No inbound connectivity required. The Azure Monitor Agent initiates outbound HTTPS connections to Azure. Our on-prem VMs don't need to accept any inbound traffic from the cloud, which keeps your security team happy and your firewall rules simple.

Separation of concerns. The Data Collection Rule is a declarative configuration artifact — it's version-controllable and can be applied consistently across our servers. The custom table schema is independent of the DCR, so the data model can evolve without re-deploying agents. The Workbook queries the table and knows nothing about how the data got there. Each layer has a clear job.

Scalability. The Azure Monitor Agent is designed to handle large-scale telemetry collection. Log Analytics can ingest and query data from thousands of machines. Workbooks render queries in near-real-time. This architecture doesn't hit a ceiling at 10 servers or 50 servers — it scales to hundreds without architectural changes.

Cost control. We only pay for Log Analytics ingestion and retention, and the cost scales with data volume. Because we're collecting structured health snapshots at a defined interval (not high-frequency metrics or verbose logs), the data volume is predictable and relatively low. A typical setup monitoring dozens of components across a fleet of VMs generates a few megabytes per day.

What this doesn’t replace

This solution is purpose-built for health monitoring of known components — application pools, services, and scheduled tasks. It's not a replacement for comprehensive infrastructure monitoring, APM, or log aggregation. We also still want traditional monitoring for CPU, memory, disk, network performance and application-level telemetry for request tracing and error rates but these things are already captured by combining Azure Arc with Azure Monitor Application Insights.

What’s next?

In the next post, we look at how we generated the required information the VM’s and capture it using an Azure Arc Data Collection Rule.

Keep you posted!

More information

Collect data from virtual machine client with Azure Monitor - Azure Monitor | Microsoft Learn

Overview of Log Analytics in Azure Monitor - Azure Monitor | Microsoft Learn

Azure Workbooks overview - Azure Monitor | Microsoft Learn

Popular posts from this blog

Kubernetes–Limit your environmental impact

Reducing the carbon footprint and CO2 emission of our (cloud) workloads, is a responsibility of all of us. If you are running a Kubernetes cluster, have a look at Kube-Green . kube-green is a simple Kubernetes operator that automatically shuts down (some of) your pods when you don't need them. A single pod produces about 11 Kg CO2eq per year( here the calculation). Reason enough to give it a try! Installing kube-green in your cluster The easiest way to install the operator in your cluster is through kubectl. We first need to install a cert-manager: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml Remark: Wait a minute before you continue as it can take some time before the cert-manager is up & running inside your cluster. Now we can install the kube-green operator: kubectl apply -f https://github.com/kube-green/kube-green/releases/latest/download/kube-green.yaml Now in the namespace where we want t...

Azure DevOps/ GitHub emoji

I’m really bad at remembering emoji’s. So here is cheat sheet with all emoji’s that can be used in tools that support the github emoji markdown markup: All credits go to rcaviers who created this list.

Podman– Command execution failed with exit code 125

After updating WSL on one of the developer machines, Podman failed to work. When we took a look through Podman Desktop, we noticed that Podman had stopped running and returned the following error message: Error: Command execution failed with exit code 125 Here are the steps we tried to fix the issue: We started by running podman info to get some extra details on what could be wrong: >podman info OS: windows/amd64 provider: wsl version: 5.3.1 Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM Error: unable to connect to Podman socket: failed to connect: dial tcp 127.0.0.1:2655: connectex: No connection could be made because the target machine actively refused it. That makes sense as the podman VM was not running. Let’s check the VM: >podman machine list NAME         ...