Leveraging Application Usage Data From Munki

tlark included in fleetdm osquery snowflake data

2024-03-26 2183 words 11 minutes

Contents

Getting Application Usage Data at Scale With Munki

Munki is a series of tools and a popular application state management tool many Mac Admins across the globe use. Some out-of-box features of Munki solve problems many commercial MDMs still cannot solve to this day. It allows a Mac Admin to write some declarative data and have Munki takes care of the rest for you. We use it to manage and patch all of our third party apps in conjunction with AutoPKG.

Many projects or ideas start with a simple statement, or declaring what you want to accomplish. I feel this specific topic is one that requires this. I feel this is required, because I know many organizations will likely abuse application usage data for things it is not meant for.

Abstract:

We want to enable humans to have the best technology available to them whenever they need it. However, we also want to be financially responsible and invoke cost savings whenever we can. Application usage data can help bridge this gap. It is my personal belief that the best UX with the best cost savings is to automatically provision licensed software upon request with little questions asked. Then revoke the license if the app is not used within an arbitrary amount of time. This allows humans you work with to get the software they need automatically and the organization to save money if that licensed software is no longer in use by that human.

One of the neat features of Munki, is that it can track application usage and Munki can take action from that data. You can read about it here in the project’s wiki page. There is a local SQLite database located in /Library/Managed Installs/application_usage.sqlite and it contains the data we are looking to ingest at scale into Snowflake!

The schema for this db is:

1
2
3
4
5
6
7
8
9


CREATE TABLE application_usage (
    event TEXT,
    bundle_id TEXT,
    app_version TEXT,
    app_path TEXT,
    last_time INTEGER DEFAULT 0,
    number_times INTEGER DEFAULT 0,
    PRIMARY KEY (event, bundle_id)
)

So, you can just shell into that db on any Mac that has Munki installed and run some basic SQLite commands to look at what it collects. However, no one is going to shell into each individual end user computer and collect this data. It is worth mentioning that Munki Reports is another tool Mac Admins can use if they choose to, that can help centralize data and management of Munki.

When you work with data at scale centralizing your data though is extremely valuable for many reasons. Here are some top reasons why an organization would want to do this:

Data Governance:
- implement RBAC controls on who can access what data, and protect crown jewels data
Analytics and Intelligence:
- Build dashboards and data workflows to help automate and gain insights
Data Sharing:
- Share your data across teams
Reduce noise:
- Ability to join your data to multiple data sources and apply laser focus context to your data

I am sure folks have written books about this subject, so I will leave it with those 4 points.

Use FleetDM to Configure ATC in osquery

Osquery has a really neat feature called Automatic Table Creation (or ATC for short), and a good introductory blog post about it can be read here. It can essentially query something like a SQLite database and return the results into an osquery table. This is the perfect feature for a setup like ours. The first SQLite database I was targeting was of course, the Munki application usage database.

FleetDM does this a bit different from the Kolide blog post, but conceptually it is the same in the end. Let’s start with our requirements:

FleetDM Stack
Firehose Kinesis Stream to Snowflake
A Munki deployment

More Info:

With anything you roll out, you should look at a phased approach. We have setup Teams in FleetDM. We have a team set to just my team, a team set to our canary groups, and then there is the concept of “all teams” which we just consider production. We make changes to the CPE team first, then if nothing breaks we move it to the canary groups and monitor it for some arbitrary range of time. Finally, we move it to production, and I highly recommend every IT shop have this sort of setup.

FleetDM has the ability to modify the osquery config files per a team, this allowed me to set up ATC in a controlled and phased fashion. There are also some caveats, as I found out the hard way that if your YAML file in FleetDM is not quite in proper order osquery will skip over your decorators. See this issue for more details.

Decorators are important, especially if you ship osquery data from FleetDM downstream to a product like Snowflake. They are included in every query result, thus it is a smart idea to put unique strings or IDs as a decorator, so now you have a primary key in your data you can use to join to other data sets.

So, again please test everything out against a small subset of your systems and use a phased approach before you go all in on any osquery configuration change. Here is a sample of a YAML config you can use to accomplish ATC:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


config:
  options:
    pack_delimiter: /
    logger_tls_period: 10
    distributed_plugin: tls
    disable_distributed: false
    logger_tls_endpoint: /api/osquery/log
    distributed_interval: 10
    distributed_tls_max_attempts: 3
  decorators:
    load:
      - SELECT uuid AS host_uuid FROM system_info;
      - SELECT hostname AS hostname FROM system_info;
      - SELECT hardware_serial from system_info as serial_number;
  auto_table_construction:
    munki_app_usage:
      path: /Library/Managed Installs/application_usage.sqlite
      query: >-
        select event, bundle_id, app_version, app_path, last_time, number_times
        from application_usage;        
      columns:
        - event
        - bundle_id
        - app_version
        - app_path
        - last_time
        - number_times
      platform: darwin

Be Specific on Selects

When working with data, you might want to do things like SELECT * FROM FOO_TABLE; and this might get the exact results you want. This may work for a very long time as well. However, it is more of a best practice to be very specific in the tables you select to avoid breakage downstream. If Munki were to add a new column to their database for any reason that SELECT * would also include that new table. Now downstream any data models or preprocessing will have a new column to deal with that isn’t defined. This could break things, and it is likely when it does break this is not the first thing you will think to check. So, just start off by sanitizing your inputs, and only selecting the exact data you intend to get downstream.

In semi-structured data like JSON this is much less of a risk as you will likely just be missing a new key.

When you use ATC to create a table FleetDM won’t recognize it as a valid table, and you might see this warning message when trying to run a live query:

If you just ignore this message and YOLO it anyway and hit that query button, you will see you actually do get results:

You might notice that there are several event types the Munki collects about an application. They are:

Activate - This monitors when the application is the foreground app
Quit - This tracks when the application is quit
Launch - This tracks when an application is launched
Number Times - The number of times each event has occurred in total
Last Time - Epoch time stamp of the last time this event occurred

This is actually great and exactly what most organizations need to monitor application usage, for its proper intended use. Which is license revocation and cost savings. Trying to monitor application usage for other use cases is a bit of a mixed bag at best, and awful at the worst.

Modeling the Data in Snowflake

For this and most of our other osquery data ingests from FleetDM we typically do a 1:1 data model. Meaning that we write a single query, that typically does a single purpose in FleetDM and ship the results as a snapshot in time to Snowflake. We worry about transforming the data downstream within Snowflake itself. This allows us to quickly get data pipelines going and then transform downstream when it is already in the product. Some of our JSON files can be over 20,000 lines of query results, and this sample I am going to share is about 700 lines of JSON.

An example of query results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


{
  "action": "snapshot",
  "calendarTime": "Tue Mar 26 19:13:20 2024 UTC",
  "counter": 0,
  "decorations": {
    "hardware_serial": "device_serial",
    "host_uuid": "device_uuid",
    "hostname": "host_name"
  },
  "epoch": 0,
  "hostIdentifier": "host-identifier",
  "name": "pack/Global/ingest-munki-app-usage",
  "numerics": false,
  "snapshot": [
    {
      "app_path": "/Applications/Slack.app",
      "app_version": "4.36.140",
      "bundle_id": "com.tinyspeck.slackmacgap",
      "event": "activate",
      "last_time": "1710526854",
      "number_times": "144",
      "path": "/Library/Managed Installs/application_usage.sqlite"
    },
    {
      "app_path": "/Applications/Slack.app",
      "app_version": "4.36.140",
      "bundle_id": "com.tinyspeck.slackmacgap",
      "event": "launch",
      "last_time": "1710169315",
      "number_times": "1",
      "path": "/Library/Managed Installs/application_usage.sqlite"
    },

I’ve truncated the results above, but you might notice I just grabbed 2 events for the Slack.app. This shows the activate event and the launch event. Now just imagine an entry like this for every application used on an end user’s device.

In a previous blog post I covered some data modeling in Snowflake with FleetDM data at a high level. We will still use those same high level concepts here. We can create a view by running the following SQL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29


create or replace view DB.SCHEMA.OSQR_MUNKI_APP_USAGE_V(
	SERIAL_NUMBER,
	HOST_NAME,
	EVENT_TIMESTAMP,
	HOST_UUID,
	APP_PATH,
	APP_VERSION,
	BUNDLE_ID,
	EVENT_TYPE,
	LAST_TIME,
	NUMBER_TIMES,
	SOURCE_PATH
) copy grants as (
select
  serial_number
, host_name
, event_timestamp
, host_uuid
, f.value['app_path']::string as app_path
, f.value['app_version']::string as app_version
, f.value['bundle_id']::string as bundle_id
, f.value['event']::string as event_type
, f.value['last_time']::number as last_time
, f.value['number_times']::number as number_times
, f.value['path']::string as source_path
from DB.SCHEMA.FLEET_OSQUERY_RESULTS
, lateral flatten(input => snapshot) as f
where name = 'pack/Global/ingest-munki-app-usage'
);

Our data is structured in a way that our decorators that we put in the FleetDM configs for osquery are always split out into their own columns from the raw JSON data, and some other commonly used keys we use. Like the query name as this is something we will filter for quite often. I am also keeping the schema intact from the original source, the SQLite database that Munki generates. This is really for data fidelity reasons, and if anyone ever has to reverse all the work I have done it just makes the most sense to copy the schema data types from the source in the model itself. So, I keep that 1:1 relationship not only with the query results, but the data types defined by the schema as well. Another big bonus of mimicking the schema from the source is that I can produce the same results from the base data model as I could locally use the SQLite shell on a macOS device running Munki.

Lets Test Drive the Data Model!

Now we just need to test our data model in Snowflake, and validate it is good to data share and use in dashboards and data applications. First lets just run a basic query against our data model and see what we get.

My favorite app in the world, Excel!

Now with a bit more filtering and more specific results for Photoshop!

Conclusion

Application usage data is a fanstatic way to track when expensive apps are getting used (or not getting used), track active license usage, help procurement with true up renewals, and just have a good general cost savings model. IT is sometimes viewed solely as an operational cost center, but now you can turn IT into a service providing cost savings by leverage good IT and automation tools and shipping your data to a platform like Snowflake. I bet we will use this model for true up renewals moving forward. This also is a good way for IT to capture what applications they likely need to offer in their self-service flows. If you have a substantial number of end users constantly using an app, perhaps this is an indicator it should be packaged up, deployed and patched/updated by IT.

There are many good use cases for this and those use cases can be tied to cost savings for the org and quality of life upgrades for your end users. Just remember application usage data is not meant to track productivity, or anything adjacent or even near that. You should not be shocked the majority of your users will spend the majority of their time in the web browser. I highly discourage any use of application usage tracking for anything other that quality of life improvements and cost savings initiatives (or things closely adjacent to these).