Hardening Palantir’s Kubernetes Infrastructure with Cilium

Palantir
Palantir Blog
Published in
7 min readMay 6, 2021

--

Containerized infrastructure has become an industry-wide trend as engineering teams lean on the likes of Docker or Kubernetes to manage, deploy, and scale their environments; here, Palantir is no exception. We built Rubix, Palantir’s Kubernetes infrastructure, with two primary goals in mind: streamlining and scaling the deployment of our software platforms and strengthening our security posture.

In this blog post, Palantir’s Information Security (InfoSec) team will share our recent experience using Cilium: an open-source project by Isovalent dedicated to securing container-based infrastructure, enabling visibility & controls preferable to those of a traditional firewall. Using Cilium as a basis for network controls, Palantir’s engineers have made material improvements to the network rules regulating Rubix’s traffic. Likewise, Palantir’s Computer Incident Response Team (CIRT) has used Cilium’s logging as a valuable source of telemetry when authoring alerting & detection strategies that target theoretical attack primitives applicable to Rubix’s infrastructure. In this article, Palantir’s InfoSec team would like to:

  • Walk through the Cilium functionality that allowed Palantir to further harden Rubix’s attack surface.
  • Address the areas of Cilium’s logging telemetry that we’ve found valuable for alerting & detection.

Shortcomings with traditional security tooling & controls

Traditional “legacy” InfoSec controls & tooling (which were largely designed with static infrastructure in mind) can have shortcomings when working with a container attack surface, especially when that attack surface is largely ephemeral, or is part of a broader ecosystem of microservices in a SaaS environment.

Network tooling & controls

A common shortcoming seen in traditional network security tooling is a relative lack of support for granular networking controls covering the varying endpoints, microservices, and resources in an environment comprised of ephemeral containers. A traditional Layer 3/4 firewall can regulate ingress & egress on the basis of port & destination, but that rule is indiscriminate, and blocks (or allows) all traffic meeting the parameters of its ruleset. Some networking solutions build out their feature set with controls that go beyond basic port / IP ingress & egress (e.g., controls by domain, byte count, time of day, or IP reputation), but such controls still tend to operate based on identifiers such as a hostname, domain, or IP address.

When working within a containerized ecosystem, in which most interaction occurs between ephemeral microservices and API resources, maintaining a concept of a given endpoint or resource’s identity can be challenging when approached in the context of traditional TCP/IP nomenclature.

Endpoint tooling & controls

Another common security tool shortcoming involves relying on telemetry from Endpoint Detection & Response (EDR) sources (See: osquery Across the Enterprise) that treat containerized infrastructure the same as any traditional static endpoint by solidifying its concept of identity based on an IP address, or a hostname. Each time an ephemeral resource is spun up, that resource must have an EDR agent installed, which in turn must check in to a central server, or must log its telemetry directly to a logging destination. Once an EDR agent is on an ephemeral host and successfully reporting telemetry, it’ll only continue doing so for as long as that resource exists (which could be a matter of seconds). Perhaps the service responsible for said endpoint spins up such a resource every minute or so; the result would be dozens of endpoints spinning up over the course of an hour, all pertaining to the same service, all doing the same job. However, for each endpoint spun up, EDR telemetry would treat it as a discrete entity (with its own ID, hostname, etc.).

Containerized network security controls with Cilium

Cilium’s concept of a resource within a given environment is based on a service, pod, or container identity, which allows for persistent visibility and controls around a given resource (or family of resources) in circumstances where such an object may only exist temporarily, or may have a sporadic lifecycle. Kubernetes labels are applied to each resource known to Cilium, which enrich a given resource’s title with further contextual data surrounding the resource’s role and function. Compared to operating based on a hostname or IP address, working with verbose labels and contextual metadata makes an analyst’s job infinitely easier when navigating the bespoke nuances of a containerized environment.

To be compliant with Palantir’s security controls, Cilium must be deployed to each resource in our Rubix environment. Once deployed, Cilium operates based on rules established in its Network Policy. Cilium network policies are specified as rules that dictate the connections and flow permitted between resources in a given environment. Network Policy rules operate at Layer 3, Layer 4, or Layer 7, and each rule establishes controls based on the Kubernetes identifiers affiliated with a given resource. For example:

  • Microservice A should only be able to make requests against API Resource X and API Resource Y, provided those requests match Z Criteria.
  • Only pods with the label Purpose A can resolve Domain B.

Policy examples

Let’s look at two examples (sourced from Isovalent’s Network Policy documents) of typical Cilium rules for Kubernetes resources. First, the Layer 3/4 example demonstrates Cilium’s understanding of an endpoint’s “role” to control traffic.

Labels-dependent Layer 4 rule [Source]

In addition to Layer 3/4 controls, Cilium introduces the notion of Layer 7 rules, which lets an engineer regulate characteristics of network traffic at the application layer. With Layer 7 controls, engineers can take their knowledge of how endpoint & API resources should interact with one another in a containerized environment, and can then orient their network controls to prohibit deviations from that baseline. The following rule is one example of such controls, and restricts endpoints with a designated label from receiving traffic on any port other than the value provided, then further restricts the aforementioned activity to specific API traffic with the correct HTTP header:

All GET /path1 and PUT /path2 when header set [Source]

Using Cilium’s controls at Palantir

At Palantir, our infrastructure and security engineers have used Cilium’s rules to establish network controls tailored to Rubix’s bespoke functionality. By ensuring that all of Rubix’s Kubernetes resources are appropriately labeled with an accurate representation of their role and function in our environment, we’ve equipped Cilium with the requisite identifiers needed to interpret Kubernetes’ identity-based concepts when regulating traffic within Rubix.

With a robust set of rules in place, our InfoSec team’s attention turns to instances when Cilium’s rules succeed in their job by blocking non-compliant traffic in Rubix. The implication of such activity is that a possibly-malicious third party could be attempting actions in the environment. In such cases, our team pivots to the areas of alerting, detection, and response.

Containerized Alerting and Detection with Cilium

Hubble is the observability / logging platform built on top of Cilium; it uses eBPF to achieve visibility into the operations occurring on an endpoint, and in turn generates discrete log entries for a variety of events.

Hubble log event types

At the time of this publication, Hubble generates nine event types; such events can be split into two categories:

1) Network traffic events generated from Cilium’s networking visibility
Cilium’s network-based logging includes telemetry from successful and unsuccessful netflow among Kubernetes resources. However, in the context of a network tool with Layer 7 controls such as Cilium, we receive block events for activity that we might not see if using a traditional network security solution. For example, if Endpoint A is a microservice whose only traffic should be a periodic GET request to HTTP 1, and Cilium blocks a POST to HTTP 1 we can assume a handful of conclusions. First, the same activity would likely have been allowed by security controls operating explicitly on IP & port restrictions, and second, an outside factor is responsible for generating the POST request. That’s not to necessarily say that every errant POST is due to a malicious actor, but having visibility into the traffic if an attacker were generating unsanctioned HTTP traffic in one’s environment would be a valuable resource in responding to such an incident.

2) Process events captured by Hubble’s on-endpoint eBPF visibility
As noted, while some of Hubble’s events are derivative of Cilium-specific network visibility & activity, other logging events simply leverage Hubble’s on-endpoint eBPF visibility to record process activity. Palantir’s alerting & detection engineers have found that Hubble’s process-specific events can serve as an alternative to traditional enterprise EDR solutions (even if Cilium was originally deployed with a focus on network activity rather than process activity). Such telemetry includes process start events, socket connections, and process end events.

There is broad overlap between Hubble’s process logging and that of traditional EDR solutions. We expect that an organization running Cilium & Hubble could tentatively reduce their agent footprint, SIEM ingestion, and enterprise EDR costs by relying on Hubble’s process logging in place of said environment’s prior EDR solution.

Alerting & Detection at Palantir

The aforementioned logging events could be used to surface a variety of attack scenarios. In the interest of providing an example as to how Palantir uses telemetry from Cilium / Hubble to surface potentially malicious behavior, our InfoSec team is publishing one of our internal alerting & detection strategies. This strategy was authored to alert on Hubble events generated when Cilium blocks a DNS request originating from pods in the Rubix environment.

Alerting & Detection Strategy 006: Cilium Blocked DNS Resolution

Ending notes

Broadly, Palantir’s InfoSec team feels that our use of Cilium has hardened the network security posture of our Rubix environment, while also making material contributions to our alerting & detection repertoire. We’re delighted to have worked with Isovalent to support this open-source project for the benefit of the broader InfoSec community, and we’re looking forward to more developments in the security space that cater to container-based environments and infrastructure.

Authors: Michael A. & Sean C.

--

--