Migrating Policy Delivery Engines with (nearly) Nobody Knowing | by Pinterest Engineering | Pinterest Engineering Blog | Jan, 2024

Pinterest Engineering

Jeremy Krach | Staff Security Engineer, Platform Security

Several years in the past, Pinterest had a brief incident because of oversights within the coverage supply engine. This engine is the know-how that ensures a coverage doc written by a developer and checked into supply management is totally delivered to the manufacturing system evaluating that coverage, just like OPAL. This incident started a multi-year journey for our crew to rethink coverage supply and migrate a whole bunch of insurance policies to a brand new distribution mannequin. We shared particulars about our former coverage supply system in a convention discuss from Kubecon 2019.

At a excessive degree, there are three essential architectural selections we’d wish to carry consideration to for this story.

Figure 1: Old coverage distribution structure, utilizing S3 and Zookeeper.
  1. Pinterest supplies a wrapper service round OPA with the intention to handle coverage distribution, agent configuration metrics, logging, and simplified APIs.
  2. Policies have been fetched routinely through Zookeeper as quickly as a brand new model was revealed.
  3. Policies lived in a shared Phabricator repository that was revealed through a CI workflow.

So the place did this go mistaken? Essentially, unhealthy variations (50+ on the time) of each coverage have been revealed concurrently because of a foul decide to the coverage repository. These unhealthy variations have been revealed to S3, with new variations registered in Zookeeper and pulled straight into manufacturing. This brought about lots of our inner companies to fail concurrently. Fortunately a fast re-run of our CI revealed recognized good variations that have been (once more) pulled straight into manufacturing.

This incident led a number of groups to start rethinking world configuration (like OPA coverage). Specifically, the Security crew and Traffic crew at Pinterest started collaborating on a brand new configuration supply system that would supply a mechanism to outline deployment pipelines for configuration.

This weblog publish is targeted on how the Security crew moved a whole bunch of insurance policies and dozens of consumers from the Zookeeper mannequin to a safer, extra dependable, and extra configurable config deployment strategy.

The core configuration supply story right here isn’t the Security crew’s to inform — Pinterest’s Traffic crew labored intently with us to know our necessities, and that crew was finally chargeable for constructing out the core know-how to allow our integration.

Generally talking, the brand new configuration administration system works as follows:

  1. Config house owners create their configuration in a shared repository.
  2. Configs are grouped by service house owners into “artifacts” in a DSL in that repository.
  3. Artifacts are configured with a pipeline, additionally in a DSL in that repository. This defines which programs obtain the artifact and when.

Each pipeline defines a set of steps and a set of supply scopes for every step. These scopes are generated regionally on every system that want to retrieve a configuration. For instance, one may outline a pipeline that first delivers to the canary system after which the manufacturing system, (simplified right here):

The DSL additionally permits for configuration round how pipeline steps are promoted — computerized (inside enterprise hours), computerized (24×7), and guide. It additionally permits for configuration of metric thresholds that should not be exceeded earlier than continuing to the following step.

The precise distribution know-how is just not dissimilar to the unique structure. Now, as a substitute of publishing coverage in a world CI job, every artifact (group of coverage and different configuration) has a devoted pipeline to outline the scope of supply and the triggers for the supply. This ensures every coverage rollout is remoted to only that system and may have no matter deployment technique and security checks that the service proprietor deems applicable. A high-level structure could be seen under.

Figure 2: New coverage distribution structure, utilizing Config server/sidecar and devoted UI.

Phase 1: Tooling and Inventory

Before we might start migrating insurance policies from a world, instantaneous deployment mannequin to a focused, staged deployment mannequin, a variety of info wanted to be collected. Specifically, for every coverage file in our previous configuration repository we would have liked to establish:

  1. The service and Github crew related to the coverage
  2. The programs utilizing the coverage
  3. The most well-liked deploy order for the programs utilizing the coverage

Fortunately, most of this info was available from a handful of knowledge sources at Pinterest. During this primary section of the migration, we developed a script to gather all this metadata about every coverage. This concerned: studying every coverage file to drag the related service title from a compulsory tag remark, fetching the Github crew related to the service from our inner stock API, getting metrics for all programs with site visitors for the coverage, and grouping these programs right into a tough classification based mostly on a couple of frequent naming conventions. Once this knowledge was generated, we exported it to Google sheets with the intention to annotate it with some guide tweaks. Namely, some programs have been misattributed to house owners because of stale possession knowledge, and plenty of programs didn’t comply with normal, predictable naming conventions.

The subsequent piece of tooling we developed was a script that took a couple of items of enter: the trail to the coverage to be migrated, the crew names, and the deployment steps. This routinely moved the coverage from the previous repository to the brand new one, generated an artifact that included the coverage, and outlined a deployment pipeline for the related programs attributed to the service proprietor.

With all this tooling in hand, we have been prepared to begin testing the migration tooling in opposition to some easy examples.

Phase 2: Cutover Logic

Prior to the brand new coverage supply mannequin, groups would outline their coverage subscriptions in a config file managed by Telefig. One of our objectives for this migration was guaranteeing a seamless cutover that required minimal or no buyer adjustments. Since the brand new configuration administration offered the idea of scopes and outlined the coverage subscription within the configuration repository, we might rely purely on the brand new repository to outline the place insurance policies have been wanted. We wanted to replace our sidecar (the OPA wrapper) to generate subscription scopes regionally throughout start-up based mostly on system attributes. We selected to generate these scopes based mostly on the SPIFFE ID of the system, which allowed us to couple the deployments intently to the service and surroundings of the host.

We additionally acknowledged that for the reason that configuration system can ship arbitrary configs, we might additionally ship a configuration telling our OPA wrapper to change its habits. We carried out this cutover logic as a hot-reload of configuration within the OPA wrapper. When a brand new configuration file was created, the OPA wrapper detects the brand new configuration and adjustments the next properties:

  1. Where the insurance policies are saved on disk (reload of the OPA runtime engine)
  2. How the insurance policies are up to date on disk (ZooKeeper subscription outlined by buyer managed configuration file vs. doing nothing and permitting the configuration sidecar to handle it)
  3. Metric tags, to permit detection of cutover progress
Figure 3: Flowchart of the coverage cutover logic.

One advantage of this strategy is that reverting the coverage distribution mechanism could possibly be performed utterly within the new system. If a service didn’t work effectively with the brand new deployment system, we might use the brand new deployment system to replace the brand new configuration file to inform the OPA wrapper to make use of the legacy habits. Switching between modes could possibly be performed seamlessly with no downtime or affect to prospects utilizing insurance policies.

Since each the coverage setup and the cutover configuration might occur in a single repository, every coverage or service could possibly be migrated with a single pull request with none want for buyer enter. All recordsdata within the new repository could possibly be generated with our previously-built tooling. This set the stage for an extended collection of migrations with localized affect to solely the coverage being migrated.

At this level, the inspiration was laid to start the migration in earnest. Over the course of a month or two, we started auto-generating pull-requests scoped to single groups or coverage. Primarily Security and Traffic crew members generated and reviewed these PRs to make sure the deployments have been correctly scoped, related to the right groups, and rolled out efficiently.

As talked about earlier than, we had a whole bunch of insurance policies that wanted to be migrated, so this was a gentle however lengthy means of transferring insurance policies in chunks. As we gained confidence in our tooling, we ramped up the variety of insurance policies migrated in a given PR from 1–2 to 10–20.

As with any plan, there have been some unexpected points that got here up as we deployed insurance policies to a extra numerous set of programs. What we discovered was that a few of our older stateful programs have been working an older machine picture (AMI) that didn’t assist subscription declaration. This introduced a right away roadblock for progress on programs that might not simply be relaunched.

Fortunately, our Continuous Deployment crew was actively revising how the Telefig service receives updates. We labored intently with the CD crew to make sure that we dynamically upgraded all programs at Pinterest to make use of the most recent model of Telefig. This unblocked our work and allowed us to proceed migrating the remaining use circumstances.

Once we resolved the previous Telefig model difficulty, we rapidly labored with the few groups that owned the majority of the remaining insurance policies to get the whole lot moved over into the brand new configuration deployment mannequin. Below is a tough timeline of the migration:

Figure 4: Timeline of the migration to the brand new coverage framework.

Once the metrics above stabilized at 100%, we started cleansing up the previous tooling. This allowed us to delete a whole bunch of traces of code and drastically simplify the OPA wrapper, because it now not needed to construct in coverage distribution logic.

At the tip of this course of, we now have a safer coverage deployment platform that enables our groups to have full management over their deployment pipelines and totally isolate every deployment from insurance policies not included in that deployment.

Migrating issues is difficult. There’s at all times resistance to a brand new workflow, and the extra folks that should work together with it, the longer the tail on the migration. The important takeaways from this migration are as follows.

Focus on measurement first. In order to remain on observe, you’ll want to know who shall be impacted, the scope of what work stays, and what massive wins are behind you. Having good measurement additionally helps justify the mission and provides an awesome set of assets to brag about accomplishments at milestones alongside the way in which.

Secondly, migrations typically comply with the Pareto Principle. Specifically, 20% of the use-cases to be migrated will typically account for 80% of the outcomes. This is seen within the timeline chart above — there are two big spikes in progress (one in mid April and one a couple of weeks later). These spikes are consultant of migrations for 2 groups, however they characterize an outsized proportion of the general standing. Keep this in thoughts when prioritizing which programs emigrate, as generally spending a variety of time simply emigrate one crew or system might have a disproportionate payoff.

Finally, anticipate points however be able to adapt. Spend time early within the course of considering by your edge circumstances, however depart your self additional time on the roadmap to account for points that you may not predict. Somewhat little bit of buffer goes a great distance for peace of thoughts and in the event you occur to ship the outcomes early, that’s an awesome win to have fun!

This work wouldn’t have been attainable with out an enormous group of individuals working collectively over the previous few years to construct the perfect system attainable.

Huge because of our companions on the Traffic crew for constructing out a strong configuration deployment system and onboarding us as the primary large-scale manufacturing use case. Specifically, because of Tian Zhao who led most of our collaboration and was instrumental in getting our use-case onboarded. Additional because of Zhewei Hu, James Fish and Scott Beardsley.

The safety crew was additionally an enormous assist in reviewing the structure, migration plans and pull-requests. Specifically Teagan Todd was an enormous assist in working many of those migrations. Also Yuping Li, Kevin Hock and Cedric Staub.

When encountering points with older programs, Anh Nguyen was an enormous assist in upgrading programs below the hood.

Finally, thanks to our companions on groups that owned a considerable amount of insurance policies, as they helped us push the migration ahead by performing their very own migrations: Aneesh Nelavelly, Vivian Huang, James Fraser, Gabriel Raphael Garcia Montoya, Liqi Yi (He Him), Qi LI, Mauricio Rivera and Harekam Singh.

To study extra about engineering at Pinterest, take a look at the remainder of our Engineering Blog and go to our Pinterest Labs web site. To discover and apply to open roles, go to our Careers web page.

Leave a Reply

Your email address will not be published. Required fields are marked *