Designing a content management system for 100mm+ songs

I was the sole designer on Attribution Engine - Pex’s flagship music licensing product. From August 2019 to September 2020, we went from concept to Series A raise valuing the company at over $180mm.

Our team size grew 6x in less than a year, and it was time to deliver on our promises.

Initially, I was tasked with designing the platform broadly, owning every module within AE. As the company grew, I was able to dive deeper into specific areas.

Beyond our Series A, my main area of ownership within AE was the content management system, which in many ways was the heart of the system.

My role

Senior staff designer through the end to end process: discovery, user research, requirements, design, testing, support through launch.

The team

2 product managers, 7 backend engineers, 1 database architect, 1 frontend engineer

Timeline

October 2020 - December 2021

Scaling and accommodating music’s biggest names

Universal, Warner and Sony were early believers in AE. Combined they own over 60% of all recorded music, so getting them onboarded quickly was paramount in reaching strong early traction.

Problem statement

How might we import and organize their catalog quickly and painlessly, so that they could begin extracting real value out of the platform?

Breakdown of the problem

Time to value
Because onboarding large catalogs takes a considerable amount of time, the majors are quite skeptical in working with new vendors. Many times before, they’ve been unable to extract enough value to make their efforts worthwhile.
Dirty data
Major labels work similar to hedge funds, increasing revenue by buying, selling and holding catalogs for periods of time.  Because the songs are frequently changing hands, the data can be fairly low quality, making it difficult to organize.
Limited resources
Labels run pretty lean with few developers. This means catalog transfer needs to happen smoothly and fairly automated.

Undoing some growing pains

The 0 to 1 version of AE was done with less than 10 people. As Pex grew, so did internal bias towards waterfall methodology.

I wanted to begin shifting towards a more collaborative and agile process. One where we were all working cross-functionally from the beginning - discovering and defining the problem, exploring all potential solutions and delivering the final solution.

Waterfall
TO
Agile

I wanted to instill in the org that everyone was capable of design thinking, and solutions could come from anywhere. Most importantly, I wanted everyone to buy in and feel like they not only had the ability to shape solutions, but the expectation to help solve problems as they arose.

Fighting for our customer to have their seat at the table

Pex had historically undervalued user research. Given the size that AE needed to scale to, I knew we needed to be in frequent discussions with our end enterprise users.

After much internal debate, I was given the green light to start piloting research with the majors - although the leash was pretty short. I needed to prove value quickly, and make the best use of my time with the majors.

Approaching research creatively and making internal allies along the way

Market research

YouTube’s ContentID tool was AE’s biggest, direct competition. AE in its simplest form was ContentID for all web platforms.

I spent a lot of time researching how enterprise clients were getting their data into ContentID, and from there, how ContentID was organizing and structuring the data.

Enterprises were already familiar and using ContentID. I learned the most widely used protocol for importing song catalogs was DDEX.

Rather than reinvent the wheel, it was time to learn what was working well and what wasn’t.

Learning about DDEX

I tried to learn everything I could about DDEX.

DDEX was the industry's gold standard of transporting music catalog data. It was used when pushing catalogs back and forth, label to label. It was also used to push catalogs to services or vendors. Examples being ContentID or platforms like Spotify and Pandora.

User interviews

Content ID power users

Focusing on ContentID power users allowed us to start with enterprise adjacent users, but not the major labels themselves. We needed to prove a bit of value internally, before sales and exec teams were ok with us approaching the majors.

I learned an enormous amount from these users. They used ContentID everyday, and they had so many wonderful insights and hacks they had developed to make their work possible.

A lot of these insights went into helping us implement search and filtering in the CMS.

Major labels

Thankfully enough internal value was derived when interviewing adjacent ContentID power users. We got our green light to start chatting with the majors in a more friendly, casual way.

I went very deep learning about their workflows and processes for preparing to work with new vendors, and how their existing vendor relationships worked.

I wanted to understand:

Forming hypothesis

During our interviews, I learned that it took an enormous amount of time to prepare DDEX files for a new vendor, and often there was little short term value to doing so.

I figured the best possible experience would be one where we come alongside a vendor they already use and deem valuable. Spotify, for instance, is a platform they’re already sending catalog to frequently, and deem valuable since streaming is a primary source of catalog revenue.

How might we accept the data like Spotify, so we’re only an additional place to send data, rather than a completely new vendor with wildly different specs?

This was the question I brought to the team, and we began working backwards from.

My approach to system design

At Pex, I was the system thinker guy. I had a strong understanding of the inner workings of our platforms, and knew intimately where things were operating smoothly and where areas of improvements lived.

I approach system design similar to how one approaches biohacking. The parallels between system design and the body are immense. It’s easy to identify obvious issues like broken arms and obesity. Harder to identify things like vitamin deficiencies or overactive glands.

I take a very methodical, deep in the weeds approach to system design. 30,000 ft views are easier, but often being hyper-focused on the smallest details will lead to much more robust, thoughtful solutions that scale for years to come.

Importing 100mm songs in days instead of months

Deep dive on ddex

I first needed to get into the weeds of DDEX. I researched and read every piece of documentation I could get my hands on. I also spoke with experts in the field that prepared and passed DDEX files everyday at major labels, YouTube and Pandora.

I learned that DDEX is a format, but it lacks a lot of firm standards. The data is very messy and everyone has a different way of assembling it. This made designing around it much more difficult, and managing trade offs was key.

Data storage

When we first started creating data models to mirror the major’s Spotify feeds, we were importing every piece of data as-is.

Upon initial tests we realized the storage costs alone would put us out of business. We were storing too much data that we didn’t actually need for AE’s purposes.

I worked very closely with our database architects and backend engineers, scrutinizing every single data point, and dumping as much data as we could while keeping the overall system design intact. In reality, we were still storing the feeds in blob storage, but we were using only a minor portion in our staging and production databases.

Speeding up import

Before we figured out our storage constraints, the time to import a batch of songs was quite painful. We knew we needed to make adjustments to our storage and overall database structure in order to optimize the flow of data.

Between making adjustments both to what we stored and how, we were able to import millions of songs in a few weeks vs. our initial estimates of ~6 months per major label.

Majors were used to serious wait time, so millions of songs imported in 2-3 weeks was music to their ears - and little to no prep, since they had already prepped the feeds when sending to Spotify.

Making sense of the data

Now that we’d received all of this data, we needed to make sense of it.

Many songs had multiple owners. This meant many people were sending data about the same songs. They were sending their portion of the data, and we needed to interpret it correctly.

Metadata

Most other CMS systems started and ended with matching data based on metadata. Things like song titles and artists. Most did it poorly, and most stopped here.

One very difficult concept when sifting through music metadata is that a song can be released in many forms. Let’s take a song like Better Now by Post Malone. At the time of writing, Better Now has been released 38 times. As a single in a clean form and an explicit form. On the CD Beerbongs and Bentleys and the vinyl version… and 34 others. Each release having a slightly different ownership and data profile. This was a huge challenge.

After much trial and error working through various models on whiteboards and in Miro, we opted to use both the metadata and the audio itself to guide us towards clean data buckets.

Audio

Audio matching is what Pex does best and what almost no other CMS does. We believed this could be a differentiator.

Audio matching would allow us to bucket data together based on the audio matching 100%. This would weed out a lot of the data in the process. What it didn’t correct for is the exact same audio being released on different products. CD vs. vinyl for instance.

This was ok. From there, we were able to factor in bits of metadata. Things like product codes helped guide us. These auxiliary pieces of metadata were helpful, but often incomplete and couldn’t always be fully relied on due to human error.

Trust score

The final piece of the puzzle was a trust score system. Not all data sources or partners were created equal. Major labels had decent data integrity. They also had a lot of incentive since bad data led to millions in lost revenue. DIY distributors like TuneCore or Distrokid were a different story though. Same with small indie labels and publishers. They meant well, but the resources were lacking.

We implemented a trust score system to guide us in assembling the data. Majors being the highest, followed by large distributors, followed by mid-level indie labels, so on and so forth. For many reasons, I can’t get too into the details here.

What that allowed us to do is make sense of what we received with certainty and automation. We were able to, in a sense, take someone’s word over another, through automation and our trust score system.

Nothing is ever perfect and iterations were plentiful. Overall though, the model worked, clients understood, and it led to us needing 1/10th of the manpower we initially thought to onboard new clients and large catalogs.

Displaying 100mm songs

Importing was half the battle. Now, we needed to look at how we display 100mm songs.

There were many CMS examples in various forms on the market, but few that handled a commercial level of assets intuitively and beautifully. This was confirmed in our user interviews. Many perceived simple tasks were difficult or impossible in competitor products.

Asset library

My overall philosophy was that using the asset library should be quick and painless. It’s a tool or a means to an end. It’s not really the product itself.

Base on interviews, I knew our users wanted to be able to:

My main objective was to keep the design out of the way. I intentionally pulled back many of our data points to get to the meat. I wanted to give the user enough information, so they had a level of certainty that they were looking at the right songs, but not bog them down with every detail.

Ownership

Ownership is quite complicated in the music business. The average hit of the last decade has over 15 owners between writers, artists, labels and publishers. Distilling all of this data down into something digestible was a challenge.

I opted to show ownership in a global sense. Users could quickly see how much they owned and in which countries, so they could further determine which songs were more valuable. Owning a song in the US is much different than Norway.

Licensing

Licensing is the fun part. Here, we’re showing where your song was used in what piece of content, how many times, how many views, etc. This is where the asset library turns into a money making machine.

I wanted to clearly show the user high-value songs, and give some depth in terms of analytics. This would help users not only identify high-value songs, but also lead to observations that could aid them in duplicating past successes.

Data conflicts

Conflicts were plentiful since we were receiving data from so many parties given so many owners of the same song. I opted for simple explanations and visual aid where appropriate to show that ownership conflicts were occurring. Users could easily see which countries were causing problems, fix the data, and unlock new revenue opportunities by doing so.

Other ideas for exploration

Because we’re able to identify problems and clean data more robustly than other platforms, I dreamed of one day being able to offer a service that allowed clients to clean their data and then pipe it back out to services like Spotify, Apple Music and more.

Essentially, if we could find you more revenue in our ecosystem, we would in turn unlock more revenue elsewhere.

The build would be quite intense, so for now it lives on the very distant roadmap.

Closing thoughts

The majors were used to slow and tedious onboarding. They were also used to working with large teams.

We didn’t have a large team. Efficiency and smart automation were key.

We were able to import all of the majors catalogs in around 6-8 weeks, once we began onboarding. They were able to get immediate value out of our platform with a minimal effort and further instill trust in our products and vision.