Fun with YouTube’s Audio Content ID System

Update (April 21, 2010): The YouTube account I used to upload my test videos, “retnirpregnif,” was removed due to a “terms of use violation” in late January or early February of 2010. YouTube never sent me any kind of notice or alert explaining their rationale for the termination of my account, so all I can do is guess. But I’m fairly certain that an actual human pulled the account, not an automated system. The remainder of this article remains unaltered from its original April 2009 state. Most of the techniques described here are antiquated and no longer work. I don’t have any new information about which techniques may work today.

Anybody who hasn’t been living under a rock knows about YouTube. It’s a video site built entirely around user-submitted content. Anybody can film anything, upload it to the site, and anybody on the internet can watch it if they so choose. Sounds great in theory, but over time it’s succumbed to a very basic problem: The users can’t be trusted.

Copyrighted materials—TV shows, music videos, concerts, even entire feature films—have popped up on YouTube in huge quantities. Obviously, the copyright owners and content providers don’t like this, especially when such free distribution cuts into their bottom line. Back in the day, a copyright holder would have to stumble across an infringing video, contact YouTube, and ask them to take it down manually. It doesn’t take a genius to realize that the content providers couldn’t keep that up forever, especially as more and more new users kept pouring in.

Enter Fingerprinting

YouTube narrowly avoided legal trouble by promising the big media companies that they’d develop a system that could detect and automatically remove any copyrighted material that was uploaded to the site. But in reality, they didn’t actually develop the audio fingerprinting system; they licensed it from a company called Audible Magic.

Audible Magic originally wrote software for CD duplication companies. When you handed a master disc off to a duplication house, they’d check it with an Audible Magic system first. The goal was to positively identify every song on the disc, as well as the copyright/licensing status, before the company ran off 10,000 copies of your potentially pirated disc.

YouTube jumped at this technology and worked to integrate it into their site. It scanned over all the uploads and generated a “fingerprint” for each video. It would then compare each fingerprint to a database containing practically every copyrighted work that the media companies wanted to keep off the site. If any videos matched, it was assumed that the user has posted copyrighted material without permission and the infringing video was removed.

Some labels got the right idea, though. Instead of demanding that any infringing content be taken down, some chose to promote their material or insert links to pay music sites where you could purchase the songs that were being played. That was an amazing idea: It permitted the users to basically do whatever they wanted copyright-wise, while still driving traffic and potential sales to legitimate music retailers.

Heating Up

That worked well enough for a time, but the media companies weren’t satiated yet. A slew of legal threats, negotiations, and all-around chicanery ensued. After all, YouTube was making money by running ads alongside videos which often contained material from these companies, and they all wanted a piece.

Unfortunately, nothing seemed to please Warner Music Group, who left the talks without reaching an agreement. They then demanded that YouTube remove every single piece of WMG-owned media on the site. Videos disappeared all over the place.

This Is Where I Came In

I don’t consider myself to be much more than a casual YouTube user. I’ll upload maybe one or two things a year, but nothing amazing or anything I put any real effort into.

For example, one of my videos depicts three members of my high school’s marching band dressed in pajamas at an overly girly sleepover. The song used in the background was “I Know What Boys Like” by The Waitresses. I thought it was hilarious when I was 17, but I had all but forgotten about it five years later.

I was caught by surprise one day when I received an automated email from YouTube informing me that my video had a music rights issue and it was removed from the site. I didn’t really care.

Then a car commercial parody I made (arguably one of my better videos) was taken down because I used an unlicensed song. That pissed me off. I couldn’t easily go back and re-edit the video to remove the song, as the source media had long since been archived in a shoebox somewhere. And I couldn’t simply re-upload the video, as it got identified and taken down every time. I needed to find a way to outsmart the fingerprinter. I was angry and I had a lot of free time. Not a good combination.

I racked my brain trying to think of every possible audio manipulation that might get by the fingerprinter. I came up with an almost-scientific method for testing each modification, and I got to work.

Methodology

The song chosen for all the tests is “I Know What Boys Like,” a 1982 song by the one-hit wonder group The Waitresses. This song was chosen for several reasons:

It was the first song I ever saw that was identified and removed by YouTube’s fingerprinting system.
It has a very distinctive sound that I thought would be easily identifiable. It’s also really repetitive, which probably makes it an easy target for an automated system to detect.
It’s one of the few songs I actually have readily available in an uncompressed format. The majority of my music collection is stored with lossy data compression, which might have impacted the results.
In general, it’s just a terrible song. I wanted to highlight the fact that somewhere out there, somebody thinks this 27-year-old heap is still valuable enough to be barred from YouTube.

The song originally came from a 1990 CD pressing of “The Best of the Waitresses,” which I came across during my freshman year of college. I was so surprised to see a copy of this album, I begged the owner to allow me to make a copy for posterity (and also for hilarity). I used Nero Burning ROM to make a bit-perfect copy of the full album onto a CD-R. I then listened to my copy, laughed at the majority of it, then stored it in a CD binder.

Fast-forward to the present day, when I decided to run these tests. I ripped my copy of the album with Exact Audio Copy in “secure” mode. The result was a 16-bit stereo, 44,100 Hz PCM wave file. This was used as the master file for all the tests.

For each test, a duplicate copy of the master file was manipulated. Practically every change to the audio was made in Adobe Audition 3 on Windows. The modified duplicates were saved as 44,100/16 stereo waves and moved over to a Mac.

Each file was loaded into an empty Final Cut Pro sequence. The video settings, although theoretically irrelevant, were always set to 24 FPS, progressive, NTSC 720×480 @ 4:3, with 44.1/16 stereo downmix audio. The audio files were matched with a default Text generator which described the test being performed. The resulting video files were saved in DV NTSC QuickTime format.

From there, the files were moved into Apple Compressor where they were batch converted into a format YouTube would accept. I chose the “H.264 for iPod video and iPhone 320×240 (QVGA)” setting, which encodes reasonably fast with excellent quality. The final output files were M4V containers with H.264 video and AAC stereo audio.

Finally, the video files were uploaded to my YouTube test account. I chose the name retnirpregnif, which is the word “fingerprinter” backwards. The title of each uploaded video was always set to a description of that particular test. In all but one test, the description was set to ‘The song is “I Know What Boys Like” by The Waitresses.’ I chose that description to see if the presence or absence of a copyrighted song name in any of the metadata fields influenced the detection. The tags, category, and any other fields were left blank, and possibly auto-filled by the uploader.

I considered a test passed if the status line on my account’s “Uploaded Videos” page read “Live!” and the thumbnail had been generated. (Also, if the video actually played, that’s a big plus.) If a video had a status of “Matched third party content” or I received an email about a particular video, I considered that test failed.

Please note that these tests are only meant to test the audio aspect of YouTube’s fingerprinting system. They probably have a similar feature in place to scan for content in the image data, but I make no effort to test that in this document. The video fingerprinter might be susceptible to tweaks like those I describe below, or it might be an entirely different can of worms. I’ll leave it to somebody else to figure that one out.

The Tests

No Description

For the first test, I uploaded a completely unmodified copy of the entire song, but with a description field that read “No Description.” The purpose of this test was to determine if YouTube could still identify the material if none of the user-submitted metadata gave any indication that it was there.

Reverse

The entire song was reversed. The purpose of this test was to determine how discriminating the fingerprinter was. If the test passed, it would reveal the system’s inability to identify a song which is playing backwards.

Pitch Alteration

Pitch Alteration The entire song was modified with Audition’s “Stretch” plugin. In all tests, the Precision was set to High, Constant Vowels was off, Preserve speech Characteristics was on, Formant Shift was 0, and Solo Instrument or Voice was on. (Admittedly, it should’ve been off, but that would’ve taken friggin’ forever to process.)

For these tests, the Stretching Mode was Pitch Shift. The Ratio was changed from test to test to create varying amounts of pitch change.

These tests created an output file with exactly the same length and speed as the source, but with the pitch increased or decreased. These tests were designed to determine if the fingerprinter looks at the “notes” the song is made of.

Time Alteration

Time Alteration The entire song was modified with Audition’s “Stretch” plugin. In all tests, the Precision was set to High, Constant Vowels was off, Preserve speech Characteristics was on, Formant Shift was 0, and Solo Instrument or Voice was on.

For these tests, the Stretching Mode was Time Stretch. The Ratio was changed from test to test to create varying amounts of tempo change.

These tests created an output file with exactly the same notes as the source, but with the speed (tempo) increased or decreased. These tests were designed to determine if the fingerprinter looks at the “beats” and rhythm of the song.

Resampling

Resampling The entire song was modified with Audition’s “Stretch” plugin. In all tests, the Precision was set to High, and Constant Vowels was off.

For these tests, the Stretching Mode was Resample. The Ratio was changed from test to test to create varying amounts of tempo change.

These tests created an output file with both altered pitch and altered speed relative to the original. Quite simply, the song was played back at a faster or slower rate than the original—similar to a tape being played at the wrong speed. And now I suddenly feel old.

Noise

The entire song was mixed with varying levels of background noise. In the first round of tests, the song was mixed with varying levels of pure white noise created with Audition’s Noise generator. Color: White
Style: Independent Channels
Intensity: 40

For the second round of tests, the entire song was played on a set of M-Audio BX5a studio monitor speakers (chosen because of their flat frequency response ≥100 Hz, and because they were the only ones I really had available), and recorded into a Canon ZR200 camcorder onto a MiniDV tape. The tape was captured into Final Cut Pro, the resulting 48,000 Hz 16-bit audio was split off to a wave file, and then it was converted back into 44,100 Hz in Audition. The camera was placed at different distances and different angles relative to the stereo field’s central axis. No effort was made to keep the room quiet during the trials, and as a result things like heaters, refrigerators, TV flyback transformers, and running water can be heard throughout.

Amplification/Attenuation/DC Bias

Amplification/Attenuation/DC Bias The entire song had its volume adjusted by varying amounts from test to test. For amplification tests, the song was allowed to clip hard at 0 dB, creating a great deal of distortion on the louder trials.

In later tests, the amplification was unchanged, but a positive DC bias was added to the signal, resulting in a great deal of distortion and the type of audio I’m afraid to play on good speakers.

These tests were designed to see if there was any absolute volume below which the fingerprinter couldn’t detect the song. Likewise, it tested to see if any amount of digital clipping and distortion could disrupt the detection process.

Time Chunks

Time Chunks The song was trimmed to (n × 3) seconds long, where n is a value that changes from test to test. The preserved segment of audio comes from near (but not exactly) the center of the song. From 0 seconds to n seconds, the audio is muted. Likewise, from (n × 2) seconds to the end of the song, the audio is also muted. The resulting n seconds at the center of the song are allowed to play. If the song is shorter than (n × 3) seconds, the muted sections are shortened so the entire output file is the same length as the source.

In later tests, the muted and unmuted portions were aligned to the head and tail ends of the song, for reasons that will be explained later.

The goal of these tests was to determine how much of the song needed to be present to trigger a positive detection, and if the position of that section had any effect on the detection.

Stereo Imagery

Stereo Imagery The entire song was subjected to a series of filters that modify the audio based on the similarities and differences between the two audio channels.

For two of the tests, the vocals were removed or isolated using Audition’s Center Channel Extractor plugin. Extract Audio From: 0° phase, 0% pan, 0ms delay
Frequency Range: 140–20,000Hz
Volume Boost Mode: off
Crossover: 100%
Phase Discrimination: 4.5°
Amplitude Discrimination: 6dB
Amplitude Bandwidth: 9dB
Spectral Decay Rate: 0%
FFT Size: 32,768
Overlays: 12
Window Width: 100%

In the third test, both channels’ waves were inverted. The phase relationship between left and right were preserved.

In the fourth test, only the right channel’s wave was inverted. The left remained untouched. The resulting audio file is completely out-of-phase.

In the fifth test, the two channels were first averaged together, effectively making the file mono. It still had two channels, but they contained identical waveforms. The right channel was then taken out-of-phase in the same manner as the fourth test. The resulting audio file is completely out-of-phase, and when both channels are summed together, they will destroy one another and average out to zero, or total silence.

These tests are designed to see how well the fingerprinter copes with audio with unexpected phase alterations. Also, the later tests attempt to reveal if the fingerprinter considers the files in stereo, or if it first converts them into mono for analysis.