We Found 650,000 Ways Advertisers Label You

From “Heavy Purchasers” of Pregnancy Tests to the Depression-Prone: We Found 650,000 Ways Advertisers Label You

A spreadsheet on ad platform Xandr’s website revealed a massive collection of “audience segments” used to target consumers based on highly specific, sometimes intimate information and inferences.
See the full data here. GitHub

 

What words would you use to describe yourself? You might say you’re a dog owner, a parent, that you like Taylor Swift, or that you’re into knitting. If you feel like sharing, you might say you have a sunny personality or that you follow a certain religion.

If you spend any time online, you probably have some idea that the digital ad industry is constantly collecting data about you, including a lot of personal information, and sorting you into specialized categories so you’re more likely to buy the things they advertise to you. But in a rare look at just how deep—and weird—the rabbit hole of targeted advertising gets, The Markup has analyzed a database of 650,000 of these audience segments, newly unearthed on the website of Microsoft’s ad platform Xandr. The trove of data indicates that advertisers could also target people based on sensitive information like being “heavy purchasers” of pregnancy test kits, having an interest in brain tumors, being prone to depression, visiting places of worship, or feeling “easily deflated” or that they “get a raw deal out of life.”

Many of the Xandr ad categories are more prosaic, classifying people as “Affluent Millennials,” for example, or as “Dunkin Donuts Visitors.” Industry critics have raised questions about the accuracy of this type of targeting. And the practice of slicing and dicing audiences for advertisers is an old one.

But the exposure of a collection of audience segments this size offers consumers an unusual look at how they and their families are packaged, described, and categorized by ad companies.

Because the segments also include the names of the companies involved in creating them, they also shed light on how disparate pools of personal data—collected by tracking people’s online activity and real-world movements—are combined into bespoke, branded groups of potential ad viewers that can be marketed to publishers and advertisers.

“I think it’s the largest piece of evidence I’ve ever seen that provides information about what I call today’s “distributed surveillance economy,” said Wolfie Christl, a privacy researcher at Cracked Labs, who discovered the file and shared it with The Markup.

Christl noted that the Xandr segments touched on highly sensitive topics. One civil liberties advocate called this sort of targeting “one of the greatest threats to data privacy” and said that he was concerned with some of the categories in the Xandr material, especially around reproductive health. A consumer who was placed in one of the audience segments available through Xandr said the segment did not accurately reflect his income.

Christl also shared his findings with German digital rights news site netzpolitik.org, which reported on the audience segments in cooperation with The Markup. The publication revealed the participation of European firms like data location broker Adsquare on the Xandr platform and examined whether they are complying with European Union data protection laws.

Microsoft removed the file from its website after we emailed the company and did not respond to multiple requests for comment.

What Is in This Data?

The file, which was linked to from a public page on Xandr’s website, contains 650,000 rows of data, each containing the name of an audience segment, the name of the supplier of the data behind that segment, a supplier ID number, and a segment ID number.

On Xandr’s platform, advertisers can pay for the ability to target people through the segments.

The segment names sometimes contain the names of data firms other than the supplier that in some cases may be the original source of the data. They also sometimes contain a hierarchical taxonomy, such as “Lifestyle > Visitation > Recent Retail Visit by Shopper > Lululemon.”

It appears this file was meant to showcase the wide range of data sources available to license from Xandr’s marketplace (no individual consumers are listed in the dataset). Some of the segments included instructions to be used only until a certain date (“RETIRED – Use Thru 3/2020”) or not to be used in states with privacy laws (“Vyvanse ADHD Adult Target List NO COLORADO.”) Other segments appeared to be custom-built for specific ad campaigns or small local businesses. It is not clear that all of the segment names were intended to be publicly available.

The file metadata says that it was created in May 2021, meaning that the ad segments it contains may not be in use today.

There are 93 data suppliers listed in the file, including some well-known tech companies like data juggernaut Oracle (which was listed as data supplier to more than a third of the segments), location data broker Foursquare (Factual), and consumer data giant Acxiom, as well as dozens of lesser-known ad tech companies.

Christl said he thinks the large number of companies named in the file shows that Xandr was (at least in 2021) reselling large amounts of sensitive data from a wide range of data brokers from around the world. Regarding the large amounts of segments related to sensitive topics, Christl said, “I think the file suggests that Xandr did not take even the slightest measures to exclude at least the most sensitive data from its marketplace.”

Many of the audience segments fall into broad consumer categories and also show a surprising amount of specialization:

Sensitive Segments

The Markup found thousands of rows in the file that indicate sensitive audience groupings.

Medical and Health Related

Many medical- and health-related segments mentioned specific conditions consumers may be diagnosed with, medicine they may be taking, or conditions they may develop. This category included several segments relating to reproductive health, including some involving pregnancy tests, contraceptives, and infertility.

Race/Ethnicity

Race and ethnicity showed up frequently among the demographic data targeted by the segments.

Political

Many segments were related to political beliefs, political activity, and contentious issues such as gun control, immigration, and LGBTQ rights.

Psychological Profiles

Profiles involving people’s feelings and psychology were numerous, offering advertisers a menu of consumers grouped by sentiment and mental health.

Financial

Some of the most colorfully described audience segments came from consumer credit agencies Equifax and Experian. Segments are branded with alliterative names like “Silver Sophisticates” and “Progressive Potpourri” that reflect the political and socioeconomic makeup of the household. Some of these brand-name segments promise a package of economically stressed individuals to target with names like “Struggling Elders” and “Tight Money.”

Military

Veterans are the subject of several audience segments, as are active and retired members of the military.

Location and Geofencing

Consumers are packaged according to their location history and movements. Advertisers were offered segments that appeared to target people based on where they shop, work, and visit, including those who go to state capitol buildings, congressional offices, federal agency offices, and locations like defense contractor and gun manufacturer headquarters.

Brand Protection

In addition to using audience segments to determine who will see an ad, platforms like Xandr also offer ways for advertisers to control when their ads won’t appear. “Brand protection” or “negative keyword” segments are lists of keywords that let advertisers prevent an ad from appearing in contexts that could reflect poorly on them.

“The most common example is, you don’t want your airplane ad running alongside an article about a plane crash,” said Nandini Jammi, co-founder of Check My Ads, a nonprofit ad tech watchdog group, in an interview with The Markup. Jammi said there is a group of broadly agreed-upon categories that all brands want to steer clear of, known in the industry as “the Dirty Dozen”:  death and injury, military conflict, adult content, terrorism, hate speech, obscenity, drugs, tobacco, firearms, crime, online piracy and spam, and harmful sites. Oracle offers segments to avoid these categories as part of its “Contextual Intelligence” offerings.

This mechanism is a blunt instrument, however, and industry observers have seen how easily a newsworthy term like “coronavirus” triggered the mechanism, preventing ads from being placed and unintentionally choking off crucial revenue to the struggling ad-supported news business.

What Is Xandr?

Xandr is an online advertising platform that Microsoft purchased from AT&T in 2021.

Xandr connects and serves both sides of the advertising ecosystem—the “supply” side of publishers with open ad slots—and the “demand” side of advertisers looking to place their ads in front of people.

Advertisers use Xandr to place their ads across various digital advertising channels, targeting audience segments as they hear ads in streaming audio and view them on the web, in video, and on connected televisions. Xandr also provides the ability to measure advertising performance and to trade in real-time ad auctions.

Publishers use Xandr to sell and manage their ad inventory, optimize the highest prices for ad placement, sell their ad space in real-time auctions, measure advertising success, and perform quality control to make sure only appropriate ads appear next to their content.

The audience segment file analyzed by The Markup was found on a documentation page on Xandr’s website under the heading “Data Marketplace – Buyer Overview.”

What Companies Are Providing These Segments?

This spreadsheet lists 93 distinct data providers, but many of the segments reference other data companies, which may indicate the origin of the data. Companies owned by data giant Oracle make up more than one-third of the segments (36 percent).

Oracle did not respond to requests for comment.

How Are These Segments Used?

Here’s a hypothetical scenario (albeit greatly simplified) of how advertisers use audience segments like the ones analyzed by The Markup.

  • You are scrolling through a news website.
  • You tap on a link to read an article about a new study looking at people diagnosed with depression, as you have a close friend suffering from the condition.
  • As the page starts to load, a signal goes out from an advertising platform used by the website publisher that says there is an available ad slot up for auction. This signal includes information about the website, information about the page you requested, the ad size, your device or mobile ad ID, your IP address, and often your approximate location.
  • Another ad platform receives the signal and opens a bidding process for advertisers who wish to show you an ad.
  • Ad platforms working behalf of the advertisers analyze the data in the bid request to see if it aligns with the advertisers’ current campaigns.
  • One of the bidders recognizes your IP address and ad ID and finds that you are in the “Health & Fitness::Depression (audience interest)” segment. This bidder is an ad agency working on behalf of its client, a pharmaceutical company that sells drugs to treat depression and is willing to pay enough in the real-time auction to win the ad placement.
  • The ad agency submits its bid through the ad platform and wins the auction.
  • An ad for an anti-depression drug made by the pharmaceutical company loads on your page.
  • The whole process of auctioning your attention unfolded in the blink of an eye, mere milliseconds.

Potential Harms and Tough Times

Adam Schwartz, a senior lawyer at the Electronic Frontier Foundation (EFF), said that the effort by the online ad industry to closely target people, as in the segments file reviewed by The Markup, constitutes “one of the greatest threats to data privacy” today.

Of the companies providing the segments, Schwartz said, “It’s especially alarming to see that they are amassing information about reproductive health given that there are an increasing number of states that want to punish people for getting reproductive health care.”

At least one person targeted based on a segment in the Xandr file was also alarmed.

Markup reader Paul Bowers said “it was jarring to see” himself referred to as part of a financial audience segment listed as “Tough Times” in materials handed over to him by the grocery chain Food Lion after Bower requested a copy of his data from the company.

In a 32-page PDF file provided by Food Lion, “Tough Times” was listed as Bowers’ “Mosaic Household Description.” In a 2014 Experian marketing document, the Tough Times segment was described as a collection of “Older, lower income and ethnically-diverse singles typically concentrated in inner-city apartments.”

“My wife and I do pretty well for ourselves in the grand scheme of things,” Bowers told The Markup, “and the income figures this company had for us were way off.” He suspects that location may have something to do with it. “It might have been because our Food Lion is in a low-income area.”

Mosaic, the originator of the “Tough Times” segment, is a brand of audience products sold by the consumer data giant Experian. The “Tough Times” segment was available on Xandr, judging from the file, which lists it as being supplied by Experian and Oracle’s BlueKai platform under the name “Branded Data > Experian > Mosaic > Group S: Economic Challenges > S71 – Tough Times (BlueKai).”

Bowers requested his data from Food Lion after reading our story about supermarket shopper data collection. In addition to the “Tough Times” classification, the PDF he received contained information about his shopping patterns and a series of numerical scores indicating how much he “engaged” with certain categories of goods.

Jordan Takeyama, a spokesperson for Experian, told The Markup in an email, “We use anonymized, aggregated and modeled data to build the segments, and information about individuals is never shared with any organization.” Takeyama said that the example segments we sent them “are outdated and no longer available to our clients.” Food Lion did not respond to a request for comment.

Bowers said that he was surprised at some of the other audience segments in Experian Mosaic. “I knew that marketers were interested in broad segments like ‘males ages 18-35,’ but I had no idea how granular these segments could be.”

Takeyama said that Experian “… make[s] it easy for consumers to see, correct, opt-out from the use or sale, and delete their personal information as defined by law from our databases.”

What Can I Do to Find Out What Segments I Am in—and How Do I Stay out of Them?

There are various ways you can prevent companies from tracking you and thus avoid ending up in ad audience segments in the first place. But let’s look at what happens after you have been profiled in the system. If you are curious about which audience segments you might appear in, there are a few things you can do.

To see what segments you’ve been included in, you can submit a request to review your data from large data brokers like Oracle, Axciom or Experian. Some companies allow you to submit corrections when errors are found. After seeing how companies have categorized you, you may choose to opt out of data collection. Most companies describe the process for opting out on their privacy policy pages.

Facebook and Instagram users can see the ad topics that Meta has generated based on their online and offline behavior, and users have the ability to remove unwanted or inaccurate topics.

Some observers have proposed more systemic solutions to help people avoid tracking. EFF advocates banning targeted advertising outright, for example. “We need the government to step in and enact real data privacy legislation,” said the EFF’s Schwartz.

The advertising industry, meanwhile, has attempted to self-regulate. Xandr is a member of the nonprofit Network Advertising Initiative, or NAI, which requires members to comply with voluntary policies on the handling of consumers’ sensitive information and that says it conducts annual compliance reviews. NAI advises members to obtain opt-in consent when using sensitive health information, and many of the health-related segments The Markup found in the file appear to be sensitive based on NAI’s definition. But it’s not clear if opt-in consent was used in the data collection process.

Similarly, the file contains many segments referencing the sorts of locations that NAI standards  classify as “sensitive points of interest,” even though the standards say data collection in those types of locations should be limited.

Nat Wood, a spokesperson for NAI, told The Markup in an emailed statement that NAI conducts comprehensive annual reviews of its members. Wood said that the segments can be created in several different ways. “Third-party segments come from various sources and don’t necessarily rely on sensitive personal data. They could be modeled or lookalike data; based on purchase history with opt-in consent; or available in some jurisdictions but not others.”

How Accurate Are These Segments?

Tim Hwang is a researcher and author of “Subprime Attention Crisis,” a book that examines what he sees as the digital ad industry’s structural flaws. He questions whether advertising technology is really as effective at targeting as it claims to be.

“We do have examples of segments that genuinely seem to create the opportunity to sell products,” Hwang said in an interview. For example, if “we know that you have no money in your bank account,” then “this is a great time to sell you a super high interest loan.”

But when it comes to the industry profiling people in depth, he said, “[t]he reality is that it is indeed just a lot messier than all that. And it is basically kind of like a patchwork.”

This article was originally published on The Markup and was republished under the Creative Commons Attribution-NonCommercial-NoDerivatives license.

Posted in: Big Data, Civil Liberties, Competitive Intelligence, Cybersecurity, Data Mining, E-Commerce, Health, Privacy