Incidental Data Capture

Rey Farhan

18 March 2015

The vast scale of consumer data collection administered by today’s internet technology companies includes billions of transactions, personal identifiers, location information, and other metadata, totalling trillions of aggregated data elements.² Their principal use is in construction of predictive character profiles that can inform business development. In short, these organizations aim to create common stories about our lives. By processing data points that arise from human behavior, they draw out patterns that infer our likely age, sex, race, purchasing habits, and inclinations. We might be ‘new adopters,’ or ‘young professional millennials’: personas subjectively fitted to our data with a mixture of statistical inference and approximate human judgment. Inferences that then become the basis for a science (or perhaps art) of categorization and forecasting that exerts an invisible force upon us by shaping the platforms we use for communication, entertainment, and consumption.

alt-text — *Manga style space.* from Lev Manovich’s How to Compare One Million Images

What these profiles accomplish for companies is to substitute the anxiety of managerial decision-making with strategies for precise, iterative, and measurable targeting against consumers’ data-derived personas: replacing operations previously subject to human judgement with quantifiable performance goals and hedged-risk business plans.

The development of personas starts with data capture. At the outset, scale and coverage are of primary importance. Acquiring more people’s data lends more power to the statistical techniques used to identify patterns, and having information about more types of human activity leads to more vivid personas. ‘Bigger’ data equates to improved accuracy, detail, and certainty, all of which increase the business value of the entire collection and processing operation. Companies are thus presented with a strong economic incentive to push their data collection further and to exchange unprocessed behavioral information, as well as personas, with other companies to make up for shortfalls in their own captured stocks.

In many cases users directly contribute the raw materials by trading their data for use of a product.This exchange is implicit for free-to-use services like Twitter, Facebook, and Google: we can use their services at no cost, they may use our data for advertising purposes. Less apparent is the extent to which business entities attempt to fill in the gaps in consumer narratives outside of these applications. Twitter’s tracking code is said to be present on 20% of websites, Facebook Connect on 30%. ³ Google’s tracking reach is practically inescapable, traversing its search engine, Gmail, Chrome, Google Analytics, Youtube, and Android properties as well as its many other integrated software and hardware products. But in addition to the major social platforms, we are also accompanied by a host of silent 3rd party tracking services, all keen to follow our behavior as we traverse the web and physical space.

Collection and distribution services of this type operate entirely behind the scenes. One such company, Neustar, spun-off of Lockheed Martin, provides a compulsory service used by every cell phone user which allows telephone number portability between carriers. With their access to customer contracts, Neustar determines real names, addresses, emails, ages, and credit card details. Neustar also possess a massive in-house web tracking service to match their customers’ phone history to their movements across the web. This data is then packaged for resale to other companies. Their ambition for fuller coverage has also led Neustar to establish reciprocal data-sharing relationships with high-usage websites and services to capture a deeper set of behavioral metrics.

The processing that follows data capture is then a marriage of statistical and subjective interpretation. It is vital to establish age, location, and sex. Beyond that, all tastes, habits, interests, and social connections may be useful. Direct survey data is sometimes used to anchor subjective classifiers with real-world observations, however no peer review is required. A persona definition may be the product of tireless labor from a seasoned research team, or a haphazard assessment produced by an overworked intern.

There are no clear industry standards, no checks to verify an analyst’s persona classifier isn’t the data-equivalent of a mysterious sausage meat. Further, no mechanism ensures congruence between the various judgements made about information dossiers across companies.⁴ While we might feel secure in our identity, hundreds of divergent stories about us could be traded on shadow markets between various commercial entities where they may eventually be applied to any transaction we make. The eventual use of personas to inform advertising, interaction design, even pricing is then entirely open ended. Our data can be recycled to help serve “personalized” content suggestions upon checkout, train machine learning-driven email marketing campaigns for better market segment targeting, or simply stored away for use at a later date.

Once out of our possession, we can do very little to control the circulation of our personal information, the same can be said of its ‘personified’ form. In fact, the only direct control we can exercise is to tightly restrict our participation in digital culture. Practically, this means receding from the modern world: no social media, no cell phone, carefully blocked internet, no credit card purchases, the list goes on. For most, this is no choice at all.

And we can only speculate as to where this expanding universe of data products and patchwork user-narratives will lead, whether information derived today will still be in use 20 years from now. Will the incidental remains of our daily online activities become lasting reflections crystallized in metadata? Impossible to say given how little forethought is currently dedicated to the accumulation, diffusion, and retention of personal information.

We currently lack much of the legal framework necessary to adequately address fundamental issues of ownership and rights regarding personal data. Everyone contributes to their own, ever-growing, ‘data legacy,’ fragmented across countless digital properties, but the fate of these imprints is uncertain; the dynamics of their development and coagulation into dominant composite narratives is entirely without precedent. Thus, chimeric projections of self, born gradually through piecemeal exchange of dimensional behavioral data may become the de facto norm.

Though such ‘insights’ might be sufficient for ad-targeting, inevitable mission-creep will result in diversification of their use. As behavioral information drifts further from the original capture conditions, susceptibility to erroneous pattern recognition, overfitting, cognitive bias, and flawed statistical inference increase. Companies making the shift toward data-driven business operations and relying on such impoverished data might just as well manage by way of horoscope reading. But what may be bad for business, could have tragic consequences for consumers. Applying poor data and poor data practices can have real impact on physical and economic well being. As is common practice (despite the plainly exploitative agenda), syphoning ill-fit behavioral information into the health insurance or payday loan industry, for instance, quickly turns from professional negligence to a practice that causes real harm.

Citizens of modern societies expect the manner in which they represent their personal identity will be protected. They could reasonably assume that extensions of these protections have been applied to the information technology marketplace. Instead, technological developments have outpaced the necessary legal and social discourse. So while commercial data use increasingly exerts its largely intangible force upon society, the public remains unaware and vulnerable without sufficient regulations in place.

Will the incidental remains of our daily online activities become lasting reflections crystallized in metadata?

Our presence online has little consideration for national boundaries, meaning an effort to rectify digital consumer protection must occur internationally. But while initiating an open conversation of this scale will be difficult, discussion amongst an educated global public is the first step to finding a solution that balances the ethical dimensions of personal data retention with practical oversight of consumer business practices. The status quo, in which only select countries enforce their own legislation, has the overall effect of weakening controls to the level of those with the lowest standards.

Addressing personal ownership of data profiles is critical to sustainable solutions for personal data protection, as is how organizations relate fragmented or distributed data profiles back to individuals. We must also develop a common parlance and set of narratives to describe data-collection, profiling, and retention so that the wider public can grasp the architecture of consumer informatics. Without a relatable language and common stories, these topics risk appearing alien, theoretically inaccessible, and needlessly conspiratorial while, in truth, commercial data capture has become invisibly commonplace and unremarkable in today’s world.

Cover image courtesy e-discoveryteam.^↥
Data Brokers: A Call for Transparency and Accountability,” Federal Trade Commission (May 2014)^↥
‘Like’ Button Follows Web Users,” The Wall Street Journal (May 2011)^↥
Solove, D. The digital person technology and privacy in the information age. (New York: New York University Press, 2004).