DSHR's Blog: big data

Showing posts with label big data. Show all posts

Thursday, November 12, 2020

Even More On The Ad Bubble

I've been writing for some time about the hype around online advertising. There's a lot of evidence that it is ineffective. Recently, the UK's Information Commissioner's Office concluded an investigation into Cambridge Analytica's involvement in the 2016 US election and the Brexit referendum. At The Register, Shaun Nichols summarizes their conclusions in UK privacy watchdog wraps up probe into Cambridge Analytica and... it was all a little bit overblown, no?:

El Reg has heard on good authority from sources in British political circles that Cambridge Analytica's advertised powers of online suggestion were rather overblown and in fact mostly useless. In the end, it was skewered by its own hype, accused of tangibly influencing the Brexit and presidential votes on behalf of political parties and campaigners using its Facebook data. Yet, no evidence, according to the ICO, could be found supporting those specific claims.

Below the fold I look at this, a recent book on the topic, and other evidence that has emerged since I wrote Contextual vs. Behavioral Advertising.

The Death Of Corporate Research Labs

In American innovation through the ages, Jamie Powell wrote:

who hasn’t finished a non-fiction book and thought “Gee, that could have been half the length and just as informative. If that.”

Yet every now and then you read something that provokes the exact opposite feeling. Where all you can do after reading a tweet, or an article, is type the subject into Google and hope there’s more material out there waiting to be read.

So it was with Alphaville this Tuesday afternoon reading a research paper from last year entitled The changing structure of American innovation: Some cautionary remarks for economic growth by Arora, Belenzon, Patacconi and Suh (h/t to KPMG’s Ben Southwood, who highlighted it on Twitter).

The exhaustive work of the Duke University and UEA academics traces the roots of American academia through the golden age of corporate-driven research, which roughly encompasses the postwar period up to Ronald Reagan’s presidency, before its steady decline up to the present day.

Arora et al argue that a cause of the decline in productivity is that:

The past three decades have been marked by a growing division of labor between universities focusing on research and large corporations focusing on development. Knowledge produced by universities is not often in a form that can be readily digested and turned into new goods and services. Small firms and university technology transfer offices cannot fully substitute for corporate research, which had integrated multiple disciplines at the scale required to solve significant technical problems.

As someone with many friends who worked at the legendary corporate research labs of the past, including Bell Labs and Xerox PARC, and who myself worked at Sun Microsystems' research lab, this is personal. Below the fold I add my 2c-worth to Arora et al's extraordinarily interesting article.

More On Failures From FAST 2020

A Study of SSD Reliability in Large Scale Enterprise Storage Deployments by Stathis Maneas et al, which I discussed in Enterprise SSD Reliability, wasn't the only paper at this year's Usenix FAST conference about storage failures. Below the fold I comment on one specifically about hard drives rather than SSDs, making it more relevant to archival storage.

The Scholarly Record At The Internet Archive

The Internet Archive has been working on a Mellon-funded grant aimed at collecting, preserving and providing persistent access to as much of the open-access academic literature as possible. The motivation is that much of the "long tail" of academic literature comes from smaller publishers whose business model is fragile, and who are at risk of financial failure or takeover by the legacy oligopoly publishers. This is particularly true if their content is open access, since they don't have subscription income. This "long tail" content is thus at risk of loss or vanishing behind a paywall.

The project takes two opposite but synergistic approaches:

Top-Down: Using the bibliographic metadata from sources like CrossRef to ask whether that article is in the Wayback Machine and, if it isn't trying to get it from the live Web. Then, if a copy exists, adding the metadata to an index.
Bottom-up: Asking whether each of the PDFs in the Wayback Machine is an academic article, and if so extracting the bibliographic metadata and adding it to an index.

Below the fold, a discussion of the progress that has been made so far.

Advertising Is A Bubble

The surveillance economy, and thus the stratospheric valuations of:

Facebook and Alphabet (Google’s parent), which rely on advertising for, respectively, 97% and 88% of their sales.

depend on the idea that targeted advertising, exploiting as much personal information about users as possible, results in enough increased sales to justify its cost.This is despite the fact the both experimental research and the experience of major publishers and advertisers show the opposite. Now, The new dot com bubble is here: it’s called online advertising by Jesse Frederik and Maurits Martijn provides an explanation for this disconnect. Follow me below the fold to find out about it and enjoy some wonderful quotes from them.

Library of Congress Storage Architecture Meeting

.The Library of Congress has finally posted the presentations from the 2019 Designing Storage Architectures for Digital Collections workshop that took place in early September, I've greatly enjoyed the earlier editions of this meeting, so I was sorry I couldn't make it this time. Below the fold, I look at some of the presentations.

Carl Malamud's Text Mining Project

For many years now it has been obvious that humans can no longer effectively process the enormous volume of academic publishing. The entire system is overloaded, and its signal-to-noise ratio is degrading. Journals are no longer effective gatekeepers, indeed many are simply fraudulent. Peer review is incapable of preventing fraud, gross errors, false authorship, and duplicative papers; reviewers cannot be expected to have read all the relevant literature.

On the other hand, there is now much research showing that computers can be effective at processing this flood of information. Below the fold I look at a couple of recent developments.

Betteridge's Law Violation

Erez Zadok points me to Wasim Ahmed Bhat's Is a Data-Capacity Gap Inevitable in Big Data Storage? in IEEE Computer. It is a violation of Betteridge's Law of Headlines because the answer isn't no. But what, exactly, is this gap? Follow me below the fold.

Might Need Some Work

"I Agree" - Source

Cory Doctorow writes:

"I Agree" is Dima Yarovinsky's art installation for Visualizing Knowledge 2018, with printouts of the terms of service for common apps on scrolls of colored paper, creating a bar chart of the fine print that neither you, nor anyone else in the history of the world, has ever read.

Earlier, Doctorow explained that the GDPR requires that:

Under the new directive, every time a European's personal data is captured or shared, they have to give meaningful consent, after being informed about the purpose of the use with enough clarity that they can predict what will happen to it.

Saturday, February 20, 2016

Andrew Orlowski speaks!

At the Battle of Ideas Festival at the Barbican last year, Claire Fox chaired a panel titled "Is Technology Limiting Our Humanity?", and invited my friend Andrew Orlowski of The Register to speak. Two short but thought-provoking extracts are now up, which The Register's editors have entitled:

Terrified robots will take middle class jobs? Look in a mirror, in which Andrew argues that jobs are automated only when they have been drained of all human functions such as judgement.
Meet the original Big Data, TED Talk, Thought Shower Futurist, in which Andrew discusses the analogies between the work of William Playfair (1759-1823) and today's Big Data enthusiasts.

Playfair in particular is a fascinating character:

an embezzler and a blackmailer, with some unscrupulous data-gathering methods. He would kidnap farmers until they told him how many sheep they had. Today he’s remembered as the father of data visualisation. He was the first to use the pie chart, the line chart, the bar chart.
...
Playfair stressed the confusion of the moment, its historical discontinuity, and advanced himself as a guru with new methods who was able to make sense of it.

Both extracts are worth your time.

Thursday, October 8, 2015

Two In Two Days

Tuesday, Cory Doctorow pointed to "another of [Maciej Cegłowski's] barn-burning speeches". It is entitled What Happens Next Will Amaze You and it is a must-read exploration of the ecosystem of the Web and its business model of pervasive surveillance. I commented on my post from last May Preserving the Ads? pointing to it, because Cegłowski goes into much more of the awfulness of the Web ecosystem than I did.

Yesterday Doctorow pointed to another of Maciej Cegłowski's barn-burning speeches. This one is entitled Haunted by Data, and it is just as much of a must-read. Doctorow is obviously a fan of Cegłowski's and now so am I. It is hard to write talks this good, and even harder to ensure that they are relevant to stuff I was posting in May. This one takes the argument of The Panopticon Is Good For You, also from May, and makes it more general and much clearer. Below the fold, details.

DSHR's Blog

Thursday, November 12, 2020

Even More On The Ad Bubble

Tuesday, May 19, 2020

The Death Of Corporate Research Labs

Tuesday, March 24, 2020

More On Failures From FAST 2020

Tuesday, February 18, 2020

The Scholarly Record At The Internet Archive

Tuesday, January 14, 2020

Advertising Is A Bubble

Thursday, January 9, 2020