- State of the Art in Program Synthesis — conference, with talks to be posted afterward, run by a YC startup. Program Synthesis is one of the most exciting fields in software today, in my humble opinion: Programs that write programs are the happiest programs in the world, in the words of Andrew Hume. It’ll give coders superpowers, or make us redundant, but either way, it’s interesting.
- Alternative Influence (Data and Society) — amazing report. Extremely well-written, it lays out how the alt right uses YouTube. These strategies reveal a tension underlying the content produced by these influencers: while they present themselves as news sources, their content strategies often more accurately consist of marketing and advertising approaches. These approaches are meant to provoke feelings, memories, emotions, and social ties. In this way, the “accuracy” of their messaging can be difficult to assess through traditional journalistic tactics like fact-checking. Specifically, they recount ideological testimonials that frame ideology in terms of personal growth and self-betterment. They engage in self-branding techniques that present traditional, white, male-dominated values as desirable and aspirational. They employ search engine optimization (SEO) to highly rank their content against politically charged keywords. And they strategically use controversy to gain attention and frame political ideas as fun entertainment.
- Chatbot and Related Research Paper Notes with Images — Papers related to chatbot models in chronological order spanning about five years from 2014. Some papers are not about chatbots, but I included them because they are interesting, and they may provide insights into creating new and different conversation models. For each paper I provided a link, the names of the authors, and GitHub implementations of the paper (noting the deep learning framework) if I happened to find any. Since I tried to make these notes as concise as possible, they are in no way summarizing the papers but are merely a starting point to get a hang of what the paper is about, and to mention main concepts with the help of pictures.
- I Don’t Know (Wired) — Two percent of Brits don’t know whether they’ve lived in London before. Five percent don’t know whether they’ve been attacked by a seagull or not. A staggering one in 20 residents of this fine isle don’t know whether or not they pick their nose. (via Flowing Data)
- Haberman — interesting research into one way that online maps end up with places that aren’t places.
- Blueprint — a React-based UI toolkit for the web. It is optimized for building complex, data-dense web interfaces for desktop applications that run in modern browsers and IE11. This is not a mobile-first UI toolkit.
- IBM Open Sources Power Chip Instruction Set (Next Platform) — To be precise about what IBM is doing, it is opening up the Power ISA [Instruction Set Architecture] and giving it to the OpenPower Foundation royalty free with patent rights, and that means companies can implement a chip using the Power ISA without having to pay IBM or OpenPower a dime, and they have patent rights to what they develop. Companies have to maintain compatibility with the instruction set, King explains, and there are a whole set of compatibility requirements, which we presume are precisely as stringent as Arm and are needed to maintain runtime compatibility should many Power chips be developed, as IBM hopes will happen.
- Less than Half of Google Searches Now Result in a Click (Sparktoro) — We can see a consistent pattern: organic shrinks while zero-click searches and paid CTR rise. But the devil’s in the details, and, in this case, mostly the mobile details, where Google’s gotten more aggressive with how ads and instant answer-type features appear. Everyone has to beware of the self-serving, “hey, we’re doing people a favor by taking (some action that results in greater market domination for us)” because there’s a time when the fact that you have meaningful competition is better for the user than a marginal increase in value add from keeping them in your property longer. (via Slashdot)
- Super-Contributors and Power Laws (MySociety) — Overall, two-thirds of users made only one report—but the reports made by this large set of users only makes up 20% of the total number of reports. This means that different questions can lead you to very different conclusions about the service. If you’re interested in the people who are using FixMyStreet, that two-thirds is where most of the action is. If you’re interested in the outcomes of the service, this is mostly due to a much smaller group of people. This dynamic applies pretty much everywhere and is worth understanding.
- Facebook Prophet — a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well. Written in Python and R.
- On Nonscalability: The Living World Is Not Amenable to Precision-Nested Scales — to scale well is to develop the quality called scalability, that is, the ability to expand—and expand, and expand—without rethinking basic elements. […] [B]y its design, scalability allows us to see only uniform blocks, ready for further expansion. This essay recalls attention to the wild diversity of life on earth through the argument that it is time for a theory of nonscalability. (via Robin Sloan)
- Information Operations Directed at Hong Kong (Twitter) — Today we are adding archives containing complete tweet and user information for the 936 accounts we’ve disclosed to our archive of information operations—the largest of its kind in the industry. This is a goldmine for researchers, as you can see from Renee DiResta’s notes. Facebook also removed accounts for the same reason but hasn’t shared the data. Google has not taken a position yet, which prompted Alex Stamos to say, “Two of the three relevant companies have made public statements. Neither have realistic prospects in the PRC, the other does. Lots of lessons from this episode, but one might be a reinforcement of how Russia represents “easy mode” for platforms doing state attribution. It’s a lot harder when the actor is financially critical, like the PRC or India.” We’re in interesting times, and research around content moderation are the most interesting things I’ve seen on the internet since SaaS. This work cuts to human truths, technical capability, and the limits of openness.
- Robust Learning from Untrusted Sources (Morning Paper) — designed to let you incorporate data from multiple “weakly supervised” (i.e., noisy) data sources. Snorkel replaces labels with probability-weighted labels, and then trains the final classifier using those.
- Imaging Floppies (Jason Scott) — recording the magnetic strength everywhere on the disk so you archive all the data not just the data you can read once. The result of this hardware is that it takes a 140 kilobyte floppy disk (140k) and reads it into a 20 megabyte (20,000 kilobyte) disk image. This means a LOT of the magnetic aspects of the floppy are read in for analysis. […] This doesn’t just dupe the data, but the copy protection, unique track setup, and a bunch of variance around each byte on the floppy to make it easier to work with. The software can then do all sorts of analysis to give us excellent, bootable disk images. Don’t ever think that archiving is easy, or problems are solved.
- Chart.xkcd — a chart library plots “sketchy,” “cartoony,” or “hand-drawn” styled charts. The world needs more whimsy.
- CROKAGE: A New Way to Search Stack Overflow — a paper about a service [that] takes the description of a programming task as a query and then provides relevant, comprehensive programming solutions containing both code snippets and their succinct explanations. There’s a replication package on GitHub. Follows in the footsteps of Douglas Adams’s Electric Monk (which people bought to pay for them) and DVRs (which people use to watch TV for them), now we have software that’ll copy dodgy code from the web for you. Programmers, software is coming for your jobs.
- Cheap Fakes Beat Deep Fakes — One of the fundamental rules of information warfare is that you never lie (except when necessary.) Deepfakes are detectable as artificial content, which reveals the lie. This discredits the source of the information and the rest of their argument. For an information warfare campaign, using deepfakes is a high-risk proposition.
- I Took 9 Different Commercial DNA Tests and Got 6 Different Results — refers to the dubious ancestry measures. “Ancestry itself is a funny thing, in that humans have never been these distinct groups of people,” said Alexander Platt, an expert in population genetics at Temple University in Philadelphia. “So, you can’t really say that somebody is 92.6 percent descended from this group of people when that’s not really a thing.”
- Dirty Tricks 6502 Programmers Use — wonderfully geeky disection of a simple task rendered in as few bytes as possible.
It’s a nerve-wracking time to be a Big Tech company. Yesterday, a US subcommittee on antitrust grilled representatives from Amazon, Google, Facebook, and Apple in Congress, and presidential candidates have gone so far as to suggest that these behemoths should be broken up. In the European Union, regulation is already happening: in March, the EU levied its third multibillion-dollar fine against Googlefor anti-competitive behavior.
In his 2018 letter to shareholders, published this past April, Jeff Bezos was already prepping for conversations with regulators. He doesn’t think Amazon is a monopoly. Instead, the company’s founder argues it is “just a small player in global retail.”
In Bezos’s defense, for many of the products Amazon sells, there are indeed many alternative sources, suggesting plenty of competition. Despite Amazon’s leadership in online retail, Walmart is more than double Amazon’s size as a general retailer, with Costco not far behind Amazon. Specialty retailers like Walgreens and CVS in the pharmacy world and Kroger and Albertson’s in groceries also dwarf Amazon’s presence in their categories.
But Amazon does not just compete with Walmart, CVS, Kroger, and other retailers—it also competes with the merchants who sell products through its platform.
This competition isn’t just the obvious kind, such as the Amazon Basics-branded batteries that by 2016 represented one third of all online battery sales, as well as similar Amazon products in audio, home electronics, baby wipes, bed sheets, and kitchenware. Amazon also competes with its merchants for visibility on its platform, and charges them additional fees for favored placement. And because Amazon is now leading with featured products rather than those its customers think are the best, its merchants are incentivized to advertise on the platform. Amazon’s fast-growing advertising business is thus a kind of tax on its merchants.
Likewise, Google does not just compete with other search engines like Bing and DuckDuckGo, but with everyone who produces content on the world wide web. Apple’s iPhone and Google’s Android don’t just compete with each other as smartphone platforms, but also with the app vendors who rely on smartphones to sell their products.
This kind of competition is taken for granted by antitrust regulators, who are generally more concerned with the end cost for consumers. And as anyone who has shopped online will know, Amazon is nearly always the cheaper option. (In fact, surveys have suggested that between seven and nine out of 10 Americans will check Amazon to compare the price of a purchase.) As long as the monopoly doesn’t lead to us forking out more money, then antitrust regulators traditionally leave it alone.
However, this view of antitrust leaves out some unique characteristics of digital platforms and marketplaces. These giants don’t just compete on the basis of product quality and price—they control the market through the algorithms and design features that decide which products users will see and be able to choose from. And these choices are not always in consumers’ best interests.
A fresh approach to antitrust
All of the internet giants—Amazon, Google, Facebook, and insofar as app stores are considered, Apple—provide the illusion of free markets, in which billions of consumers choose among millions of suppliers’ offerings, which compete on the basis of price, quality, and availability.
But if you recognize that what consumers really choose from is not the universe of all possible products, but those that are offered up to them either on the homepage or the search screen, the “shelf space” provided by these platforms is in fact far more limited than the tiniest of local markets—and what is placed on that shelf is uniquely under the control of the platform owner. And with mobile playing a larger and larger role, that digital shelf space is visibly shrinking rather than growing.
In short, the designers of marketplace-platform algorithms and screen layouts can arbitrarily allocate value to whom they choose. The marketplace is designed and controlled by its owners, and that design shapes “who gets what and why” (to use the marvelous phrase from Alvin E. Roth, who received a Nobel prize in economics for his foundational work in the field of market design.)
When it comes to antitrust, the question of market power must be answered by analyzing the effect of these marketplace designs on both buyers and sellers, and how they change over time. How much of the value goes to the platform, how much to consumers, and how much to suppliers?
The platforms have the power to take advantage of either side of their marketplace. Any abuse of market power is likely to show up first on the supply side. A dominant platform can squeeze its suppliers while continuing to pass along part of the benefit to consumers—but keeping more and more of it for themselves.
Over time, though, consumers feel the bite. Power over sellers ultimately translates into power over customers as well. As the platform owner favors its own offerings over those of its suppliers, choice is reduced, though it is only in the endgame that consumer pricing—the typical measure of a monopoly—begins to be affected.
The control that the platforms have over placement and visibility puts them in a unique position to collect what economists call rents: that is, value extracted through the ownership of a limited resource. These rents may come in the form of additional advantage given to the marketplace’s own private-label products, but also through the fees that are paid by merchants who sell through that platform. These fees can take many forms, including the necessity for merchants to spend more on advertising in order to gain visibility; Amazon products don’t have to pay such a levy.
The term “rents” dates back to the very earliest days of modern economics, when agricultural land was still the primary source of wealth. That land was worked productively by tenant farmers, who produced value through their labor. But the bulk of the benefit was taken by the landed gentry, who lived lives of ease on the unearned income that accrued to them simply through the ownership of their vast estates. In today’s parlance, Amazon’s merchants are becoming sharecroppers. The cotton field has been replaced by a search field.
Not all rents are bad. Economist Joseph Schumpeter pointed out that technological innovation often can lead to temporary rents, as innovators initially have a corner on a new product or service. But he also pointed out that these so-called Schumpeterian rents can, over time, become traditional monopolistic rents.
This is what antitrust regulators should be looking at when evaluating internet platform monopolies. Has control over the algorithms and designs that allocate attention become the latest tool in the landlord’s toolbox?
Big Tech has become the internet’s landlord—and rents are rising as a result.
In her book, The Value of Everything, economist Mariana Mazzucato makes the case that if we are really to understand the sources of inequality in our economy, economists must turn their attention back to rents. One of the central questions of classical economics was what activities are actually creating value for society, and which are merely value extracting—in effect charging a kind of tax on value that has actually been created elsewhere.
In today’s neoclassical economics, rents are seen as a temporary aberration, the result of market defects that will disappear given sufficient competition. But whether we are asking fundamental questions about value creation, or merely insufficient competition, rent extraction gives us a new lens through which to consider antitrust policy.
How internet platforms increase choice
Before digital marketplaces curtailed our choices as consumers, they first expanded our options.
Amazon’s virtually unlimited virtual shelf space radically expanded opportunity for both suppliers and consumers. After all, Amazon carries 120 million unique products in the US alone, compared to about 120,000 in a Walmart superstore or 35 million on walmart.com. What’s more, Amazon operates a marketplace with over 2.5 million third-party sellers, whose products, collectively, provide 58% of all Amazon retail revenue, with only 42% coming from Amazon’s first-party retail operation.
In the first-party retail operation, Amazon buys products from its suppliers and then resells them to consumers. In the third-party operation, Amazon collects fees for providing marketplace services to sellers—including display on amazon.com, warehousing, shipping, and sometimes even financing—but never legally takes possession of the companies’ merchandise. This is what allows it to have so many more products to sell than its competitors: because Amazon never takes possession of inventory but instead charges suppliers for the services it provides, the risk of offering a slow-moving product is transferred from Amazon to its suppliers.
All of this appears to add up to the closest approximation ever seen in retail to what economists call “perfect competition.” This term refers to market conditions in which a large number of sellers with offers to provide comparable products at a range of prices are met by a large number of buyers looking for those products. Those buyers are armed not only with the ability to compare the price at which products are offered, but also to compare the quality of those products via consumer ratings and reviews. In order to win the business of consumers, suppliers must not only offer the best products at the best prices, but must compete for customers to express their satisfaction with the products they have bought.
So far, at least according to the statistics Bezos shared in his annual letter, the success of the Amazon marketplace is a triumph for both suppliers and consumers, and antitrust regulators should look elsewhere. As he put it, “Third-party sellers are kicking our first-party butt.”
He may well be right, but there are warning signs from other internet marketplaces like Google search that suggest the situation may not be as rosy as it appears. As it turns out, regulators need to consider some additional factors in order to understand the market power of internet platforms.
How internet platforms take away choice
If Amazon has become “the everything store” for physical goods, Google is the everything store for information.
Even more than Amazon, Google appears to meet the conditions for perfect competition. It matches up consumers with a near-infinite source of supply. Ask any question, and you’ll be provided with answers from hundreds or even thousands of competing content suppliers.
To do this, Google searches hundreds of billions of web pages created by hundreds of millions of information suppliers. Traditional price matching is absent, since much of the content is offered for free, but Google uses hundreds of other signals to determine what answers its customers are likely to find “best.” They measure such things as the reputation of the sites linking to any other site (page rank); the words those sites use to make those links (anchor text); the content of the document itself (via an AI engine referred to as “the Google Brain”); how likely people are to click on a given result in the list, based on millions of iterations, all recorded and measured; and even whether people clicked on a link and appear to have gone away satisfied (“a long click”) or came back and clicked on another (“a short click”).
The same goes for advertising on Google. Its “pay per click” ad auction model was a breakthrough in the direction of perfect competition: advertisers pay only when customers click on their ads. Both Google and advertisers are thus incentivized to feature ads that users actually want to see.
Only about 6% of Google search results pages contain any advertising at all. Both content producers and consumers have the benefit of Google’s immense effort to index and search all web pages, not just those that are commercially valuable. Google is like a store where all of the goods are free to consumers, but some merchants pay, in the form of advertising, to have their goods placed front and center.
The company is well aware of the risk that advertising will lead Google to favor the needs of advertisers over those of searchers. In fact, “Advertising and mixed motives” is the title of the appendix to Google founders Larry Page and Sergey Brin’s original 1998 research paper on Google’s search algorithms, written while they were still graduate students at Stanford.
By placement on the screen and algorithmic priority, platforms have the power to shape the pages users click on and the products they decide to buy.
“The goals of the advertising business model do not always correspond to providing quality search to users,” they thoughtfully observed. Google made enormous efforts to overcome those mixed motives by clearly separating their advertising results from their organic results, but the company has blurred those boundaries over time, perhaps without even recognizing the extent to which they have done so.
It is undeniable that the Google search results pages of today look nothing like they did when the company went public in 2004. The list of 10 “organic” results with three paid listings on the top and a sidebar of advertising results on the right that once characterized Google are long gone.
Dutch search engine consultant Eduard Blacquière documented the changes in size and placement of adwords (link in Dutch), the pay-per-click advertisements that run alongside searches, between 2010 and 2014. Here’s a page he captured in June 2010, the result for a search for the word “autoverzekering” (“auto insurance” in Dutch).
Note that the adwords at the top of the page have a background tint, and those at the side have a narrower column width, setting both off clearly from the organic results. Take a quick glance at this page, and your eye can quickly jump to the organic results while ignoring the ads if that’s what you prefer.
Here is Blacquière’s dramatization of the change in size of that top block of adwords. As you can see, the ad block has both dramatically changed in size and lost its background color between 2010 and 2019, making it much harder to distinguish ads from organic results.
Today, paid results can push organic results almost off the screen, so that the searcher has to scroll down to see them at all. On mobile pages with advertisements, this is almost always the case. Blacquière also documented the result of several studies done over a five-year period, which found the likelihood of a click on the first organic search result fell from over 40% in 2010 to less than 20% in 2014. This shows that through changes in homepage design alone, Google was able to shift significant attention from organic search results to ads.
Not only is paid advertising supplanting organic search results, but for more and more queries, Google itself has now collected enough information to provide what it considers to be the best answer directly to the consumer, eradicating the need to send us to a third-party website at all.
That’s the box that often appears above the search results when you ask a question, such as What are the lyrics to “Don’t Stop Believing,” or What date did WWII end?; the box to the right that pops up with restaurant reviews and opening hours; or a series of visual cards midway down the screen that show you the actors who appeared in a movie or different kinds of pastries common to a geographic region.
Through changes in homepage design alone, Google was able to shift significant attention from organic search results to ads.
Where does this information come from? In 2010, with the acquisition of Metaweb, Google committed to a project it called “the knowledge graph,” a collection of facts about well-known entities such as places, people, and events. This knowledge graph provides immediate answers for many of the most common queries.
The knowledge graph was initially culled from the web by ingesting information from sources such as Wikipedia, Wikidata, and the CIA Factbook, but since then, it has become far more encyclopedic and has ingested information from all over the web. In 2016, Google CEO Sundar Pichai claimed that the Google knowledge graph contained more than 70 billion facts.
As shown in the figure below, for a popular search that has commercial potential, like visit Yellowstone, not only is the search results page dominated by paid search results (ads) and content directly supplied by Google, but Google’s “answer boxes” are themselves filled with links to other Google pages rather than to third-party websites. (Note that Google personalizes results and also runs hundreds of thousands of A/B tests a day on the effect of minor changes in position, so your own results for this identical search may have different results than are shown here.)
As of March 2017, user clickstream data provided by web analytics firm Jumpshot suggests that up to 40% of all Google queries no longer result in a click through to an external website. Think of all the questions you go to Google for that no longer require a second click: what’s the weather? What’s the current value of the euro against the dollar? What’s that song that’s playing in the background? What’s the best local restaurant? Biographies of eminent people, descriptions of cities, neighborhoods, businesses, historical events, quotes by famous authors, song lyrics, stock prices, and flight times all now appear as immediate answers from Google.
I am not necessarily suggesting anti-competitive intent. Google claims, with considerable justice, that all of these changes to search engine result pages are designed to improve user experience. And indeed, it is often helpful to get an immediate answer to a query rather than having to click through to another web site. Furthermore, much of this data is in fact licensed. But these deals seem like a step backward from the perfect competition represented by Google’s original reliance on multi-factor search algorithms to surface the very best information from independent web sites.
The net effect on Google’s financial performance is striking. In 2004, the year that Google went public, it had two principal advertising revenue engines: Adwords (those pay-per-click advertisements that run alongside searches on Google’s own site) and Adsense (pay-per-click advertisements that Google places on third-party websites on their behalf, either in search results on their site or directly alongside their content). In 2004, the two revenue sources were very close to equal. But by 2018, Google’s revenue from advertising on its own properties had grown to 82% of its total advertising revenue, with only 18% coming from the advertising it provides on third–party sites.
These examples illustrate the power of a platform to shape, both by placement on the screen and algorithmic priority, the pages users click on and the products they decide to buy—and therefore also the economic success for the supply side of its marketplace. Google maintains a rigorous separation between the search and advertising teams, but despite that fact, changes in the layout of Google’s pages and its algorithms have played an enormous role in shaping the attention of its users to favor those who advertise with Google.
When Google decides unilaterally on the size and position that its own products take on the screen, it also stops consumers from organically deciding what content to click on or what socks to buy. That’s what antitrust regulators should be considering: whether the algorithmic and design control exerted by sites like Google or Amazon reduces the choices we have as consumers.
Maintaining the illusion of choice
If Google has monopolized our access to information, Amazon’s fast-growing advertising business is now shaping what products consumers are actually given to choose from. Have they, too, taken a bite from the poisoned apple of advertising’s mixed motives?
Amazon’s merchants are becoming sharecroppers. The cotton field has been replaced by a search field.
Like Google, Amazon used to rely heavily on the collective intelligence of its users to recommend the best products from its suppliers. It did this by using information such as the supplier-provided description of the product, the number and quality of reviews, the number of inbound links, the sales rank of similar products, and so on, to determine the order in which search results would appear. These were all factored into Amazon’s default search ranking, which put products that were considered “Most Popular” first.
But as with Google, this eden of internet collective intelligence may be in danger of coming to an end.
In the example below, you can see that the default search for “best science fiction books” on Amazon now turns up only “Featured” (i.e., paid for) products. Are these the results you’d expect from this search? Where are the Hugo and Nebula award winners? Where are the books and authors with thousands of five-star reviews?
Contrast these results for those for the same search on Google, shown in the figure below. A knowledgeable science-fiction fan might quibble with some of these selections, but this is indeed a list of widely acknowledged classics in the field. In this case, Google presents no advertising, and so the results instead simply reflect the collective intelligence of what the web thinks is best.
While this might be taken as a reflection of the superiority of Google’s search algorithms over Amazon’s, the more important point is to note how differently a platform treats results when it has no particular commercial axe to grind.
Amazon has long claimed that the company is fanatically focused on the needs of its customers. A search like the one shown above, which favors paid results, demonstrates how far the quest for advertising dollars takes them from that avowed goal.
Advice for antitrust regulators
So, how are we therefore best to decide if these Big Tech platforms need to be regulated?
In one famous exchange, Bill Gates, the founder and former CEO of Microsoft, told Chamath Palihapitiya, the one-time head of the Facebook platform:
“This isn’t a platform. A platform is when the economic value of everybody that uses it exceeds the value of the company that creates it. Then it’s a platform.”
Given this understanding of the role of a platform, regulators should be looking to measure whether companies like Amazon or Google are continuing to provide opportunity for their ecosystem of suppliers, or if they’re increasing their own returns at the expense of that ecosystem.
Rather than just asking whether consumers benefit in the short term from the companies’ actions, regulators should be looking at the long-term health of the marketplace of suppliers—they are the real source of that consumer benefit, not the platforms alone. Have Amazon, Apple, or Google earned their profits, or are they coming from monopolistic rents?
How might we know whether a company operating an algorithmically managed marketplace is extracting rents rather than simply taking a reasonable cut for the services it provides? The first sign may not be that it is raising prices for consumers, but that it is taking a larger percentage from its suppliers, or competing unfairly with them.
Before antitrust authorities look to remedies like breaking up these companies, a good first step would be to require disclosure of information about the growth and health of the supply side of their marketplaces. The statistics about the growth of its third-party marketplace that Bezos trumpeted in his shareholder letter tell only half the story. The questions to ask are who profits, by how much, and how that allocation of rewards is changing over time.
Regulators such as the SEC should require regular financial reporting on the allocation of value between the platform and its marketplace. I have done limited analysis for Google and Amazon based on information provided in their annual public filings, but much of the information required for a rigorous analysis is just not available.
Google provides an annual economic impact report analyzing value provided to its advertisers, but there is no comparable report for the value created for its content suppliers. Nor is there any visibility into the changing fortunes of app suppliers into the Play Store, Google’s Android app marketplace, or into the fortunes of content providers on YouTube.
Questions of who gets what and why must be asked of Amazon’s marketplace and its other operating units, including its dominant cloud-computing division, or Apple’s App Store. The role of Facebook’s algorithms in deciding what content appears in its readers’ newsfeeds has been widely scrutinized with regard to political bias and manipulation by hostile actors, but there’s been little rigorous economic analysis of economic bias in the algorithms of any of these companies.
Data is the currency of these companies. It should also be the currency of those looking to regulate them. You cannot regulate what you don’t understand. The algorithms that these companies use may be defended as trade secrets, but their outcomes should be open to inspection.
In this episode of the Data Show, I speak with Alex Ratner, project lead for Stanford’s Snorkel open source project; Ratner also recently garnered a faculty position at the University of Washington and is currently working on a company supporting and extending the Snorkel project. Snorkel is a framework for building and managing training data. Based on our survey from earlier this year, labeled data remains a key bottleneck for organizations building machine learning applications and services.
Ratner was a guest on the podcast a little over two years ago when Snorkel was a relatively new project. Since then, Snorkel has added more features, expanded into computer vision use cases, and now boasts many users, including Google, Intel, IBM, and other organizations. Along with his thesis advisor professor Chris Ré of Stanford, Ratner and his collaborators have long championed the importance of building tools aimed squarely at helping teams build and manage training data. With today’s release of Snorkel version 0.9, we are a step closer to having a framework that enables the programmatic creation of training data sets.
We had a great conversation spanning many topics, including:
Why he and his collaborators decided to focus on “data programming” and tools for building and managing training data.
A tour through Snorkel, including its target users and key components.
What’s in the newly released version (v 0.9) of Snorkel.
The number of Snorkel’s users has grown quite a bit since we last spoke, so we went through some of the common use cases for the project.
Data lineage, AutoML, and end-to-end automation of machine learning pipelines.
Holoclean and other projects focused on data quality and data programming.
The need for tools that can ease the transition from raw data to derived data (e.g., entities), insights, and even knowledge.
- Making Uncommon Knowledge Common — The Rich Barton playbook is building data content loops to disintermediate incumbents and dominate search, and then using this traction to own demand in their industries.
- Data: Past, Present, and Future — Data and data-empowered algorithms now shape our professional, personal, and political realities. This course introduces students both to critical thinking and practice in understanding how we got here, and the future we now are building together as scholars, scientists, and citizens. The way “Intro to Data Science” classes ought to be.
- Clever Travel Mouse — very small presenter tool, mouse and pointer.
- Training Bias in “Hate Speech Detector” Means Black Speech is More Likely to be Censored (BoingBoing) — The authors do a pretty good job of pinpointing the cause: the people who hand-labeled the training data for the algorithm were themselves biased, and incorrectly, systematically misidentified AAE writing as offensive. And since machine learning models are no better than their training data (though they are often worse!), the bias in the data propagated through the model.
- Getting Deplatformed from Apple (BoingBoing) — It turned out that getting locked out of his Apple account made all of Luke’s Apple hardware almost useless. I think it should be illegal to do this. I believe in deplatforming (with appropriate boundaries and appeal) but breaking my hardware is bollocks.
- How to Avoid Groupthink When Hiring (HBR) — abridged process: First, make it clear to interviewers that they should not share their interview experiences with each other before the final group huddle. Next, ask each interviewer to perform a few steps before the group huddle: distill their interview rating to a single numerical score; write down their main arguments for and against hiring this person and their final conclusion; If interviewers are emailing in their numerical scores and thoughts on a candidate, don’t include the entire group in the email. Finally, the hiring managers should take note of the average score for a candidate.
- Loot Boxes a Matter of “Life or Death,” says Researcher — “There’s one clear message that I want to get across today, and it stands in stark contrast to mostly everything you’ve heard so far,” Zendle said. “The message is this: spending money on loot boxes is linked to problem gambling. The more money people spend on loot boxes, the more severe their problem gambling is. This isn’t just my research. This is an effect that has been replicated numerous times across the world by multiple independent labs. This is something the games industry does not engage with.”
- Interoperability and Privacy (BoingBoing) — latest in the tear that Cory’s been on about how to deal with the centralized power of BigSocial.
- Younger Americans are Better than Older Americans at Telling Factual News Statements from Opinions (Pew Research) — About a third of 18- to 49-year-olds (32%) correctly identified all five of the factual statements as factual, compared with two-in-ten among those ages 50 and older. A similar pattern emerges for the opinion statements. Among 18- to 49-year-olds, 44% correctly identified all five opinion statements as opinions, compared with 26% among those ages 50 and older. Or, 68% of 18-49 year olds couldn’t tell whether five factual statements were factual? (via @pewjournalism)
- How YouTube Radicalized Brazil (NYT) — He was killing time on the site one day, he recalled, when the platform showed him a video by a right-wing blogger. He watched out of curiosity. It showed him another, and then another. “Before that, I didn’t have an ideological political background,” Mr. Martins said. YouTube’s auto-playing recommendations, he declared, were “my political education.” “It was like that with everyone,” he said.
- Paged Out — a new experimental (one article == one page) free magazine about programming (especially programming tricks!), hacking, security hacking, retro computers, modern computers, electronics, demoscene, and other similar topics.
- Credit Blacklists, Not the Solution to Every Problem — translated Chinese article on blacklists. As the aforementioned source explained, Wulian County is one of the first in Shandong Province to trial the construction of a social credit system, that began last year. The blacklist is a disciplinary measure restricted to persons within the county. It is different from the People’s Bank of China’s credit information evaluation system blacklist, or the blacklist for those deemed to be untrustworthy by the People’s Court. It does not affect the educational opportunities of anyone’s children, whether or not they themselves can ride a train or plane, and so on. Activities such as volunteering, donating blood, charitable contributions, and so on, can add to one’s personal credit (score), and can also be used to restore and upgrade credit ratings, removing themselves from the blacklist. (via ChinAI)