- Senate Testimony (Maciej Ceglowski) — This is an HTMLized version of written testimony I provided on May 7, 2019, to the Senate Committee on Banking, Housing, and Urban Affairs for their hearing on Privacy Rights and Data Collection in a Digital Economy. […] The leading ad networks in the European Union have chosen to respond to the GDPR by stitching together a sort of Frankenstein’s monster of consent, a mechanism whereby a user wishing to visit, say, a weather forecast page is first prompted to agree to share data with a consortium of 119 entities, including the aptly named “A Million Ads” network. The user can scroll through this list of intermediaries one by one, or give or withhold consent en bloc, but either way she must wait a further two minutes for the consent collection process to terminate before she is allowed to find out whether or not it is going to rain.
- Awesome Decision Tree Papers — A collection of research papers on decision, classification, and regression trees with implementations.
- Other People’s Problems (Camille Fournier) — There’s always going to be something you can’t fix. So how do you decide where to exert your energy? Step one: figure out who owns this problem.
- Toward the Next Generation of Programming Tools (Mike Loukides) — one of the most interesting research areas in artificial intelligence is the ability to generate code.
What can artificial intelligence (AI) and machine learning (ML) do to improve customer experience? AI and ML already have been intimately involved in online shopping since, well, the beginning of online shopping. You can’t use Amazon or any other shopping service without getting recommendations, which are often personalized based on the vendor’s understanding of your traits: your purchase history, your browsing history, and possibly much more. Amazon and other online businesses would love to invent a digital version of the (possibly mythical) sales person who knows you and your tastes, and can unerringly guide you to products you will enjoy.
Everything begins with better data
To make that vision a reality, we need to start with some heavy lifting on the back end. Who are your customers? Do you really know who they are? All customers leave behind a data trail, but that data trail is a series of fragments, and it’s hard to relate those fragments to each other. If one customer has multiple accounts, can you tell? If a customer has separate accounts for business and personal use, can you link them? And if an organization uses many different names (we remember a presentation in which someone talked of the hundreds of names—literally—that resolved to IBM), can you discover the single organization responsible for them? Customer experience starts with knowing exactly who your customers are and how they’re related. Scrubbing your customer lists to eliminate duplicates is called entity resolution; it used to be the domain of large companies that could afford substantial data teams. We’re now seeing the democratization of entity resolution: there are now startups that provide entity resolution software and services that are appropriate for small to mid-sized organizations.
Once you’ve found out who your customers are, you have to ask how well you know them. Getting a holistic view of a customer’s activities is central to understanding their needs. What data do you have about them, and how do you use it? ML and AI are now being used as tools in data gathering: in processing the data streams that come from sensors, apps, and other sources. Gathering customer data can be intrusive and ethically questionable; as you build your understanding of your customers, make sure you have their consent and that you aren’t compromising their privacy.
ML isn’t fundamentally different from any other kind of computing: the rule “garbage in, garbage out” still applies. If your training data is low quality, your results will be poor. As the number of data sources grows, the number of potential data fields and variables increases, along with the potential for error: transcription errors, typographic errors, and so on. In the past it might have been possible to manually correct and repair data, but correcting data manually is an error-prone and tedious task that continues to occupy most data scientists. As with entity resolution, data quality and data repair have been the subject of recent research, and a new set of machine learning tools for automating data cleaning are beginning to appear.
One common application of machine learning and AI to customer experience is in personalization and recommendation systems. In recent years, hybrid recommender systems—applications that combine multiple recommender strategies—have become much more common. Many hybrid recommenders rely on many different sources and large amounts of data, and deep learning models are frequently part of such systems. While it’s common for recommendations to be based on models that are only retrained periodically, advanced recommendation and personalization systems will need to be real time. Using reinforcement learning, online learning, and bandit algorithms, companies are beginning to build recommendation systems that constantly train models against live data.
Machine learning and AI are automating many different enterprise tasks and workflows, including customer interactions. We’ve all experienced chatbots that automate various aspects of customer service. So far, chatbots are more annoying than helpful—though, well-designed, simple “frequently asked question” bots can lead to good customer acquisition rates. But we’re only in the early stages of natural language processing and understanding—and in the past year, we’ve seen many breakthroughs. As our ability to build sophisticated language models improves, we will see chatbots progress through a number of stages: from providing notifications, to managing simple question and answer scenarios, to understanding context and participating in simple dialogs, and finally to personal assistants that are “aware” of their users’ needs. As chatbots improve, we expect them to become an integral part of customer service, not merely an annoyance that you have to work through to get to a human. And for chatbots to reach this level of performance, they will need to incorporate real-time recommendation and personalization. They will need to understand customers as well as a human.
Fraud detection is another technology that is now digesting machine learning. Fraud detection is engaged in a constant battle between the good guys and the criminals, and the stakes are constantly increasing. Fraud artists are inventing more sophisticated techniques for online crime. Fraud is no longer person-to-person: it is automated, as in a bot that buys up all the tickets to events so they can be resold by scalpers. As we’ve seen in many recent elections, it is easy for criminals to penetrate social media by creating a bot that floods conversations with automated responses. It is much harder to discover those bots and block them in real time. That’s only possible with machine learning, and even then, it’s a difficult problem that’s only partially solved. But solving it is a critical part of re-building an online world in which people feel safe and respected.
Advances in speech technologies and emotion detection will reduce friction in automated customer interactions even further. Multi-modal models that combine different kinds of inputs (audio, text, vision) will make it easier to respond to customers appropriately; customers might be able to show you what they want or send a live video of a problem they’re facing. While interactions between humans and robots frequently place users in the creepy “uncanny valley,” it’s a safe bet that customers of the future will be more comfortable with robots than we are now.
But if we’re going to get customers through to the other side of the uncanny valley, we also have to respect what they value. AI and ML applications that affect customers will have to respect privacy; they will have to be secure; and they will have to be fair and unbiased. None of these challenges are simple, but technology won’t improve customer experience if customers end up feeling abused. The result may be more efficient, but that’s a bad tradeoff.
What will machine learning and artificial intelligence do for customer experience? It has already done much. But there’s much more that it can do—and that it has to do—in building the frictionless customer experience of the future.
- People + AI Guidebook (Google) — Designing human-centered AI products.
- Strong Opinions Loosely Held Might Be The Worst Idea in Tech — What really happens? The loudest, most bombastic engineer states their case with certainty, and that shuts down discussion. Other people either assume the loudmouth knows best, or don’t want to stick out their neck and risk criticism and shame. This is especially true if the loudmouth is senior, or there is any other power differential.
- The Empty Promise of Data Moats (A16Z) — Most data network effects are really scale effects.
- Trans-inclusive Design — this is GOLD and should be required reading for every software designer and developer.
Chris Hughes, a co-founder of Facebook, recently wrote an opinion piece for the New York Times presenting an argument for breaking up Facebook. I was trying to stay out of this discussion, and Nick Clegg’s response pretty much summed up my opinions. But I got one too many emails from friends that simply assumed I’d be in agreement with Chris, especially given my critique of blitzscaling and Silicon Valley’s obsession with “world domination.”
If Facebook should be broken up, it should be on grounds of anticompetitive behavior, and Chris barely even touches on that topic. Dragging in the history of all Facebook’s various corporate missteps for which a breakup would provide no remedy just muddies the water and lets everyone think that punishment has been done. It’s actually a way of NOT solving the actual problems. Breaking up Facebook won’t solve the disinformation problem. It won’t solve the privacy problem. It might well make it harder for Facebook to work on those problems.
In addition, if harm to our society is sufficient reason for a breakup, there are a lot of companies ahead of Facebook in the queue. Big banks. The pharma companies that brought us the opioid crisis. The energy cartel that’s still fighting against responding to climate change 70 years after the warning bells were sounded.
Even if you restrict yourself to surveillance capitalism, picking out one high-profile malefactor, issuing draconian punishment, and then going back to business as usual everywhere else is a lose-lose, not a win. Facebook is not at the root of this problem, nor even the worst offender. To solve this problem, we need to look not only beyond Facebook, but also beyond the entire tech industry. Banks and telcos are often worse privacy offenders than Google or Facebook. And if you restrict yourself to harm from disinformation, polarization, and radicalization, Twitter and YouTube are just as culpable and, out of the spotlight, appear to be doing less than Facebook to police their problems. Reddit is far worse, and traditional media outlets, especially the Fox News empire, almost as bad. (See Yochai Benkler’s excellent book on Network Disinformation.)
Yes, Facebook has scale, but even there, I thought Chris’ arguments were weak. In particular, I found the New York Times infographic to be quite disingenuous:
What’s wrong with this infographic? It lists all of Facebook’s properties, including those that are primarily messaging, while limiting the competition purely to social media, and failing to include other messaging products. Take out WhatsApp and Messenger, and Facebook suddenly doesn’t look so big. Or add in other messaging products from companies like Apple, Google, Microsoft, and plain old SMS, and again, suddenly Facebook doesn’t look so dominant. Add in Google’s other properties besides YouTube, especially those like search and the new newsfeed on Android phones that influence what people see and understand about the world. Restrict it to the individual national markets where the competition actually happens, and it would look different again. This isn’t analysis. This is polemics.
I’m not saying there aren’t grounds for investigation of anticompetitive behavior in the tech industry, or by Facebook specifically, but if someone’s going to talk breakup, I’d love to see reporting and opinion pieces that make a true antitrust case, if there is one.
If the world really wants to tackle the problems that networked media creates or amplifies, making Facebook a scapegoat and letting everyone else off the hook is a massive mistake.
- Git-rebase in Depth — These tools can be a little bit intimidating to the novice or even intermediate git user, but this guide will help to demystify the powerful git-rebase.
- SwiftWasm — Run Swift in browsers. SwiftWasm compiles your Swift code to WebAssembly.
- Deepfake Salvador Dalí Takes Selfies with Museum Visitors (Verge) — Using archival footage from interviews, GS&P pulled over 6,000 frames and used 1,000 hours of machine learning to train the AI algorithm on Dalí’s face. His facial expressions were then imposed over an actor with Dalí’s body proportions, and quotes from his interviews and letters were synced with a voice actor who could mimic his unique accent, a mix of French, Spanish, and English. The selfie is magic, though. Prize to whoever thought that up!
- Rules of ML (Google) — a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machine-learned model, then you have the necessary background to read this document.
In a Quora post, Alan Kay lamented the state of tooling for programmers. Every other engineering discipline has built modern computational tools: for computer aided design, simulation and testing, and for manufacturing. But programming hasn’t progressed significantly since the 1970s. We’ve built great tools for others, but not ourselves. The shoemaker’s children have holes in their shoes.
Kay isn’t being entirely fair, but before looking at what he’s missing, it’s important to think about how he’s right. If we don’t understand how he’s right, we certainly won’t understand what to build next, and where the future may be staring us in the face.
Let’s start with programming itself. We’re still using punch cards, now emulated on modern high-resolution monitors. We’re still doing line-oriented programming with an alpha-numeric character set. We’re still using programming languages that, for the most part, behave like C or like LISP, and the biggest debates in the programming community are about which of these ancient paradigms is better. We have IDEs that make it somewhat easier to generate those virtual punch cards, but don’t fundamentally change the nature of the beast. We have some tools for unit testing, but they work by requiring us to write more punch cards (unit tests). We have version control tools for managing changes to those punch cards. And we even have tools for continuous integration, continuous deployment, and container orchestration—all of which are programmed by creating more virtual punch cards.
Database developers are in somewhat better shape. Non-procedural languages like SQL lend themselves more readily to visual programming styles, yielding tools like Tableau, though those tools don’t help much if you’re connecting an enterprise application to a back-end database.
Where can we go from here? I’ve long thought that the real next-generation programming language won’t be a rehash of LISP, C, or Smalltalk syntax. It won’t be character based at all: it will be visual. Rather than typing, we’ll draw what we want. I’ve yet to see a language that fits the bill. Teaching platforms like Alice and Scratch are an interesting attempt, but they don’t go anywhere near far enough: they just take the programming languages we already know and apply a visual metaphor. A C-clamp instead of a loop. Plug-together blocks instead of keywords. Nothing has really changed.
I suspect that the visual programming language we need will borrow ideas from all of our programming paradigms: it will pass messages, it will have objects, it will support concurrency, and so on. What it won’t have is a textual representation; it won’t be a visual sugarcoating to an underlying language like C.
I have some ideas about where such a language might come from. I see two trends that might help us think about the future of programming.
First, I see increasing evidence that there are two kinds of programmers: programmers who build things by connecting other things together, and programmers who create the things that others connect together. The first is the “blue collar” group of programmers; the second is the “academic” group of programmers–for example, the people doing new work in AI. Neither group is more valuable or important than the other. Building trades provide a good analogy. If I need to install a new kitchen sink, I call a plumber. He knows how to connect the sink to the rest of the plumbing; he doesn’t know how to design the sink. There’s a sink designer in an office somewhere who probably understands (in a rudimentary way) how to connect the sink to the plumbing, but whose real expertise is designing sinks, not installing them. You can’t do without either, though the world needs more plumbers than designers. The same holds true for software: the number of people who build web applications is far greater than the number of people who build web frameworks like React or who design new algorithms and do fundamental research.
Should the computational plumber use the same tools as the algorithm designer? I don’t think so; I can imagine a language that is highly optimized for connecting pre-built operations, but not for building the operations themselves. You can see a glimpse of this in languages that apply functions to every element of a list: you no longer need loops. (Python’s map() applies a function to every element of a list; there are languages where functions behave like this automatically.) You can see PHP as a language that’s good for connecting web pages to databases, but horrible for implementing cryptographic algorithms. And perhaps this connective language might be visual: if we’re just connecting a bunch of things, why not use lines rather than lines of code? (Mel Conway is working on “wiring diagrams” that allow consumers to build applications by describing interaction scenarios.)
Second, one of the most interesting research areas in artificial intelligence is the ability to generate code. A couple of years ago, we saw that AI was capable of optimizing database queries. Andrej Karpathy has argued that this ability places us on the edge of Software 2.0, in which AI will generate the algorithms we need. If, in the future, AI systems will be able to write our code, what kind of language will we need to describe the code we want? That language certainly won’t be a language with loops and conditionals; nor do I think it will be a language based on the mathematical abstraction given by functions. Karpathy suggests that the core of this language will be tagged datasets for training AI models. Instead of writing step-by-step instructions, we will show the computer what we want it to do. If you think such a programming paradigm is too radical, too far away from the process of making a machine do your bidding, think about what the machine language programmers of the 1950s would think about a modern optimizing compiler.
So, while I can’t yet imagine what a new visual language will look like, I can sense that we’re on the edge of being able to build it. In fact, we’re building it already. One of Jeremy Howard’s projects at platform.ai is a system that allows subject matter experts to build machine learning applications without any traditional programming. And Microsoft has launched a drag-and-drop machine learning tool that provides a graphical tool for assembling training data, cleaning it, and using it to build a model without any traditional programming. I suppose one could argue that this “isn’t real programming,” but that’s tantamount to defining programming by its dependence on archaic tools.
What about the rest? What about the tools for building, testing, deploying, and managing software? Here’s where Kay underestimates what we have. I’d agree that it isn’t much, but it’s something; it’s a foundation that we can build upon. We have more than 40 years of experience with build tools (starting with make in 1976), and similar experience with automated configuration (CFengine, the precursor to Chef and Puppet, goes back to the ‘90s), network monitoring (Nagios dates back to 2002), and continuous integration (Hudson, predecessor of Jeeves, dates to 2005). Kubernetes, which handles container orchestration, is the “new kid on the block.” Kubernetes is the robotically automated factory for distributed systems. It’s a tool for managing and operating large software deployments, running across many nodes. That’s really a complete tool suite that runs from automated assembly through automated testing to fully automated production. Lights off; let the computers do the work.
Sadly, these tools are still configured with virtual punch cards (text files, usually in XML, YAML, JSON, or something equally unpleasant), and that’s a problem that has to be overcome. I think the problem isn’t difficulty–creating a visual language for any of these tools strikes me as significantly easier than creating a visual language for programming–it’s tradition.
Software people are used to bad tools. And while we’d hate being forced to use physical punch cards (I’ve done that, it’s no fun), if you virtualize those punch cards, we’re just fine. Perhaps it’s a rite of passage, a sort of industrial hazing. “We survived this suckage, you should too if you’re going to be a real programmer.” We’re happy with that.
Kay is right that we shouldn’t be happy with this state of affairs. The pain of building software using tools that would be immediately understandable to developers in the 1960s keeps us thinking about the bits, rather than the meaning of those bits. As an industry, we need to get beyond that. We have prototypes for some of the tools we need. We just need to finish the job.
We need to imagine a different future for software development.
- Tetris on a Flip-Disc Display (YouTube) — the update click is ridiculously satisfying. (via BoingBoing)
- Agnotology and Epistemological Fragmentation (danah boyd) — In 1995, Robert Proctor and Iain Boal coined the term “agnotology” to describe the strategic and purposeful production of ignorance. […] there’s an increasing number of people who are propagating conspiracy theories or simply asking questions as a way of enabling and magnifying white supremacy. This is agnotology at work. Fascinating in the details of how the misinformers do their work online.
- Reverse Engineering a Xinjiang Police Mass Surveillance App (Human Rights Watch) — discovering the data (online) saved by the surveillance system. TechCrunch even shows the tables.
- TCP/IP Over Amazon Cloudwatch Logs — Running in a standard Go process, Richard Linklayer tunnels IP packets over Amazon Cloudwatch Log Streams that follow a special naming convention — the stream and log group names are just MAC addresses. A cute hack.
In this episode of the Data Show, I spoke with Jeff Jonas, CEO, founder and chief scientist of Senzing, a startup focused on making real-time entity resolution technologies broadly accessible. He was previously a fellow and chief scientist of context computing at IBM. Entity resolution (ER) refers to techniques and tools for identifying and linking manifestations of the same entity/object/individual. Ironically, ER itself has many different names (e.g., record linkage, duplicate detection, object consolidation/reconciliation, etc.).
ER is an essential first step in many domains, including marketing (cleaning up databases), law enforcement (background checks and counterterrorism), and financial services and investing. Knowing exactly who your customers are is an important task for security, fraud detection, marketing, and personalization. The proliferation of data sources and services has made ER very challenging in the internet age. In addition, many applications now increasingly require near real-time entity resolution.
We had a great conversation spanning many topics including:
Why ER is interesting and challenging
How ER technologies have evolved over the years
How Senzing is working to democratize ER by making real-time AI technologies accessible to developers
Some early use cases for Senzing’s technologies
Some items on their research agenda
Here are a few highlights from our conversation:
Entity Resolution through years
In the early ’90s, I worked on a much more advanced version of entity resolution for the casinos in Las Vegas and created software called NORA, non-obvious relationship awareness. Its purpose was to help casinos better understand who they were doing business with. We would ingest data from the loyalty club, everybody making hotel reservations, people showing up without reservations, everybody applying for jobs, people terminated, vendors, and 18 different lists of different kinds of bad people, some of them card counters (which aren’t that bad), some cheaters. And they wanted to figure out across all these identities when somebody was the same, and then when people were related. Some people were using 32 different names and a bunch of different social security numbers.
… Ultimately, IBM bought my company and this technology became what is known now at IBM as “identity insight.” Identity insight is a real-time entity resolution engine that gets used to solve many kinds of problems. MoneyGram implemented it and their fraud complaints dropped 72%. They saved a few hundred million just in their first few years.
… But while at IBM, I had a grand vision about a new type of entity resolution engine that would have been unlike anything that’s ever existed. It’s almost like a Swiss Army knife for ER.
The Senzing entity resolution engine works really well on two records from a domain that you’ve never even seen before. Say you’ve never done entity resolution on restaurants from Singapore. The first two records you feed it, it’s really, really already smart. And then as you feed it more data, it gets smarter and smarter.
… So, there are two things that we’ve intertwined. One is common sense. One type of common sense is the names—Dick, Dickie, Richie, Rick, Ricardo are all part of the same name family. Why should it have to study millions and millions of records to learn that again?
… Next to common sense, there’s real-time learning. In real-time learning, we do a few things. You might have somebody named Bob, but who now goes by a nickname or an alias of Andy. Eventually, you might come to learn that. So, now you know you have to learn over time that Bob also has this nickname, and Bob lived at three addresses, and this is his credit card number, and now he’s got four phone numbers. So you want to learn those over time.
… These systems we’re creating, our entity resolution systems—which really resolve entities and graph them (call it index of identities and how they’re related)—never has to be reloaded. It literally cleans itself up in the past. You can do maintenance on it while you’re querying it, while you’re loading new transactional data, while you’re loading historical data. There’s nothing else like it that can work at this scale. It’s really hard to do.
- Adversarial Examples Are Not Bugs, They Are Features — Adversarial vulnerability is a direct result of our models’ sensitivity to well-generalizing features in the data.
- Tech Companies Are Deleting Evidence of War Crimes (The Atlantic) — By piecing together information that becomes publicly accessible on social media and other sites, internet users can hold the perpetrators accountable—that is, unless algorithms developed by the tech giants expunge the evidence first. Facebook’s automatic content removal tech is removing evidence these investigators use to hold war criminals to account. We live in an age when software designed to get college students laid is critical to prosecuting war criminals.
- Why Open Source Firmware is Important for Security (Jessie Frazelle) — It’s counter-intuitive that the code that we have the least visibility into has the most privileges. This is what open source firmware is aiming to fix.
- Tukey, Design Thinking, and Better Questions (Roger Peng) — In my view, the most useful thing a data scientist can do is to devote serious effort towards improving the quality and sharpness of the question being asked. In my experience as well.
- Brian Kernighan interviews Ken Thompson (YouTube) — wonderful footage from Vintage Computer Festival East 2019.
- Hypertext and our Collective Destiny (Tim Berners-Lee) — a 1995 talk honouring Vannevar Bush. I had (and still have) a dream that the web could be less of a television channel and more of an interactive sea of shared knowledge. (via Daniel G. Siegel)
- Dealing with Software Collapse — The main issue with the rot metaphor is that it puts the blame on the wrong piece of the puzzle. If software becomes unusable over time, it’s not because of any alteration to that software that needs to be reversed. Rather, it’s the foundation on which the software has been built, ranging from the actual hardware via the operating system to programming languages and libraries, that has changed so much that the software is no longer compatible with it.
- Distributed Consensus Revised (Part II) (The Morning Paper) — In today’s post, we’re going to be looking at Chapter 3 of Dr. Howard’s thesis, which is a tour (“systematization of knowledge,” SoK) of some of the major known revisions to the classic Paxos algorithm.