Four short links: 25 June 2019

Four short links: 25 June 2019

Four short links
  1. The Next Generation of Deep Learning: Analog Computing (IEEE) — Further progress in compute efficiency for deep learning training can be made by exploiting the more random and approximate nature of deep learning work flows. In the digital space that means to trade off numerical precision for accuracy at the benefit of compute efficiency. It also opens the possibility to revisit analog computing, which is intrinsically noisy, to execute the matrix operations for deep learning in constant time on arrays of nonvolatile memories. (Paywalled paper)
  2. The Internet is Increasingly a Low-Trust Society (Wired) — Zeynep Tufecki nails it. Social scientists distinguish high-trust societies (ones where you can expect most interactions to work) from low-trust societies (ones where you have to be on your guard at all times). People break rules in high-trust societies, of course, but laws, regulations, and norms help to keep most abuses in check; if you have to go to court, you expect a reasonable process. In low-trust societies, you never know. You expect to be cheated, often without recourse. You expect things not to be what they seem and for promises to be broken, and you don’t expect a reasonable and transparent process for recourse. It’s harder for markets to function and economies to develop in low-trust societies. It’s harder to find or extend credit, and it’s risky to pay in advance.
  3. Be Internet Awesome — Google’s media literacy materials. Be Internet Awesome is like an instruction manual for making smart decisions online. Kids today need a guide to the internet and media just as they need instruction on other topics. We need help teaching them about credible sources, the power of words and images, and more importantly, how to be smart and savvy when seeing different media while browsing the web. All of these resources are not only available for classrooms, but also free and easily accessible for families as well. They’re in both English and in Spanish, along with eight other languages. (via Google Blog)
  4. PsyToolkitcreate and run cognitive psychological experiments in your browser.
Article image: Four short links

Four short links: 24 June 2019

Four short links: 24 June 2019

Four short links
  1. NTFS Timestampsa 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC). WTAF?
  2. Computers Changed Spycraft (Foreign Policy) — so much has changed—eg., dead letter drops: It is easy for Russian counterintelligence to track the movements of every mobile phone in Moscow, so if the Canadian is carrying her device, observers can match her movements with any location that looks like a potential site for a dead drop. They could then look at any other phone signal that pings in the same location in the same time window. If the visitor turns out to be a Russian government official, he or she will have some explaining to do.
  3. Netflix Records All of your Bandersnatch Choices, GDPR Request Reveals (Verge) — that’s some next-level meta.
  4. Being Beyoncé’s Assistant for the Day (Twitter) — a choose-your-own-adventure implemented in Twitter. GENIUS!
Article image: Four short links

New live online training courses

New live online training courses


(source: Pixabay)

Learn new topics and refine your skills with more than 219 new live online training courses we opened up for June and July on the O’Reilly online learning platform.

AI and machine learning

AI-driven Future State Cloud Operations, June 7

Deep Learning with PyTorch, June 20

Deep Learning from Scratch, July 2

Introduction to Reinforcement Learning, July 8

Fundamentals of Machine Learning and Data Analytics, July 10-11

Essential Machine Learning and Exploratory Data Analysis with Python and Jupyter Notebook, July 11-12

Artificial Intelligence: An Overview of AI and Machine Learning, July 15

Real-Time Streaming Analytics and Algorithms for AI Applications, July 17

Hands-on Machine Learning with Python: Classification and Regression, July 17

Hands-on Machine Learning with Python: Clustering, Dimension Reduction, and Time Series Analysis, July 18

Deep Reinforcement Learning, July 18

Deep Learning for Natural Language Processing, July 25

Getting Started with Machine Learning, July 29

Inside Unsupervised Learning: Anomaly Detection using Dimensionality Reduction, August 6

Deploying Machine Learning Models to Production: A Toolkit for Real-World Success , August 7-8

Hands-on Adversarial Machine Learning, August 13

Inside Unsupervised Learning: Group Segmentation Using Clustering, August 13

Reinforcement Learning: Building Recommender Systems, August 16


Business Applications of Blockchain, July 17

Certified Blockchain Solutions Architect (CBSA) Certification Crash Course, July 25


Ken Blanchard on Leading at a Higher Level: 4 Keys to Creating a High Performing Organization , June 13

Engineering Mentorship, June 24

Spotlight on Learning From Failure: Hiring Engineers with Jeff Potter, June 25

60 Minutes to a Better Prototype, June 25

Being a Successful Team Member, July 1

Spotlight on Data: Improving Uber’s Customer Support with Natural Language Processing and Deep Learning with Piero Molino, July 2

Getting S.M.A.R.T about Goals, July 9

Building the Courage to Take Risks, July 9

Spotlight on Innovation: Making Things Happen with Scott Berkun, July 10

Thinking Like a Manager, July 10

Better Business Writing, July 15

Spotlight on Data: Data Storytelling with Mico Yuk, July 15

Why Smart Leaders Fail, July 16

Product Management for Enterprise Software, July 18

Introduction to Critical Thinking, July 23

Negotiation Fundamentals, July 23

Spotlight on Learning from Failure: Corporate Disinformation and the Changing Face of Attacks with Renee DiResta and Robert Matney, July 23

Having Difficult Conversations, July 25

Giving a Powerful Presentation, July 25

The Power of Lean in Software Projects, July 25

Managing a Toxic Work Environment, July 25

Leadership Communication Skills for Managers, July 29

Emotional Intelligence in the Workplace, July 30

90 Minutes to Better Decision-Making, July 30

Performance Goals for Growth, July 31

Adaptive Project Management, July 31

Spotlight on Cloud: Mitigating Cloud Complexity to Ensure Your Organization Thrives with David Linthicum, August 1

How to Be a Better Mentor, August 5

Fundamentals of Learning: Learn Faster and Better Using Neuroscience, August 6

Introduction to Strategic Thinking Skills, August 6

Foundations of Microsoft Excel, August 6

Succeeding with Project Management, August 8

How to Give Great Presentations, August 13

60 minutes to Better User Stories and Backlog Management, August 13

Building Your LinkedIn Network, August 13

Understanding Business Strategy, August 14

Data science and data tools

Text Analysis for Business Analytics with Python, June 12

Business Data Analytics Using Python , June 25

Debugging Data Science, June 26

Programming with Data: Advanced Python and Pandas, July 9

Understanding Data Science Algorithms in R: Regression, July 12

Time Series Forecasting, July 15

Cleaning Data at Scale, July 15

Scalable Data Science with Apache Hadoop and Spark, July 16

Effective Data Center Design Techniques: Data Center Topologies and Control Planes, July 19

First Steps in Data Analysis, July 22

Inferential Statistics using R, July 24

Foundational Python for Data Science, July 24

Intermediate SQL for Data Analysis, July 29

Introduction to Pandas: Data Munging with Python, July 29-30

Data Analysis Paradigms in the Tidyverse, July 30

Intro to Mathematical Optimization, August 6

Getting Started with PySpark, August 8

Text Analysis for Business Analytics with Python, August 12

Real-time Data Foundations: Kafka, August 13

Introduction to Statistics for Data Analysis with Python, August 14

Understanding Data Science Algorithms in R: Scaling, Normalization and Clustering, August 14

Real-time Data Foundations: Spark, August 15

Visualization and Presentation of Data, August 15

Python Data Science Full Throttle with Paul Deitel: Introductory AI, Big Data and Cloud Case Studies, September 24

Design and product management

Introduction to UI & UX design, June 24


Discovering Modern Java, June 7

Design Patterns in Java, June 13-14

Java Testing with Mockito and the Hamcrest Matchers, June 19

Scaling Python with Generators, June 25

Pythonic Object-Oriented Programming, June 26

Python Advanced: Generators and Coroutines, June 26

Pythonic design patterns, June 27

Advanced Test-Driven Development (TDD), June 27

Test-Driven Development In Python, June 28

Learning Python 3 by Example, July 1

Getting Started with Spring and Spring Boot, July 2-3

Java 8 Generics in 3 Hours, July 5

Secure JavaScript with Node.js, July 10

Learn the Basics of Scala in 3 Hours, July 15

Quantitative Trading with Python, July 15

Advanced React.js, July 16

Next-generation Java Testing with JUnit 5 , July 16

Java Full Throttle with Paul Deitel: A One-Day, Code-Intensive Java, July 16

Modern JavaScript, July 17

Mastering the Basics of Relational SQL Querying, July 17-18

Getting Started with Python 3 , July 17-18

Building Applications with Apache Cassandra, July 19

Scala Fundamentals: From Core Concepts to Real Code in 5 Hours, July 19

Clean Code, July 23

Introduction to Python Programming, July 23

TypeScript Fundamentals, July 24

Rust Programming: A Crash Course, July 29

Python Data Science Full Throttle with Paul Deitel: Introductory AI, Big Data and Cloud Case Studies, July 30

Beyond Python Scripts: Logging, Modules, and Dependency Management, July 30

Advanced JavaScript, July 30

Beyond Python Scripts: Exceptions, Error Handling and Command-Line Interfaces, July 31

Introduction to TypeScript Programming, August 5

Getting Started with Python 3, August 5-6

Mastering Pandas, August 7

Advanced TypeScript Programming, August 13

Getting Started with React.js, August 14

SQL Fundamentals for Data, August 14-15

Testing Vue.js Applications, August 15

Getting Started with Python 3, August 15-16

Modern Java Exception Handling, August 22

Python: The Next Level, August 1-2


Kubernetes Security, June 10

Defensive Cybersecurity Fundamentals , June 17

Understanding the Social Forces Affecting Cyberattackers, June 28

Ethical Hacking Bootcamp with Hands-on Labs, July 1-3

Cyber Security Defense, July 2

Getting Started with Cyber Investigations and Digital Forensics, July 8

Start Your Security Certification Career Today, July 11

Certified Ethical Hacker (CEH) Crash Course, July 11-12

AWS Security Fundamentals, July 15

Introduction to Encryption, July 16

CISSP Crash Course, July 17-18

CISSP Certification Practice Questions and Exam Strategies, July 18

Linux, Python, and Bash Scripting for Cybersecurity Professionals, July 19

Cyber Security Fundamentals, July 25-26

AWS Certified Security – Specialty Crash Course, July 25-26

Understanding the Social Forces Affecting Cyberattackers, August 5

CCNA Cyber Ops SECFND 210-250, August 13

CCNA Cyber Ops SECOPS 210-255, August 15

Systems engineering and operations

AWS Access Management, June 6

Google Cloud Platform – Professional Cloud Developer Crash Course, June 6-7

React Hooks in Action, June 14

Running MySQL on Kubernetes, June 19

CompTIA A+ Core 1 (220-1001) Certification Crash Course, June 19-20

Introducing Infrastructure as Code with Terraform, June 20

How Routers Really Work: Network Operating Systems and Packet Switching, June 21

Creating React Applications with GraphQL, June 24

Getting Started with Google Cloud Platform, June 24

AWS Certified Big Data – Specialty Crash Course, June 26-27

Building APIs with Django REST Framework , June 28

Hands-on Arista Networking Foundational Routing Topics: Learning Arista Networking Through Lab Exercises , June 28

Azure Architecture: Best Practices, June 28

Learn Linux in 3 Hours, July 1

Managing Containers on Linux, July 1

Getting Started with Amazon SageMaker on AWS, July 1

Ansible in 4 Hours, July 2

Automating with Ansible, July 2

Kubernetes in 4 Hours, July 3

Getting Started with OpenShift, July 5

Amazon Web Services (AWS) Security Crash Course, July 8

Microservices Architecture and Design, July 8-9

AWS Machine Learning Specialty Certification Crash Course, July 8-9

AWS Certified Solutions Architect Associate Crash Course, July 8-9

Google Cloud Platform Security Fundamentals, July 9

CCNA Routing and Switching 200-125 Crash Course , July 9, 11, 16, 18

Exam AZ-300: Microsoft Azure Architect Technologies Crash Course, July 11-12

IBM Blockchain Platform as a Service, July 11-12

Google Cloud Certified Associate Cloud Engineer Crash Course, July 15-16

Getting Started with Amazon Web Services (AWS), July 15-16

AWS for Mobile App Developers, July 16

9 Steps to Awesome with Kubernetes, July 16

Getting Started with Cloud Computing, July 16

Google Cloud Platform (GCP) for AWS Professionals, July 17

AWS Certified SysOps Administrator (Associate) Crash Course, July 17-18

Software Architecture Foundations: Characteristics and Tradeoffs, July 18

AWS Managed Services , July 18-19

Building Micro-frontends, July 22

Linux Performance Optimization, July 22

Linux Under the Hood, July 22

Practical Linux Command Line for Data Engineers and Analysts, July 22

Introduction to Kubernetes, July 22-23

Introduction to Docker images, July 23

Analyzing Software Architecture, July 23

Domain-driven design and event-driven microservices, July 23-24

Building a Cloud Roadmap, July 24

Software Architecture by Example, July 24

Introduction to Docker CI/CD, July 24

Automating Architectural Governance Using Fitness Functions, July 25

Exam MS-100: Microsoft 365 Identity and Services Crash Course, July 25-26

Linux Foundation System Administrator (LFCS) Crash Course, July 25-26

Architecture for Continuous Delivery, July 29

Introduction to Docker Containers, July 30

Implementing Evolutionary Architectures, July 30-31

Docker for JVM Projects, July 31

Getting Started with Continuous Delivery (CD), August 1

Implementing and Troubleshooting TCP/IP, August 5

Developing Incremental Architecture, August 5-6

Microservice Decomposition Patterns, August 6

From Developer to Software Architect, August 6-7

Systems Design for Site Reliability Engineers, August 7

Building and Managing Kubernetes Applications, August 7

Designing Serverless Architecture with AWS Lambda, August 7-8

Docker: Beyond the Basics (CI & CD), August 7-8

Introduction to Istio, August 8

Microservice Fundamentals, August 13

Getting Started with Google Cloud Platform, August 13

Microservices Caching Strategies, August 14

Practical Docker, August 14

Amazon Web Services (AWS) Technical Essentials, August 14

Kubernetes Security, August 14

AWS Design Fundamentals, August 15-16

Software Architecture by Example, August 16

Structural Design Patterns with Spring, August 20

Resilience and Fast Reroute in Computer Networks: Tools and Techniques to Optimize Network Performance, August 23

Article image: Student

(source: Pixabay).

Four short links: 21 June 2019

Four short links: 21 June 2019

Four short links
  1. Private Join and Compute (Google) — This functionality allows two users, each holding an input file, to privately compute the sum of associated values for records that have common identifiers. (via Wired)
  2. PyRobot — from CMU and Facebook. PyRobot is a framework and ecosystem that enables AI researchers and students to get up and running with a robot in just a few hours, without specialized knowledge of the hardware or of details such as device drivers, control, and planning.
  3. PartNeta consistent, large-scale data set of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our data set consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This data set enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. (via IEEE Spectrum )
  4. Self-Supervised Learning (Andrew Zisserman) — 122 slides, very readable, about learning from images, from video, and from video with sound.
Article image: Four short links

Four short links: 20 June 2019

Four short links: 20 June 2019

Four short links
  1. Model Governance and Model Operationsmodels built or tuned for specific applications (in reality, this means models + data) will need to be managed and protected.
  2. Bodies in Seats — the story of Facebook’s 30,000 content moderators: contractors, low pay (as little as $28,800 a year), and a lot of PTSD for everyone. “Nobody’s prepared to see a little girl have her organs taken out while she’s still alive and screaming.” Moderators were told they had to watch at least 15 to 30 seconds of each video.
  3. Dialoga domain-specific language for creating works of interactive fiction. Inspired by Inform and Prolog, they say.
  4. End-User Probabilistic ProgrammingWe examine the sources of uncertainty actually encountered by spreadsheet users, and their coping mechanisms, via an interview study. We examine spreadsheet-based interfaces and technology to help reason under uncertainty, via probabilistic and other means. We show how uncertain values can propagate uncertainty through spreadsheets, and how sheet-defined functions can be applied to handle uncertainty. Hence, we draw conclusions about the promise and limitations of probabilistic programming for end-users.
Article image: Four short links

Enabling end-to-end machine learning pipelines in real-world applications

Enabling end-to-end machine learning pipelines in real-world applications


(source: Pixabay)

In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.

We had a great conversation spanning many topics, including:

Related resources:

Four short links: 19 June 2019

Four short links: 19 June 2019

Four short links
  1. Speech2Face: Learning the Face Behind a Voice — complete with an interesting ethics discussion up front. I wonder where this was intended to go: after all, it can’t perfectly reconstruct faces, so what you get is a stereotype based on the voice. Meh.
  2. Minivac 601 Replica (Instructables) — Created by information theory pioneer Claude Shannon as an educational toy for teaching digital circuits, the Minivac 601 Digital Computer Kit was billed as an electromechanical digital computer system.
  3. Nines Are Not Enough: Meaningful Metrics for CloudsWe show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems.
  4. Announcing Envoy Mobile (Lyft Engineering) — as Simon Willison said: Lyft’s Envoy proxy / service mesh has been widely adopted across the industry as a server-side component for adding smart routing and observability to the network calls made between services in microservice architectures. “The reality is that three 9s at the server-side edge is meaningless if the user of a mobile application is only able to complete the desired product flows a fraction of the time”—so Lyft is building a C++ embedded library companion to Envoy which is designed to be shipped as part of iOS and Android client applications. “Envoy Mobile in conjunction with Envoy in the data center will provide the ability to reason about the entire distributed system network, not just the server-side portion.” Their decision to release an early working prototype and then conduct ongoing development entirely in the open is interesting, too.
Article image: Four short links

What are model governance and model operations?

What are model governance and model operations?


(source: Pixabay)

Our surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration—no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts.

With the shift toward the implementation of machine learning, it’s natural to expect improvement in tools targeted at helping companies with ML. In previous posts, we’ve outlined the foundational technologies needed to sustain machine learning within an organization, and there are early signs that tools for model development and model governance are beginning to gain users.

Figure 1. A collection of tools that focus primarily on aspects of model development, governance, and operations. Source: Ben Lorica.

Model development

One sure sign that companies are getting serious about machine learning is the growing popularity of tools designed specifically for managing the ML model development lifecycle, such as MLflow and Why aren’t traditional software tools sufficient? In a previous post, we noted some key attributes that distinguish a machine learning project:

  • Unlike traditional software where the goal is to meet a functional specification, in ML the goal is to optimize a metric.
  • Quality depends not just on code, but also on data, tuning, regular updates, and retraining.
  • Those involved with ML usually want to experiment with new libraries, algorithms, and data sources—and thus, one must be able to put those new components into production.

The growth in adoption of tools like MLflow indicates that new tools are in fact very much needed. These ML development tools are designed specifically to help teams of developers, machine learning engineers, and data scientists collaborate, manage, and reproduce, ML experiments. Many tools in this category let users to systematically conduct modeling experiments (e.g., hyperparameter tuning, NAS) while emphasizing the ease with which one can manage, track, and reproduce such experiments.

Model governance

We are also beginning to come across companies that acknowledge the need for model governance tools and capabilities. Just as companies have long treated data as assets, as ML becomes more central to an organization’s operations, models will be treated as important assets. More precisely, models built or tuned for specific applications (in reality, this means models + data) will need to be managed and protected:

  • A database for authorization and security: who has read/write access to certain models
  • A catalog or a database that lists models, including when they were tested, trained, and deployed
  • A catalog of validation data sets and the accuracy measurements of stored models
  • Versioning (of models, feature vectors, data) and the ability to roll out, roll back, or have multiple live versions
  • Metadata and artifacts needed for a full audit trail
  • Who approved and pushed the model out to production, who is able to monitor its performance and receive alerts, and who is responsible for it
  • A dashboard that provides custom views for all principals (operations, ML engineers, data scientists, business owners)

Model operations, testing, and monitoring

As machine learning proliferates in products and services, we need a set of roles, best practices, and tools to deploy, manage, test, and monitor ML in real-world production settings. There are some initial tools aimed at model operations and testing—mainly for deploying and monitoring ML models—but it’s clear we are still in the early stages for solutions in these areas.

There are three common issues that diminish the value of ML models once they’re in production. The first is concept drift: the accuracy of models in production degrades over time, because of changes in the real world, stemming from a growing disparity between the data they were trained on and the data they are used on. The second is locality: when deploying models to new geographic locations, user demographics, or business customers, it’s often not the case that pre-trained models work at the expected level of accuracy. Measuring online accuracy per customer / geography / demographic group is important both to monitor bias and to ensure accuracy for a growing customer base. The third is data quality: since ML models are more sensitive to the semantics of incoming data, changes in data distribution that are often missed by traditional data quality tools wreak havoc on models’ accuracy.

Beyond the need to monitor that your current deployed models operate as intended, another challenge is knowing that a newly proposed model actually delivers better performance in production. Some early systems allow for the comparison of an “incumbent model” against “challenger models,” including having challengers in “dark launch” or “offline” mode (this means challenger models are evaluated on production traffic but haven’t been deployed to production). Other noteworthy items include:

  • Tools for continuous integration and continuous testing of models. A model is not “correct” if it returns a valid value—it has to meet an accuracy bar. There needs to be a way to validate this against a given metric and validation set before deploying a model.
  • Online measurement of the accuracy of each model (what’s the accuracy that users are experiencing “in the field”?). Related to this is the need to monitor bias, locality effects, and related risks. For example, scores often need to be broken down by demographics (are men and women getting similar accuracy?) or locales (are German and Spanish users getting similar accuracy?).
  • The ability to manage the quality of service for model inference to different customers, including rate limiting, request size limiting, metering, bot detection, and IP geo-fencing.
  • Ability to scale (and auto-scale), secure, monitor, and troubleshoot live models. Scaling has two dimensions—the size of the traffic hitting the models and the number of models that need to be evaluated.

Model operations and testing is very much still a nascent field where systematic checklists are just beginning to be assembled. An overview from a 2017 paper from Google lets us gauge how much tooling is still needed for model operations and testing. This paper came with a 28-item checklist that detailed things that need to be accounted for in order to have a reliable, production-grade machine learning system:

  • Features and data: seven items that include checks for privacy controls, feature validation, exploring the necessity and cost of a feature, and other data-related tests.
  • Tests for model development: seven sanity checks, including checking whether a simpler model will suffice, model performance on critical data slices (e.g., region, age, recency, frequency, etc.), the impact of model staleness, and other important considerations.
  • Infrastructure tests: a suite of seven considerations, including the reproducibility of model training, the ease with which models can be rolled back, integration tests on end-to-end model pipelines, model tests via a canary process.
  • Monitoring: the authors list a series of seven items to ensure models are working as expected. This includes tests for model staleness, performance metrics (training, inference, throughput), validating that training and serving code generate similar values, and other essential items.

New roles

Discussions around machine learning tend to revolve around the work of data scientists and model building experts. This is beginning to change now that many companies are entering the implementation phase for their ML initiatives. Machine learning engineers, data engineers, developers, and domain experts are critical to the success of ML projects. At the moment, few (if any) teams have checklists as extensive as the one detailed in the 2017 paper from Google. The task of building real-world production-grade ML models still requires stitching together tools and teams that cut across many functional areas. However, as tools for model governance and model operations and testing begin to get refined and become more widely available, it’s likely that specialists (an “ML ops team”) will be tasked to use such tools. Automation will also be an important component, as these tools will need to enable organizations to build, manage, and monitor many more machine learning models.

Figure 2. Demand for tools for managing ML in the enterprise. Source: Ben Lorica, using data from a Twitter poll.

We are beginning to see specialized tools that allow teams to manage the ML model development lifecycle. Tools like MLflow are being used to track and manage machine learning experiments (mainly offline, using test data). There are also new tools that cover aspects of governance, production deployment, serving, and monitoring, but at the moment they tend to focus on single ML libraries (TFX) or modeling tools (SAS Model Manager). The reality is, enterprises will want flexibility in the libraries, modeling tools, and environments they use. Fortunately, startups and companies are beginning to build comprehensive tools for enabling ML in the enterprise.

Related content:

“Managing risk in machine learning”

Article image: Tools

(source: Pixabay).

Four short links: 18 June 2019

Four short links: 18 June 2019

Four short links
  1. jExcela lightweight vanilla JavaScript plugin to create amazing web-based interactive tables and spreadsheets compatible with Excel or any other spreadsheet software. You can create an online spreadsheet table from a JS array, JSON, CSV, or XSLX files. You can copy from excel and paste straight to your jExcel spreadsheet and vice versa. It is very easy to integrate any third-party JavaScript plugins to create your own custom columns, custom editors, and customize any feature into your application.
  2. Why Are We So Pessimistic? (Brookings) — The belief or perception that things are much worse than they really are is widespread, and I believe it comes with significant detrimental impacts on societies.
  3. We Read 150 Privacy Policies. They Were an Incomprehensible Disaster (NYT) — Only Immanuel Kant’s famously difficult “Critique of Pure Reason” registers a more challenging readability score than Facebook’s privacy policy.
  4. Perspectives and Approaches in AI Ethics: East AsiaEach country’s perspectives on and approaches to AI and robots on the tool-partner spectrum are evaluated by examining its policy, academic thought, local practices, and popular culture. This analysis places South Korea in the tool range, China in the middle of the spectrum, and Japan in the partner range.
Article image: Four short links

The quest for high-quality data

The quest for high-quality data


(source: Public Domain

“AI starts with ‘good’ data” is a statement that receives wide agreement from data scientists, analysts, and business owners. There has been a significant increase in our ability to build complex AI models for predictions, classifications, and various analytics tasks, and there’s an abundance of (fairly easy-to-use) tools that allow data scientists and analysts to provision complex models within days. As model building become easier, the problem of high-quality data becomes more evident than ever. A recent O’Reilly survey found that those with mature AI practices (as measured by how long they’ve had models in production) cited “Lack of data or data quality issues” as the main bottleneck holding back further adoption of AI technologies.

Even with advances in building robust models, the reality is that noisy data and incomplete data remain the biggest hurdles to effective end-to-end solutions. The problem is even more magnified in the case of structured enterprise data. These data sets are often siloed, incomplete, and extremely sparse. Moreover, the domain knowledge, which often is not encoded in the data (nor fully documented), is an integral part of this data (see this article from Forbes). If you also add scale to the sparsity and the need for domain knowledge, you have the perfect storm of data quality issues.

In this post, we shed some light on various efforts toward generating data for machine learning (ML) models. In general, there are two main lines of work toward that goal: (1) clean the data you have, and (2) generate more data to help train needed models. Both directions have seen new advances in using ML models effectively, building on multiple new results from academia.

Data integration and cleaning

One of the biggest pitfalls in dealing with data quality is to treat all data problems the same. Academic research has been more deliberate in describing the different classes of data quality problems. We see two main classes of problems, which have varying degrees of complexity, and often mandate different approaches and tools to solve them. Since they consume a significant amount of time spent on most data science projects, we highlight these two main classes of data quality problems in this post:

  1. Data unification and integration
  2. Error detection and automatic repairing/imputation

Data unification and integration

Even with the rise of open source tools for large-scale ingestion, messaging, queuing, and stream processing, siloed data and data sets trapped behind the bars of various business units is the normal state of affairs in any large enterprise. Data unification or integration refers to the set of activities that bring this data together into one unified data context. Schema matching and mapping, record linkage and deduplication, and various mastering activities are the types of tasks a data integration solution performs. Advances in ML offer a scalable and efficient way to replace legacy top-down, rule-based systems, which often result in massive costs and very low success in today’s big data settings. Bottom-up solutions with human-guided ML pipelines (such as Tamr, Paxata, or Informatica—full disclosure: Ihab Ilyas is co-founder of Tamr) show how to leverage the available rules and human expertise to train scalable integration models that work on thousands of sources and large volumes of data. We discussed some of the challenges and enablers in using ML for this class of problems in an earlier post.

The class of data unification problems has its own characteristics in terms of solution complexity: (1) the problem is often quadratic in the size of the input (since we need to compare everything to everything else), and (2) the main ML task is fairly understood and is mainly determining if two “things” are the same. These characteristics have a considerable impact on the design of the solution. For example, a complex sophisticated model for finding duplicates or matching schema is the least of our worries if we cannot even enumerate all possible pairs that need to be checked. Effective solutions for data unification problems tend to be a serious engineering effort to: (1) prune the space of possible candidates; (2) interact effectively with experts to provide training data and validate the machine decision; and (3) keep rich lineage and provenance to track decisions back for auditing, revising, or reusing for future use cases. Due to the nature of the ML task (mainly Boolean classification here), and the richness of structure, most successful models tend to be the good old “shallow” models, such as random forest, with the help of simple language models (to help with strings data). See this article on data integration status for details.

Error detection, repairing and value imputation

Siloed or integrated data is often noisy, missing, and sometimes even has contradicting facts. Data cleaning is the class of data quality efforts that focuses on spotting and (hopefully) repairing such errors. Like data integration, data cleaning exercises often have been carried out with intensive labor work, or ad-hoc rule-based point solutions. However, this class has different complexities and characteristics that affect the design of the solution: the core ML task is often far more complex than a matching task, and requires building models that understand “how data was generated” and “how errors were introduced” to be able to reverse that process to spot and repair errors.

While data cleaning has long been a research topic in academia, it often has been looked at as a theoretical logic problem. This probably explains why none of the solutions have been adopted in industry. The good news is that researchers from academia recently managed to leverage that large body of work and combine it with the power of scalable statistical inference for data cleaning. The open source HoloClean probabilistic cleaning framework is currently the state-of-the-art system for ML-based automatic error detection and repair. HoloClean adopts the well-known “noisy channel” model to explain how data was generated and how it was “polluted.” It then leverages all known domain knowledge (such as available rules), statistical information in the data, and available trusted sources to build complex data generation and error models. The models are then used to spot errors and suggest the “most probable” values to replace.

Paying attention to scale is a requirement cleaning and integration have in common: building such complex models involves “featurizing” the whole data set via a series of operations—for example, to compute violations of rules, count co-occurrences, or build language models. Hence, an ML cleaning solution would need to be innovative on how to avoid the complexity of these operations. HoloClean, for example uses techniques to prune the domain of database cell and apply judicious relaxations to the underlying model to achieve the required scalability. Older research tools struggled with how to handle the various types of errors, and how to combine the heterogeneous quality input (e.g., business and quality rules, policies, statistical signals in the data, etc.). The HoloClean framework advances the state of the art in two fundamental ways: (1) combining the logical rules and the statistical distribution of the data into one coherent probabilistic model; and (2) scaling the learning and inference process via a series of system and model optimizations, which allowed it to be deployed in census organizations and large commercial enterprises.

Data programming

Increasing the quality of the available data via either unification or cleaning, or both, is definitely an important and a promising way forward to leverage enterprise data assets. However, the quest for more data is not over, for two main reasons:

  1. ML models for cleaning and unification often need training data and examples of possible errors or matching records. Depending completely on human labeling for these examples is simply a non-starter; as ML models get more complex and the underlying data sources get larger, the need for more data increases, the scale of which cannot be achieved by human experts.
  2. Even if we boosted the quality of the available data via unification and cleaning, it still might not be enough to power the even more complex analytics and predictions models (often built as a deep learning model).

An important paradigm for solving both these problems is the concept of data programming. In a nutshell, data programming techniques provide ways to “manufacture” data that we can feed to various learning and predictions tasks (even for ML data quality solutions). In practical terms, “data programming” unifies a class of techniques used for the programmatic creation of training data sets. In this category of tools, frameworks like Snorkel show how to allow developers and data scientists to focus on writing labeling functions to programmatically label data, and then model the noise in the labels to effectively train high-quality models. While using data programming to train high-quality analytics models might be clear, we find it interesting how it is used internally in ML models for the data unification and cleaning we mentioned earlier in this post. For example, tools like Tamr leverage legacy rules written by customers to generate a large amount of (programmatically) labeled data to power its matching ML pipeline. In a recent paper, the HoloClean project showed how to use “data augmentation” to generate many examples of possible errors (from a small seed) to power its automatic error detection model.

Market validation

The landscape of solutions we presented here for the quest for high-quality data have already been well validated in the market today.

  • ML solutions for data unification such as Tamr and Informatica have been deployed at a large number of Fortune-1000 enterprises.
  • Automatic data cleaning solutions such as HoloClean already have been deployed by multiple financial services and the census bureaus of various countries.
  • As the growing list of Snorkel users suggests, data programming solutions are beginning to change the way data scientists provision ML models.

As we get more mature in understanding the differences between the various problems of integration, cleaning, and automatic data generation, we will see real improvement in handling the valuable data assets in the enterprise.

Machine learning applications rely on three main components: models, data, and compute. A lot of articles are written about new breakthrough models, many of which are created by researchers who publish not only papers, but code written in popular open source libraries. In addition, recent advances in automated machine learning has resulted in many tools that can (partially) automate model selection and hyperparameter tuning. Thus, many cutting-edge models are now available to data scientists. Similarly, cloud platforms have made compute and hardware more accessible to developers.

Models are increasingly becoming commodities. As we noted in the survey results above, the reality is that a lack of high-quality training data remains the main bottleneck in most machine learning projects. We believe that machine learning engineers and data scientists will continue to spend most of their time creating and refining training data. Fortunately, help is on the way: as we’ve described in this post, we are finally beginning to see a class of technologies aimed squarely at the need for quality training data.

Related content: