- The Next Generation of Deep Learning: Analog Computing (IEEE) — Further progress in compute efficiency for deep learning training can be made by exploiting the more random and approximate nature of deep learning work flows. In the digital space that means to trade off numerical precision for accuracy at the benefit of compute efficiency. It also opens the possibility to revisit analog computing, which is intrinsically noisy, to execute the matrix operations for deep learning in constant time on arrays of nonvolatile memories. (Paywalled paper)
- The Internet is Increasingly a Low-Trust Society (Wired) — Zeynep Tufecki nails it. Social scientists distinguish high-trust societies (ones where you can expect most interactions to work) from low-trust societies (ones where you have to be on your guard at all times). People break rules in high-trust societies, of course, but laws, regulations, and norms help to keep most abuses in check; if you have to go to court, you expect a reasonable process. In low-trust societies, you never know. You expect to be cheated, often without recourse. You expect things not to be what they seem and for promises to be broken, and you don’t expect a reasonable and transparent process for recourse. It’s harder for markets to function and economies to develop in low-trust societies. It’s harder to find or extend credit, and it’s risky to pay in advance.
- Be Internet Awesome — Google’s media literacy materials. Be Internet Awesome is like an instruction manual for making smart decisions online. Kids today need a guide to the internet and media just as they need instruction on other topics. We need help teaching them about credible sources, the power of words and images, and more importantly, how to be smart and savvy when seeing different media while browsing the web. All of these resources are not only available for classrooms, but also free and easily accessible for families as well. They’re in both English and in Spanish, along with eight other languages. (via Google Blog)
- PsyToolkit — create and run cognitive psychological experiments in your browser.
- NTFS Timestamps — a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601 (UTC). WTAF?
- Computers Changed Spycraft (Foreign Policy) — so much has changed—eg., dead letter drops: It is easy for Russian counterintelligence to track the movements of every mobile phone in Moscow, so if the Canadian is carrying her device, observers can match her movements with any location that looks like a potential site for a dead drop. They could then look at any other phone signal that pings in the same location in the same time window. If the visitor turns out to be a Russian government official, he or she will have some explaining to do.
- Netflix Records All of your Bandersnatch Choices, GDPR Request Reveals (Verge) — that’s some next-level meta.
- Being Beyoncé’s Assistant for the Day (Twitter) — a choose-your-own-adventure implemented in Twitter. GENIUS!
Learn new topics and refine your skills with more than 219 new live online training courses we opened up for June and July on the O’Reilly online learning platform.
AI and machine learning
Deep Learning with PyTorch, June 20
Deep Learning from Scratch, July 2
Deep Reinforcement Learning, July 18
Hands-on Adversarial Machine Learning, August 13
Business Applications of Blockchain, July 17
Engineering Mentorship, June 24
60 Minutes to a Better Prototype, June 25
Being a Successful Team Member, July 1
Getting S.M.A.R.T about Goals, July 9
Thinking Like a Manager, July 10
Better Business Writing, July 15
Why Smart Leaders Fail, July 16
Introduction to Critical Thinking, July 23
Negotiation Fundamentals, July 23
Having Difficult Conversations, July 25
Giving a Powerful Presentation, July 25
Managing a Toxic Work Environment, July 25
90 Minutes to Better Decision-Making, July 30
Performance Goals for Growth, July 31
Adaptive Project Management, July 31
How to Be a Better Mentor, August 5
Foundations of Microsoft Excel, August 6
Succeeding with Project Management, August 8
How to Give Great Presentations, August 13
Building Your LinkedIn Network, August 13
Understanding Business Strategy, August 14
Data science and data tools
Business Data Analytics Using Python , June 25
Debugging Data Science, June 26
Time Series Forecasting, July 15
Cleaning Data at Scale, July 15
First Steps in Data Analysis, July 22
Inferential Statistics using R, July 24
Foundational Python for Data Science, July 24
Intermediate SQL for Data Analysis, July 29
Introduction to Pandas: Data Munging with Python, July 29-30
Intro to Mathematical Optimization, August 6
Getting Started with PySpark, August 8
Real-time Data Foundations: Kafka, August 13
Real-time Data Foundations: Spark, August 15
Visualization and Presentation of Data, August 15
Design and product management
Introduction to UI & UX design, June 24
Discovering Modern Java, June 7
Design Patterns in Java, June 13-14
Scaling Python with Generators, June 25
Pythonic Object-Oriented Programming, June 26
Pythonic design patterns, June 27
Test-Driven Development In Python, June 28
Learning Python 3 by Example, July 1
Java 8 Generics in 3 Hours, July 5
Learn the Basics of Scala in 3 Hours, July 15
Quantitative Trading with Python, July 15
Advanced React.js, July 16
Mastering the Basics of Relational SQL Querying, July 17-18
Getting Started with Python 3 , July 17-18
Clean Code, July 23
Introduction to Python Programming, July 23
TypeScript Fundamentals, July 24
Rust Programming: A Crash Course, July 29
Introduction to TypeScript Programming, August 5
Getting Started with Python 3, August 5-6
Mastering Pandas, August 7
Advanced TypeScript Programming, August 13
Getting Started with React.js, August 14
SQL Fundamentals for Data, August 14-15
Testing Vue.js Applications, August 15
Getting Started with Python 3, August 15-16
Modern Java Exception Handling, August 22
Python: The Next Level, August 1-2
Kubernetes Security, June 10
Defensive Cybersecurity Fundamentals , June 17
Cyber Security Defense, July 2
Certified Ethical Hacker (CEH) Crash Course, July 11-12
AWS Security Fundamentals, July 15
Introduction to Encryption, July 16
CISSP Crash Course, July 17-18
Cyber Security Fundamentals, July 25-26
AWS Certified Security – Specialty Crash Course, July 25-26
CCNA Cyber Ops SECFND 210-250, August 13
CCNA Cyber Ops SECOPS 210-255, August 15
Systems engineering and operations
AWS Access Management, June 6
React Hooks in Action, June 14
Running MySQL on Kubernetes, June 19
AWS Certified Big Data – Specialty Crash Course, June 26-27
Building APIs with Django REST Framework , June 28
Azure Architecture: Best Practices, June 28
Learn Linux in 3 Hours, July 1
Managing Containers on Linux, July 1
Ansible in 4 Hours, July 2
Automating with Ansible, July 2
Kubernetes in 4 Hours, July 3
Getting Started with OpenShift, July 5
Microservices Architecture and Design, July 8-9
CCNA Routing and Switching 200-125 Crash Course , July 9, 11, 16, 18
IBM Blockchain Platform as a Service, July 11-12
Getting Started with Amazon Web Services (AWS), July 15-16
AWS for Mobile App Developers, July 16
9 Steps to Awesome with Kubernetes, July 16
Getting Started with Cloud Computing, July 16
AWS Managed Services , July 18-19
Building Micro-frontends, July 22
Linux Performance Optimization, July 22
Linux Under the Hood, July 22
Introduction to Kubernetes, July 22-23
Introduction to Docker images, July 23
Analyzing Software Architecture, July 23
Building a Cloud Roadmap, July 24
Software Architecture by Example, July 24
Introduction to Docker CI/CD, July 24
Architecture for Continuous Delivery, July 29
Introduction to Docker Containers, July 30
Implementing Evolutionary Architectures, July 30-31
Docker for JVM Projects, July 31
Implementing and Troubleshooting TCP/IP, August 5
Developing Incremental Architecture, August 5-6
Microservice Decomposition Patterns, August 6
From Developer to Software Architect, August 6-7
Docker: Beyond the Basics (CI & CD), August 7-8
Introduction to Istio, August 8
Microservice Fundamentals, August 13
Getting Started with Google Cloud Platform, August 13
Microservices Caching Strategies, August 14
Practical Docker, August 14
Kubernetes Security, August 14
AWS Design Fundamentals, August 15-16
Software Architecture by Example, August 16
Structural Design Patterns with Spring, August 20
- Private Join and Compute (Google) — This functionality allows two users, each holding an input file, to privately compute the sum of associated values for records that have common identifiers. (via Wired)
- PyRobot — from CMU and Facebook. PyRobot is a framework and ecosystem that enables AI researchers and students to get up and running with a robot in just a few hours, without specialized knowledge of the hardware or of details such as device drivers, control, and planning.
- PartNet — a consistent, large-scale data set of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our data set consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This data set enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. (via IEEE Spectrum )
- Self-Supervised Learning (Andrew Zisserman) — 122 slides, very readable, about learning from images, from video, and from video with sound.
- Model Governance and Model Operations — models built or tuned for specific applications (in reality, this means models + data) will need to be managed and protected.
- Bodies in Seats — the story of Facebook’s 30,000 content moderators: contractors, low pay (as little as $28,800 a year), and a lot of PTSD for everyone. “Nobody’s prepared to see a little girl have her organs taken out while she’s still alive and screaming.” Moderators were told they had to watch at least 15 to 30 seconds of each video.
- Dialog — a domain-specific language for creating works of interactive fiction. Inspired by Inform and Prolog, they say.
- End-User Probabilistic Programming — We examine the sources of uncertainty actually encountered by spreadsheet users, and their coping mechanisms, via an interview study. We examine spreadsheet-based interfaces and technology to help reason under uncertainty, via probabilistic and other means. We show how uncertain values can propagate uncertainty through spreadsheets, and how sheet-defined functions can be applied to handle uncertainty. Hence, we draw conclusions about the promise and limitations of probabilistic programming for end-users.
In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.
We had a great conversation spanning many topics, including:
- Speech2Face: Learning the Face Behind a Voice — complete with an interesting ethics discussion up front. I wonder where this was intended to go: after all, it can’t perfectly reconstruct faces, so what you get is a stereotype based on the voice. Meh.
- Minivac 601 Replica (Instructables) — Created by information theory pioneer Claude Shannon as an educational toy for teaching digital circuits, the Minivac 601 Digital Computer Kit was billed as an electromechanical digital computer system.
- Nines Are Not Enough: Meaningful Metrics for Clouds — We show that this problem shares some similarities with the challenges of applying statistics to make decisions based on sampled data. We also suggest that defining guarantees in terms of defense against threats, rather than guarantees for application-visible outcomes, can reduce the complexity of these problems.
- Announcing Envoy Mobile (Lyft Engineering) — as Simon Willison said: Lyft’s Envoy proxy / service mesh has been widely adopted across the industry as a server-side component for adding smart routing and observability to the network calls made between services in microservice architectures. “The reality is that three 9s at the server-side edge is meaningless if the user of a mobile application is only able to complete the desired product flows a fraction of the time”—so Lyft is building a C++ embedded library companion to Envoy which is designed to be shipped as part of iOS and Android client applications. “Envoy Mobile in conjunction with Envoy in the data center will provide the ability to reason about the entire distributed system network, not just the server-side portion.” Their decision to release an early working prototype and then conduct ongoing development entirely in the open is interesting, too.
Our surveys over the past couple of years have shown growing interest in machine learning (ML) among organizations from diverse industries. A few factors are contributing to this strong interest in implementing ML in products and services. First, the machine learning community has conducted groundbreaking research in many areas of interest to companies, and much of this research has been conducted out in the open via preprints and conference presentations. We are also beginning to see researchers share sample code written in popular open source libraries, and some even share pre-trained models. Organizations now also have more use cases and case studies from which to draw inspiration—no matter what industry or domain you are interested in, chances are there are many interesting ML applications you can learn from. Finally, modeling tools are improving, and automation is beginning to allow new users to tackle problems that used to be the province of experts.
With the shift toward the implementation of machine learning, it’s natural to expect improvement in tools targeted at helping companies with ML. In previous posts, we’ve outlined the foundational technologies needed to sustain machine learning within an organization, and there are early signs that tools for model development and model governance are beginning to gain users.
One sure sign that companies are getting serious about machine learning is the growing popularity of tools designed specifically for managing the ML model development lifecycle, such as MLflow and Comet.ml. Why aren’t traditional software tools sufficient? In a previous post, we noted some key attributes that distinguish a machine learning project:
- Unlike traditional software where the goal is to meet a functional specification, in ML the goal is to optimize a metric.
- Quality depends not just on code, but also on data, tuning, regular updates, and retraining.
- Those involved with ML usually want to experiment with new libraries, algorithms, and data sources—and thus, one must be able to put those new components into production.
The growth in adoption of tools like MLflow indicates that new tools are in fact very much needed. These ML development tools are designed specifically to help teams of developers, machine learning engineers, and data scientists collaborate, manage, and reproduce, ML experiments. Many tools in this category let users to systematically conduct modeling experiments (e.g., hyperparameter tuning, NAS) while emphasizing the ease with which one can manage, track, and reproduce such experiments.
We are also beginning to come across companies that acknowledge the need for model governance tools and capabilities. Just as companies have long treated data as assets, as ML becomes more central to an organization’s operations, models will be treated as important assets. More precisely, models built or tuned for specific applications (in reality, this means models + data) will need to be managed and protected:
- A database for authorization and security: who has read/write access to certain models
- A catalog or a database that lists models, including when they were tested, trained, and deployed
- A catalog of validation data sets and the accuracy measurements of stored models
- Versioning (of models, feature vectors, data) and the ability to roll out, roll back, or have multiple live versions
- Metadata and artifacts needed for a full audit trail
- Who approved and pushed the model out to production, who is able to monitor its performance and receive alerts, and who is responsible for it
- A dashboard that provides custom views for all principals (operations, ML engineers, data scientists, business owners)
Model operations, testing, and monitoring
As machine learning proliferates in products and services, we need a set of roles, best practices, and tools to deploy, manage, test, and monitor ML in real-world production settings. There are some initial tools aimed at model operations and testing—mainly for deploying and monitoring ML models—but it’s clear we are still in the early stages for solutions in these areas.
There are three common issues that diminish the value of ML models once they’re in production. The first is concept drift: the accuracy of models in production degrades over time, because of changes in the real world, stemming from a growing disparity between the data they were trained on and the data they are used on. The second is locality: when deploying models to new geographic locations, user demographics, or business customers, it’s often not the case that pre-trained models work at the expected level of accuracy. Measuring online accuracy per customer / geography / demographic group is important both to monitor bias and to ensure accuracy for a growing customer base. The third is data quality: since ML models are more sensitive to the semantics of incoming data, changes in data distribution that are often missed by traditional data quality tools wreak havoc on models’ accuracy.
Beyond the need to monitor that your current deployed models operate as intended, another challenge is knowing that a newly proposed model actually delivers better performance in production. Some early systems allow for the comparison of an “incumbent model” against “challenger models,” including having challengers in “dark launch” or “offline” mode (this means challenger models are evaluated on production traffic but haven’t been deployed to production). Other noteworthy items include:
- Tools for continuous integration and continuous testing of models. A model is not “correct” if it returns a valid value—it has to meet an accuracy bar. There needs to be a way to validate this against a given metric and validation set before deploying a model.
- Online measurement of the accuracy of each model (what’s the accuracy that users are experiencing “in the field”?). Related to this is the need to monitor bias, locality effects, and related risks. For example, scores often need to be broken down by demographics (are men and women getting similar accuracy?) or locales (are German and Spanish users getting similar accuracy?).
- The ability to manage the quality of service for model inference to different customers, including rate limiting, request size limiting, metering, bot detection, and IP geo-fencing.
- Ability to scale (and auto-scale), secure, monitor, and troubleshoot live models. Scaling has two dimensions—the size of the traffic hitting the models and the number of models that need to be evaluated.
Model operations and testing is very much still a nascent field where systematic checklists are just beginning to be assembled. An overview from a 2017 paper from Google lets us gauge how much tooling is still needed for model operations and testing. This paper came with a 28-item checklist that detailed things that need to be accounted for in order to have a reliable, production-grade machine learning system:
- Features and data: seven items that include checks for privacy controls, feature validation, exploring the necessity and cost of a feature, and other data-related tests.
- Tests for model development: seven sanity checks, including checking whether a simpler model will suffice, model performance on critical data slices (e.g., region, age, recency, frequency, etc.), the impact of model staleness, and other important considerations.
- Infrastructure tests: a suite of seven considerations, including the reproducibility of model training, the ease with which models can be rolled back, integration tests on end-to-end model pipelines, model tests via a canary process.
- Monitoring: the authors list a series of seven items to ensure models are working as expected. This includes tests for model staleness, performance metrics (training, inference, throughput), validating that training and serving code generate similar values, and other essential items.
Discussions around machine learning tend to revolve around the work of data scientists and model building experts. This is beginning to change now that many companies are entering the implementation phase for their ML initiatives. Machine learning engineers, data engineers, developers, and domain experts are critical to the success of ML projects. At the moment, few (if any) teams have checklists as extensive as the one detailed in the 2017 paper from Google. The task of building real-world production-grade ML models still requires stitching together tools and teams that cut across many functional areas. However, as tools for model governance and model operations and testing begin to get refined and become more widely available, it’s likely that specialists (an “ML ops team”) will be tasked to use such tools. Automation will also be an important component, as these tools will need to enable organizations to build, manage, and monitor many more machine learning models.
We are beginning to see specialized tools that allow teams to manage the ML model development lifecycle. Tools like MLflow are being used to track and manage machine learning experiments (mainly offline, using test data). There are also new tools that cover aspects of governance, production deployment, serving, and monitoring, but at the moment they tend to focus on single ML libraries (TFX) or modeling tools (SAS Model Manager). The reality is, enterprises will want flexibility in the libraries, modeling tools, and environments they use. Fortunately, startups and companies are beginning to build comprehensive tools for enabling ML in the enterprise.
- Why Are We So Pessimistic? (Brookings) — The belief or perception that things are much worse than they really are is widespread, and I believe it comes with significant detrimental impacts on societies.
- Perspectives and Approaches in AI Ethics: East Asia — Each country’s perspectives on and approaches to AI and robots on the tool-partner spectrum are evaluated by examining its policy, academic thought, local practices, and popular culture. This analysis places South Korea in the tool range, China in the middle of the spectrum, and Japan in the partner range.
“AI starts with ‘good’ data” is a statement that receives wide agreement from data scientists, analysts, and business owners. There has been a significant increase in our ability to build complex AI models for predictions, classifications, and various analytics tasks, and there’s an abundance of (fairly easy-to-use) tools that allow data scientists and analysts to provision complex models within days. As model building become easier, the problem of high-quality data becomes more evident than ever. A recent O’Reilly survey found that those with mature AI practices (as measured by how long they’ve had models in production) cited “Lack of data or data quality issues” as the main bottleneck holding back further adoption of AI technologies.
Even with advances in building robust models, the reality is that noisy data and incomplete data remain the biggest hurdles to effective end-to-end solutions. The problem is even more magnified in the case of structured enterprise data. These data sets are often siloed, incomplete, and extremely sparse. Moreover, the domain knowledge, which often is not encoded in the data (nor fully documented), is an integral part of this data (see this article from Forbes). If you also add scale to the sparsity and the need for domain knowledge, you have the perfect storm of data quality issues.
In this post, we shed some light on various efforts toward generating data for machine learning (ML) models. In general, there are two main lines of work toward that goal: (1) clean the data you have, and (2) generate more data to help train needed models. Both directions have seen new advances in using ML models effectively, building on multiple new results from academia.
Data integration and cleaning
One of the biggest pitfalls in dealing with data quality is to treat all data problems the same. Academic research has been more deliberate in describing the different classes of data quality problems. We see two main classes of problems, which have varying degrees of complexity, and often mandate different approaches and tools to solve them. Since they consume a significant amount of time spent on most data science projects, we highlight these two main classes of data quality problems in this post:
- Data unification and integration
- Error detection and automatic repairing/imputation
Data unification and integration
Even with the rise of open source tools for large-scale ingestion, messaging, queuing, and stream processing, siloed data and data sets trapped behind the bars of various business units is the normal state of affairs in any large enterprise. Data unification or integration refers to the set of activities that bring this data together into one unified data context. Schema matching and mapping, record linkage and deduplication, and various mastering activities are the types of tasks a data integration solution performs. Advances in ML offer a scalable and efficient way to replace legacy top-down, rule-based systems, which often result in massive costs and very low success in today’s big data settings. Bottom-up solutions with human-guided ML pipelines (such as Tamr, Paxata, or Informatica—full disclosure: Ihab Ilyas is co-founder of Tamr) show how to leverage the available rules and human expertise to train scalable integration models that work on thousands of sources and large volumes of data. We discussed some of the challenges and enablers in using ML for this class of problems in an earlier post.
The class of data unification problems has its own characteristics in terms of solution complexity: (1) the problem is often quadratic in the size of the input (since we need to compare everything to everything else), and (2) the main ML task is fairly understood and is mainly determining if two “things” are the same. These characteristics have a considerable impact on the design of the solution. For example, a complex sophisticated model for finding duplicates or matching schema is the least of our worries if we cannot even enumerate all possible pairs that need to be checked. Effective solutions for data unification problems tend to be a serious engineering effort to: (1) prune the space of possible candidates; (2) interact effectively with experts to provide training data and validate the machine decision; and (3) keep rich lineage and provenance to track decisions back for auditing, revising, or reusing for future use cases. Due to the nature of the ML task (mainly Boolean classification here), and the richness of structure, most successful models tend to be the good old “shallow” models, such as random forest, with the help of simple language models (to help with strings data). See this article on data integration status for details.
Error detection, repairing and value imputation
Siloed or integrated data is often noisy, missing, and sometimes even has contradicting facts. Data cleaning is the class of data quality efforts that focuses on spotting and (hopefully) repairing such errors. Like data integration, data cleaning exercises often have been carried out with intensive labor work, or ad-hoc rule-based point solutions. However, this class has different complexities and characteristics that affect the design of the solution: the core ML task is often far more complex than a matching task, and requires building models that understand “how data was generated” and “how errors were introduced” to be able to reverse that process to spot and repair errors.
While data cleaning has long been a research topic in academia, it often has been looked at as a theoretical logic problem. This probably explains why none of the solutions have been adopted in industry. The good news is that researchers from academia recently managed to leverage that large body of work and combine it with the power of scalable statistical inference for data cleaning. The open source HoloClean probabilistic cleaning framework is currently the state-of-the-art system for ML-based automatic error detection and repair. HoloClean adopts the well-known “noisy channel” model to explain how data was generated and how it was “polluted.” It then leverages all known domain knowledge (such as available rules), statistical information in the data, and available trusted sources to build complex data generation and error models. The models are then used to spot errors and suggest the “most probable” values to replace.
Paying attention to scale is a requirement cleaning and integration have in common: building such complex models involves “featurizing” the whole data set via a series of operations—for example, to compute violations of rules, count co-occurrences, or build language models. Hence, an ML cleaning solution would need to be innovative on how to avoid the complexity of these operations. HoloClean, for example uses techniques to prune the domain of database cell and apply judicious relaxations to the underlying model to achieve the required scalability. Older research tools struggled with how to handle the various types of errors, and how to combine the heterogeneous quality input (e.g., business and quality rules, policies, statistical signals in the data, etc.). The HoloClean framework advances the state of the art in two fundamental ways: (1) combining the logical rules and the statistical distribution of the data into one coherent probabilistic model; and (2) scaling the learning and inference process via a series of system and model optimizations, which allowed it to be deployed in census organizations and large commercial enterprises.
Increasing the quality of the available data via either unification or cleaning, or both, is definitely an important and a promising way forward to leverage enterprise data assets. However, the quest for more data is not over, for two main reasons:
- ML models for cleaning and unification often need training data and examples of possible errors or matching records. Depending completely on human labeling for these examples is simply a non-starter; as ML models get more complex and the underlying data sources get larger, the need for more data increases, the scale of which cannot be achieved by human experts.
- Even if we boosted the quality of the available data via unification and cleaning, it still might not be enough to power the even more complex analytics and predictions models (often built as a deep learning model).
An important paradigm for solving both these problems is the concept of data programming. In a nutshell, data programming techniques provide ways to “manufacture” data that we can feed to various learning and predictions tasks (even for ML data quality solutions). In practical terms, “data programming” unifies a class of techniques used for the programmatic creation of training data sets. In this category of tools, frameworks like Snorkel show how to allow developers and data scientists to focus on writing labeling functions to programmatically label data, and then model the noise in the labels to effectively train high-quality models. While using data programming to train high-quality analytics models might be clear, we find it interesting how it is used internally in ML models for the data unification and cleaning we mentioned earlier in this post. For example, tools like Tamr leverage legacy rules written by customers to generate a large amount of (programmatically) labeled data to power its matching ML pipeline. In a recent paper, the HoloClean project showed how to use “data augmentation” to generate many examples of possible errors (from a small seed) to power its automatic error detection model.
The landscape of solutions we presented here for the quest for high-quality data have already been well validated in the market today.
- ML solutions for data unification such as Tamr and Informatica have been deployed at a large number of Fortune-1000 enterprises.
- Automatic data cleaning solutions such as HoloClean already have been deployed by multiple financial services and the census bureaus of various countries.
- As the growing list of Snorkel users suggests, data programming solutions are beginning to change the way data scientists provision ML models.
As we get more mature in understanding the differences between the various problems of integration, cleaning, and automatic data generation, we will see real improvement in handling the valuable data assets in the enterprise.
Machine learning applications rely on three main components: models, data, and compute. A lot of articles are written about new breakthrough models, many of which are created by researchers who publish not only papers, but code written in popular open source libraries. In addition, recent advances in automated machine learning has resulted in many tools that can (partially) automate model selection and hyperparameter tuning. Thus, many cutting-edge models are now available to data scientists. Similarly, cloud platforms have made compute and hardware more accessible to developers.
Models are increasingly becoming commodities. As we noted in the survey results above, the reality is that a lack of high-quality training data remains the main bottleneck in most machine learning projects. We believe that machine learning engineers and data scientists will continue to spend most of their time creating and refining training data. Fortunately, help is on the way: as we’ve described in this post, we are finally beginning to see a class of technologies aimed squarely at the need for quality training data.