Lessons learned while helping enterprises adopt machine learning

Lessons learned while helping enterprises adopt machine learning

Skyscrapers

(source: Pixabay)

In this episode of the Data Show, I spoke with Francesca Lazzeri, an AI and machine learning scientist at Microsoft, and her colleague Jaya Mathew, a senior data scientist at Microsoft. We conducted a couple of surveys this year—“How Companies Are Putting AI to Work Through Deep Learning” and “The State of Machine Learning Adoption in the Enterprise”—and we found that while many companies are still in the early stages of machine learning adoption, there’s considerable interest in moving forward with projects in the near future. Lazzeri and Mathew spend a considerable amount of time interacting with companies that are beginning to use machine learning and have experiences that span many different industries and applications. I wanted to learn some of the processes and tools they use when they assist companies in beginning their machine learning journeys.

Here are some highlights from our conversation:

Team data science process

Francesca Lazzeri: The Data Science Process is a framework that we try to apply in our projects. Everything begins with a business problem, so external customers come to us with a business problem or a process they want to optimize. We work with them to translate these into realistic questions, into what we call data science questions. And then we move to the data portion: what are the different relevant data sources, is the data internal or external? After that, you try to define the data pipeline. We start with the core part of the data science process—that is, data cleaning—and proceed to feature engineering, model building, and model deployment and management.

…There are also usually external agents involved. When I say external agents, I mean there are program managers and business experts who follow us during this process. These are individuals who are the data and domain experts. It’s a very interactive process because you go back and forth trying to understand if what you are building is something that really can be interesting to the business owners.

What is holding back adoption of machine learning

Jaya Mathew: One of the biggest bottlenecks is lack of talent within the organization. A company really needs to invest in either up-scaling their existing employee base, which tends to be expensive and they’re trying to figure out if that investment is really worth it. Or they need to try to hire, and hiring specific skill sets is difficult, as there is a talent shortage everywhere.

Then, in addition to that, there’s also a little bit of hesitation because some of the AI and machine learning models are “black boxes”. … I think many governments and many organizations need to be able to explain what’s going on before they deploy a model.

Related resources:

10 top Java resources on O’Reilly’s online learning platform

10 top Java resources on O’Reilly’s online learning platform

Binary data

(source: Pixabay)

We dove into the data on our online learning platform to identify the most-used Java resources. These are the items our platform subscribers regularly turn to as they apply Java in their projects and organizations.

Effective Java, 3rd Edition — Joshua Bloch covers language and library features added in Java 7, 8, and 9, including the functional programming constructs that were added to its object-oriented roots. Many new items have been added, including a chapter devoted to lambdas and streams.

Java 8 and 9 Fundamentals: Modern Java Development with Lambdas, Streams, and Introducing Java 9’s JShell and the Java Platform Module System (JPMS) — Paul Deitel applies the Deitel signature live-code approach to teaching programming and explores the Java language and Java APIs in depth.

Java 8 in Action: Lambdas, streams, and functional-style programming — Raoul-Gabriel Urma, Mario Fusco, and Alan Mycroft cover lambdas, streams, and functional-style programming in this clearly written guide to to the new features of Java 8.

Head First Java, 2nd Edition — Bert Bates and Kathy Sierra offer a complete introduction to object-oriented programming and Java.

OCP Oracle Certified Professional Java SE 8 Programmer II — Scott Selikoff and Jeanne Boyarsky bring you a comprehensive companion for preparing for Exam 1Z0-809 as well as upgrade Exam 1Z0-810 and Exam 1Z0-813.

Java Concurrency in Practice — This book arms readers with both the theoretical underpinnings and concrete techniques for building reliable, scalable, maintainable concurrent applications.

Optimizing Java — Chris Newland, James Gough, and Benjamin Evans teach you how to tune Java applications for performance using a quantitative, verifiable approach.

Java: The Complete Reference, 10th Edition — Herbert Schildt covers the entire Java language, including its syntax, keywords, and fundamental programming principles.

Java for Beginners: Step-by-Step Hands-On Guide to Java — Manuj Aggarwal and the TetraTutorials Team bring you a course jam-packed with practical demos, homework assignments, and live coding to help you grasp the complex topics.

Cloud Native Java — Josh Long and Kenny Bastani show Java/JVM developers how to build better software, faster, using Spring Boot, Spring Cloud, and Cloud Foundry.

Article image: Binary data

(source: Pixabay).

Four short links: 16 November 2018

Four short links: 16 November 2018

Four short links
  1. IllumiPaper — illuminated elements built into regular paper, with implementation.
  2. sr.ht — (pronounced “sir hat”) a software forge like GitHub or GitLab, but with interesting strengths (e.g., very lightweight pages, and the CI system).
  3. Leak Mitigation ChecklistIf you just leaked sensitive information in public source code, read this document as part of your emergency procedure.
  4. Emulating an IBM PC on an ESP8266an 8086 PC-XT emulation with 640K RAM, 80×25 CGA composite video, and a 1.44MB MS-DOS disk on an ESP12E without additional components. (via Alasdair Allen)
Article image: Four short links

Four short links: 15 November 2018

Four short links: 15 November 2018

Four short links
  1. USA Needs to Pursue Malicious Cyber Actors — a report that argues that the United States currently lacks a comprehensive overarching strategic approach to identify, stop, and punish cyberattackers. (1) There is a burgeoning cybercrime wave. (2) There is a stunning cyber enforcement gap. (3) There is no comprehensive U.S. cyber enforcement strategy aimed at the human attacker. This is definitely a golden age of online crime.
  2. DeepMasterPrints: Generating MasterPrints for Dictionary Attacks via Latent Variable EvolutionMasterPrints are real or synthetic fingerprints that can fortuitously match with a large number of fingerprints, thereby undermining the security afforded by fingerprint systems. Previous work by Roy, et al., generated synthetic MasterPrints at the feature level. In this work, we generate complete image-level MasterPrints known as DeepMasterPrints, whose attack accuracy is found to be much superior than that of previous methods. (via Mikko Hypponen)
  3. The Tripartite Identity Pattern (Randy Farmer) — The three components of user identity are: the account identifier, the login identifier, and the public identifier.
  4. Project VisBug — edit/tweak existing webpages.
Article image: Four short links

Four short links: 14 November 2018

Four short links: 14 November 2018

Four short links
  1. Managing Risk in Machine Learning Projects (Ben Lorica) — Considerations for a world where ML models are becoming mission critical.
  2. Transcripts of 2018 IGF — Internet Governance Forum session transcripts.
  3. Featuretoolsopen source Python framework for automated feature engineering.
  4. Solving Snake — fun exploration of different algorithms you might use to play the Snake game.
Article image: Four short links

Four short links: 30 November 2018

Four short links: 30 November 2018

Four short links
  1. QEMU Advent CalendarAn amazing QEMU disk image every day!. It’s that time of year again! See also Advent of Code.
  2. De Facto Closed SourceYou want to download thousands of lines of useful, but random, code from the internet, for free, run it in a production web server, or worse, your user’s machine, trust it with your paying users’ data and reap that sweet dough. We all do. But then you can’t be bothered to check the license, understand the software you are running, and still want to blame the people who make your business a possibility when mistakes happen, while giving them nothing for it? This is both incompetence and entitlement.
  3. U.S. Government Wonders What to Limit Exports OfThe representative general categories of technology for which Commerce currently seeks to determine whether there are specific emerging technologies that are essential to the national security of the United States include: (1) Biotechnology, such as: (i) Nanobiology; (ii) Synthetic biology; (iv) Genomic and genetic engineering; or (v) Neurotech. (2) Artificial intelligence (AI) and machine learning technology, such as: (i) Neural networks and deep learning (e.g., brain modeling, time series prediction, classification); (ii) Evolution and genetic computation (e.g., genetic algorithms, genetic programming); (iii) Reinforcement learning; (iv) Computer vision (e.g., object recognition, image understanding); (v) Expert systems (e.g., decision support systems, teaching systems); (vi) Speech and audio processing (e.g., speech recognition and production); (vii) Natural language processing (e.g., machine translation); (viii) Planning (e.g., scheduling, game playing); (ix) Audio and video manipulation technologies (e.g., voice cloning, deepfakes); (x) AI cloud technologies; or (xi) AI chipsets. (3) Position, Navigation, and Timing (PNT) technology. (4) Microprocessor technology, such as: (i) Systems-on-Chip (SoC); or (ii) Stacked Memory on Chip. (5) Advanced computing technology, such as: (i) Memory-centric logic. (6) Data analytics technology, such as: (i) Visualization; (ii) Automated analysis algorithms; or (iii) Context-aware computing. (7) Quantum information and sensing technology, such as (i) Quantum computing; (ii) Quantum encryption; or (iii) Quantum sensing. (8) Logistics technology, such as: (i) Mobile electric power; (ii) Modeling and simulation; (iii) Total asset visibility; or (iv) Distribution-based Logistics Systems (DBLS). (9) Additive manufacturing (e.g., 3D printing); (10) Robotics such as: (i) Micro-drone and micro-robotic systems; (ii) Swarming technology; (iii) Self-assembling robots; (iv) Molecular robotics; (v) Robot compliers; or (vi) Smart Dust. (11) Brain-computer interfaces, such as (i) Neural-controlled interfaces; (ii) Mind-machine interfaces; (iii) Direct neural interfaces; or (iv) Brain-machine interfaces. (12) Hypersonics, such as: (i) Flight control algorithms; (ii) Propulsion technologies; (iii) Thermal protection systems; or (iv) Specialized materials (for structures, sensors, etc.). (13) Advanced Materials, such as: (i) Adaptive camouflage; (ii) Functional textiles (e.g., advanced fiber and fabric technology); or (iii) Biomaterials. (14) Advanced surveillance technologies, such as: Faceprint and voiceprint technologies. It’s a great list of what’s in the next Gartner Hype Cycle report.
  4. The Digital Maginot Line (Renee DiResta) — We know this is coming, and yet we’re doing very little to get ahead of it. No one is responsible for getting ahead of it. […] platforms aren’t incentivized to engage in the profoundly complex arms race against the worst actors when they can simply point to transparency reports showing that they caught a fair number of the mediocre actors. […] The regulators, meanwhile, have to avoid the temptation of quick wins on meaningless tactical bills (like the Bot Law) and wrestle instead with the longer-term problems of incentivizing the platforms to take on the worst offenders (oversight), and of developing a modern-day information operations doctrine.
Article image: Four short links

Four short links: 22 November 2018

Four short links: 22 November 2018

Four short links
  1. XOXO 2018 Videos — playlist of talks from XOXO 2018. (via BoingBoing)
  2. Learn Git Branching — visual!
  3. Post-REST (Tim Bray) — musings on what might replace REST in different parts of the current world of web services.
  4. Projectslist of practical projects that anyone can solve in any programming language, divided into categories according to what the project will exercise your knowledge of—e.g., Files, Data Structures, Threading, etc. Good for teachers looking for ideas.
Article image: Four short links

Building tools for enterprise data science

Building tools for enterprise data science

Ein Besuch bei Ford in Köln

(source: Gilly on Flickr)

In this episode of the Data Show, I spoke with Vitaly Gordon, VP of data science and engineering at Salesforce. As the use of machine learning becomes more widespread, we need tools that will allow data scientists to scale so they can tackle many more problems and help many more people. We need automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection and hyperparameter tuning, as well as monitoring.

I wanted the perspective of someone who is already faced with having to support many models in production. The proliferation of models is still a theoretical consideration for many data science teams, but Gordon and his colleagues at Salesforce already support hundreds of thousands of customers who need custom models built on custom data. They recently took their learnings public and open sourced TransmogrifAI, a library for automated machine learning for structured data, which sits on top of Apache Spark.

Here are some highlights from our conversation:

The need for an internal data science platform

It’s more about how much commonality there is between every single data science use case—how many of the problems are redundant and repeatable.

… A lot of data scientists solve problems that honestly have a lot to do with engineering, a lot to do with things that are not pure modeling.

TransmogrifAI

TransmogrifAI is an automated machine library for mostly structured data, and the problem that it aims to solve is that we at Salesforce have hundreds of thousands of customers. While all of them share a common set of data, the Salesforce platform itself is extremely customizable. Actually, 80% of the data inside the Salesforce platform actually sits in what we refer to as custom objects, which one can think of as custom tables in a database.

… We don’t build models that are shared between customers. We always use a single customer’s data. We have hundreds of thousands of models potentially that we need to build, and because of that, we needed to automate the entire process. We just cannot throw people at the problem. We basically created TransmogrifAI to automate the entire end-to-end process for creating a model for a user and we decided to open source it a couple months ago.

Related resources:

Four short links: 21 November 2018

Four short links: 21 November 2018

Four short links
  1. Black Mirror Brainstorms (Aaron Lewis) — In light of the latest FB scandal, here’s my proposal for replacing Design Sprints: “Black Mirror Brainstorms.” A workshop in which you create a Black Mirror episode. The plot must revolve around misuse of your team’s product. See Casey Fiesler’s Black Mirror, Light Mirror, which I’ve linked to before on 4SL.
  2. Toolkit NavigatorA compendium of toolkits for public sector innovation and transformation, curated by OPSI and our partners around the world.
  3. Conjure — Palantir’s open source simple but opinionated toolchain for defining APIs once and generating client/server interfaces in multiple languages. For more, read the blog post.
  4. Hardware Effectsthis repository demonstrates various hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU and OS architecture. For each effect I try to create a proof of concept program that is as small as possible so it can be understood easily. How full stack ARE you?
Article image: Four short links

Four short links: 20 November 2018

Four short links: 20 November 2018

Four short links
  1. Some Requests for Machine Learning Research from the East African Tech SceneBased on 46 in–depth interviews […] a list of concrete machine learning research problems, progress on which would directly benefit tech ventures in East Africa. Example: Priors for autocorrect and low-literacy SMS use—SMS text contains many language misuses due to a combination of autocorrection and low literacy. E.g., “poultry farmer” becoming “poetry farmer.” Such mistakes are bound to occur in any written language corpus, but engineers working with rural populations in East Africa report that this is a prevalent issue for them, confounding the use of pretrained language models. This problem also exists to some degree in voice data with respect to English spoken in different accents. Priors over autocorrect substitution rules, or custom, per–dialect confusion matrices between phonetically similar words could potentially help. Expect much more work like this as AI/ML moves into non-WEIRD (Western Educated Industrialized Rich Democratic) nations.
  2. How the Media Gets Tesla Wrong — a reminder that our convenient shorthand and once-over-lightly reading of the news gives a false and rosy picture of what’s possible.
  3. Why Information Security is Hard: An Economic Perspective — fascinating arguments! I particularly like the statistical argument: a lone attacker might find 10 bugs a year, a well-prepared defender might find 1,000 bugs a year, but if there are 100,000 available bugs for exploitation, then there’s very low probability that the defender found and patched the same bugs that the attacker found…
  4. DoodleMaster — sketches->UI via a CNN, a proof-of-concept.
Article image: Four short links