- ncform — a very nice configuration generation way to develop forms.
- GitHub Sponsors — allowing donations.
- Starlink — SpaceX is developing a low latency, broadband internet system to meet the needs of consumers across the globe. Enabled by a constellation of low Earth orbit satellites, Starlink will provide fast, reliable internet to populations with little or no connectivity, including those in rural communities and places where existing services are too expensive or unreliable.
- Gallery of Programmer Interfaces — These images bear witness to the passionate work of so many people striving to improve programming. So often the cobbler’s children are barefoot.
In this episode of the Data Show, I spoke with Jike Chong, chief data scientist at Acorns, a startup focused on building tools for micro-investing. Chong has extensive experience using analytics and machine learning in financial services, and he has experience building data science teams in the U.S. and in China.
We had a great conversation spanning many topics, including:
Potential applications of data science in financial services.
The current state of data science in financial services in both the U.S. and China.
His experience recruiting, training, and managing data science teams in both the U.S. and China.
Here are some highlights from our conversation:
Opportunities in financial services
There’s a customer acquisition piece and then there’s a customer retention piece. For customer acquisition, we can see that new technologies can really add value by looking at all sorts of data sources that can help a financial service company identify who they want to target to provide those services. So, it’s a great place where data science can help find the product market fit, not just at one instance like identifying who you want to target, but also in a continuous form where you can evolve a product and then continuously find the audience that would best fit the product and continue to analyze the audience so you can design the next generation product. … Once you have a specific cohort of users who you want to target, there’s a need to be able to precisely convert them, which means understanding the stage of the customer’s thought process and understanding how to form the narrative to convince the user or the customer that a particular piece of technology or particular piece of service is the current service they need.
… On the customer serving or retention side, for financial services we commonly talk about building hundred-year businesses, right? They have to be profitable businesses, and for financial service to be profitable, there are operational considerations—quantifying risk requires a lot of data science; preventing fraud is really important, and there is garnering the long-term trust with the customer so they stay with you, which means having the work ethic to be able to take care of customer’s data and able to serve the customer better with automated services whenever and wherever the customer is. It’s all those opportunities where I see we can help serve the customer by having the right services presented to them and being able to serve them in the long term.
Opportunities in China
A few important areas in the financial space in China include mobile payments, wealth management, lending, and insurance—basically, the major areas for the financial industry.
For these areas, China may be a forerunner in using internet technologies, especially mobile internet technologies for FinTech, and I think the wave started way back in the 2012/2013 time frame. If you look at mobile payments, like Alipay and WeChat, those have hundreds of millions of active users. The latest data from Alipay is about 608 million users, and these are monthly active users we’re talking about. This is about two times the U.S. population actively using Alipay on a monthly basis, which is a crazy number if you consider all the data that can generate and all the things you can see people buying to be able to understand how to serve the users better.
If you look at WeChat, they’re boasting one billion users, monthly active users, early this year. Those are the huge players, and with that amount of traffic, they are able to generate a lot of interest for the lower-frequency services like wealth management and lending, as well as insurance.
- Few-Shot Adversarial Learning of Realistic Neural Talking Head Models — astonishing work, where you can essentially do deep-fakes from one or two photos. See the YouTube clip for amazing footage of it learning from historical photos and even a painting. (via Dmitry Ulyanov)
- Basis Universal GPU Texture Codec — open source codec for a super-compressed image file format that can be quickly transcoded to something ready for GPUs. See this Hacker News comment for a very readable explanation of why it’s important for game developers.
- Serenity — open source OS for x86 machines, which seems like Unix with Windows 98 UI.
- The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction — We present a rubric as a set of 28 actionable tests, and offer a scoring system to measure how ready for production a given machine learning system is. With an implementation in Excel.
- Software-Defined Memory in Warehouse-Scale Computers (ACM) — when you’re Google, you invent new types of memory. In this case, a cheaper, but slower, “far memory” that is slower than DRAM but faster than Flash. Of course you do!
- ZetaSQL — Google’s SQL parser and analyzer. Cf Apache Calcite. (via Hacker News)
- Wolfram Engine — a locally downloadable Wolfram Engine to put computational intelligence into your applications. The Free Wolfram Engine for Developers is available for pre-production software development.
- Love Your Job? Someone May be Taking Advantage of You (Duke) — people see it as more acceptable to make passionate employees do extra, unpaid, and more demeaning work than they did for employees without the same passion. Which goes some way to explaining why I’ve found passion to be strongly correlated with burnout.
- Computational Socioeconomics — In this review, we will make a brief manifesto about a new interdisciplinary research field named Computational Socioeconomics, followed by a detailed introduction about data resources, computational tools, data-driven methods, theoretical models, and novel applications at multiple resolutions—including the quantification of global economic inequality and complexity, the map of regional industrial structure and urban perception, the estimation of individual socioeconomic status and demographic, and the real-time monitoring of emergent events.
- Microsoft Applying AI to Entire Developer Lifecycle — Microsoft looks at three different types of code when gathering data: source code—logic and markup (e.g., structure, logic, declarations, comments, variables), distinct learning from public, org, and personal repositories; metadata—interactions (e.g., pull requests, bugs/tickets, codeflow), telemetry (e.g., diagnostics for your app, profiling, etc.); and adjacent sources—documentation, tutorials, and samples; discussion forums (e.g., StackOverflow, Teams / Slack).
- Report from the AMP Advisory Committee Meeting — We heard, several times, that publishers don’t like AMP. They feel forced to use it because otherwise they don’t get into Google’s news carousel—right at the top of the search results.
- Social Media’s Enduring Effect on Adolescent Life Satisfaction (PNAS) — We found that social media use is not, in and of itself, a strong predictor of life satisfaction across the adolescent population. Instead, social media effects are nuanced, small at best, reciprocal over time, gender specific, and contingent on analytic methods.
In this post, I share slides and notes from a keynote I gave at the Strata Data Conference in London earlier this year. I will highlight the results of a recent survey on machine learning adoption, and along the way describe recent trends in data and machine learning (ML) within companies. This is a good time to assess enterprise activities, as there are many indications a number of companies are already beginning to use machine learning. For example, in a July 2018 survey that drew more than 11,000 respondents, we found strong engagement among companies: 51% stated they already had machine learning models in production.
With all the hype around AI, it can be tempting to jump into use cases involving data types with which you aren’t familiar. We found that companies that have successfully adopted machine learning do so either by building on existing data products and services, or by modernizing existing models and algorithms. Here are some typical ways organizations begin using machine learning:
- Build upon existing analytics use cases: e.g., one can use existing data sources for business intelligence and analytics, and use them in an ML application.
- Modernize existing applications such as recommenders, search ranking, time series forecasting, etc.
- Use ML to unlock new data types—e.g., images, audio, video.
- Tackle completely new use cases and applications.
Consider deep learning, a specific form of machine learning that resurfaced in 2011/2012 due to record-setting models in speech and computer vision. While we continue to read about impressive breakthroughs in speech and computer vision, companies are beginning to use deep learning to augment or replace existing models and algorithms. A famous example is Google’s machine translation system, which shifted from “stats focused” approaches to TensorFlow. In our own conferences, we see strong interest in training sessions and tutorials on deep learning for time series and natural language processing—two areas where organizations likely already have existing solutions, and for which deep learning is beginning to show some promise.
Machine learning is not only appearing in more products and systems, but as we noted in a previous post, ML will also change how applications themselves get built in the future. Developers will find themselves increasingly building software that has ML elements. Thus, many developers will need to curate data, train models, and analyze the results of models. With that said, we are still in a highly empirical era for ML: we need big data, big models, and big compute.
If anything, deep learning models are even more data hungry than previous algorithms favored by data scientists. Data is key to machine learning applications, and getting data flowing, cleaned, and in usable form is going to be key to sustaining a machine learning practice.
With an eye toward the growing importance of machine learning, we recently completed a data infrastructure survey that drew more than 3,200 respondents. Our goal was twofold: (1) find out what tools and platforms people are using, and (2) determine whether or not companies are building the foundational tools needed to sustain their ML initiatives. Many respondents signaled that they were using open source tools (Apache Spark, Kafka, TensorFlow, PyTorch, etc.) and managed services in the cloud.
One of the main questions we asked was: what are you currently building or evaluating?
- Not surprisingly, data integration and ETL were among the top responses, with 60% currently building or evaluating solutions in this area. In an age of data-hungry algorithms, everything really begins with collecting and aggregating data.
- An important part of getting your data ready for machine learning is to normalize, standardize, and augment it with other data sources. 52% of survey respondents indicated they were building or evaluating solutions for data preparation and cleaning. These include human-in-the-loop systems for data preparation: these are tools that allow domain experts to train automated systems to do data preparation and cleaning at scale. In fact, there is an exciting new research area called data programming, which unifies techniques for the programmatic creation of training sets.
- You also need solutions that let you understand what data you have and who can access it. About a third of the respondents in the survey indicated they are interested in data governance systems and data catalogs. Some companies are beginning to build their own solutions, and several will be presenting them at Strata Data in NYC this coming Fall—e.g., Marquez (WeWork) and Databook (Uber). But this is also an area where startups—Alation, Immuta, Okera, and others—are beginning to develop interesting offerings.
- 21% of survey respondents said they are building or evaluating data lineage solutions. In the past, we got by with a casual attitude toward data sources. Discussions of data ethics, privacy, and security have made data scientists aware of the importance of data lineage and provenance. Specifically, companies will need to know where the data comes from, how it was gathered, and how it was modified along the way. The need to audit or reproduce ML pipelines is increasingly a legal and security issue. Fortunately, we are beginning to see open source projects (including DVC, Pachyderm, Delta Lake, DOLT) that address the need for data lineage and provenance. At recent conferences, we’ve also had talks from companies that have built data lineage systems—Intuit, Lyft, Accenture, and Netflix, among others—and there will be more presentations on data lineage solutions at Strata Data in New York City this coming fall.
- As the number of data scientists and machine learning engineers grow within an organization, tools have to be standardized, models and features need to be shared, and automation starts getting introduced. 58% of survey respondents indicated they are building or evaluating data science platforms. Our Strata Data conference consistently features several sessions on how companies built their internal data science platforms, specifically in regard to what tradeoffs and design choices they made, and what lessons they’ve learned along the way.
What about the cloud? In our recent survey, we found a majority are already using a public cloud for portions of their data infrastructure, and more than a third have been using serverless. We have had many training sessions, tutorials, and talks on serverless at recent conferences: including a talk by Eric Jonas on a recent paper laying out the UC Berkeley view on serverless, followed by a talk by Avner Braverman on the role of serverless in AI and data applications.
Companies are just getting started building machine learning applications, and I believe the use of machine learning will continue to grow over the next few years for a couple of reasons:
- 5G is beginning to be rolled out, and 5G will lead to the development of machine-to-machine applications, many of which will incorporate ML.
- Specialized hardware for machine learning (specifically, deep learning) will come online: we are already seeing new hardware for model inference for edge devices and servers. Sometime in Q3/Q4 of 2019, specialized hardware for training deep learning models will become available. Imagine systems that will let data scientists and machine learning experts run experiments at a fraction of the cost and a fraction of the time. This new generation of specialized hardware for machine learning training and inference will allow data scientists to explore and deploy many new types of models.
There are a couple of early indicators that ML will continue to grow within companies, both point to the growing number of companies interested in productionizing machine learning. First, while we read a lot of articles in the press about data scientists, a few years ago a new role dedicated to productionizing ML began to emerge.
Machine learning engineers sit between data science and engineering/ops, they tend to be higher paid than data scientists, and they generally have stronger technical and programming skills. As my Twitter poll above suggests, there seem to be early indications that data scientists are “rebranding” themselves into this new job title.
Another signal that interest in ML is increasing emerges when you look at the traction of new projects like MLflow: in just about 10 months since it launched, we already see strong interest from many companies. As we noted in a previous post, a common use case for MLflow is experiment tracking and management—before MLflow, there weren’t good open source tools for this. Projects like MLflow and Kubeflow (as well as products from companies like comet.ml and Verta.AI) make ML development easier for companies to manage.
MLflow is an interesting new tool, but it is focused on model development. As your machine learning practice expands to many parts of your organization, it becomes clear that you’ll need other specialized tools. In speaking with many companies that have built data platforms and infrastructure for machine learning, a few important factors arise that have to be taken into account as you design your toolchain:
- Support for different modeling approaches and tools: while deep learning has become more important, the reality is that even the leading technology companies use a variety of modeling approaches including SVM, XGboost, and statistical learning methods.
- Duration and frequency of model training will vary, depending on the use case, the amount of data, and the specific type of algorithms used.
- How much model inference is involved in specific applications?
Just like data are assets that require specialized tools (including data governance solutions and data catalogs), models are also valuable assets that will need to be managed and protected. As we noted in a previous post, tools for model governance and model operations will also be increasingly critical: the next big step in the democratization of machine learning is making it more manageable. Model governance and model ops will require solutions that contain items like:
- A database for authorization and security: who has read/write access to certain models
- A catalog or a database that lists models, including when they were tested, trained, and deployed
- Metadata and artifacts needed for audits
- Systems for deployment, monitoring, and alerting: who approved and pushed the model out to production, who is able to monitor its performance and receive alerts, and who is responsible for it
- A dashboard that provides custom views for all principals (operations, ML engineers, data scientists, business owners)
Companies are learning that there are many important considerations that arise with the use of ML. Thankfully, the research community has begun rolling out techniques and tools to address some of the important challenges ML presents, including fairness, explainability, safety and reliability, and especially security and privacy. Machine learning often interacts and impacts users, so companies not only need to put in place processes that will let them deploy ML responsibly, they need to build foundational technologies that will allow them to retain oversight, particularly when things go wrong. The technologies I’ve alluded to above—data governance, data lineage, model governance—are all going to be useful for helping manage these risks. In particular, auditing and testing machine learning systems will rely on many of the tools I’ve described above.
There are real, not just theoretical, risks and considerations. These foundational tools will increasingly be essential and no longer optional. For example, a recent DLA Piper survey provides an estimate of GDPR breaches that have been reported to regulators: more than 59,000 personal data breaches as of February, 2019.
While we tend to think of ML as producing a “model” or “algorithm” that we deploy, auditing ML systems can be challenging, as there are actually two algorithms to keep track of:
- The actual model that one deploys and uses in an application of product
- Another algorithm (the “trainer” and “pipeline”) that uses data to produce the Model that best optimizes some objective function.
So, managing ML really means building a set of tools that can manage a series of interrelated algorithms. Based on the survey results I’ve described above, companies are beginning to build the important foundational technologies—data integration and ETL, data governance and data catalogs, data lineage, model development and model governance—that are important to sustaining a responsible machine learning practice.
But challenges remain, particularly as the use of ML grows within companies that are already having to grapple with many IT, software, and cloud solutions (besides having to manage the essential task of “keeping the lights on”). The good news is that there are early indicators that companies are beginning to acknowledge the need to build or acquire the requisite foundational technologies.
- Basic Account Hygiene to Prevent Hijacking (Google) — SMS 2FA blocked 100% of automated bots, 96% of bulk phishing attacks, and 76% of targeted attacks. On-device prompts, a more secure replacement for SMS, helped prevent 100% of automated bots, 99% of bulk phishing attacks and 90% of targeted attacks.
- Conversational AI Playbook — The detailed instructions, practical advice, and real-world examples provided here should empower developers to improve the quality and variety of conversational experiences of the coming months and years.
- Falsehoods Programmers Believe about Unix Time — These three facts all seem eminently sensible and reasonable, right? Unix time is the number of seconds since 1 January 1970 00:00:00 UTC. If I wait exactly one second, Unix time advances by exactly one second. Unix time can never go backward. False, false, false.
- Testing and Debugging in Machine Learning (Google) — Testing and debugging machine learning systems differs significantly from testing and debugging traditional software. This course describes how, starting from debugging your model all the way to monitoring your pipeline in production.
- Six Buckets of Productsec — There are six buckets a security bug can fall into on its journey through life: Prevented—best outcome, never turned into code. Found automatically—found via static analysis or other tools, “cheap” time cost. Found manually—good even if it took more time; a large set of bugs can only be found this way. Found externally—usually via bug bounty, put users at real risk, expensive time cost but 100x better than other outcomes. Never found—most bugs probably end up here. Exploited—the worst.
- ShadowHammer (Bruce Schneier) — The common thread through all of the above-mentioned cases is that attackers got valid certificates and compromised their victims’ development environments. (via Bruce Schneier)
- The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks — dense, randomly initialized, feed-forward networks contain subnetworks (“winning tickets”) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
- Christchurch Call — first time governments and companies have, en masse, sat at a table to figure out how to curb violent extremist content on the platforms.
- The Platform Challenge (Alex Stamos) — absolute cracker of a talk about regulating the social media platforms. Must watch.
- Amazon’s Away Teams — Capturing the way things are at an organization as large as Amazon is always a challenge. The company has never publicly codified its management system as it has done for its leadership principles. But this picture might offer new ideas for people seeking to coordinate technology development at scale. (via Simon Willison)
- Why I Still Love Tech (Wired) — I love the whole made world. But I can’t deny that the miracle is over, and that there is an unbelievable amount of work left for us to do.
- Illustrated Machine Learning Cheatsheets — what it says on the box.
Proposals submitted for the O’Reilly Software Architecture Conference serve as a valuable weather vane, providing direction for developers and architects from what some of the leading names in the field propose as sessions. These go-to experts and practitioners work on the front lines of technology, and they understand that business and software architecture need to operate in harmony to support overall organizational success.
Our recent analysis of speaker proposals from the O’Reilly Software Architecture Conference turned up a number of interesting findings:
- Microservices was the No. 1 term in the proposals. This topic remains a bedrock concept in the software architecture space.
- A big year-over-year jump in serverless, up 89 slots, suggests increased interest, exploration, and experimentation around this nascent and evolving topic for software architects.
- AI (No. 45) and Machine Learning (No. 20) ranked well individually in the most frequently referenced topics, but if you combine them they would rise to the No. 7 topic overall. The increase of AI/ML in proposals is likely tied to the need for more skills development in the software architecture space as well as AI/ML’s role in monitoring and reliability.
- Kubernetes rose 72 positions year-over-year, reflecting how important orchestration has become to software architects who increasingly plan distributed systems.
- Monitoring grew from an unranked term in 2018 to the No. 48 term in 2019. This signals that DevOps-related topics are becoming important for those working on software architecture.
The increased interest in the proposals for serverless, Kubernetes, and AI/ML shows what we see roiling the software architecture ecosystem—i.e., the effect of organizations migrating to the cloud and distributed microservices, the potential to instantiate microservices as serverless components, and the increasing importance of data for monitoring and automation.
The following table shows topics of interest from our analysis of proposals from the 2018 Software Architecture Conference in New York and our upcoming 2019 Software Architecture Conference in San Jose. We used a form of the Term Frequency-Inverse Document Frequency (TF/IDF) technique to identify and rank the top terms.
|Term||Software Architecture ’19 Proposal Rank||Software Architecture ’18 Proposal Rank||Rank Change: 2019 vs 2018|
We focused this list on important industry terms and terms showing notable year-over-year changes. We omitted stop-like words such as “software,” “system,” and “how” from the results. Unranked means a term had less than three instances in the corpus of proposals submitted for an event.
Microservices still matter
“Microservices,” the No. 1 term in proposals, may seem like it’s past its prime given how much has already been written and said about it, but microservices remains directional or at least aspirational to software architecture for many organizations. Consistent interest in this topic in the speaker proposals suggests organizations that have already committed to microservices are still pushing to get the most out of their investments. Note that adopting microservices doesn’t mean you’ve mastered it.
Microservices is also an essential part of the Next Architecture, the trend we’re tracking that sees organizations embracing the combination of cloud, containers, orchestration, and microservices to meet customer expectations for availability, features, and performance. We expect microservices to continue as a top concept in Software Architecture Conference proposals for the foreseeable future.
Serverless and service mesh are ascendant
Some view serverless as an unproven topic, and clearly we have a lot to learn about moving toward mainstream use of serverless architecture. However, the uptick in the term “serverless” in the proposals is hard to ignore (up 89 slots to the No. 7 position), suggesting increased exploration, investigation, and experimentation with serverless implementations. Serverless is a good fit for software architects, as it abstracts the messy implementation, hardware, and other elements of a design. With serverless filling a need for software architects, we expect it to continue as a notable proposal term.
The term “mesh,” used to describe the service mesh frameworks used to coordinate between microservices without using shared assets, made a big jump: from unranked in 2018 to its current rank of No. 35. For the conceptual work that software architects engage in, service mesh allows operational coupling while avoiding domain coupling and has become the dominant operational reuse pattern in the microservices world.
Embracing AI and machine learning
The ranks of “AI” (No. 45) and “machine learning” (No. 20) both increased individually, and combining the two topics would create the 7th-ranked term overall. We see AI/ML becoming more important to software architecture on two fronts: putting AI/ML into production requires new skills and, for most organizations, a steep learning curve; AI/ML can also be used to help address software architecture issues such as monitoring and reliability. Clearly the software architecture community has embraced the two related topics in a big way.
A need to understand Kubernetes
“Kubernetes,” the open source container orchestration tool, is another Next Architecture-related term that made a strong showing in our proposal results. In 2019, Kubernetes moved up 72 slots to the No. 34 position.
Software architects won’t become Kubernetes experts—implementation and management is typically handled by infrastructure teams—but with Kubernetes growing in importance in the enterprise, architects need to understand how this powerful and complicated tool can best be applied in their organizations.
Mature concepts make inroads
The term “domain,” most likely associated with the “domain-driven design” approach to software development, was up 563 positions to the No. 26 slot in the 2019 proposal data. “Testing” was up 85 positions year over year, taking the No. 29 slot. “Agile” increased 64 slots to No. 38. Why do we see these mature topics showing up in the proposals? The data suggests these established topics from the broader technology world have made inroads with software architects. Software architecture needs to reflect the entire business, so as these topics grow within organizations their relevance to software architects also grows.
Noteworthy trends in the second-tier proposal topics
Second-tier topics are often mentioned in the proposal data, but less frequently than the top-tier topics covered above. While these second-tier topics show more rank volatility over time, careful investigation can turn up topics worth paying attention to, either as poised to climb to the top tier, or topics that are losing their currency with the proposing cohort.
|Term||Software Architecture ’19 Proposal Rank||Software Architecture ’18 Proposal Rank||Rank Change: 2019 vs 2018|
Unranked means a term had less than three instances in the corpus of proposals submitted for an event.
The term “monitoring” went from unranked in 2018 to the No. 48 term in 2019. This is a sign that DevOps-related topics have become more important for those working on software architecture. Combined with the increased ranks of the terms “testing” and “agile” in the top-tier topics, we see a trend of software architecture making a more pronounced embrace of DevOps-style deployment and production-ready designs, along with the continuing uptick in the engineering practices espoused by continuous delivery.
The rise of “open source” to a mid-tier rank of No. 80 shows that software architects, even while working at the conceptual phase of system design, are turning more regularly to open source tools and platforms.
The growth in the terms “monolith” (No. 99), used in the context of migrating to a microservices architecture or describing legacy architecture, and “migration” (No. 191), suggests that proposers are increasingly distinguishing between monolith and microservice applications.
The increase in “Istio” mentions (No. 150), a tool for implementing a service mesh to connect and monitor distributed microservices, reinforces the focus on microservices and empirical support for the rapidly increasing interest in service mesh for our cohort of session proposers.
Finally, the drop in position for the term “web” to a rarely mentioned rank of No. 482 shows a trend we notice in other areas. What we used to call “web” is now just computing, and there’s no need to call out the web anymore. It’s the water we swim in.
The proposal analysis shows us the broad and growing swath of factors that must be considered by the software architecture practitioner. There are too many topics for any practitioner to know every facet of the software architecture ecosystem. To perform efficiently, software architects should avoid getting bogged down in too much technical detail. They need just enough technical knowledge and an expansive approach to keeping up with the changing nature of software architecture to grow with this dynamic field.