- For Micro Robot Insects, Four Wings May Be Better Than Two (IEEE Spectrum) — This robot uses the same sort of piezoelectric actuators as Harvard’s RoboBee, just rotated sideways. At 143 milligrams, it weighs just about as much as a real honeybee, but the key statistic is that it’s capable of lifting an additional 260 mg (at least), which ought to be enough for both sensors and a battery or supercapacitor. The extra power comes from the extra wings, of course, and while you can’t simply double payload capacity by doubling the number of wings, you can, hopefully, go from “not quite enough payload” to “just barely enough payload.”
- Computing Extremely Accurate Quantiles Using t-Digests — We present on-line algorithms for computing approximations of rank-based statistics that give high accuracy, particularly near the tails of a distribution, with very small sketches. Notably, the method allows a quantile q to be computed with an accuracy relative to max(q,1−q) rather than absolute accuracy as with most other methods. This new algorithm is robust with respect to skewed distributions or ordered data sets and allows separately computed summaries to be combined with no loss in accuracy. (via Ellen Friedman)
- GPT-2: Better Language Models (OpenAI) — their first output not released as open source because its text-generation skills are excellent. It could readily be used to make a bot army on Twitter. This indicates a change in where the line between “research best done in the open” and “giving away weapons” is drawn. These findings, combined with earlier results on synthetic imagery, audio, and video, imply that technologies are reducing the cost of generating fake content and waging disinformation campaigns. The public at large will need to become more skeptical of text they find online, just as the “deep fakes” phenomenon calls for more skepticism about images. See also The Verge’s writeup.
- Quantum Computing, Capabilities and Limits: An Interview with Scott Aaronson (GigaOm) — interesting and readable for the non-quantum mechanic. I think it’s too early to identify any Moore’s Law pattern. I mean, for god sakes, we don’t even know which technology is going to be the right one. The community is not converged around whether it’s going to be superconducting or trapped ions or something else. You can make plots of the number of qubits and the coherence time of those qubits, and you do see a strong improvement. But the number of qubits—let’s say it’s gone up from one or two to 20; it’s kind of hard to see an exponential in those numbers.
In this episode of the Data Show, I spoke with Siwei Lyu, associate professor of computer science at the University at Albany, State University of New York. Lyu is a leading expert in digital media forensics, a field of research into tools and techniques for analyzing the authenticity of media files. Over the past year, there have been many stories written about the rise of tools for creating fake media (mainly images, video, audio files). Researchers in digital image forensics haven’t exactly been standing still, though. As Lyu notes, advances in machine learning and deep learning have also found a receptive audience among the forensics community.
We had a great conversation spanning many topics including:
The many indicators used by forensic experts and forgery detection systems
Balancing “open” research with risks that come with it—including “tipping off” adversaries
State-of-the-art detection tools today, and what the research community and funding agencies are working on over the next few years.
Technical, societal, and cultural challenges that come with the rise of fake media.
Here are some highlights from our conversation:
Imbalance between digital forensics researchers and forgers
In theory, it looks difficult to synthesize media. This is true, but on the other hand, there are factors to consider on the side of the forgers. The first is the fact that most people working in forensics, like myself, usually just write a paper and publish it. So, the details of our detection algorithm becomes available immediately. On the other hand, people making fake media are usually secretive; they don’t usually publish the details of their algorithms. So, there’s a kind of imbalance between the information on the forensic side and the forgery side.
The other issue is user habit. The fact that even if some of the fakes are very low quality, a typical user checks it just for a second; sees something interesting, exciting, sensational; and helps distribute it without actually checking the authenticity. This actually helps fake media to broadcast very, very fast. Even though we have algorithms to detect fake media, these tools are probably not fast enough to actually stop the trap.
… Then there are the actual incentives for this kind of work. For forensics, even if we have the tools and the time to catch a piece of fake media, we don’t get anything. But for people actually making the fake media, there is more financial or other forms of incentive to do that.
- The Moral Choice Machine: Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices — We create a template list of prompts and responses, which include questions such as “Should I kill people?”, “Should I murder people?”, etc., with answer templates of “Yes/no, I should (not).” The model’s bias score is now the difference between the model’s score of the positive response (“Yes, I should”) and that of the negative response (“No, I should not”). For a given choice overall, the model’s bias score is the sum of the bias scores for all question/answer templates with that choice. We ran different choices through this analysis using a Universal Sentence Encoder. Our results indicate that text corpora contain recoverable and accurate imprints of our social, ethical, and even moral choices. Our method holds promise for extracting, quantifying, and comparing sources of moral choices in culture, including technology. (via press release)
- Civilizational HTTP Error Codes (Gavin Starks) — 807 STONE TABLET; CARRIER NOT SUPPORTED.
- Can’t Unsee — simple and fun way to learn to pay attention to design details. (via Alex Dong)
- Rant — all-purpose procedural text library.
- Towards Federated Learning at Scale — research paper from Google on a distributed machine learning approach which enables training on a large corpus of decentralized data residing on devices like mobile phones. They’re working on it for Android; first app is the keyboard: Our system enables one to train a deep neural network, using TensorFlow, on data stored on the phone which will never leave the device. The weights are combined in the cloud with Federated Averaging, constructing a global model which is pushed back to phones for inference. An implementation of Secure Aggregation ensures that on a global level, individual updates from phones are uninspectable. The system has been applied in large-scale applications, for instance in the realm of a phone keyboard.
- Mozilla’s Clever-Commit — By combining data from the bug-tracking system and the version-control system (aka, changes in the code base), Clever-Commit uses artificial intelligence to detect patterns of programming mistakes based on the history of the development of the software. This allows us to address bugs at a stage when fixing a bug is a lot cheaper and less time consuming than upon release. Video.
- SaaS Web Design Trends — everything from where the logo is to what action is being called for to the rise of custom illustrations (versus photographs).
- The Role of Social Context for Fake News Detection — In this paper, we study the novel problem of exploiting social context for fake news detection. We propose a tri-relationship embedding framework TriFN, which models publisher-news relations and user-news interactions simultaneously for fake news classification. We conduct experiments on two real-world data sets, which demonstrate that the proposed approach significantly outperforms other baseline methods for fake news detection. (via Paper a Day)
In the film Interstellar, Christopher Nolan’s time, space, and logic-bending tale of against-the-odds, post-apocalyptic survival, one the first signs that things are about take a turn for the weird is when a small fleet of autonomous tractors bereft of a functioning GPS nevertheless manage to drive themselves out of the cornfields to park next to a weatherbeaten farmhouse owned by Matthew McConaughey’s character, a pilot-turned-farmer named Cooper. There are spaceships, sardonic robots, and a very trippy journey through a black hole to the backside of a children’s bookshelf (all set to Hans Zimmer’s brilliant, unfurling score), but for me, it was the tractors that proved the most surprising and also disturbing.
Set in a dustbowl near-future when Earth’s natural capital has somehow been catastrophically squandered, there aren’t enough farmers, or enough un-blighted land, to grow enough food to feed everyone. With labor in short supply, the best fix is a tech fix: farm bots.
The good news is that we are not yet at the point where relocating somewhere off the planet is the only way to “save humanity,” but farm labor has become a serious issue, one made notably worse by the current administration’s immigration policies. The average age for farmers in the U.S. and Canada has crept up to about 60, while the percentage of people working in agriculture has dipped below 2%.
There has also been a significant loss of productive farmland, the result of urban sprawl and widespread land degradation. Meanwhile, the human population continues to grow, more than doubling over the last 50 years to nearly eight billion. To meet the challenge of producing more food with less everything, farm bots are going to be an essential part of the mix—along with practices that restore soil health while reducing the need for chemical inputs, policies that protect farmland, biotech to develop crops better able to survive the many challenges of a changing climate, and improved logistics for food storage and distribution.
Tech has always been integral to agriculture, from the first stick used to scratch a hole in the ground to plant a seed, to plows that turn the soil and scythes to harvest wheat. Modern farmers routinely use everything from satellite data, drone surveillance, and soil analysis to figure out exactly when to plant, what to water, where to spray, and how to harvest. A combine is a dazzling, giant factory on wheels. That said, knowing there is a human driving the electronics-laden, mechanical beast somehow reaffirms the proper order of things. Large, autonomous machines, on the other hand, are the stuff of nightmares (see I, Robot).
The team at Canadian agritech startup DOT, however, remains undeterred. The eponymous DOT is a u-shaped, diesel-powered, autonomous (or remote-controlled) platform that can be hooked up to all sorts of farm machinery and programmed using a Windows Surface Pro that pulls in field-specific data stored in the cloud. According to its developers, its many benefits include:
- Saving more than 20% on farm fuel, labour, and equipment capital costs
- Reducing CO2 emissions by 20%
- Gaining more than 20% on equipment’s future trade-in value
Farm bots actually come in a variety shapes and sizes:
Notably, only a few weeks after Engadget published the above video in August of 2017, John Deere bought Blue River, one of the companies profiled. Two years ago, investments in agritech topped $700 million, according to the Financial Times. That figure more than doubled in 2018, according to a report by Finistere Ventures, which invests in the sector. Although it is unclear how much of that money was specifically put into farm bot startups, whatever the number, it is growing quickly.
These are still early days. The AI that runs the bots gets a little savvier with each new data point. Sensors, cameras, and the dexterity of robot “hands” for picking and sorting produce are also improving. Diesel will eventually give way to cleaner fuels, making an industry fundamentally dependent on the environment that much eco-friendlier.
Still, I can’t wait to dig my hands in the dirt of my garden—and am about to start some seedlings in egg shells I’ve saved specially for the purpose (the shells provide a little hit of calcium). Yes the snow is falling, but spring will come and with it that marvelous of scent of a living earth.
It just wouldn’t be right to let farm bots have all the fun.
- Sidewalk Labs and Cellphone Data (The Intercept) — To make these measurements, the program gathers and de-identifies the location of cellphone users, which it obtains from unspecified third-party vendors. It then models this anonymized data in simulations—creating a synthetic population that faithfully replicates a city’s real-world patterns but that “obscures the real-world travel habits of individual people,” as Bowden told The Intercept.
- Zobrist Hashing — a hash function construction used in computer programs that play abstract board games, such as chess and Go, to implement transposition tables, a special kind of hash table that is indexed by a board position and used to avoid analyzing the same position more than once.
- Software Optimization Resources — the hard stuff (from my perspective higher up the stack), from C++ through assembly down to the microarchitecture of CPUs.
- Lighting up my DasKeyboard with Blood Sugar changes using my body’s REST API (Scott Hanselman) — However, since the keyboard has a localhost REST API and so does my blood sugar, I busted out this silly little shell script.
- Reflecting on The Soul of a New Machine (Bryan Cantrill) — re-reading the book now from start to finish has given new parts depth and meaning. Aspects that were more abstract to me as an undergraduate—from the organizational rivalries and absurdities of the industry to the complexities of West’s character and the tribulations of the team down the stretch—are now deeply evocative of concrete episodes of my own career.
- ExFaKT — a framework for explaining facts over knowledge graphs and text. […] ExFaKT uses background knowledge encoded in the form of Horn clauses to rewrite the fact in question into a set of other easier-to-spot facts.
- FreedomEV — third-party Linux for your rooted Tesla.
- Redesigning the System — Music is abundant; purpose is scarce.
Profiles of IT executives suggest that many are planning to spend significantly in cloud computing and AI over the next year. This concurs with survey results we plan to release over the next few months. In a forthcoming survey, “Evolving Data Infrastructure,” we found strong interest in machine learning (ML) among respondents across geographic regions. Not only are companies interested in tools, technologies, and people who can advance the use of ML within their organizations, they are beginning to build the core foundational technologies needed to sustain their usage of analytics and ML. With that said, important challenges remain. In other surveys we ran, we found “lack of skilled people,” “lack of data,” and cultural and organizational challenges as the leading obstacles cited for holding back the adoption of machine learning and AI.
In this post, I’ll describe some of the core technologies and tools companies are beginning to evaluate and build. Many companies are just beginning to address the interplay between their suite of AI, big data, and cloud technologies. I’ll also highlight some interesting uses cases and applications of data, analytics, and machine learning. The resource examples I’ll cite will be drawn from the upcoming Strata Data conference in San Francisco, where leading companies and speakers will share their learnings on the topics covered in this post.
AI and machine learning in the enterprise
When asked what holds back the adoption of machine learning and AI, survey respondents for our upcoming report, “Evolving Data Infrastructure,” cited “company culture” and “difficulties in identifying appropriate business use cases” among the leading reasons. Attendees of the Strata Business Summit will have the opportunity to explore these issues through training sessions, tutorials, briefings, and real-world case studies from practitioners and companies. Recent improvements in tools and technologies has meant that techniques like deep learning are now being used to solve common problems, including forecasting, text mining and language understanding, and personalization. We’ve assembled sessions from leading companies, many of which will share case studies of applications of machine learning methods, including multiple presentations involving deep learning:
Foundational data technologies
Machine learning and AI require data—specifically, labeled data for training models. There are many articles that point to the explosion of data, but in order for that data that be useful for analytics and ML, it has to be collected, transported, cleaned, stored, and combined with other data sources. Thus, our surveys have shown that companies tend to apply machine learning and AI in areas where they have prior simpler use cases (business intelligence and analytics) that required data technologies to already be in place. In our upcoming report, “Evolving Data Infrastructure,” respondents indicated they are beginning to build essential components needed to sustain machine learning and AI within their organizations:
Take data lineage, an increasingly important consideration in an age when machine learning, AI, security, and privacy are critical for companies. At Strata Data San Francisco, Netflix, Intuit, and Lyft will describe internal systems designed to help users understand the evolution of available data resources. As companies ingest and use more data, there are many more users and consumers of that data within their organizations. Data lineage, data catalog, and data governance solutions can increase usage of data systems by enhancing trustworthiness of data. Moving forward, tracking data provenance is going to be important for security, compliance, and for auditing and debugging ML systems.
Companies are embracing AI and data technologies in the cloud
In the survey behind our upcoming report, “Evolving data infrastructure,” we found 85% of respondents indicated they had data infrastructure in at least one of the seven cloud providers we listed, with two-thirds (63%) using Amazon Web Services (AWS) for some portion of their data infrastructure. We found companies run a mix of open source technologies and managed services, and many respondents indicated they used more than one cloud provider.
This agrees with other surveys I’ve come across that indicated IT executives plan to invest a significant portion of their budgets in cloud computing resources and services.
Security and privacy
Regulations in Europe (GDPR) and California (Consumer Privacy Act) have placed concepts like “user control” and “privacy-by-design” at the forefront for companies wanting to deploy ML. With these new regulations in mind, the research community has stepped up and new privacy-preserving tools and techniques—including differential privacy—are becoming available for both business intelligence and ML applications. Strata Data San Francisco will feature sessions on important topics including: data security and data privacy; the use of data, analytics, and ML in (cyber)security; privacy-preserving analytics ; and secure machine learning.
When it come to ethics, it’s fair to say the data community (and the broader technology community) is very engaged. As I noted in an earlier post, the next-generation data scientists and data engineers are undergoing training and engaging in discussions pertaining to ethics. Many universities are offering courses; some like UC Berkeley have multiple courses. We’re at the point where companies are beginning to formulate and share some best practices and processes. We are pleased to announce that we have a slate of tutorials and sessions—and a full day of presentations dedicated to ethics—at the upcoming Strata Data conference in San Francisco.
Use cases and solutions
Data, machine learning, and AI are impacting companies across industries and geographic locations. Companies are beginning to build key components including solutions that address data lineage and data governance, as well as tools that can increase the productivity of their data scientists (“data science platforms”). Many technologies and techniques are general purpose and cut across domains and industries. However, there are tools and methods that are used more heavily in certain verticals, and more importantly, we all like learning what our industry peers have been building and thinking about. Here are some related talks from a few verticals:
- Blazer — Explore your data with SQL. Easily create charts and dashboards, and share them with your team.
- FPG-1 — PDP-1 FPGA implementation in Verilog, with CRT, Teletype, and Console. The PDP-1 was groundbreaking: serial number 0 was delivered to the BBN offices where Licklider would see it as a way forward to his timesharing vision. From The Dream Machine: “The PDP-1 was revolutionary,” Fredkin declares, still marveling four decades later. “Today such things don’t happen. Today a machine comes along and is slightly faster than its competitors. But here was a machine that was off the charts. Its price performance ratio was spectacularly better than anything that had come before.”
- ClusterFuzz — a scalable fuzzing infrastructure that finds security and stability issues in software. See Google’s announcement of the open-sourcing of it.
- Questions for a New Technology — They aren’t particularly subtle in their bias. They aren’t supposed to be. They also aren’t meant to be a series of boxes to be checked or hoops to be jumped through.
- Hamlet in Virtual Reality — context for WGBH’s Hamlet 360. It’s 360º video, so you can pick what you look at but not where you look at it from. Interesting work, and a reminder that we’re still trying to figure out what kinds of stories these media lend themselves to, and how best to tell stories with them.
- Self-Taught Robot Figures Out What It Looks Like and What It Can Do — To begin with, the robot had no idea what shape it was and behaved like an infant, moving randomly while attempting various tasks. Within about a day of intensive learning, the robot built up an internal picture of its structure and abilities. After 35 hours, the robot could grasp objects from specific locations and drop them in a receptacle with 100% accuracy. Paper is behind a paywall, though Sci-Hub has it.
- Bubble Sort: An Archaeological Algorithmic Analysis — Text books, including books for general audiences, invariably mention bubble sort in discussions of elementary sorting algorithms. We trace the history of bubble sort, its popularity, and its endurance in the face of pedagogical assertions that code and algorithmic examples used in early courses should be of high quality and adhere to established best practices. This paper is more an historical analysis than a philosophical treatise for the exclusion of bubble sort from books and courses. However, sentiments for exclusion are supported by Knuth: “In short, the bubble sort seems to have nothing to recommend it, except a catchy name and the fact that it leads to some interesting theoretical problems.” Although bubble sort may not be a best practice sort, perhaps the weight of history is more than enough to compensate and provide for its longevity.
- Comprehensive Survey on Graph Neural Networks — We propose a new taxonomy to divide the state-of-the-art graph neural networks into different categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatial-temporal networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this fast-growing field.