- XOXO 2018 Videos — playlist of talks from XOXO 2018. (via BoingBoing)
- Learn Git Branching — visual!
- Post-REST (Tim Bray) — musings on what might replace REST in different parts of the current world of web services.
- Projects — list of practical projects that anyone can solve in any programming language, divided into categories according to what the project will exercise your knowledge of—e.g., Files, Data Structures, Threading, etc. Good for teachers looking for ideas.
In this episode of the Data Show, I spoke with Vitaly Gordon, VP of data science and engineering at Salesforce. As the use of machine learning becomes more widespread, we need tools that will allow data scientists to scale so they can tackle many more problems and help many more people. We need automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection and hyperparameter tuning, as well as monitoring.
I wanted the perspective of someone who is already faced with having to support many models in production. The proliferation of models is still a theoretical consideration for many data science teams, but Gordon and his colleagues at Salesforce already support hundreds of thousands of customers who need custom models built on custom data. They recently took their learnings public and open sourced TransmogrifAI, a library for automated machine learning for structured data, which sits on top of Apache Spark.
Here are some highlights from our conversation:
The need for an internal data science platform
It’s more about how much commonality there is between every single data science use case—how many of the problems are redundant and repeatable.
… A lot of data scientists solve problems that honestly have a lot to do with engineering, a lot to do with things that are not pure modeling.
TransmogrifAI is an automated machine library for mostly structured data, and the problem that it aims to solve is that we at Salesforce have hundreds of thousands of customers. While all of them share a common set of data, the Salesforce platform itself is extremely customizable. Actually, 80% of the data inside the Salesforce platform actually sits in what we refer to as custom objects, which one can think of as custom tables in a database.
… We don’t build models that are shared between customers. We always use a single customer’s data. We have hundreds of thousands of models potentially that we need to build, and because of that, we needed to automate the entire process. We just cannot throw people at the problem. We basically created TransmogrifAI to automate the entire end-to-end process for creating a model for a user and we decided to open source it a couple months ago.
- Black Mirror Brainstorms (Aaron Lewis) — In light of the latest FB scandal, here’s my proposal for replacing Design Sprints: “Black Mirror Brainstorms.” A workshop in which you create a Black Mirror episode. The plot must revolve around misuse of your team’s product. See Casey Fiesler’s Black Mirror, Light Mirror, which I’ve linked to before on 4SL.
- Toolkit Navigator — A compendium of toolkits for public sector innovation and transformation, curated by OPSI and our partners around the world.
- Conjure — Palantir’s open source simple but opinionated toolchain for defining APIs once and generating client/server interfaces in multiple languages. For more, read the blog post.
- Hardware Effects — this repository demonstrates various hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU and OS architecture. For each effect I try to create a proof of concept program that is as small as possible so it can be understood easily. How full stack ARE you?
- Some Requests for Machine Learning Research from the East African Tech Scene — Based on 46 in–depth interviews […] a list of concrete machine learning research problems, progress on which would directly benefit tech ventures in East Africa. Example: Priors for autocorrect and low-literacy SMS use—SMS text contains many language misuses due to a combination of autocorrection and low literacy. E.g., “poultry farmer” becoming “poetry farmer.” Such mistakes are bound to occur in any written language corpus, but engineers working with rural populations in East Africa report that this is a prevalent issue for them, confounding the use of pretrained language models. This problem also exists to some degree in voice data with respect to English spoken in different accents. Priors over autocorrect substitution rules, or custom, per–dialect confusion matrices between phonetically similar words could potentially help. Expect much more work like this as AI/ML moves into non-WEIRD (Western Educated Industrialized Rich Democratic) nations.
- How the Media Gets Tesla Wrong — a reminder that our convenient shorthand and once-over-lightly reading of the news gives a false and rosy picture of what’s possible.
- Why Information Security is Hard: An Economic Perspective — fascinating arguments! I particularly like the statistical argument: a lone attacker might find 10 bugs a year, a well-prepared defender might find 1,000 bugs a year, but if there are 100,000 available bugs for exploitation, then there’s very low probability that the defender found and patched the same bugs that the attacker found…
- DoodleMaster — sketches->UI via a CNN, a proof-of-concept.
- Time is Partial — Even though time naturally feels like a total order, studying distributed systems or weak memory exposes you, head on, to how it isn’t. And that’s precisely because these are both cases where our standard over-approximation of time being total limits performance—which we obviously can’t have.
- Black Mirror, Light Mirror: Teaching Technology Ethics Through Speculation (Casey Fiesler) — This is not a new idea, and I’m certainly not the only one to do a lot of thinking about it (e.g., see “How to Teach Computer Ethics Through Science Fiction”), but I wanted to share two specific exercises that I use and that I think are easily adaptable.
- How I Lost and Regained Control of My Microchip Implant (Vice) — After a year of living with a totally useless NFC implant, I kind of started to like it. That small, almost imperceptible little bump on my left hand was a constant reminder that even the most sophisticated and fool-proof technologies are no match for human incompetence. (via Slashdot)
- System Syzygy — open source puzzle game for Mac, Windows, and Linux. (via Andrew Plotkin)
- The Cliff Nest — sci-fi story with computer security challenges built in.
- Amazon Textract — OCR in the cloud, extracting not just text but also structured tables. Part of a big feature dump Amazon’s done today, including recommendations, AWS on-prem, and a fully managed time series database.
- Quantum Ledger Database — a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority. Amazon QLDB tracks each and every application data change and maintains a complete and verifiable history of changes over time. Many of the advantages of a blockchain ledger without the distributed pains. Quantum in the sense of “minimum chunk of something,” not “uses quantum computing.”
- Sennheiser Headset Software Enabled MITM Attacks — When users have been installing Sennheiser’s HeadSetup software, little did they know the software was also installing a root certificate into the Trusted Root CA Certificate store. To make matters worse, the software was also installing an encrypted version of the certificate’s private key that was not as secure as the developers may have thought. This is the price of using software to improve hardware.
- Firecracker — Amazon’s open source virtualization technology that is purpose-built for creating and managing secure, multitenant containers and functions-based services. Docker but for FaaS platforms. Best explanation is on lobste.rs: Firecracker is solving the problem of multitenant container density while maintaining the security boundary of a VM. If you’re entirely running first-party trusted workloads and are satisfied with them all sharing a single kernel and using Linux security features like cgroups, selinux, and seccomp, then Firecracker may not be the best answer. If you’re running workloads from customers similar to Lambda, desire stronger isolation than those technologies provide, or want defense in depth, then Firecracker makes a lot of sense. It can also make sense if you need to run a mix of different Linux kernel versions for your containers and don’t want to spend a whole bare-metal host on each one.
- Amazon Ground Station: Ingest and Process Data from Orbiting Satellites — a sign that space is becoming more mainstream. Also interesting because they’re doing a bunch of processing in EC2 rather than at the basestation. General-purpose computers often beat specialized ones.
- Me Bot — A simple tool to make a bot that speaks like you, simply learning from your WhatsApp Chats. (via Hacker News)
- Horizon — FB open sources reinforcement learning platform for large-scale products and services, built on PyTorch.
- Open Source is Not About You (Rich Hickey) — As a user of something open source, you are not thereby entitled to anything at all. You are not entitled to contribute. You are not entitled to features. You are not entitled to the attention of others. You are not entitled to having value attached to your complaints. You are not entitled to this explanation. Tough love talk. See also this statement by the author of the event-stream NPM module, who passed maintenance onto someone who added malware to it. If it’s not fun anymore, you get literally nothing from maintaining a popular package.
- Ganbreeder — explore images created by generative adversarial networks.
- 2018 IFComp Winners — interactive fiction is nextgen chatbot tech. Worth keeping up with to see how they stretch parsers and defy expectations of the genre.
- The Architecture of Closed Worlds (We Make Money Not Art) — One of the most striking lessons of the book is that it is extremely difficult to create a miniaturized world without inheriting some of the problems of the surrounding world. No matter how much control was exerted on the synthetic habitats, no matter how ambitious the vision, the breadth of engineering and human ingeniosity, the results were marred by surprisingly mundane obstacles: gerbils outsmarting the machine, bacteria loss, fingernails and skin infiltrating collectors, or simply the difficulty of implementing behavioural changes. The physical version of online social networks that are shocked to discover their userbase includes pedophiles, racists, stalkers, murderers, nutters, and malicious folks.
- Heaps — a mature cross-platform graphics engine designed for high-performance games. It is designed to leverage modern GPUs that are commonly available on both desktop and mobile devices. 2D and 3D game framework, built on the Haxe language and toolkit.
- dive — tool for exploring each layer in a docker image.
- Probabilistic Models of Cognition — This book explores the probabilistic approach to cognitive science, which models learning and reasoning as inference in complex probabilistic models. We examine how a broad range of empirical phenomena, including intuitive physics, concept learning, causal reasoning, social cognition, and language understanding, can be modeled using probabilistic programs (using the WebPPL language).
- Chinese iPhone Users are Poor — The Shanghai-based firm also found that most iPhone users are unmarried females aged between 18 and 34, who graduated with just a high school certificate and earn a monthly income of below 3,000 yuan (HK$3,800). They are perceived to be part of a group known as the “invisible poor”—those who do not look as poor as their financial circumstances.
- eDEX-UI — a fullscreen desktop application resembling a sci-fi computer interface, heavily inspired from DEX-UI and the TRON Legacy movie effects. It runs the shell of your choice in a real terminal and displays live information about your system. It was made to be used on large touchscreens but will work nicely on a regular desktop computer or perhaps a tablet PC or one of those funky 360° laptops with touchscreens.
- evilginx2 — a man-in-the-middle attack framework used for phishing login credentials along with session cookies, which in turn allows one to bypass 2-factor authentication protection.
- Some Notes About HTTP/3 (Errata Security) — QUIC is really more of a new version of TCP (TCP/2???) than a new version of HTTP (HTTP/3). It doesn’t really change what HTTP/2 does so much as change how the transport works. Therefore, my comments below are focused on transport issues rather than HTTP issues.