- Managing Risk in Machine Learning Projects (Ben Lorica) — Considerations for a world where ML models are becoming mission critical.
- Transcripts of 2018 IGF — Internet Governance Forum session transcripts.
- Featuretools — open source Python framework for automated feature engineering.
- Solving Snake — fun exploration of different algorithms you might use to play the Snake game.
- QEMU Advent Calendar — An amazing QEMU disk image every day!. It’s that time of year again! See also Advent of Code.
- De Facto Closed Source — You want to download thousands of lines of useful, but random, code from the internet, for free, run it in a production web server, or worse, your user’s machine, trust it with your paying users’ data and reap that sweet dough. We all do. But then you can’t be bothered to check the license, understand the software you are running, and still want to blame the people who make your business a possibility when mistakes happen, while giving them nothing for it? This is both incompetence and entitlement.
- U.S. Government Wonders What to Limit Exports Of — The representative general categories of technology for which Commerce currently seeks to determine whether there are specific emerging technologies that are essential to the national security of the United States include: (1) Biotechnology, such as: (i) Nanobiology; (ii) Synthetic biology; (iv) Genomic and genetic engineering; or (v) Neurotech. (2) Artificial intelligence (AI) and machine learning technology, such as: (i) Neural networks and deep learning (e.g., brain modeling, time series prediction, classification); (ii) Evolution and genetic computation (e.g., genetic algorithms, genetic programming); (iii) Reinforcement learning; (iv) Computer vision (e.g., object recognition, image understanding); (v) Expert systems (e.g., decision support systems, teaching systems); (vi) Speech and audio processing (e.g., speech recognition and production); (vii) Natural language processing (e.g., machine translation); (viii) Planning (e.g., scheduling, game playing); (ix) Audio and video manipulation technologies (e.g., voice cloning, deepfakes); (x) AI cloud technologies; or (xi) AI chipsets. (3) Position, Navigation, and Timing (PNT) technology. (4) Microprocessor technology, such as: (i) Systems-on-Chip (SoC); or (ii) Stacked Memory on Chip. (5) Advanced computing technology, such as: (i) Memory-centric logic. (6) Data analytics technology, such as: (i) Visualization; (ii) Automated analysis algorithms; or (iii) Context-aware computing. (7) Quantum information and sensing technology, such as (i) Quantum computing; (ii) Quantum encryption; or (iii) Quantum sensing. (8) Logistics technology, such as: (i) Mobile electric power; (ii) Modeling and simulation; (iii) Total asset visibility; or (iv) Distribution-based Logistics Systems (DBLS). (9) Additive manufacturing (e.g., 3D printing); (10) Robotics such as: (i) Micro-drone and micro-robotic systems; (ii) Swarming technology; (iii) Self-assembling robots; (iv) Molecular robotics; (v) Robot compliers; or (vi) Smart Dust. (11) Brain-computer interfaces, such as (i) Neural-controlled interfaces; (ii) Mind-machine interfaces; (iii) Direct neural interfaces; or (iv) Brain-machine interfaces. (12) Hypersonics, such as: (i) Flight control algorithms; (ii) Propulsion technologies; (iii) Thermal protection systems; or (iv) Specialized materials (for structures, sensors, etc.). (13) Advanced Materials, such as: (i) Adaptive camouflage; (ii) Functional textiles (e.g., advanced fiber and fabric technology); or (iii) Biomaterials. (14) Advanced surveillance technologies, such as: Faceprint and voiceprint technologies. It’s a great list of what’s in the next Gartner Hype Cycle report.
- The Digital Maginot Line (Renee DiResta) — We know this is coming, and yet we’re doing very little to get ahead of it. No one is responsible for getting ahead of it. […] platforms aren’t incentivized to engage in the profoundly complex arms race against the worst actors when they can simply point to transparency reports showing that they caught a fair number of the mediocre actors. […] The regulators, meanwhile, have to avoid the temptation of quick wins on meaningless tactical bills (like the Bot Law) and wrestle instead with the longer-term problems of incentivizing the platforms to take on the worst offenders (oversight), and of developing a modern-day information operations doctrine.
- XOXO 2018 Videos — playlist of talks from XOXO 2018. (via BoingBoing)
- Learn Git Branching — visual!
- Post-REST (Tim Bray) — musings on what might replace REST in different parts of the current world of web services.
- Projects — list of practical projects that anyone can solve in any programming language, divided into categories according to what the project will exercise your knowledge of—e.g., Files, Data Structures, Threading, etc. Good for teachers looking for ideas.
In this episode of the Data Show, I spoke with Vitaly Gordon, VP of data science and engineering at Salesforce. As the use of machine learning becomes more widespread, we need tools that will allow data scientists to scale so they can tackle many more problems and help many more people. We need automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection and hyperparameter tuning, as well as monitoring.
I wanted the perspective of someone who is already faced with having to support many models in production. The proliferation of models is still a theoretical consideration for many data science teams, but Gordon and his colleagues at Salesforce already support hundreds of thousands of customers who need custom models built on custom data. They recently took their learnings public and open sourced TransmogrifAI, a library for automated machine learning for structured data, which sits on top of Apache Spark.
Here are some highlights from our conversation:
The need for an internal data science platform
It’s more about how much commonality there is between every single data science use case—how many of the problems are redundant and repeatable.
… A lot of data scientists solve problems that honestly have a lot to do with engineering, a lot to do with things that are not pure modeling.
TransmogrifAI is an automated machine library for mostly structured data, and the problem that it aims to solve is that we at Salesforce have hundreds of thousands of customers. While all of them share a common set of data, the Salesforce platform itself is extremely customizable. Actually, 80% of the data inside the Salesforce platform actually sits in what we refer to as custom objects, which one can think of as custom tables in a database.
… We don’t build models that are shared between customers. We always use a single customer’s data. We have hundreds of thousands of models potentially that we need to build, and because of that, we needed to automate the entire process. We just cannot throw people at the problem. We basically created TransmogrifAI to automate the entire end-to-end process for creating a model for a user and we decided to open source it a couple months ago.
- Black Mirror Brainstorms (Aaron Lewis) — In light of the latest FB scandal, here’s my proposal for replacing Design Sprints: “Black Mirror Brainstorms.” A workshop in which you create a Black Mirror episode. The plot must revolve around misuse of your team’s product. See Casey Fiesler’s Black Mirror, Light Mirror, which I’ve linked to before on 4SL.
- Toolkit Navigator — A compendium of toolkits for public sector innovation and transformation, curated by OPSI and our partners around the world.
- Conjure — Palantir’s open source simple but opinionated toolchain for defining APIs once and generating client/server interfaces in multiple languages. For more, read the blog post.
- Hardware Effects — this repository demonstrates various hardware effects that can degrade application performance in surprising ways and that may be very hard to explain without knowledge of the low-level CPU and OS architecture. For each effect I try to create a proof of concept program that is as small as possible so it can be understood easily. How full stack ARE you?
- Some Requests for Machine Learning Research from the East African Tech Scene — Based on 46 in–depth interviews […] a list of concrete machine learning research problems, progress on which would directly benefit tech ventures in East Africa. Example: Priors for autocorrect and low-literacy SMS use—SMS text contains many language misuses due to a combination of autocorrection and low literacy. E.g., “poultry farmer” becoming “poetry farmer.” Such mistakes are bound to occur in any written language corpus, but engineers working with rural populations in East Africa report that this is a prevalent issue for them, confounding the use of pretrained language models. This problem also exists to some degree in voice data with respect to English spoken in different accents. Priors over autocorrect substitution rules, or custom, per–dialect confusion matrices between phonetically similar words could potentially help. Expect much more work like this as AI/ML moves into non-WEIRD (Western Educated Industrialized Rich Democratic) nations.
- How the Media Gets Tesla Wrong — a reminder that our convenient shorthand and once-over-lightly reading of the news gives a false and rosy picture of what’s possible.
- Why Information Security is Hard: An Economic Perspective — fascinating arguments! I particularly like the statistical argument: a lone attacker might find 10 bugs a year, a well-prepared defender might find 1,000 bugs a year, but if there are 100,000 available bugs for exploitation, then there’s very low probability that the defender found and patched the same bugs that the attacker found…
- DoodleMaster — sketches->UI via a CNN, a proof-of-concept.
- Time is Partial — Even though time naturally feels like a total order, studying distributed systems or weak memory exposes you, head on, to how it isn’t. And that’s precisely because these are both cases where our standard over-approximation of time being total limits performance—which we obviously can’t have.
- Black Mirror, Light Mirror: Teaching Technology Ethics Through Speculation (Casey Fiesler) — This is not a new idea, and I’m certainly not the only one to do a lot of thinking about it (e.g., see “How to Teach Computer Ethics Through Science Fiction”), but I wanted to share two specific exercises that I use and that I think are easily adaptable.
- How I Lost and Regained Control of My Microchip Implant (Vice) — After a year of living with a totally useless NFC implant, I kind of started to like it. That small, almost imperceptible little bump on my left hand was a constant reminder that even the most sophisticated and fool-proof technologies are no match for human incompetence. (via Slashdot)
- System Syzygy — open source puzzle game for Mac, Windows, and Linux. (via Andrew Plotkin)
- The Cliff Nest — sci-fi story with computer security challenges built in.
- Amazon Textract — OCR in the cloud, extracting not just text but also structured tables. Part of a big feature dump Amazon’s done today, including recommendations, AWS on-prem, and a fully managed time series database.
- Quantum Ledger Database — a fully managed ledger database that provides a transparent, immutable, and cryptographically verifiable transaction log owned by a central trusted authority. Amazon QLDB tracks each and every application data change and maintains a complete and verifiable history of changes over time. Many of the advantages of a blockchain ledger without the distributed pains. Quantum in the sense of “minimum chunk of something,” not “uses quantum computing.”
- Sennheiser Headset Software Enabled MITM Attacks — When users have been installing Sennheiser’s HeadSetup software, little did they know the software was also installing a root certificate into the Trusted Root CA Certificate store. To make matters worse, the software was also installing an encrypted version of the certificate’s private key that was not as secure as the developers may have thought. This is the price of using software to improve hardware.
- Firecracker — Amazon’s open source virtualization technology that is purpose-built for creating and managing secure, multitenant containers and functions-based services. Docker but for FaaS platforms. Best explanation is on lobste.rs: Firecracker is solving the problem of multitenant container density while maintaining the security boundary of a VM. If you’re entirely running first-party trusted workloads and are satisfied with them all sharing a single kernel and using Linux security features like cgroups, selinux, and seccomp, then Firecracker may not be the best answer. If you’re running workloads from customers similar to Lambda, desire stronger isolation than those technologies provide, or want defense in depth, then Firecracker makes a lot of sense. It can also make sense if you need to run a mix of different Linux kernel versions for your containers and don’t want to spend a whole bare-metal host on each one.
- Amazon Ground Station: Ingest and Process Data from Orbiting Satellites — a sign that space is becoming more mainstream. Also interesting because they’re doing a bunch of processing in EC2 rather than at the basestation. General-purpose computers often beat specialized ones.
- Me Bot — A simple tool to make a bot that speaks like you, simply learning from your WhatsApp Chats. (via Hacker News)
- Horizon — FB open sources reinforcement learning platform for large-scale products and services, built on PyTorch.
- Open Source is Not About You (Rich Hickey) — As a user of something open source, you are not thereby entitled to anything at all. You are not entitled to contribute. You are not entitled to features. You are not entitled to the attention of others. You are not entitled to having value attached to your complaints. You are not entitled to this explanation. Tough love talk. See also this statement by the author of the event-stream NPM module, who passed maintenance onto someone who added malware to it. If it’s not fun anymore, you get literally nothing from maintaining a popular package.
- Ganbreeder — explore images created by generative adversarial networks.
- 2018 IFComp Winners — interactive fiction is nextgen chatbot tech. Worth keeping up with to see how they stretch parsers and defy expectations of the genre.
- The Architecture of Closed Worlds (We Make Money Not Art) — One of the most striking lessons of the book is that it is extremely difficult to create a miniaturized world without inheriting some of the problems of the surrounding world. No matter how much control was exerted on the synthetic habitats, no matter how ambitious the vision, the breadth of engineering and human ingeniosity, the results were marred by surprisingly mundane obstacles: gerbils outsmarting the machine, bacteria loss, fingernails and skin infiltrating collectors, or simply the difficulty of implementing behavioural changes. The physical version of online social networks that are shocked to discover their userbase includes pedophiles, racists, stalkers, murderers, nutters, and malicious folks.