Just five years ago, most applications that contained “intelligence” were created manually using countless lines of code created by software developers who aimed to encode the learned experience of subject matter experts in a particular domain. Not anymore.
In the last decade, as the concept of “big data” emerged as a hot technology topic and took center stage, the early adopters and purveyors of big data tools realized that access to large quantities of data was only the beginning. Creating models and algorithms designed to ferret-out insight from the big data hairball became a very real and almost insurmountable challenge.
Why? Because to develop insight based upon terabytes of unstructured data requires a general understanding of the nature of the data being analyzed. This general understanding requires someone or more-appropriately an army of analysts and weeks and weeks, if not months of time to pore over and derive insights from the data.
So the natural subsequent thinking was to question if technology could be created to automate the development of big data analysis and insights. Enter Artificial Intelligence. AI can process more data than any single human and in a fraction of the time. Apply a powerful-enough computer, and it can act as a virtual army of data scientists, uncovering hidden patterns and insights within big data. There’s just one application.
It turns out that AI that can actually perform well requires more data than EVER. And this data needs to be tailored to the specific task at hand. All of a sudden the technology industry is back to square one, and that’s where you come in!
Behind Every Great AI is All of Us
Artificial intelligence and more specifically the latest machine-learning types-such as deep learning neural networks-require development of experience, often called training. Training can be accomplished by a knowledgeable person supervising a machine learning platform, giving instructions on what to look for, providing examples of how this is done and observing the platform results to make corrections. Wash. Rinse. Repeat. This process is called “supervised learning,” which is slow and tedious. If a single person is training the software, it’s possible to inadvertently introduce biases.
Real World Challenges
Let’s say that your company is interested in analyzing all the comments regarding your products from sales on online to find out what attributes customers like and what they don’t. You might have created a program that can extract the text from each comment. That is the easy part. Next you need to understand how to identify when a customer makes a positive or a negative comment. Not every comment will be the same. Your staff would need tag each comment as positive or negative and then identify the subject of the positive or negative comment. Maybe a customer comment includes the phrase “bad battery life”. The fact of whether this is a positive or negative comment would have to be identified and recorded as well as the subject: battery life.
Learning by Numbers: Ground Truth Data
Verified data about the data is called “ground truth data”. Not only does AI need ground truth data, it needs a lot of it.
To understand the wide variety of potential comments, AI needs a large number of examples in order to become accurate. Providing one or two examples, or even one hundred, is insufficient. It requires large amounts of this type of data to understand the problem well-enough to provide good answers on a wide variety of real-world data.
Learning what’s Useful
Now, it would be one thing if development of ground truth data could stop there. But other aspects must be addressed: relevance and representativeness. Simply put, the problem must be scoped in such a way as to define what the input data set will look like in order to be relevant to solving the problem. If the input data doesn’t align with the real problem/solution, then the AI system will not learn anything useful.
For instance, if your task is to identify all images where a gas station is present and you then input pictures that don’t include gas stations, the ultimate output will be useless. Next you need to ensure that as many possible variants of pictures of gas stations are present. Black and white, color, weather, obstructions, shapes, presence and lack of presence of vehicles must all be part of the ground truth data. If you only train your system on one company’s stations, then gas stations owned by other companies might not be recognized. So you need to have a fully representative sample data set.
Obtaining comprehensive ground truth isn’t easy. As a result, it traditionally has been expensive to train AI systems to become highly accurate due to the upfront complexity and cost of creating ground truth data.
Into this fray, a lot of novel ideas have evolved on how to get this data to feed AI systems, and it turns out that you are probably helping in more ways than you ever considered.
Turning Interactions into Data Gold
Companies involved with machine learning are using many different and novel ways to create or gain access to the very valuable ground truth data.
The most obvious method is to pay for it. This often occurs through hiring a contract workforce; traditionally a company would go to a service bureau to create a data entry project, but increasingly crowdsourcing services such as Amazon Mechanical Turk, CrowdFlower and MightAI offer on-demand data services that allow quick turnaround of truth data requirements without a lot of upfront project costs. Interestingly enough, some of these new cloud-based services are looking to deploy machine learning to reduce their own costs.
Machine Learning for Everyone
Nonprofit and academic-oriented collaborations exist to aid with research and development of machine learning applications. For instance, Imagenet has been around providing source material as well as contests to aid with improving a system’s ability to process images. The National Institute of Standards and Technology has a curated database to provide source material for the recognition of handwritten numbers. There are also private efforts that open their datasets such as Google’s OpenImages dataset and YouTube 8M.
Going beyond the “brute force” and collaborative models, there is a lot of creativity in gathering information.
For instance, some commercial organizations offer their machine learning platforms for free to developers to use in their applications. While use is limited, integrating these capabilities into applications provides both samples as well as feedback data that can be curated and used to further the accuracy of the machine learning applications being used. The more use, the more data gold that can be curated. These include services such as IBM Watson’s Alchemy API.
Ubiquitous Data Gathering
Lastly and probably the most interesting are the data gathering efforts that are tied to interaction with you and me. If you have ever used an Alexa device and provided feedback, then you are supplying sample data and the ground truth.
Have you ever filled-out a form where you have to identify characters, read a street sign or detect the presence of something within an image before you can be considered “not a robot?” While these types of programs are intended to ferret out real people from automated data harvesting programs, the real benefit to the organization providing this service, such as Google’s reCaptcha service, is with ground truth data. Every time you answer these questions, you provide it!
Gaming the System
Google and other companies also offer games where your interaction teaches the machine learning program. For instance, Google’s collection of “games” at its AI Experiments site provides fun interaction that both collects ground truth data and even has you training the systems via immediate feedback.
Companies and organizations are getting more-and-more creative at leveraging everyday interactions with customers all in an effort to collect data critical for improving existing AI or to create entirely new AI services.
Generative Adversarial Networks or GANs
Eventually, our feedback may not be necessary. The ultimate goal is to not have to train systems; they will learn by themselves. Even humans need teachers so forward-thinking companies are experimenting with what are called Generative Adversarial Networks or GANs that pit one machine learning application against another; creating a teacher-student relationship. While one application creates data intended to “fool” the other neural net, the other evaluates it to determine if the output represents real or “fake” data based on previous examples.
Some AI implementations have already created photographs that look entirely real even though they are completely fictional. The idea is if you let these two competing systems play against each other, together they will improve each other.
Machines will teach machines.