Once viewed as less desirable than real data, synthetic data is now seen by some as a panacea. Real data is messy and riddled with bias. New data privacy regulations make it hard to collect. By contrast, synthetic data is pristine and can be used to build more diverse data sets. You can produce perfectly labeled faces, say, of different ages, shapes, and ethnicities to build a face-detection system that works across populations.
But synthetic data has its limitations. If it fails to reflect reality, it could end up producing even worse AI than messy, biased real-world data—or it could simply inherit the same problems. “What I don’t want to do is give the thumbs up to this paradigm and say, ‘Oh, this will solve so many problems,’” says Cathy O’Neil, a data scientist and founder of the algorithmic auditing firm ORCAA. “Because it will also ignore a lot of things.”
Realistic, not real
Deep learning has always been about data. But in the last few years, the AI community has learned that good data is more important than big data. Even small amounts of the right, cleanly labeled data can do more to improve an AI system’s performance than 10 times the amount of uncurated data, or even a more advanced algorithm.
That changes the way companies should approach developing their AI models, says Datagen’s CEO and cofounder, Ofir Chakon. Today, they start by acquiring as much data as possible and then tweak and tune their algorithms for better performance. Instead, they should be doing the opposite: use the same algorithm while improving on the composition of their data.
But collecting real-world data to perform this kind of iterative experimentation is too costly and time intensive. This is where Datagen comes in. With a synthetic data generator, teams can create and test dozens of new data sets a day to identify which one maximizes a model’s performance.
To ensure the realism of its data, Datagen gives its vendors detailed instructions on how many individuals to scan in each age bracket, BMI range, and ethnicity, as well as a set list of actions for them to perform, like walking around a room or drinking a soda. The vendors send back both high-fidelity static images and motion-capture data of those actions. Datagen’s algorithms then expand this data into hundreds of thousands of combinations. The synthesized data is sometimes then checked again. Fake faces are plotted against real faces, for example, to see if they seem realistic.
Datagen is now generating facial expressions to monitor driver alertness in smart cars, body motions to track customers in cashier-free stores, and irises and hand motions to improve the eye- and hand-tracking capabilities of VR headsets. The company says its data has already been used to develop computer-vision systems serving tens of millions of users.
It’s not just synthetic humans that are being mass-manufactured. Click-Ins is a startup that uses synthetic AI to perform automated vehicle inspections. Using design software, it re-creates all car makes and models that its AI needs to recognize and then renders them with different colors, damages, and deformations under different lighting conditions, against different backgrounds. This lets the company update its AI when automakers put out new models, and helps it avoid data privacy violations in countries where license plates are considered private information and thus cannot be present in photos used to train AI.
Mostly.ai works with financial, telecommunications, and insurance companies to provide spreadsheets of fake client data that let companies share their customer database with outside vendors in a legally compliant way. Anonymization can reduce a data set’s richness yet still fail to adequately protect people’s privacy. But synthetic data can be used to generate detailed fake data sets that share the same statistical properties as a company’s real data. It can also be used to simulate data that the company doesn’t yet have, including a more diverse client population or scenarios like fraudulent activity.
Proponents of synthetic data say that it can help evaluate AI as well. In a recent paper published at an AI conference, Suchi Saria, an associate professor of machine learning and health care at Johns Hopkins University, and her coauthors demonstrated how data-generation techniques could be used to extrapolate different patient populations from a single set of data. This could be useful if, for example, a company only had data from New York City’s more youthful population but wanted to understand how its AI performs on an aging population with higher prevalence of diabetes. She’s now starting her own company, Bayesian Health, which will use this technique to help test medical AI systems.
The limits of faking it
But is synthetic data overhyped?
When it comes to privacy, “just because the data is ‘synthetic’ and does not directly correspond to real user data does not mean that it does not encode sensitive information about real people,” says Aaron Roth, a professor of computer and information science at the University of Pennsylvania. Some data generation techniques have been shown to closely reproduce images or text found in the training data, for example, while others are vulnerable to attacks that make them fully regurgitate that data.
This might be fine for a firm like Datagen, whose synthetic data isn’t meant to conceal the identity of the individuals who consented to be scanned. But it would be bad news for companies that offer their solution as a way to protect sensitive financial or patient information.
Research suggests that the combination of two synthetic-data techniques in particular—differential privacy and generative adversarial networks—can produce the strongest privacy protections, says Bernease Herman, a data scientist at the University of Washington eScience Institute. But skeptics worry that this nuance can be lost in the marketing lingo of synthetic-data vendors, which won’t always be forthcoming about what techniques they are using.
People are already using ChatGPT to create workout plans
Hitting the gym
Despite the variable quality of ChatGPT’s fitness tips, some people have actually been following its advice in the gym.
John Yu, a TikTok content creator based in the US, filmed himself following a six-day full-body training program courtesy of ChatGPT. He instructed it to give him a sample workout plan each day, tailored to which bit of his body he wanted to work (his arms, legs, etc), and then did the workout it gave him.
The exercises it came up with were perfectly fine, and easy enough to follow. However, Yu found that the moves lacked variety. “Strictly following what ChatGPT gives me is something I’m not really interested in,” he says.
Lee Lem, a bodybuilding content creator based in Australia, had a similar experience. He asked ChatGPT to create an “optimal leg day” program. It suggested the right sorts of exercises—squats, lunges, deadlifts, and so on—but the rest times between them were far too brief. “It’s hard!” Lem says, laughing. “It’s very unrealistic to only rest 30 seconds between squat sets.”
Lem hit on the core problem with ChatGPT’s suggestions: they fail to consider human bodies. As both he and Yu found out, repetitive movements quickly leave us bored or tired. Human coaches know to mix their suggestions up. ChatGPT has to be explicitly told.
For some, though, the appeal of an AI-produced workout is still irresistible—and something they’re even willing to pay for. Ahmed Mire, a software engineer based in London, is selling ChatGPT-produced plans for $15 each. People give him their workout goals and specifications, and he runs them through ChatGPT. He says he’s already signed up customers since launching the service last month and is considering adding the option to create diet plans too. ChatGPT is free, but he says people pay for the convenience.
What united everyone I spoke to was their decision to treat ChatGPT’s training suggestions as entertaining experiments rather than serious athletic guidance. They all had a good enough understanding of fitness, and what does and doesn’t work for their bodies, to be able to spot the model’s weaknesses. They all knew they needed to treat its answers skeptically. People who are newer to working out might be more inclined to take them at face value.
The future of fitness?
This doesn’t mean AI models can’t or shouldn’t play a role in developing fitness plans. But it does underline that they can’t necessarily be trusted. ChatGPT will improve and could learn to ask its own questions. For example, it might ask users if there are any exercises they hate, or inquire about any niggling injuries. But essentially, it can’t come up with original suggestions, and it has no fundamental understanding of the concepts it is regurgitating
How Roomba tester’s private images ended up on Facebook
A Roomba recorded a woman on the toilet. How did screenshots end up on social media?
This episode we go behind the scenes of an MIT Technology Review investigation that uncovered how sensitive photos taken by an AI powered vacuum were leaked and landed on the internet.
- A Roomba recorded a woman on the toilet. How did screenshots end up on Facebook?
- Roomba testers feel misled after intimate images ended up on Facebook
- Eileen Guo, MIT Technology Review
- Albert Fox Cahn, Surveillance Technology Oversight Project
This episode was reported by Eileen Guo and produced by Emma Cillekens and Anthony Green. It was hosted by Jennifer Strong and edited by Amanda Silverman and Mat Honan. This show is mixed by Garret Lang with original music from Garret Lang and Jacob Gorski. Artwork by Stephanie Arnett.
Jennifer: As more and more companies put artificial intelligence into their products, they need data to train their systems.
And we don’t typically know where that data comes from.
But sometimes just by using a product, a company takes that as consent to use our data to improve its products and services.
Consider a device in a home, where setting it up involves just one person consenting on behalf of every person who enters… and living there—or just visiting—might be unknowingly recorded.
I’m Jennifer Strong and this episode we bring you a Tech Review investigation of training data… that was leaked from inside homes around the world.
Jennifer: Last year someone reached out to a reporter I work with… and flagged some pretty concerning photos that were floating around the internet.
Eileen Guo: They were essentially, pictures from inside people’s homes that were captured from low angles, sometimes had people and animals in them that didn’t appear to know that they were being recorded in most cases.
Jennifer: This is investigative reporter Eileen Guo.
And based on what she saw… she thought the photos might have been taken by an AI powered vacuum.
Eileen Guo: They looked like, you know, they were taken from ground level and pointing up so that you could see whole rooms, the ceilings, whoever happened to be in them…
Jennifer: So she set to work investigating. It took months.
Eileen Guo: So first we had to confirm whether or not they came from robot vacuums, as we suspected. And from there, we also had to then whittle down which robot vacuum it came from. And what we found was that they came from the largest manufacturer, by the number of sales of any robot vacuum, which is iRobot, which produces the Roomba.
Jennifer: It raised questions about whether or not these photos had been taken with consent… and how they wound up on the internet.
In one of them, a woman is sitting on a toilet.
So our colleague looked into it, and she found the images weren’t of customers… they were Roomba employees… and people the company calls ‘paid data collectors’.
In other words, the people in the photos were beta testers… and they’d agreed to participate in this process… although it wasn’t totally clear what that meant.
Eileen Guo: They’re really not as clear as you would think about what the data is ultimately being used for, who it’s being shared with and what other protocols or procedures are going to be keeping them safe—other than a broad statement that this data will be safe.
Jennifer: She doesn’t believe the people who gave permission to be recorded, really knew what they agreed to.
Eileen Guo: They understood that the robot vacuums would be taking videos from inside their houses, but they didn’t understand that, you know, they would then be labeled and viewed by humans or they didn’t understand that they would be shared with third parties outside of the country. And no one understood that there was a possibility at all that these images could end up on Facebook and Discord, which is how they ultimately got to us.
Jennifer: The investigation found these images were leaked by some data labelers in the gig economy.
At the time they were working for a data labeling company (hired by iRobot) called Scale AI.
Eileen Guo: It’s essentially very low paid workers that are being asked to label images to teach artificial intelligence how to recognize what it is that they’re seeing. And so the fact that these images were shared on the internet, was just incredibly surprising, given how incredibly surprising given how sensitive they were.
Jennifer: Labeling these images with relevant tags is called data annotation.
The process makes it easier for computers to understand and interpret the data in the form of images, text, audio, or video.
And it’s used in everything from flagging inappropriate content on social media to helping robot vacuums recognize what’s around them.
Eileen Guo: The most useful datasets to train algorithms is the most realistic, meaning that it’s sourced from real environments. But to make all of that data useful for machine learning, you actually need a person to go through and look at whatever it is, or listen to whatever it is, and categorize and label and otherwise just add context to each bit of data. You know, for self driving cars, it’s, it’s an image of a street and saying, this is a stoplight that is turning yellow, this is a stoplight that is green. This is a stop sign.
Jennifer: But there’s more than one way to label data.
Eileen Guo: If iRobot chose to, they could have gone with other models in which the data would have been safer. They could have gone with outsourcing companies that may be outsourced, but people are still working out of an office instead of on their own computers. And so their work process would be a little bit more controlled. Or they could have actually done the data annotation in house. But for whatever reason, iRobot chose not to go either of those routes.
Jennifer: When Tech Review got in contact with the company—which makes the Roomba—they confirmed the 15 images we’ve been talking about did come from their devices, but from pre-production devices. Meaning these machines weren’t released to consumers.
Eileen Guo: They said that they started an investigation into how these images leaked. They terminated their contract with Scale AI, and also said that they were going to take measures to prevent anything like this from happening in the future. But they really wouldn’t tell us what that meant.
Jennifer: These days, the most advanced robot vacuums can efficiently move around the room while also making maps of areas being cleaned.
Plus, they recognize certain objects on the floor and avoid them.
It’s why these machines no longer drive through certain kinds of messes… like dog poop for example.
But what’s different about these leaked training images is the camera isn’t pointed at the floor…
Eileen Guo: Why do these cameras point diagonally upwards? Why do they know what’s on the walls or the ceilings? How does that help them navigate around the pet waste, or the phone cords or the stray sock or whatever it is. And that has to do with some of the broader goals that iRobot has and other robot vacuum companies has for the future, which is to be able to recognize what room it’s in, based on what you have in the home. And all of that is ultimately going to serve the broader goals of these companies which is create more robots for the home and all of this data is going to ultimately help them reach those goals.
Jennifer: In other words… This data collection might be about building new products altogether.
Eileen Guo: These images are not just about iRobot. They’re not just about test users. It’s this whole data supply chain, and this whole new point where personal information can leak out that consumers aren’t really thinking of or aware of. And the thing that’s also scary about this is that as more companies adopt artificial intelligence, they need more data to train that artificial intelligence. And where is that data coming from? Is.. is a really big question.
Jennifer: Because in the US, companies aren’t required to disclose that…and privacy policies usually have some version of a line that allows consumer data to be used to improve products and services… Which includes training AI. Often, we opt in simply by using the product.
Eileen Guo: So it’s a matter of not even knowing that this is another place where we need to be worried about privacy, whether it’s robot vacuums, or Zoom or anything else that might be gathering data from us.
Jennifer: One option we expect to see more of in the future… is the use of synthetic data… or data that doesn’t come directly from real people.
And she says companies like Dyson are starting to use it.
Eileen Guo: There’s a lot of hope that synthetic data is the future. It is more privacy protecting because you don’t need real world data. There have been early research that suggests that it is just as accurate if not more so. But most of the experts that I’ve spoken to say that that is anywhere from like 10 years to multiple decades out.
Jennifer: You can find links to our reporting in the show notes… and you can support our journalism by going to tech review dot com slash subscribe.
We’ll be back… right after this.
Albert Fox Cahn: I think this is yet another wake up call that regulators and legislators are way behind in actually enacting the sort of privacy protections we need.
Albert Fox Cahn: My name’s Albert Fox Cahn. I’m the Executive Director of the Surveillance Technology Oversight Project.
Albert Fox Cahn: Right now it’s the Wild West and companies are kind of making up their own policies as they go along for what counts as a ethical policy for this type of research and development, and, you know, quite frankly, they should not be trusted to set their own ground rules and we see exactly why with this sort of debacle, because here you have a company getting its own employees to sign these ludicrous consent agreements that are just completely lopsided. Are, to my view, almost so bad that they could be unenforceable all while the government is basically taking a hands off approach on what sort of privacy protection should be in place.
Jennifer: He’s an anti-surveillance lawyer… a fellow at Yale and with Harvard’s Kennedy School.
And he describes his work as constantly fighting back against the new ways people’s data gets taken or used against them.
Albert Fox Cahn: What we see in here are terms that are designed to protect the privacy of the product, that are designed to protect the intellectual property of iRobot, but actually have no protections at all for the people who have these devices in their home. One of the things that’s really just infuriating for me about this is you have people who are using these devices in homes where it’s almost certain that a third party is going to be videotaped and there’s no provision for consent from that third party. One person is signing off for every single person who lives in that home, who visits that home, whose images might be recorded from within the home. And additionally, you have all these legal fictions in here like, oh, I guarantee that no minor will be recorded as part of this. Even though as far as we know, there’s no actual provision to make sure that people aren’t using these in houses where there are children.
Jennifer: And in the US, it’s anyone’s guess how this data will be handled.
Albert Fox Cahn: When you compare this to the situation we have in Europe where you actually have, you know, comprehensive privacy legislation where you have, you know, active enforcement agencies and regulators that are constantly pushing back at the way companies are behaving. And you have active trade unions that would prevent this sort of a testing regime with a employee most likely. You know, it’s night and day.
Jennifer: He says having employees work as beta testers is problematic… because they might not feel like they have a choice.
Albert Fox Cahn: The reality is that when you’re an employee, oftentimes you don’t have the ability to meaningfully consent. You oftentimes can’t say no. And so instead of volunteering, you’re being voluntold to bring this product into your home, to collect your data. And so you’ll have this coercive dynamic where I just don’t think, you know, at, at, from a philosophical perspective, from an ethics perspective, that you can have meaningful consent for this sort of an invasive testing program by someone who is in an employment arrangement with the person who’s, you know, making the product.
Jennifer: Our devices already monitor our data… from smartphones to washing machines.
And that’s only going to get more common as AI gets integrated into more and more products and services.
Albert Fox Cahn: We see evermore money being spent on evermore invasive tools that are capturing data from parts of our lives that we once thought were sacrosanct. I do think that there is just a growing political backlash against this sort of technological power, this surveillance capitalism, this sort of, you know, corporate consolidation.
Jennifer: And he thinks that pressure is going to lead to new data privacy laws in the US. Partly because this problem is going to get worse.
Albert Fox Cahn: And when we think about the sort of data labeling that goes on the sorts of, you know, armies of human beings that have to pour over these recordings in order to transform them into the sorts of material that we need to train machine learning systems. There then is an army of people who can potentially take that information, record it, screenshot it, and turn it into something that goes public. And, and so, you know, I, I just don’t ever believe companies when they claim that they have this magic way of keeping safe all of the data we hand them, there’s this constant potential harm when we’re, especially when we’re dealing with any product that’s in its early training and design phase.
Jennifer: This episode was reported by Eileen Guo, produced by Emma Cillekens and Anthony Green, edited by Amanda Silverman and Mat Honan. And it’s mixed by Garret Lang, with original music from Garret Lang and Jacob Gorski.
Thanks for listening, I’m Jennifer Strong.
The Download: ChatGPT workout plans, and cleaning up aviation
When I opened the email telling me I’d been accepted to run the London Marathon, I felt elated. And then terrified. Barely six months on from my last marathon, I knew how dedicated I’d have to be to keep running day after day, week after week, month after month, through rain, cold, tiredness, grumpiness, and hangovers.
The marathon is the easy part. It’s the constant grind of the training that kills you—and finding ways to keep it fresh and interesting is part of the challenge. Some exercise nuts think they’ve found a way to live their routines up: by using the AI chatbot ChatGPT as a sort of proxy personal trainer.
Its appeal is obvious. ChatGPT answers questions in seconds, saving the need to sift through tons of information, and asking follow-up questions will give you a more detailed and personalized answer. But is ChatGPT really the future of how we work out? Or is it just a confident bullshitter? Read the full story.
How new technologies could clean up air travel
Aviation is a notorious “hard-to-decarbonize” sector. It makes up about 3% of the world’s greenhouse-gas emissions, and airline traffic could more than double from today’s levels by 2050.
When it comes to flying, the technical challenge of cutting emissions is especially steep. Fuels for planes need to be especially light and compact, so planes can make it into the sky and still have room for people or cargo. But the industry has some promising ideas for cleaning up its act—and some of them are already taking off. Read the full story.