Synthetic data may sound new, but it has been around for decades. As early as the 1930s, there were already efforts to work with synthetic voice and audio. However, it was in the early 1990s when data scientists started exploring the idea of using manufactured data for statistical analysis without infringing on privacy rights.

Fast forward to the present decade and the world is seeing projects like MIT's Synthetic Data Vault, which is "a one-stop shop where users can get as much data as they need for their projects," as described in an article on MIT News. This project provides open source tools that expand access to data without compromising privacy.

The use of synthetic data is already a part of mainstream science and technology. Various industries employ it. One of the prominent consumers of this type of data is the machine learning sector, particularly in terms of imaging.

OneView, a company that produces synthetic data for the remote sensing industry, conducted a case study on using synthetic data together with real data to improve detection results in satellite imagery. The results prove that the benefits of synthetic data are real.

"When the sample size is small, synthetic data performs much better than real data, resulting in a high-quality algorithm - achieved in a shorter time and with lesser effort and lower cost," the OneView study concluded.

Not fake data 

Synthetic data is not fake data, although it is technically manufactured data. Some may use these terms interchangeably, but as far as connotations are concerned, the former does not entail information that is meant to mislead or manipulate responses.

As the term suggests, synthetic data is artificially generated. However, it is not created out of thin air, without basis. It is produced by funnelling real-world information through an algorithm to churn out a new set of data designed to simulate real-world scenarios. Synthetic data mimics the statistical features of real information without being a duplicate or a mere copy of something else.

Synthetic data can be used to augment insufficient actual data, as demonstrated by the case study conducted by OneView. It is not produced just to provide random information and muddy statistical analysis. It is even used to address the bias problem in machine learning (more on this later). In other words, synthetic data is meant to advance scientific studies and not introduce variables that skew or manipulate results.

Can be better than real data

There are many advantages in using synthetic data. Companies like OneView will not be making a business model out of synthetic data if it does not mean anything useful and practical. OneView's synthetic datasets have been helping machine learning systems improve particularly in the field of AI imaging.

Synthetic data costs considerably less compared to doing actual surveys or research to obtain real-world information. Advanced video game engines, for example, can be employed to create virtual worlds in 3D with detailed landscape and bird's eye view perspectives. They provide a highly viable alternative to drone footages or aerial shots.

Game graphics engines have become highly advanced over the years that they are now being used in scientific research. As a conference paper presented at the New Frontiers for Entertainment Computing concludes, "the visualization of scientific data with game engines is possible and leads to promising results." Synthetic visual data is already recognized as valuable for statistical analysis and the accumulation of information useful in training AI systems and testing algorithms.

It is safe to say that synthetic can be better than actual data really in several instances. Not to mention, they are highly scalable. Surveys and other research tasks do not need to be done all over again when dealing with different scales as synthetic data can be generated rapidly and adjusted according to changing needs.

How synthetic data improves machine learning

As scientific studies have proven, manufactured can serve as excellent alternatives to actual data. "Synthetic data has the potential to increase the performance of machine learning, especially in case of unbalanced datasets," writes noted data engineer Anna Marek in a technical blog on The Data Lab.

Increased efficiency and bias attenuation: These are the top benefits of using synthetic data in machine learning. As mentioned, synthetic data is significantly cheaper and faster to produce. It is also highly scalable. It can train AI systems much faster and with greater accuracy given the more precise labeling or annotation that comes with synthetic data.

The customizable data generated by OneView, for example, comes with advanced annotation to make it more convenient to use compared to acquiring real-world data. Labels or annotations can be produced as the data is generated. There is no need to hire more people to scrutinize details and perform manual tagging, marking, or labeling. This also means greater accuracy as the possibility of human errors is avoided.

Moreover, synthetic data addresses the problem of bias in machine learning. Often, human intervention in the design of AI systems results in the integration of the preconceived ideas or expectations of the developers.

"It's not accuracy versus fairness. The data should represent the world how it should be," says Julia Stoyanovich, a professor of computer science at New York University. The science of machine learning and computer imaging should not involve subjective concepts such as fairness or equitable representation. It is about precision, what the reality is. "Companies don't have to choose. Instead, the data should represent the world how it should be," Stoyanovich stresses.

Practical applications

OneView can prove that synthetic data does help in machine learning and provide palpable practical applications. The company generates data for clients in various industries including urban planning, defense and intelligence, infance, insurance, energy, and infrastructure.

Urban planners and related service providers can benefit from the accurate visual representation of urban scapes including the traffic infrastructure. Similarly, finance and insurance companies can use manufactured data to facilitate the identification of opportunities, assess the impact of events-of-interest, and come up with reasonable evaluations to validate or contest insurance claims.

Defense and intelligence companies, too, can take advantage of synthetic data to train their systems for rare object detection and edge cases coverage. Likewise, those in the energy and infrastructure sector can rely on synthetic data to plan developments, project infrastructure changes, and evaluate infrastructure plans based on damage, development, and convenience impacts.

Kobi Katz, the VP and CIO of RAFAEL Advanced Defense Systems Ltd., lauds OneView's synthetic data solutions. "OneView's unique technology is a game changer. It is valuable for numerous use cases, brings unique capabilities, and provides support in product development and AI knowhow," Katz says.

Real benefits

To emphasize, synthetic data is not fake or false information. It is data generated using special algorithms or computer programs that take into account real-world situations and produce data that approximates the statistical features of actual objects and scenarios. As such, it is useful in scientific endeavors including machine learning and computer vision.

ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
* This is a contributed article and this content does not necessarily represent the views of techtimes.com
Join the Discussion