AI generator of anonymized data

The time when developers of data warehouses and generally any platform integrating corporate data had full access to all production data of the company should be long gone. Apart from the fact that such an approach entails significant risks, it is often contrary to legislation and standards. On the other hand, developers need to develop over data that makes business sense and does not contain completely meaningless strings and numbers. That’s why we decided to develop a tool that overcomes these challenges – the Meaningful yet anonymised data generator using generative AI.

Generátor anonymizovaných dummy dat (obr. 1)

Disclaimer: This article was translated from the Czech language by AI.

How the idea was born

At the beginning, there was a clear need: to create a tool that would enable:

  • Securely and efficiently develop an internal data warehouse and test it without compromising sensitive data.
  • Demonstration of products that we use internally ourselves to clients, which is impossible over “live” data.

Traditional anonymization methods usually generate meaningless data over which a real demonstration makes no sense and which is problematic even for developers. Therefore, we focused on developing a solution that harnesses the power of generative artificial intelligence.

How our solution works

Our meaningful anonymized data generator allows users to easily create anonymized datasets for testing purposes. The process begins by reading data from a CSV file or SQL database. Users can define generation parameters such as dataset topic, data requirements, and the percentage of original data to be used as the base. A preview of the sample data is then generated, which can be modified as needed. After final tuning of the parameters and data, a complete anonymized dataset is generated.

Specific procedure:

Load the data on the initial page. A CSV file or SQL database can be selected as the data source. If CSV is selected, the file is loaded. In the case of SQL database, you need to enter the login credentials and then select the table for anonymization.

Generátor anonymizovaných dummy dat (obr. 2)

In the “Dummy data design” section we define the content of the dummy data. We can specify the topic of the dataset, the data requirements and specify what percentage of the original data will be used as the basis for generation. By default, 5 to 10 rows of input data is sufficient, or you can customize the parameters for the OpenAI API. Next, choose a variable to sort the dataset. Click on “Generate sample data” to get a preview of the resulting data. In this step, it is important to specify what data we want to generate to avoid AI “hallucinations”. Best results are achieved by manually editing and “tweaking” a few values, or by repeatedly generating sample data with additional information. Once the data matches the ideas, we go to the “Final Dummy Data” page and generate the full dataset.

Generátor anonymizovaných dummy dat (obr. 3)

To edit a specific value, click on the cell with the value, edit and confirm with the “Save edited data” button. To add a new row, click the “+” icon in the data footer, enter the new values and confirm with the “Save modified data” button. In the “Final dummy data” section, set the number of rows in the final dataset and choose whether to use the sample data or generate new data. Click on “Generate final dummy data” to get the complete anonymized dataset.

Generátor anonymizovaných dummy dat (obr. 4)

What we have learned and what we must not forget

During the development process, we gained a lot of new experience in prompt engineering, data validation and performance optimization. A large part of the time was spent debugging the correct formulation of requirements so that the generative AI generates realistic and consistent data that is free of duplicates.

Developing applications with AI agents requires more than just the usual chatter with AI. The key know-how for prompt engineering is the art of formulating requirements so that AI generates exactly what we need. This includes detailed definition of input data, parameters, and iterative tuning. Furthermore, expertise in data engineering and data science is essential to master the combination of working with data, APIs and machine learning principles. Of course, data validation to ensure quality of outputs is a must.

Why choose AI for data anonymization?

Using AI to generate anonymized data brings a number of advantages over traditional methods. AI can generate data that is not only anonymized, but also realistic and variable. This ensures that testing is conducted on data that faithfully simulates real-world scenarios, leading to more accurate results. AI makes it possible to automate the entire process and generate thousands of meaningful records, which is difficult and costly to implement manually.

What the solution will deliver

  • Secure testing and development: it allows you to test applications and analytics tools on data that does not compromise the security of real-world information.
  • Accelerate development: automated generation of dummy data saves time and shortens the development cycle.
  • Flexibility and customisation.
  • Compliance with legislation: Data anonymization ensures compliance with GDPR and other regulations.

Most importantly, our solution is designed to be accessible to those without deep technical knowledge. There is no need for programming or complex parameter settings to generate dummy data. Just enter the requirements in natural language and the system does the rest. If necessary, the generation can be easily modified using an intuitive interface.

If you are interested in the anonymized data generator and would like to know more about it, please do not hesitate to contact us, we will be happy to have a free no-obligation chat with you.

Request a no-obligation consultation

Author: Jan Petr

Mohlo by vás zajímat

We are thrilled to announce a new partnership with Accobat, the provider of accoTOOL — a suite of tools for planning and…

2 min
Read
Read more

Want to contact us?

Drop files here or
Max. file size: 100 MB.
    This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.