Centralized data management and standardized pipelines for real estate and financial documents

Initial situation

A leading IT service provider in the financial group was faced with the strategic challenge of making the processing of huge quantities of real estate documents — from energy certificates to land register statements — future-proof. The previous landscape was characterized by manual efforts and historically developed data structures, which were increasingly reaching their limits. In particular, the strict regulatory requirements of the banking sector (GDPR, BaFin, EU AI Act) required a new level of transparency and control.

There was a lack of a central authority for data management, which not only made it difficult to comply with deletion deadlines, but also slowed down the work of data science teams, as data was often in silos and training and evaluation data sets were not uniform. The company recognized the opportunity not only to ensure compliance through a central platform, but also to create the basis for comprehensible, standardized data processes that can be used across teams.

Our solution

In close collaboration with internal teams, AMAI developed the central data storage and the associated data pipelines as the backbone for all further evaluations and model projects. The focus of this commitment is on Training and evaluation data as part of a central data mesh approach, on standardized and reproducible data preprocessing, on the Automate complex data processes and on the uniform filing of data — regardless of which specific models or services are based on it later.

Centralized data management with Delta Lake: The solution is based on a modern data lakehouse architecture. Through the use of Delta Lake and a structured medallion architecture (bronze, silver, gold), a comprehensible “single source of truth” for documents, metadata and derived artifacts is created. Versioning (“time travel”), clear data lines and the technical basis for compliance issues such as legally secure deletion after the end of the contract are therefore anchored in central data storage, rather than in separate files by teams.
Standardised data pipelines: For data scientists and data engineers, uniform PySpark-Established pipelines that automate recurring steps from raw data acquisition to structured tables. Particularly effective: The pipelines support the automatic creation and further processing of labeling projects in Label Studio — a step that previously took up a lot of time manually and is now integrated into the standardized data flow. At the same time, findings from processing, quality assurance and structuring of data can transfer it to the data storage and processing of other teams instead of reinventing each subject area in isolation.
Repeatability and automation: Complex data processes (such as dividing multi-page documents, assigning pages to logical documents, preparation for evaluations) run through comprehensible pipeline stages. This makes it understandable Which data In what form For what purposes are used — a central component for regulatory requirements and for reliable model and product decisions.

A particular challenge was bringing together the different data processing processes of different data scientists and establishing a common standard. Intensive coordination between the parties involved and the consistent application of a structured layered architecture were decisive for making complexity manageable and at the same time technically anchoring data protection and regulatory requirements.

Results & business impact

The introduction of central data management marked a turning point in the company's data strategy. The system supports the strict regulatory requirements of the financial industry (GDPR, BaFin): Deletion concepts and traceability can be better represented technically because data is processed centrally and controlled via pipelines. There are already over 900,000 documents in central management in the current phase; the architecture is designed for many times over.

By replacing manual data silos and standardized, automated pipelines, development and processing cycles have been shortened. Connecting Label Studio to the pipelines significantly reduces the effort required for labeling projects and makes it possible to plan recurring steps. From central, structured data storage, Statistics and overviews of existing data Create significantly more easily than from distributed, inconsistently prepared inventories, which facilitates control, quality monitoring and communication about data levels in the overall program.

The platform provides a solid foundation for further usage scenarios, such as deeper integration into assistance systems, chatbots and knowledge management, because the database remains uniform and expandable.