Structuring Thousands of Files from Internal and External Sources
Files structured automatically
The Problem
A mid-market real estate firm in the US had a data problem that was quietly costing them deals. Their business depended on enriching property and market data using various tools, pulling similar information from thousands of files in different formats, and preparing that data for analysis. But the files came from everywhere: internal databases, third-party data providers, public records, broker reports, and proprietary research. Every source had its own format, naming conventions, and quirks.
The firm needed to extract comparable information from each file, regardless of format, and structure it into something their analysts could actually work with. Previously, this was done manually by a team of people who would open each file, find the relevant fields, and copy them into a standardised template. With thousands of files and growing, this approach was unsustainable. Important analyses simply were not getting done because the data preparation alone would take weeks.
Leadership wanted to go from raw, heterogeneous data to structured, analysis-ready datasets without the manual bottleneck. They also needed the ability to enrich their internal data with external sources like web data and market intelligence, something that was practically impossible to do at scale with their existing process.
The Solution
BetterBrain deployed a table of AI agents, each specialising in different aspects of the data enrichment and structuring workflow. Some agents handle web search to pull in external data points. Others use enterprise search to find relevant internal documents. Custom tools handle format-specific extraction from PDFs, spreadsheets, Word documents, and other file types.
The system processes thousands of files, identifies the relevant information in each one regardless of format, and structures it into consistent, analysis-ready datasets. It uses OpenAI's gpt-4o-mini for fast extraction tasks and Groq running llama-3.1-70B for bulk processing. RAG ensures that the agents understand the firm's specific data schema and terminology, so extracted data maps correctly to the fields analysts need.
The agents work in parallel, handling hundreds of files simultaneously. When they encounter ambiguous or low-confidence extractions, they flag those for human review rather than guessing. The result is a pipeline that turns a previously impossible task into something that runs in the background while the team focuses on actual analysis and deal-making.
The Number
- Thousands of heterogeneous files structured automatically
- Reduced time and cost for lead enrichment and sales prospecting
- Enabled analysis that was previously impossible at scale
- Combined internal data with external web intelligence seamlessly
- Analysts freed to focus on deal evaluation instead of data preparation