How a Data Lakehouse Cuts Reporting Time from 2 Days to 15 Minutes
This is an anonymized client case. Some implementation details and figures were simplified or rounded to protect confidentiality while preserving the technical approach and business outcomes.
TL;DR: In one anonymized mid-market migration, we replaced spreadsheet-heavy reporting and a legacy transactional database with an analytical stack built around object storage, dbt, and DuckDB. Reporting latency dropped from days to minutes, manual preparation work fell substantially, and reporting consistency improved during the parallel-run validation period.
Updated: March 3, 2026 | By Dmitry Susha, CTO & Co-Founder
The Starting Point
The company - a B2B distributor with 200+ employees - had a data problem common among mid-market businesses:
- 12 Excel spreadsheets maintained manually by 3 analysts
- Legacy PostgreSQL database with 5 years of transactional data (180GB)
- ERP system (1C) generating daily exports that analysts copy-pasted into Excel
- 2 days to produce a monthly sales report
- No single source of truth - different departments had different numbers
The CEO’s question was simple: “Why does it take my team 2 days to tell me how much we sold last month?”
The Architecture We Built
We designed a lightweight Lakehouse architecture using open-source tools:
Data Sources Ingestion Storage Analytics
┌──────────┐ ┌────────┐ ┌────────┐ ┌──────────┐
│ PostgreSQL├────────►│ │ │ │ │ │
├──────────┤ │Airflow │───────►│ Parquet├───────►│ DuckDB │
│ 1C Export ├────────►│ DAGs │ │ on S3 │ │ │
├──────────┤ │ │ │ │ │ Dashboard│
│ Excel ├────────►│ │ │(Iceberg│ │ Queries │
└──────────┘ └────────┘ │ format)│ └──────────┘
│ └────────┘ │
▼ ▼
┌────────┐ ┌──────────┐
│ dbt │ │ Metabase │
│ models │ │ Dashboards│
└────────┘ └──────────┘
Key technology choices:
- DuckDB for the query engine (180GB fits comfortably, no server needed)
- Apache Iceberg on S3 for storage (versioned, ACID-compliant)
- dbt for transformations (SQL models with tests and documentation)
- Airflow for orchestration (scheduled ingestion from all sources)
- Metabase for dashboards (self-hosted, via DuckDB community connector)
- Terraform for infrastructure (reproducible, version-controlled)
Infrastructure cost dropped substantially compared to the managed database setup - primarily S3 storage and a small compute instance for Airflow.
The Migration Process
Week 1-2: Audit and Data Modeling
We started by mapping every data source, every Excel formula, and every report the team produced. Key findings:
- The majority of analyst time was spent on data preparation, not analysis
- 4 out of 12 spreadsheets contained duplicate data from the same source
- Monthly reports had recurring discrepancies between departments
- The PostgreSQL database had 47 tables, but only 8 were used for reporting
We designed a dimensional model with 8 fact tables and 12 dimension tables - covering 100% of existing reports with room for new analytics.
Week 3-4: Pipeline Construction
Using Airflow, we built automated ingestion pipelines:
- PostgreSQL: incremental changes were captured and materialized into analytical storage on a scheduled cadence (the exact transport layer depends on the client environment and reliability requirements)
- 1C: automated file export + Airflow sensor, processing within 30 minutes
- Excel: uploaded to S3, parsed and validated automatically with Great Expectations
All data landed in Parquet format on S3 with Iceberg metadata. dbt models transformed raw data into analytical tables with built-in tests - referential integrity, value ranges, freshness checks.
Week 5-6: Dashboards and Validation
We rebuilt every Excel report as a Metabase dashboard. The CEO’s monthly sales report became a live dashboard updating every 15 minutes.
Validation process:
- Ran old and new reports in parallel for 2 weeks
- Compared numbers line by line - identified and fixed 3 data quality issues in the source systems
- Got sign-off from each department head
The rollout was designed to minimize disruption by running the legacy and new reporting paths in parallel until stakeholders signed off on validated outputs.
The Results
Before vs After
| Metric | Before (Excel + PostgreSQL) | After (Lakehouse) |
|---|---|---|
| Monthly sales report | 2 days manual work | 15 minutes (automated) |
| Data freshness | Daily (manual export) | 15-minute lag |
| Number of data sources | 3 (disconnected) | 3 (unified) |
| Manual data prep | Majority of analyst time | Minimal (automated pipelines) |
| Report discrepancies | Recurring between departments | Converged on a single trusted definition during rollout |
| Infrastructure cost | Managed DB + manual ops | ~90% lower (S3 + DuckDB) |
| Analysts needed for reporting | 3 full-time on data prep | Reporting support time reduced substantially |
ROI Summary
The project paid for itself within the first quarter. The largest savings came from infrastructure (managed DB → S3 + DuckDB) and analyst time freed from manual data preparation. The team redirected that time to actual analysis - finding patterns, testing hypotheses, and supporting business decisions.
What We Learned
1. Start with the CEO’s question, not the technology. The project succeeded because we solved a specific business problem (“why does the report take 2 days?”), not because we deployed cool technology.
2. Excel is not the enemy - it’s the symptom. The real problem was disconnected data sources and no automation. Analysts used Excel because there was no better option.
3. For this workload, DuckDB was sufficient. 180GB of data, 8 fact tables, 50+ dashboard queries - DuckDB handled everything on a single instance without introducing a heavier serving stack.
4. dbt tests catch problems early. Within the first week of production, dbt tests caught a schema change in the 1C export that would have silently broken 3 reports. Automated quality checks paid for themselves immediately.
5. Parallel running removes fear. The biggest barrier to migration isn’t technical - it’s trust. Running old and new systems in parallel let stakeholders verify numbers at their own pace.
Frequently Asked Questions
How long does it take to migrate from Excel to a Data Lakehouse?
A typical migration takes 4-8 weeks for the MVP phase. This includes connecting data sources, building core pipelines, and deploying initial dashboards. Full automation with AI quality checks adds another 4-8 weeks.
What ROI can I expect from a Data Lakehouse?
Based on our implementations, companies typically see significant reductions in reporting time and manual data preparation. Infrastructure costs drop substantially compared to managed warehouse setups. Exact numbers depend on data volume, source complexity, and current stack - we share specific estimates after a discovery call.
Can we keep using Excel during the migration?
Yes. We run the Lakehouse in parallel with existing systems. Your team continues using Excel and legacy tools while we build and validate the new platform. The switch happens only after thorough testing.
Key Takeaways
- Reporting time dropped from 2 days to 15 minutes - fully automated
- Infrastructure costs dropped ~90% (managed DB → S3 + DuckDB)
- Analyst time on manual data prep reduced dramatically
- DuckDB handled 180GB on a single instance - no complex infrastructure needed
- Project paid for itself within the first quarter
Further Reading
- What Is a Data Lakehouse and Why It’s Replacing Traditional Warehouses - the architecture and cost breakdown explained
- ClickHouse vs DuckDB vs Snowflake: Choosing the Right Engine - when to use each tool
Want similar results for your business? Book a free data audit call - we’ll assess your current setup and estimate potential improvements.
Sources and Further Reading
- Metabase DuckDB Connector - community connector status and setup
- DuckDB S3 Support - reading from object storage
- AWS S3 Pricing - storage cost reference
- DuckDB: Working with Huge Databases - scaling guidance
Some details in this case study were simplified or rounded to protect client confidentiality. Outcomes should not be treated as guaranteed for other environments.
Reviewed by Dmitry Susha, CTO & Co-Founder at Sfotex. Last reviewed: March 2026. Contact: Telegram | Email