Vladislav Kushka - Delivery manager, Xenoss

Data lake architecture: Design patterns for AI-ready enterprise data infrastructure

Vlad Kushka — Mon, 23 Mar 2026 12:40:30 +0000

The 2026 State of Data Engineering survey of 1,101 data professionals identified that 44% still rely on cloud data warehouses as their primary paradigm, while 27% have moved to lakehouse architectures. The remaining teams use a mix of both, and 25% name legacy systems and technical debt as their biggest bottleneck. For organizations stuck in that last group, the root cause is almost always the same: the data lake was built as a storage project instead of an architecture project.

The storage itself is rarely the issue. S3 is cheap, ADLS scales well, GCS is reliable. Where data lake architecture breaks down is in the decisions made (or not made) before the first byte lands:

how zones are structured
which open table format governs transactions
whether a catalog exists to make data discoverable.

Skip any of those three, and the lake drifts toward a swamp, regardless of how much you spent on compute.

This article focuses on the architectural decisions: open table format selection, catalog and metastore strategy, AI-specific zone design, and the concrete triggers for evolving a lake into a lakehouse. If you already know what a data lake is, this is the article about how to build one that holds up in production.

Summary

Data lake architecture fails when teams treat it as a storage problem. Three decisions made before ingestion determine success: zone structure, open table format, and metadata catalog.
Open table formats (Iceberg, Delta Lake, Hudi) are now essential. The 2026 State of Data Engineering survey found that 27% of data professionals already use lakehouse architectures built on these formats.
AI workloads need specific architectural patterns. Feature store integration, unstructured data pipelines, and model training data lineage require purpose-built zones that traditional lake designs don’t include.
Governance cannot be an afterthought. 25% of data professionals cite legacy systems and technical debt as their biggest bottleneck. Most of that debt accumulates from deferred governance decisions.

What is data lake architecture?

Data lake architecture

Is a system design for storing raw, semi-structured, and unstructured data at scale, using schema-on-read to defer structure decisions until query time.

Unlike data warehouses that enforce schema-on-write, data lakes accept data in its original format, making them well-suited for exploratory analytics, log processing, and training machine learning models. The architecture encompasses ingestion pipelines, storage layers, processing engines, metadata catalogs, and governance frameworks that work together to keep data accessible, trustworthy, and queryable.

Core data lake design patterns

Medallion architecture (bronze, silver, gold)

The medallion pattern, popularized by Databricks, organizes data into three quality tiers.

The bronze layer holds raw, unprocessed data exactly as ingested.
Silver applies cleaning, deduplication, and schema enforcement.
Gold serves curated, business-ready datasets optimized for analytics and reporting.

This works well when different teams need data at different stages of refinement. Data scientists might query bronze for raw signals, while finance teams rely on gold for reconciled numbers. The medallion architecture also simplifies debugging, because every transformation step is preserved and replayable.

Data lake zones (landing, raw, curated, sandbox)

Zone-based architecture organizes the lake by access patterns and data maturity rather than quality tiers.

A typical layout includes:

a landing zone (temporary staging for incoming data)
a raw zone (immutable, append-only storage)
a curated zone (governed, validated datasets)
a sandbox zone (experimental space for data science teams).

Zones enforce different security and governance rules: the raw zone might restrict access to data engineering teams only, while the sandbox zone allows broader access with reduced governance overhead. The key decision is how many zones to create. Xenoss engineers recommend starting with three or four and expanding only when a clear business need arises. Over-engineering zones adds complexity without adding value.

Lambda and kappa architectures

Lambda architecture runs batch and real-time processing in parallel, merging results in a serving layer. It handles historical reprocessing well, but creates maintenance overhead because teams maintain two codebases.

Kappa architecture simplifies this by treating all data as a stream, replaying historical data through the same streaming pipeline when reprocessing is needed.

For enterprise use cases in 2026, kappa-influenced designs (stream-first, with batch as a fallback) are gaining traction. Apache Kafka and Confluent Cloud support this pattern natively, and platforms like Databricks unify batch and streaming under a single API.

Three decisions to make before your first ingestion pipeline runs

Across Xenoss client engagements, data lakes that succeed share one trait: the team made three explicit architectural decisions before ingesting data. Each decision, if deferred or skipped, creates compounding problems as the lake grows.

Three decisions to make before your first ingestion pipeline runs

The sequence matters: zones define the physical structure, the open table format defines transactional behavior within those zones, and the catalog makes everything discoverable. Skipping any of the three means the next one cannot function properly.

Open table formats: Choosing between Iceberg, Delta Lake, and Hudi

Open table formats bring warehouse-grade capabilities (ACID transactions, time travel, schema evolution) to data lake storage.

27% of data professionals now use lakehouse architectures, up significantly from prior years. Three formats dominate the space.

Format	Best for	Strengths	Considerations
Apache Iceberg	Multi-engine environments (Spark, Trino, Flink, Presto) and teams avoiding vendor lock-in	Engine-agnostic design, hidden partitioning, strong community momentum across AWS, Snowflake, Databricks	Newer ecosystem, fewer mature tooling integrations than Delta Lake
Delta Lake	Databricks-centric environments and teams already on Spark	Tight Spark integration, mature tooling, strong documentation, built-in optimization (Z-ordering, liquid clustering)	Historically tighter coupling to Databricks, though open-source compatibility is improving
Apache Hudi	Streaming-heavy workloads with frequent upserts and CDC	Record-level upserts, incremental processing, designed for streaming-first architectures	Smaller community than Iceberg or Delta. Best suited for specific ingestion patterns

In practice, the market is converging toward Apache Iceberg as the default for new deployments. AWS, Snowflake, and Databricks all now support Iceberg REST catalogs, and the format’s engine-agnostic design aligns with the multi-cloud direction most enterprises are moving toward. For teams already invested in Databricks, Delta Lake remains a strong choice. Hudi is best suited for teams with heavy CDC and streaming upsert requirements.

Why this matters: Choosing a table format after data is already in the lake means migrating terabytes of files and rewriting transformation logic. The format decision should be locked before the first ingestion pipeline runs.

Build an AI-ready data lake with Xenoss data engineers.

Data lake vs lakehouse: When to evolve your architecture

The lakehouse concept merges the flexibility of data lakes with the transactional guarantees of data warehouses. In the 2026 State of Data Engineering survey, 44% of respondents still use cloud data warehouses as their primary paradigm, while 27% have adopted lakehouse architectures. The remaining teams use a mix of both.

A pure data lake makes sense when the primary consumers are data scientists and ML engineers who need raw, flexible access to diverse data types. A lakehouse becomes necessary when business analysts, BI tools, and governance requirements enter the picture. The lakehouse adds structure without losing flexibility.

The practical trigger for migration is usually the moment when a team needs to run both SQL analytics and ML training on the same data. In a pure lake, maintaining separate ETL pipelines for each use case is required. In a lakehouse, both workloads read from the same governed, transactionally consistent tables.

Why this matters: Premature lakehouse adoption adds complexity without business value. But delaying it too long means accumulating technical debt in the form of duplicated datasets, inconsistent metrics, and ungoverned ML training data. Xenoss engineers recommend evaluating the transition when the data pipeline count exceeds 50 or when more than three teams consume the same datasets for different purposes.

Architecting data lakes for AI and ML workloads

85% of Lakehouse users are either developing AI models or plan to. At the same time, 36% cite governance as a major challenge for AI-driven analytics. Teams are pushing AI workloads onto data lakes that were designed for dashboards and batch reporting. The architecture gaps only become visible when the first ML pipeline goes to production.

AI workloads place four specific demands on data lake architecture that traditional designs don’t address.

Feature store integration. ML models consume features, not raw tables. A feature store (such as Feast, Tecton, or Databricks Feature Store) sits between the curated zone and the training pipeline, providing versioned, point-in-time correct feature sets. The data lake must support the feature store’s read patterns, which typically involve large sequential scans for training and low-latency lookups for inference.
Unstructured data pipelines. Text documents, images, audio, sensor readings, and log files are increasingly valuable for AI use cases. The data lake needs a dedicated zone for unstructured data with its own ingestion and cataloging pipeline. Parquet and Iceberg work well for structured features, but unstructured data often requires object-level metadata tagging and separate indexing.
Training data lineage. Regulatory and compliance requirements increasingly demand traceability from model predictions back to training data. The catalog must track which datasets were used to train which model version, including the specific time-travel snapshot. Without this lineage, models in regulated industries (banking, healthcare, insurance) cannot pass an audit.
Data versioning and reproducibility. ML experiments require reproducing exact training conditions. Open table formats with time-travel support (Iceberg, Delta Lake) enable this by letting teams query the lake as it existed at any point in time. The architecture must preserve historical snapshots long enough to support experiment reproducibility, which means retention policies need to account for ML workflows, not just analytics use cases.

Why this matters: The data lake is increasingly the foundation for AI, not just analytics. Architectures that don’t account for ML-specific requirements will need expensive retrofitting as AI adoption scales.

Data lake governance: Three failure patterns and how to avoid them

One in two Chief Data and Analytics Officers now considers optimizing the technology landscape a primary responsibility. That urgency exists because governance failures compound faster than most teams expect. Data lakes degrade through three specific patterns.

Missing metadata. Without a catalog that describes what each dataset contains, who owns it, and when it was last updated, the lake becomes unsearchable. Teams create duplicate copies of the same data rather than finding the authoritative source. Storage costs grow while data utility shrinks.

Absent ownership. When no team is accountable for a dataset’s quality, accuracy degrades silently. Stale records, schema drift, and broken pipelines go unnoticed until a downstream report produces wrong numbers. Data mesh principles (domain ownership, data-as-a-product) solve this by assigning clear accountability to the team closest to the data source.

Deferred governance decisions. The most common mistake is treating governance as a future initiative. Teams plan to add access controls, quality monitoring, and retention policies “later,” after the lake is operational.

By the time “later” arrives, the lake holds terabytes of ungoverned data, and retroactive governance becomes a multi-month remediation project. 25% of data professionals cite legacy systems and technical debt as their single biggest bottleneck. Much of that debt originates from governance decisions that were deferred during the initial build.

Govern your data lake before it becomes a data swamp.

Talk to Xenoss engineers

Bottom line

Data lake architecture is a solved problem in the sense that the design patterns are well understood. Medallion zones, open table formats, and metadata catalogs have been validated across thousands of enterprise deployments. The architecture fails when teams skip the foundational decisions.

The practical checklist is short: define your zone structure before ingesting data, select an open table format before building pipelines, and deploy a metadata catalog before granting access. These three decisions, made upfront, prevent the governance drift that turns data lakes into swamps.

For teams preparing to serve AI workloads, the architecture needs to go further: feature store integration, unstructured data zones, training data lineage, and experiment-grade versioning. These are not future requirements. With 82% of data professionals already using AI tools daily, they are current ones.

The post Data lake architecture: Design patterns for AI-ready enterprise data infrastructure appeared first on Xenoss - AI and Data Software Development Company.

Fine-tuning LLMs at scale: Cost optimization strategies

Vlad Kushka — Tue, 10 Feb 2026 12:36:54 +0000

Fine-tuning a large language model can run anywhere from $300 for a small 2.7B model with LoRA to over $35,000 for full fine-tuning on a 40B+ parameter model. Most engineering teams figure out this cost spectrum the hard way, after blowing past their initial compute budget on the first few training runs. The difference between staying on budget and overspending usually traces back to one decision: which fine-tuning technique you pick before writing any training code.

This guide breaks down the techniques that keep fine-tuning costs under control: parameter-efficient training methods like LoRA and QLoRA, smarter infrastructure choices, and the MLOps practices that prevent wasted GPU hours without sacrificing model quality.

Why LLM fine-tuning costs escalate in production

Most enterprises are still transitioning from LLM experimentation to production, only about one-third have scaled beyond piloting, and are discovering that fine-tuning costs can spiral quickly. Without deliberate optimization, GPU compute, data preparation, and iteration cycles compound into budgets that exceed initial projections by 2-5x.

Cost-efficient LLM fine-tuning typically involves Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and QLoRA, selecting smaller base models in the 7B-13B parameter range, and using high-quality curated datasets to reduce training time. PEFT methods now dominate enterprise LLM adaptation strategies, precisely because they cut compute requirements by orders of magnitude compared to full fine-tuning.

GPU memory costs for LLM training

Full fine-tuning loads every model weight into GPU memory at once. A 70B parameter model needs roughly 140GB of VRAM just to hold the weights in FP16 precision, and that’s before you add optimizer states and gradients.

For fine-tuning at FP16, expect around 200GB of VRAM, which pushes teams toward multi-GPU clusters or cloud instances running H100s at $2.50 to $4.50 per GPU-hour depending on the provider.

Scaling up model size means scaling up hardware spend, and the jumps aren’t gradual. Going from a 7B model (which fits on a single 24GB consumer GPU) to a 70B model means jumping from one RTX 4090 to a cluster of two or more H100s. You’re paying for an entirely different class of infrastructure.

Data preparation and quality bottlenecks

Hidden costs often live in data preparation: cleaning, formatting, annotation, and validation cycles that precede any training run. When your dataset has labeling errors or formatting inconsistencies, you end up re-running training multiple times, each run burning GPU hours without improving the final model.

Teams frequently underestimate this phase. A dataset that looks ready for training often reveals formatting inconsistencies, label errors, or distribution imbalances only after the first failed training run, challenges that strategic pipeline practices can help mitigate.

Experiment tracking and iteration costs

Hyperparameter sweeps, architecture experiments, and A/B testing eat GPU hours fast. Every failed experiment costs money without producing anything you can ship. Teams running dozens of training runs across different learning rates, batch sizes, and LoRA ranks can spend more on experimentation than on the final production training job.

Without disciplined experiment tracking, teams end up re-running the same configurations without realizing it. Duplicate experiments are more common than most leads want to admit. Setting up proper logging with tools like Weights & Biases or MLflow before the first training run pays for itself quickly by preventing wasted reruns.

Catastrophic forgetting: Why retraining costs spike

Catastrophic forgetting happens when fine-tuning on a new task erases what the model knew before. A model trained to analyze legal contracts might suddenly struggle with basic questions it handled fine out of the box. The new task knowledge crowds out the original capabilities.

When this happens, the fix is often a full retraining cycle from scratch instead of a quick incremental update. For teams that hit this problem repeatedly, retraining costs can balloon well beyond original projections. Techniques like Elastic Weight Consolidation (EWC) and careful learning rate schedules help preserve base model knowledge during fine-tuning, but they require planning upfront.

Parameter-efficient fine-tuning: LoRA, QLoRA, and AdaLoRA

PEFT methods freeze most of a model’s weights and train only a tiny fraction, typically 0.1% to 1% of the total parameters. PEFT techniques reduce memory requirements by 10 to 20x compared to full fine-tuning while retaining 90-95% of the quality. For teams that would otherwise need multi-GPU clusters, that tradeoff changes the economics entirely.

LoRA fine-tuning: How it works

Low-Rank Adaptation (LoRA) works by injecting small, trainable low-rank matrices into transformer layers while keeping the original model weights frozen. Instead of updating a weight matrix W directly, you add BA, where B and A are much smaller matrices with a low rank (typically 8 to 64).

When you pick the right learning rate for each setting, LoRA training progresses almost identically to full fine-tuning across Llama 3 and Qwen3 models. The typical result would be that you train 0.1% of the parameters and get 95-99% of full fine-tuning performance.

The infrastructure savings are substantial. A 7B model that needs 100-120GB VRAM for full fine-tuning can run on a single 24GB RTX 4090 with LoRA. Training time drops proportionally. And because LoRA produces small adapter files (typically 10-100MB rather than gigabytes), you can version them in Git, store dozens of task-specific adapters cheaply, and swap between them at inference time without reloading the base model.

QLoRA: Fine-tuning on consumer GPUs

QLoRA takes LoRA further by quantizing the base model to 4-bit precision while keeping the LoRA adapters in higher precision (typically 16-bit). The frozen weights compress to roughly 25% of their original size, but gradients still flow through them during training.

QLoRA used only 17% of A100 GPU memory compared to full fine-tuning while actually outperforming standard LoRA on accuracy (94.48% vs 93.79%). The 4-bit quantization appears to act as a form of regularization.

This technique opened fine-tuning to teams without enterprise-grade hardware budgets, proven feasible on 8GB VRAM GPUs, demonstrating that consumer GPUs can handle parameter-efficient training for models up to 1.5B parameters.

For larger models, a single RTX 4090 ($1,500) can fine-tune a 7B model that would otherwise require roughly $50,000 in H100 hardware. With tools like Unsloth, teams can fine-tune 3B parameter models on 8GB cards by combining QLoRA with gradient checkpointing and 8-bit optimizers.

Adaptive Low-Rank Adaptation for variable budgets

AdaLoRA builds on LoRA by dynamically allocating the parameter budget across layers based on their importance during training. However, not all transformer layers contribute equally to task-specific adaptation. Top layers (10, 11, 12 in a 12-layer model) often matter more for fine-tuning than bottom layers.

AdaLoRA uses singular value decomposition to score each layer’s importance and prunes low-value parameters automatically, concentrating capacity where it drives the most improvement.

AdaLoRA proves most valuable when you’re working with tight parameter budgets on complex tasks. For teams experimenting with different rank configurations or running hyperparameter sweeps, AdaLoRA removes one variable from the search space by handling rank allocation automatically. The sensitivity-based importance scoring works, though simpler magnitude-based approaches can match performance in some cases.

Method	Memory reduction	Training speed	Best use sase
LoRA	~90%	Fast	General-purpose fine-tuning
QLoRA	~95%	Moderate	Memory-constrained environments
AdaLoRA	~90% (variable)	Moderate	Complex tasks requiring dynamic allocation

Reduce your fine-tuning costs by 90% without sacrificing model quality

Xenoss engineers build production-grade fine-tuning pipelines using LoRA, QLoRA, and optimized infrastructure

Get a cost assessment

Distributed training architectures for large models

When models exceed single-GPU memory capacity, distributed training becomes necessary. Memory constraints become the primary limiting factor when scaling to models with hundreds of billions of parameters. The complexity increases, but modern frameworks like DeepSpeed and PyTorch FSDP have made distributed training accessible to teams without specialized infrastructure expertise.

Data parallelism and gradient accumulation

Data parallelism replicates the entire model across multiple GPUs and splits data batches among them. While pure data parallelism is memory-intensive (each GPU needs the full model), techniques like DeepSpeed’s ZeRO optimizer reduce memory consumption by up to 8x by partitioning optimizer states and gradients instead of replicating them.

Gradient accumulation simulates larger batch sizes without additional GPUs by accumulating gradients over several smaller batches before updating weights. Accumulating over K batches reduces synchronization frequency (since you only run all-reduce once per K batches), which cuts communication overhead significantly. A team with 4 GPUs can achieve the effective batch size of 16 GPUs by accumulating across 4 forward passes, though the reduced update frequency may slow convergence slightly.

Model parallelism for 70B+ parameter models

Model parallelism splits the model itself across GPUs when the full model cannot fit on a single device. There are two main approaches: pipeline parallelism (splitting by layers, with each GPU handling a segment of the network) and tensor parallelism (splitting individual layers across GPUs).

Meta’s engineering team notes that tensor parallelism improves both model fitting and throughput by sharding attention blocks and MLP layers into smaller blocks executed on different devices. For Llama 3 70B, Meta used 2,000 GPUs with multi-dimensional parallelism combining both approaches.

The tradeoff is increased communication overhead between GPUs. Data flows sequentially through layers on different devices, creating potential bottlenecks. Careful optimization of layer placement and communication patterns can minimize this overhead.

Mixed precision training: FP16 and BF16

Mixed precision uses FP16 or BF16 for most operations while maintaining FP32 for critical calculations like loss scaling. Memory usage drops by roughly half, and training speed increases significantly on modern GPUs with tensor cores.

Most frameworks now support mixed precision with minimal code changes. PyTorch’s automatic mixed precision (AMP) handles the complexity of deciding which operations run in which precision.

Infrastructure strategies for scalable training

Infrastructure decisions act as multipliers on training costs. For example, H100 prices dropped from $8/hour at launch to $2.85-3.50/hour in late 2025, with AWS cutting P5 instance pricing by 44% in June 2025 alone. Teams that locked into high-rate contracts early paid significantly more than those who waited for the market to stabilize.

GPU selection: A100/H100 GPUs offer high memory bandwidth for large models, while L4/T4 instances provide better cost-per-performance for smaller models and QLoRA workflows.
Spot instances: Cloud providers offer 60-90% discounts on interruptible compute. Effective use requires fault-tolerant training with frequent checkpointing to resume after interruptions.
Right-sizing: Matching GPU count and memory to model parameters prevents both over-provisioning (wasted spend) and under-provisioning (training failures and delays).

The build-vs-buy decision depends on utilization rate, capital availability, and scaling flexibility. For one-time training runs or infrequent model updates, cloud compute is up to 12x more cost-effective than hardware purchase.

Teams with consistent high utilization (40+ hours/week) often find on-premises infrastructure more economical over 2-3 year horizons, while teams with variable workloads benefit from cloud elasticity. With H100 retail prices around $25,000-30,000 per unit, the break-even calculation requires careful utilization forecasting.

Model compression for LLM inference costs

Training is often a one-time cost, but inference runs continuously. At scale, inference costs frequently exceed training costs within months of deployment.

Post-training quantization: GPTQ and AWQ

Quantization reduces the numerical precision of model weights from FP32 or FP16 down to INT8 or INT4. Using 4-bit integer weights yields an 8x reduction in weight memory compared to FP32 (4x compared to FP16). Model size shrinks, inference speeds up, and the accuracy tradeoff depends heavily on the quantization method and calibration approach.

GPTQ and AWQ have emerged as the leading approaches for 4-bit quantization. GPTQ uses layer-wise Hessian-based optimization to minimize output error, while AWQ identifies “salient” weights (roughly 1% of total) that carry the most important information and protects them during quantization.

Knowledge distillation to smaller models

Knowledge distillation trains a smaller “student” model to mimic a larger “teacher” model’s outputs. The student can be 10x smaller while retaining most of the teacher’s performance on specific tasks.

This dramatically reduces inference costs for production deployment. A 7B student model serving the same queries as a 70B teacher uses roughly 10x less compute per request.

Tip: Consider distillation early in your fine-tuning workflow. Training a student model alongside your primary fine-tuning run adds minimal overhead but creates a cost-efficient deployment option.

Continuous learning systems to avoid retraining costs

Continuous learning systems prevent the costly “throw it away and start over” model update pattern that many teams fall into by default. Models left unchanged for 6+ months saw error rates jump 35% on new data, creating pressure to retrain frequently. Continuous learning offers an alternative: incremental updates that preserve existing capabilities while adding new ones.

Elastic Weight Consolidation for knowledge preservation

Elastic Weight Consolidation (EWC) penalizes changes to weights identified as important for previous tasks. The model can learn new information incrementally without overwriting foundational knowledge.

This avoids full retraining cycles when adding new capabilities. EWC applied to full parameter sets of Gemma2, successfully adding Lithuanian language capabilities while mitigating catastrophic forgetting of English performance across seven language understanding benchmarks.

The approach works for domain-specific fine-tuning too: a model trained for customer support can later learn product documentation tasks without losing its ability to handle support queries.

Drift detection and automated retraining triggers

Model drift occurs when performance degrades as real-world data distributions shift over time. A model trained on 2024 customer queries may perform poorly on 2025 queries as language patterns and topics evolve.

Continuous monitoring with threshold-based alerts triggers retraining only when necessary. This approach prevents both unnecessary retraining on arbitrary schedules and undetected performance degradation that erodes user trust.

MLOps for LLM fine-tuning: Cost control practices

MLOps provides operational discipline to prevent cost wasteMLOps provides operational discipline to prevent cost waste through visibility, automation, and reproducibility.

Experiment tracking: Tools like MLflow and Weights & Biases log every experiment with cost metadata, enabling cost-per-experiment analysis and identification of inefficient patterns.
Model versioning: Registries enable quick rollback to stable versions, avoiding wasted debugging time on faulty deployments.
Cost monitoring: Integration with cloud cost management tools provides real-time spending visibility with anomaly detection and budget alerts.

Building production-ready fine-tuning pipelines

An effective end-to-end workflow synthesizes PEFT methods for training efficiency, distributed architectures for scale, compression for inference costs, and MLOps for operational control. Each component reinforces the others, experiment tracking identifies which PEFT configurations work best, while cost monitoring validates that infrastructure choices deliver expected savings.

For enterprises seeking to reduce fine-tuning costs while maintaining production reliability, Xenoss engineers bring experience building pipelines that preserve foundational model knowledge while cutting GPU costs significantly.

Book a consultation to discuss your specific requirements.

The post Fine-tuning LLMs at scale: Cost optimization strategies appeared first on Xenoss - AI and Data Software Development Company.

Best practices for architecting data pipelines in AdTech

Vlad Kushka — Thu, 20 Jun 2024 11:43:33 +0000

Managing data in digital advertising is another dimension of difficulty compared to other industries. AdTech companies have to maintain ultra-low latency and uber-high data processing speeds to accommodate zettabytes of real-time data coming hot from all the ecosystem partners.

The consequences of delayed or incomplete data are high for AdTech–poor attribution, eschewed reporting, lost auctions, frustrated customers, and reduced revenue.

Every data engineer will tell you that building data pipelines is a tough, time-consuming, and costly process, especially in the AdTech industry, where we’re dealing with massive volumes of asynchronous, event-driven data.

In this post, we’ll talk about:

The complexities of big data in AdTech
Emerging trends and new approaches to data pipeline architecture.
System design and development best practices from industry leaders

Why data in AdTech is complex

The volume, variety, and velocity of data in AdTech are humongous. TripleLift’s programmatic ad platform, for example, processes over 4 billion ad requests and over 140 billion bid requests per day, which translates to 13 million unique aggregate rows in its databases per hour and over 36 GB of new data added to its Apache Druid storage.

The second key characteristic of data in AdTech is its wide variety. The industry processes petabytes of structured and unstructured data generated from user behavior, ad engagement, programmatic ad auctions, and private data exchanges, among other elements in the chain.

In each case, the incoming data can have multiple dimensions. For one ad impression, you need to track multiple parameters like “time window,” “geolocation,” “user ID,” etc. Combined, these parameters create specific measures—analytics on specific events such as click-through rate (CTR), conversion, viewability, revenue, etc.

These events are often distributed in time and happen a million times per day. In other words, your big data pipeline architecture needs to be designed to process asynchronous data at scale.

Large data volumes also translate to high data storage costs. As the big data volumes continue to increase, permanent retention of all information will be plainly unfeasible. So AdTech companies are also facing the tough choice of optimizing their data storage infrastructure to balance data retention vs. operating costs.

Trends in data pipeline architecture for AdTech

ETL/ELT pipelines have been around since the early days of data analytics. Although many of the best practices in conceptual design are still applicable today, major advances in database design and cloud computing have changed the game.

Over 66% of companies used cloud-based data pipelines and data storage solutions, with a third using a combination of both. Cloud-native ETL tools have greater scalability potential and support a broader selection of data sources. Serverless solutions also remove the burden of infrastructure management.

Real-time data processing also progressively replaces standard batch ingestion. Distributed stream-processing platforms like Apache Kafka and Amazon Kinesis enable continuous data collection from a firehose of data sources in a standard message format. Data gets then uploaded to cloud object stores (data lakes) and made available for querying engines.

PubMatic, for example, uses Kinetica to enable blazing-fast data ingestion, storage, and processing for its real-time reporting and ad-pacing engine. Thanks to data streaming architecture, Pubmatic can process over a trillion ad impressions monthly with high speed and accuracy.

That said, because most of the information is an event, AdTech companies often rely on a combination of real-time and batch data processing. For example, streaming data on ad viewability can be used and then enriched with some batch data on past inventory performance.

As the data infrastructure expands, AdTech teams also concentrate more efforts on data observability and infrastructure monitoring to eliminate costly downtime and expensive pipeline repairs.

To dive deeper into the current trends in AdTech, we’ve invited Charles Proctor, MarTech Architect at CPMartec and EnquiryLab, to share his insights on real-time processing, AI advancements, cloud solutions, and essential data governance practices. Here’s what he had to say:

Charles Proctor, MarTech Architect at CPMartec and EnquiryLab, on upcoming data pipeline advancements and their impact on AdTech

AdTech data pipeline development: Best practices and recommended technologies

A data pipeline is a sequence of steps: Ingestion, processing, storage, and access. Each of these steps can (and should) be well-architected and optimized for the highest performance levels.

Xenoss big data engineers placed a microscope over the types of data pipeline architectures in AdTech. We evaluated the strengths and weaknesses of different architecture design patterns and toolkits industry leaders use for AdTech analytics and reporting.

Our analysis extended to both logical and platform levels, providing a comprehensive understanding of the data processing ecosystem. The logical design describes how data is processed and transformed from the source into the target, ensuring consistent data transformation across different environments. In contrast, the platform design focuses on the specific implementation and tooling required by each environment, whether it’s GCP, Azure, or Amazon. While each platform offers a unique set of tools for data transformation, the goal of the logical design remains the same: efficient and effective data transformation regardless of the provider.

Data ingestion

AdTech data originates from multiple sources — DSP and SSP partners, customer data platforms (CDP), or even DOOH devices. To extract data from a source designation, you need to make API calls, query the database, or process log files.

The challenge, however, is that in AdTech, you need to simultaneously ingest multiple streams in the pipeline — and that’s no small task (pun intended).

TripleLift, for example, needed its data pipelines to handle:

Up to 30 billion event logs per day
Normalized aggregation of 75 dimensions and 55 metrics
Over 15 hourly jobs for ingesting and aggregating data into BI tools

And all of the above have to be in a cost-effective manner, with data delivery happening within expected customer SLAs.

The TrifpleLift’s team organized all incoming event data streams into 50+ Kafka topics. Events are consumed by Secor (a Pinterest open-source consumer) and written to AWS S3 in parquet format. TripleLift uses Apache Airflow to schedule batch jobs and manage dependencies for data aggregation into its data stores and subsequent data exposure to different reporting tools.

Final TripleLift’s architecture, after resolving scaling issues by replacing VoltDB and implementing Apache Airflow

Aggregation tasks are done with Apache Spark on Databricks clusters. The data is denormalized into wide tables by joining raw event logs in order to paint a complete picture of what happened pre-, during, and after an auction. Denormalized logs are stored in Amazon S3.

In such a setup, Kafka helps make the required data streams available to different consumers simultaneously. Thanks to horizontal scaling, you can also maintain high throughput even for extra-large data volumes. You can also configure different retention policies for different Kafka topics to optimize cloud infrastructure costs.

Thanks to in-memory data processing, Apache Spark can perform data aggregation tasks at blazing speeds. It’s also a highly versatile tool, supporting multiple file formats, such as Parquet, Avro, JSON, and CSV, which makes it great for handling different data sources.

Pubmatic also relies on Apache Spark as the main technology for its data ingestion model. The team opted to use Spark Structured Streaming—a fault-tolerant stream processing engine built on the Spark SQL engine —and FlatMap to transform their datasets. In Pubmatic’s case, FlatMap delivered a 25% better performance than MapPartitions (another popular solution for distributed data transformations). With a new data ingestion module, Pubmatic can process 1.5X to 2X more data with the same number of resources.

Recommended technologies for data ingestion:

Apache Kafka: An open-source distributed event streaming platform. Kafka’s high throughput and fault tolerance make it suitable for capturing and processing large volumes of ad impressions and user interactions in real-time, enabling immediate processing and analysis.
Amazon Kinesis: A managed framework for real-time video and data streams.A strong choice for AWS users, providing managed, scalable real-time processing with seamless integration into the AWS ecosystem. Kinesis facilitates low-latency data processing and high availability, making it effective for real-time analytics in AdTech environments.
Apache Flume: An open-source data ingestion tool for collection, aggregation, and transportation of log data. Specialized for log data, Flume can be effective in environments requiring robust log data collection and integration with Hadoop for further analysis.

Data processing

Ingested AdTech data must then be brought into an analytics-ready state. Depending on your setup, you may codify automatic:

Schema application
Deduplication
Aggregation
Filtering
Enriching
Splitting

The problem? Data transformation can be complex and expensive if you use outdated ETL technology.

Take it from AppsFlyer, whose attribution SDK is installed on 95% of mobile devices worldwide. The company collected ample data, but operationalizing it was an uphill battle.

Originally, AppsFlyer built an in-house ETL tool to channel event data from Kafka to a BigQuerry warehouse. Yet, as Avner Livne, AppsFlyer Real-Time Application (RTA) Groups Lead, explained: “Data transformation was very hard. Schema changes were very hard. While [the system] was functional, everything required a lot of attention and engineering”. In fact, one analytics use case costs AppsFlyer over $3,000 per day on BigQuery and over $1.1 million annually.

The team used the Upsolver cloud-native data pipeline development platform to improve its data ingestion and transformation capabilities. After all the necessary transformations on S3 data have been performed, Upsolver’s visual IDE and SQL help make the data query ready via the AWS Glue Data Catalog.

Upsolver’s engine proved to be more cost-effective than the in-house ELT tool. AppsFlyer also substantially improved its visibility into stream log records, which allowed the company to reduce the size of created tables, leading to further cost savings.

At Xenoss, we also frequently saw cases when clients’ infrastructure costs spiral out of control—and we specialize in getting them back on track. Among other projects, our team has helped programmatic ad marketplace PowerLinks reduce its monthly infrastructure costs from $200k+ per month to $8k-10k without any performance losses. On the contrary, the volume of inbound traffic went from 20 to 80 QPS during our partnership, and we’ve implemented scaling possibility to up to 1 million QPS.

Recommended technologies for data processing:

Google Cloud Dataflow: A managed streaming analytics service.
Apache Flink: A unified stream-processing and batch-processing framework.
Apache Spark: A multi-language, scalable data querying engine.

Overwhelmed by the complexity of AdTech data? Xenoss specializes in solving AdTech data complexities

Discover

Data storage

All the collected and processed AdTech data needs a “landing pad”—a target storage destination from where it will be queried by different analytics apps and custom scripts.

In most cases, data ends up in either of the following locations:

Data lake (e.g., based on AWS S3, Hadoop HDFS, Databricks)

Flowchart of data pipeline stages in data lake design

Data warehouses (e.g., Amazon Redshift, BigQuery, Apache Hive, Snowflake)

But that’s not the end of the story. You also need suitable analytic database management software to ensure that data gets stored in the right format and can be effectively queried by downstream applications.

That’s where database management systems (DBMS) come into play. A well-selected DBMS can automate data provisioning to multiple apps and ensure better data governance and cost-effective operating costs.

DoubleVerify, for example, originally relied on a monolithic Python application for AdTech data analysis. Data was hosted in several storage locations, but the most frequent one was the columnar database Vertica, where request logs went.

The team created custom Python functions to orchestrate SQL scripts against Vertica. For fault tolerance, Python code was deployed to two on-premises servers—one primary and one secondary. Using the job scheduling software Rundeck, the code was executed using a cron schedule.

As data volumes increased, the team soon ran into issues with Vertica. According to Dennis Levin, Senior Software Engineer at DoubleVerify, jobs on Vertica were taking too long to run while adding more nodes to Vertica was both time-consuming and expensive. Due to upstream dependencies, the team also had to run most jobs on ancient Python v2.7.

To patch things up, the team came up with a new cloud-native data people architecture built with DBT, Airflow, and Snowflake.

DBT is a SQL-first transformation workflow that allows teams to deploy analytics code faster by adding best practices like modularity, portability, and CI/CD. In DoubleVerify’s case, DBT replaced ancient Python code.

The team also replaced Vertica with a cloud-native Snowflake SQL database. Unlike legacy data warehousing solutions, Snowflake can natively store and process both structured (i.e., relational) and semi-structured (e.g., JSON, Avro, XML) — all in a single system, which is convenient for deploying multiple AdTech analytics use cases.

DoubleVerify also replaced Rundeck with Apache Airflow — a modern, scalable workflow management platform. It was configured to run in Google’s data workflow orchestration service, Cloud Composer (which is built on Apache Airflow open source project).

Cloud Composer helps author, schedule, and monitor pipelines across hybrid and multi-cloud environments. Since pipelines are configured as directed acyclic graphs (DAGs), the adoption curve is low for any Python developer.

To avoid the scalability constraints of SQL databases, some AdTech companies go with non-relational databases instead. NoSQL databases have greater schema flexibility and higher scalability potential. Modern non-relational databases also use in-memory storage and distributed architectures to deliver lower latency and faster processing speeds.

The flip side, however, is that greater scalability often translates to higher operating costs. A poorly configured cloud NoSQL database can easily generate a $72,000 overnight bill. One possible solution is using a mix of hot and cold storage for different types of data streams as The Trade Desk does.

TTD receives over 100K QPS of data from its partners, which translates to over 200 TDID/segment updates per second. Given the volumes and costs of merging records, TTD needs to pick the “best” elements for analysis if any given record is too large. At the same time, the platform needs to only serve data on the record in use by an active campaign.

To manage this scale, the team uses Aerospike—a multi-model, multi-cloud NoSQL database. Aerospike runs on the edge as a hot cache destination for the real-time bidding system, which processes over 800 billion queries per day. It also serves as a System of record on AWS for managing peak loads of up to 20 million writes per second for its “cold storage” of user profiles.

This way, TTD can:

Rapidly serve data required for active campaigns
Refresh hot records within hours of new campaign activation
Forget about any impact of data delivery on bidding system performance
Support advanced analytics scenarios by surfacing cold storage cluster data.

Such a data pipeline architecture allows TTD to maintain large-scale multidimensional data records dimensions without burning unnecessary CPU costs and thaw data in 8ms for real-time bidding.

Recommended technologies for data storage:

Clickhouse: A cost-effective RDBMS for large-scale AdTech projects.
Aerospike: A schemaless distributed database with a distinct data model for organizing and storing its data, designed for scalability and high performance.
Apache Hive: A distributed, fault-tolerant data warehouse system.

Data access

The final step is building an easy data querying experience for users and enabling effective data access to downstream analytics applications.

Query engines help retrieve, filter, aggregate, and analyze the available AdTech data. Modern query engine services support multiple data sources and file formats, making them highly scalable and elastic for processing data within the data lake instead of pushing it into a data warehouse.

That’s the route Captify — a search intelligence platform — chose for its data pipelines for reporting. According to Roksolana Diachuk, the platform’s Engineering Manager, the team uses:

Amazon S3 to store customer data in various formats (CSV, parquet, etc.)
Apache Spark for processing the stored data.

To ensure effective processing, the team built a custom on top of Amazon S3 client called S3 Lister, which filters our historical records so the team doesn’t need to query with Spark. Since the data arrives in different formats, Captify applies data partitioning at the end of its data pipeline. Data partitioning is based on timestamps (date, time, and hour) as it is required for their reporting use case. Afterwards, all processed data is loaded to Impala, a query engine built on top of Apache HDFS.

Similar to The Trade Desk, Captify uses a system of hot and cold data caches. Typically, all data streams are saved for 30 months for reporting purposes. However, teams only need data from the past month or so for most reporting use cases.

Therefore, HDFS contains fresh data, which is several months old at max. All historical data records, in turn, rest in S3 stores. This way, Captify can maintain high-speed and cost-effective data querying speeds.

That said, SQL querying requires technical expertise, meaning that average business users have to rely on data science teams for report generation. For that reason, AdTech players also leverage self-service BI tools.

Tokyo-based CyberAgent, for example, went with Tableau—a self-service analytics platform. TTableau has pre-made connectors to data sources like Amazon Redshift and Google BigQuery among others, and helps build analytical models visually to provide business users with streamlined access to analytics.

CyberAgent stores petabytes of data across Hadoop, Redshift, and BigQuerry. Occasionally, they also use data marts to import data from MySQL and CSV files. According to Ken Takao, Infrastructure Manager at CyberAgent, the company “uses MySQL to store the master data for most of the products. Then blend the master data on MySQL and data on Hadoop or Redshift to extract”.

Before Tableau, the company’s engineers spent a lot of time figuring out how to obtain the required data before scripting custom SQL queries. Tableau now allows them to extract data directly from connected insights and make it available to downstream applications. This saves the engineering teams dozens of hours. Business users benefit from readily accessible insights on ad distribution, logistics, and sales volumes for the company’s portfolio of 20 products.

Both Tableau and Looker are popular data visualization solutions, but they have some limitations for AdTech data. In particular, some analytics use cases may require heavy, mostly manual data porting.

Ideally, you should build or look for a solution that supports automatic data collection from multiple systems. Media-specific data visualization solutions often have field normalization, which eliminates the need for manual data mapping and improves the granularity of data presentation.

Recommended technologies to provide effective data access:

Amazon Athena: A serverless, interactive analytics service built on open-source frameworks.
Presto: An open-source SQL query engine that allows querying Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB, and Teradata.
Apache Impala: An open source, distributed SQL query engine for Apache Hadoop.
Tableau: A flexible self-service business intelligence platform.

Data orchestration

Poor data pipeline management affects almost everything—data quality, processing speed, data governance. The biggest challenge, however, is that AdTech data pipelines have complex, multi-step workflows—and “clogging” at any step can affect the entire system’s performance.

Moreover, workflows have upstream, downstream, and interdependencies. Without a robust data orchestration system, managing all of these effectively is nearly impossible.

The simplest (and still most-used) orchestration method used for ETL pipelines is sequential scheduling via cron jobs. While it’s still a workable option for simple analytics use cases, it doesn’t scale well, plus requires significant developer time for configuration, upkeep, and error handling.

Orchestration is also challenging in data pipelines for streaming data processing. A batch orchestrator relies on idempotent steps in a pipeline, whereas real-life processes are seldom idempotent. Therefore, when you need to roll back or replay a workflow, data quality and integrity issues may arise.

In AdTech, data engineers also often need to enrich events in a stream with batch data to obtain more comprehensive insights. For example, when you need to contextualize an ad click event using user interaction data, stored in a database. This requires pipeline synchronization.

Modern orchestration tools like Airflow and AWS Step Functions, among others help deal with the above challenges through the concept of Direct Acyclic Graphs (DAG). DAGs help record the underlying task nodes in a graph and task dependencies—as edges between these nodes. Thanks to this, the system can execute concurrent tasks in a more efficient way, while data engineers get better controls for logging, debugging, and job re-runs.

For example, with an Airflow-based orchestration service, Adobe can now run over 1000+ concurrent workflows for its Experience Management platform. Arpeely, in turn, went with Google Cloud Composer and Cloud Scheduler to automate data workflows for its autonomous media engine solution.

Overall, orchestration services provide a convenient toolkit to streamline and automate complex data processing tasks, allowing your teams to focus on system fine-tuning instead of endless debugging.

Recommended technologies for data orchestration:

Apache Airflow: An open-source workflow management platform for data engineering pipelines, originally developed by Airbnb.
AWS Step Functions: A serverless workflow orchestration service offering seamless integration with AWS Lambda, AWS Batch, AWS Fargate, Amazon ECS, Amazon SQS, Amazon SNS, and AWS Glue.
Luigi: An open-source orchestration solution from the Spotify team.
Dagster: A cloud-native orchestrator improving upon Airflow.

Monitoring

Similar to regular pipes, big data pipelines also need servicing. Job scheduling conflicts, data format changes, configuration problems, errors in data transformation logic—a lot of things can cause havoc in your systems. And downtime is costly in AdTech.

High system reliability requires end-to-end observability of the data pipeline, coupled with the ability to effectively prevent and optimize the flow of data to avoid bottlenecks, increase resource efficiency, and optimize operation costs.

In particular, AdTech companies should implement:

Compute performance monitoring
Data reconciliation processes
Schema and data drift monitoring

Soundly, these tasks can be automated with data observability platforms. PubMatic, for example, uses Acceldata Pulse to monitor the performance of its massive data platform, spanning over thousands of nodes handling hundreds of petabytes of data.

The sheer scale of operations caused frequent performance issues, while Mean Time to Resolve (MTTR) stayed high. Acceldata’s observability platform helped PubMatic’s data engineers locate and isolate data bottlenecks faster, plus automate a lot of infrastructure support tasks. Thanks to the obtained insights, PubMatic also reduced its HDFS block footprint by 30% and consolidated its Kafka clusters, resulting in lower costs.

At Xenoss, we also built a custom pipeline monitoring stack using Prometheus and Grafana, which allows us to keep a 24/7 watch over all data processing operations and rapidly respond to errors and failures. This is one of the most well-balanced and efficient stacks we’ve compiled and have successfully implemented across various client businesses.

Recommended technologies for data pipeline monitoring:

Grafana: An open-source, multi-platform service for analytics and interactive visualizations.
Datadog: A SaaS monitoring and security platform.
Splunk: A leader in application management, security, and compliance analytics.
Dynatrace: An observability, AI, automation, and application security functionality meshed in one platform.

Looking for best practices in data pipeline development?

Xenoss provides expert insights and services to refine your data strategy

Learn more

Final thoughts

Data pipelines are the lifeline of the AdTech industry. But building them is hard. There are always trade-offs between cost vs. performance.

Development of robust, scalable, and high-concurrency data pipelines requires a deep understanding of the AdTech industry and prolific knowledge of different technologies and system design strategies. Partnering with a team of AdTech data engineers is your best way to avoid subpar architecture choices and costly operating mistakes.

Xenoss’ big data engineers have helped architected some of the most robust products in the industry, a gaming advertising platform with 1.4 billion monthly video impressions, and a performance-oriented mobile DSP, recently acquired by the Verve Group.

We know how to build high-load products for high-ambition teams. Contact us to learn more about our custom software development services.

The post Best practices for architecting data pipelines in AdTech appeared first on Xenoss - AI and Data Software Development Company.

What kind of engineers should you hire for AdTech software projects?

Vlad Kushka — Thu, 21 Apr 2022 11:22:12 +0000

As the complexity of software development in AdTech increases, it puts more burden on the hiring process. The average interview time for hiring senior software engineers is 40.8 days.

Hiring AdTech software engineers can take even longer due to the need for extensive domain knowledge, and the growing importance of AI/ML competency in this industry.

In order to reap the benefits of your AdTech software team – faster time to market and increased collaboration across teams and departments – it’s important to understand the industry context in which you operate and which software development specialists will be best suited to it.

The modern job market offers a vast range of software engineers with wide-ranging expertise. Besides specialization in programming language or technologies, there is division by so-called generalists, I-shapers, and T-shapers. We are going to talk about who the latter two are, which specialists are best fitted for an AdTech development project, and illustrate the way we at Xenoss approach managing and growing T-shaped engineers.

I-shaped vs T-shaped tech specialists

I-shaped specialists are narrowly specialized professionals, such as designers, software developers, or data engineers. I-shapers get proficient in a particular stack of technology and then only polish this specific expertise.

Hiring I-shaped software development specialists can be a good fit for certain long-established, conservative industries, such as healthcare or financial services.

These companies have in-house expertise for any set of problems and value deep proficiency in a particular discipline instead of tech outlook and knowledge in the related domains. For instance, a QA engineer is responsible for the testing, but can’t put in the larger product context and won’t be able to perform even the minor tweaks in code or the platform UI. When I-shapers encounter more multidisciplinary tasks, they refer you to a specialist in a different department.

Functional expertise of an I-shaped QA engineer

This is especially true if the company’s workflow is built on by RUP (Rational Unified Process) methodology that entails completing one stage of development, clearly recording it in the documentation, before moving on to the next. The first release “in production” often occurs after a few months. This works if the external conditions are relatively constant.

RUP – agile software development methodology

But if the product exists in the industry with a high degree of uncertainty and fast-paced market changes, such as in AdTech, a different development approach is needed. It is vital to focus on feedback from the market rather than focusing on the canonical rules for building development processes.

To avoid downtime and increase the speed of delivery, AdTech software development seeks out T-shaped specialists. These are people who have their own deeply studied specialization (similar to the I-shaped) and competencies in related areas.

Different types of professional expertise

The concept of T-shaped skills is a metaphor that has been used in recruiting since the 90s of the last century. The concept can be represented as two stripes: horizontal and vertical.

The horizontal bar (Breadth of Knowledge / General Skills) is the ability to interact with experts in other fields and apply their knowledge in areas other than one’s own.
The vertical bar (I-Shaped / Expert in one thing) is a deep competence in a particular area.

Functional expertise of T-shape QA engineer

In this scenario, a QA knows everything required to do the job, but also understands UX design, can create unit tests, can perform basic DevOps operations, etc.

In sophisticated AdTech projects, very often there is a critical lack of horizontal stripe width. Understanding and knowledge in related fields help to find a “common language” in the team, speed up the creation of a product, and improve its quality.

A T-shaper, while having a specialized skillset, also understands the product development holistically (different engineering environments, tools, and specifications). This gives the client an enormous advantage: keeping the team small and consequently cost-efficient, the team approaches the development the right way from the beginning.

Vova Kyrychenko, CTO at Xenoss

Challenges of AdTech dev team composition and management

AdTech development teams work in a rapidly shifting market, with changing user preferences, and tough competition, on comprehensive business tasks that have various ways to approach them. AdTech companies need engineers with knowledge in adjacent disciplines and the ability to adapt and synthesize expertise.

For instance, to increase conversion rates for a media buying platform (a classical business objective for AdTech projects), data engineers have to take into account lots of factors from the related domains: The competitive landscape, AdOps specifics, potential hardware issues, the platform’s business use cases.

To determine whether T-shapers are the right fit for AdTech projects, let’s talk first about the typical team composition and management challenges.

Hiring for a technically demanding project

Assembling the dream team for a complicated and technically demanding project, common for the AdTech market, is a challenge in itself. Hiring narrowly-specialized senior tech talent might put a significant burden on the project due to the steep cost and long time to hire. The scale of necessary expertise might turn out to be smaller, and you’ll overspend on expensive work hours.

Instead of hiring people with a narrow skillset, AdTech companies need to prioritize tech specialists with wide knowledge that can approach the problem holistically.

Igor Petrenko, Solution Architect at Xenoss, emphasizes the importance of extensive tech and product knowledge for the optimal development of the AdTech software:

It’s not just about mastering tools and platforms. In every project, our tech team gains an in-depth understanding of the underlying technologies: the tech stack, internal components, operating systems, and hardware. By diving so deep and optimizing the product’s core, the solutions we build parallel-process hundreds of thousands of requests and are ready to support the next milestones of the clients’ businesses.

The complexity of the team structure

Development in companies that rely on I-shaped specialists is predicated on the multiple managers that can merge the expertise of engineers with widely different stacks. To effectively manage the workload, project leads (or team leads/tech leads/managers) have to be familiar with the technical aspect of each specialization and be able to plan for the long haul. The typical structure for such a team can look like this:

Software development team structure with I-shaped experts

However, in AdTech software development, especially if it is an emerging product or startup, maintaining such a rigid organizational structure is often unsustainable and costly.

Budget constraints for new roles

Besides paying for multiple managerial roles, you might face the need to increase the budget on the go, each time you require some narrow expertise. For example, a few DevOps tasks emerge during the project span. You had no budget allocated for an additional position, but I-shaped back-end engineers cannot substitute for its functions so you’ll have to extend the project’s budget anyway. On the other hand, small teams of senior T-shaped developers, that have more universal tech expertise from the beginning, are more cost-effective in the long run.

Understanding the business environment

Managing business objectives of an evolving AdTech product requires actionable tactical solutions and long term planning – a combination that requires a profound understanding of the domain.

The software engineer in this industry needs a thorough understanding of the competitor landscape, privacy regulations, supply chain logic, the typical data, and identity challenges. Even the most skilled I-shaper won’t be able to navigate these treacherous waters. AdTech engineers need a corresponding knowledge of AdOps, data science, and data architecture to comprehend the complexity of the technical solutions they have to devise.

Communication in cross-functional teams

Establishing a common language for a team of I-shaped specialists can be challenging. It is incredibly difficult to establish a clear feedback loop between developer, architect, tester, and data scientist. Narrow specialists don’t understand each other well, their vocabularies vary, and they tend to focus on their own well of knowledge. Workflows are not symmetric; with small volumes, it is difficult to plan the work of a deep specialist without allowing downtime.

Why T-shaped expertise is indispensable for AdTech software development

We recommend prioritizing T-shaped specialists in the hiring process since only these specialists are well-equipped for the multidisciplinary nature of AdTech.

A squad of T-shapers offers you a great deal of flexibility – with feature and task prioritization, change management based on user feedback, data-driven experimentation, and even resource optimization. Such small, agile teams have already taken over the world little by little, even banks and insurance providers.

A T-shaper is appreciated for several qualities:

Outlook. In a modern, competitive business, this property is one of the most valuable. Knowledge of related or distant subjects helps create nonstandard solutions and solutions “at the junctions.”
Universality. A T-shaper can reinforce the development of any part of the project at any stage, providing close to 100% utilization of his working time.
Interoperability. It saves the manager’s time on establishing workflow and communications, which helps avoid misunderstandings that result in the waste of the development resource.
Agility. Such a specialist is a walking backup for some team members. What if a Python developer gets hit by a coronavirus? A T-shaper will be able to pick up the dropped baton and continue the project.

Due to the complexity of AdTech software projects, knowledge of the domain and technology outlook is absolutely critical to solving the business challenges of this industry.

T-shapers are capable of solving the challenges of our complex niche that requires brainstorming with a multi-disciplinary team, experiments, and improvisation for the optimal solution. The team of T-shapers can also help you keep the software development expenses in check; they can optimize when specialized development would just write off the costs.

[cta-no-description title=”Looking for T-shape experts for your AdTech team? ” url=”https://xenoss.io/dedicated-development-teams” buttontext=”Get in touch”]

Xenoss success case: a T-shaped team for an AdTech platform

To put into perspective how T-shaped experts reinforce the development of AdTech projects, let’s review a real-world case from our practice.

One AdTech solution Xenoss delivered is a customer data platform for mobile apps. Due to the initial focus on T-shaped expertise in the hiring process, we were able to assemble a multifaceted team that could adapt to the changing needs of stakeholders.

In the projects, we agree on the quarterly business objectives with the client that are ambitious, and usually concern optimization of specific processes or increasing performance KPIs. Those are tasks typical for startups that operate with a high degree of uncertainty. If the team didn’t understand the project holistically, we wouldn’t able to anticipate the software’s future problems and possible outcomes that allow us to develop optimal solutions.

To deliver a solution with utmost efficiency, team members require a comprehensive understanding of the business objective and understand different aspects of its implementation:

Business/AdOps. How do the projects align with the business processes and market landscape

Engineering. Software development, the technical underpinning

Data science. AI models and machine learning algorithms, the core of the solution

Product and Delivery. DevOps, automation, and development infrastructure

For instance, data scientists on this project need basic AdOps knowledge. Otherwise, they won’t distinguish different inventory types for optimization and won’t build realistic models. Effective communication between team members would also be impossible without some understanding of the domain.

Before beginning the quarter, the entire team discusses the business objective and decides on the best strategy to approach it. It is much easier for our team of T-shaped professionals to convey ideas and formulate a shared product vision. In the I-shaped team, the company would have to invest more management resources to make those wheels turn.

The project crew holds several sessions where each team member lays out their vision for the project. Then we summarize those outputs in the roadmap. The T-shaped team can quickly reach a consensus and proceed with the development due to the holistic understanding of the project by the entire team.

Important notice about a T-shaped engineering team

Having T-shapers on your team is not a silver bullet against all development constraints. Despite their broad expertise, you cannot expect them to perform exceptionally well in every domain.

Product managers frequently expect T-shaped engineers to be full-fledged tech consultants in everything. A T-shaper can adjust and get up to speed with various tech stacks. Yet they still have a main area of expertise.

Expecting an engineer to be an expert in data architecture, cloud technologies, and UI design is simply not realistic. Developers can substitute for each other when there is a need and assume different roles throughout the project while acknowledging the strong and weak sides of each team member is essential. You can’t put a generalist in charge of an infrastructure decision that requires years of expertise and a solid track record.

Developing T-shaped expertise within your company is also a separate organizational challenge. Working in a cross-functional team can sometimes mean expanding expertise in the corresponding domains is more cumbersome than in a functionally aligned organization. Introducing “guilds,” a.k.a communities of practice, can facilitate this process, as they do it in Spotify.

How Xenoss grows T-shaped specialists

Tips on growing T-shaped specialists in-house

While you can sometimes find skilled T-shaped engineers on the market, that’s not always the case. You can approach this problem by growing such specialists internally. To develop a T-shaped specialist within your ranks, you must create the right environment and conditions for them.

Autonomy. Everyone must understand their responsibility for what they do. The individual or team must be able to make their own decisions. In turn, management must provide room for potential mistakes while having clear guidance and streamlined control processes in place that eliminate mistakes in the client-facing solution.

Motivational goal. Each team member must understand the overall goal and be aware of what their contribution brings to the table. Instead of a straightforward task, they should be responsible for solving a business challenge with the creative freedom to choose the technical and tactical approach.

Space for growth. Something this allows people to show their best qualities, learn something new, and be the best at something. And this is the main prerequisite for nurturing the T-shaped – setting problem-oriented tasks, allocating time for research, and mastering a new skill set to create solutions.

Our development methodology at Xenoss fully supports these three objectives. The team strategizes with the client for the optimal solution to a given business challenge, collectively sets goals and tactics to achieve them, and bears the responsibility for successful execution. The work is divided into sprints, and for each, a goal or goals are set that brings the team one step closer to achieving the final result.

To make this collaborative process work, Xenoss AdTech software engineers pick up the knowledge and skills from the related disciplines and establish effective cross-functional communication to foster a holistic understanding of the project. This stimulates the search for ideas to improve the product.

Takeaways

A T-shaper, capable of picking up knowledge on the fly and establishing proficiency in the related disciplines, is the most valuable asset for the AdTech project, especially with a diverse tech stack. Modern advertising technologies are developing in the direction of syncretism, dense intersection, and even partial mergers of various domains.

The team of I-shapers can be a good fit for the company with a rigid managerial structure and a long-term incremental delivery process. For time-sensitive AdTech software development with a high degree of uncertainty, especially for an emerging product, you need a T-shaper to analyze the problem on several levels and work out a solution.

T-shaped specialists are especially valuable for emerging AdTech products, where they quite often have to work in startup mode, adapt to the constantly changing context, and at the same time be able to demonstrate team effectiveness and deliver real business value.

Looking for experienced AdTech engineers and integrators for your team?

Learn more

The post What kind of engineers should you hire for AdTech software projects? appeared first on Xenoss - AI and Data Software Development Company.