4 steps to build an AI model

Preparing your model is the next phase of any AI project. It includes data collection, cleaning, model building, training, and optimization. A well-prepared model ensures accuracy, reliability, and robustness in solving your specific problem. This chapter outlines the key steps involved in preparing your model, and we’ll share steps to build a GPT-style AI.

Gathering data

Data is the foundation of any AI model. The quality, quantity, and relevance of the data you gather significantly impact your model's performance. There are the steps for data gathering:

Identify data sources:

This step is about finding the goldmines that hold the information your AI needs to learn. Here's a breakdown of common sources:

  • Internal databases: This is often the first place to look. Your organization likely has a wealth of information stored in CRM systems, transaction records, customer databases, etc. Internal data is often readily available and can provide valuable insights into your specific business needs.

For example, if you're building a model to predict customer churn, your CRM system will be invaluable, containing data on customer demographics, purchase history, interactions with customer service, and more.

  • Public datasets: A treasure trove of open-source data exists, ready to be explored and, in some cases, for free. Websites like Kaggle, UCI Machine Learning Repository, Hugging Face Hub (where we got our data for the GPT), and government data portals offer datasets on various topics.

For instance, if you're developing an image recognition model, ImageNet (available on Kaggle) provides a massive dataset of labeled images.

  • Web scraping: You can extract data directly from websites. While web scraping can be a good way to get data, it requires careful consideration of ethical and legal implications. Always respect website terms of service and robots.txt files, and be mindful of privacy concerns.

How it works is that if you're analyzing sentiment around a product, you might scrape reviews from e-commerce sites or social media platforms.

  • APIs of third-party data providers: Many organizations specialize in collecting and providing data through APIs. These can help access specific types of data or real-time information. For example, if you need financial market data, you might use an API from a provider like Bloomberg or Refinitiv.

Define data requirements

Before diving into data collection, it's essential to have a clear understanding of what you need.

  1. Data type:
    • Structured data: Highly organized information that fits neatly into rows and columns (e.g., spreadsheets, SQL databases). This is generally easier to work with for AI applications.
    • Unstructured data: Data that doesn't have a predefined format (e.g., text, images, audio, video). Requires more complex techniques to extract meaningful information.
  2. Data format:

The format will influence how you process and load the data. Each data format has unique characteristics that determine how it is parsed, processed, and loaded into computing environments. For instance:

  • CSV: Simple tabular data that is easily loaded into programs like Python or Excel but lacks hierarchical structure.
  • JSON: Ideal for structured, hierarchical data, commonly used in web applications.
  • XML: Used for markup and structured data, though less compact than JSON.
  • Databases: Handle large datasets with querying capabilities (e.g., SQL for relational databases).
  • Arrow: Optimized for columnar data storage and analytics, often used in big data processing systems like Apache Spark.
  • Timeframe:
  • Historical data: Past data used to train the model on long-term trends and patterns.
  • Real-time data: Data streamed continuously, essential for applications that require up-to-the-minute information (e.g., fraud detection, high-frequency trading).

Validate data accessibility

  • Permissions: Ensure you have the legal right to access and use the data. This includes complying with data privacy regulations (e.g., GDPR, CCPA).
  • Infrastructure: You must answer the question, do you have the necessary tools and systems to collect, store, and process the data? Consider storage capacity, processing power, and data transfer capabilities.
  • Format and compatibility: Can your existing systems handle the data format? Will you need to convert or transform the data before use?

Check for bias

Bias in data can lead to unfair or inaccurate AI models. Be vigilant in identifying and mitigating potential biases.

  1. Sources of bias:
    • Sampling bias: Data that doesn't accurately represent the real-world population.
    • Measurement bias: Errors or inconsistencies in data collection methods.
    • Confirmation bias: Data that reinforces pre-existing assumptions.
  2. Mitigation strategies:
    • Diverse data sources: Use a variety of sources to get a more complete picture.
    • Data augmentation: Increase the diversity of your data by creating synthetic data or applying transformations to existing data.
    • Careful evaluation: Regularly assess your model for bias using fairness metrics and techniques.

By following these steps, you can build a strong foundation of high-quality data for your AI model. The next step is cleaning your data.

Data cleaning and preparation

Raw data is rarely ready for model training as it is often incomplete, inconsistent, and noisy. Data cleaning and preparation involve transforming raw data into a usable format for training your model. Here's a detailed look at the essential data-cleaning steps:

Handle missing data:

Missing data is a common problem. Ignoring it can lead to biased or inaccurate models. Here's how to address it:

  1. Understanding the reasons: Before you fill in any gaps, try to understand why data is missing. Is it random, or is there a pattern? This can help you choose the right imputation technique.
  2. Imputation techniques: There are different ways to handle missing data, and it depends on the type of data. Let’s discuss numerical and categorical data:
  3. Numerical data:
    • Mean/median/mode: Replace missing values with the average, middle, or most frequent value of that feature. For instance, if you had a column for weather temperatures in the summer, you could simply use the average temperature for that season to fill in missing values.
    • K-nearest neighbors: Use the values of similar data points to estimate the missing value.
  4. Categorical data:
    • Most frequent value: Replace missing values with the most common category.
    • "Unknown" category: Create a new category to represent missing values. This can be useful if the fact that a value is missing is itself informative.

If a row or column has a very high percentage of missing values, it might be best to remove it entirely. However, be cautious, as you could lose valuable information.

Remove outliers

Outliers are data points that significantly deviate from the rest of the data and can distort your model's learning process. You need to find them first before handling them. Let’s start with how to detect outliers.

  • Detection methods:
  • Z-score: Measures how many standard deviations a data point is from the mean.
  • Interquartile range (IQR): Identifies outliers that fall outside a certain range around the median.
  • Visualization: Box plots and scatter plots can help visualize outliers.

After finding the outliers, these are the ways to handle them:

  • Handling outliers:
  • Removal: If outliers are due to errors or are truly irrelevant, remove them.
  • Replacement: Replace outliers with a less extreme value (e.g., the upper or lower limit of the IQR).
  • Retention: Sometimes outliers are genuine and important, so keep them if they represent real-world phenomena.

The next step in the process is scaling.

Scaling

Many AI algorithms perform better when numerical features are on a similar scale. Let’s use a real-world example to explain this concept.

Imagine you're trying to predict house prices. You have two features:

  • Size: Ranges from 500 sq ft to 5,000 sq ft
  • Number of bedrooms: Ranges from 1 to 5

These features are on vastly different scales. Algorithms that use distance calculations (like K-Nearest Neighbors) can be thrown off by this. The size feature will dominate the distance calculations because its values are much larger.

Scaling helps level the playing field so all features are considered equally. We will talk about two scaling techniques.

Standardization (Z-score normalization)

Standardization centers the data around zero and scales it so the standard deviation is 1.

  • Formula:
  • z = (x - mean) / standard deviation
  • x is the original value
  • mean is the average of all values for that feature
  • standard deviation measures how spread out the data is.

Using our original example, imagine the average house size is 2000 sq ft, with a standard deviation of 500 sq ft.

  • A 2500 sq ft house gets a z-score of (2500-2000)/500 = 1.0
  • A 1500 sq ft house gets a z-score of (1500-2000)/500 = -1.0

Standardization makes data normally distributed (a bell-shaped curve), which is an assumption of many algorithms.

Normalization (min-max scaling)

Normalization squishes all the data into a specific range (usually 0 to 1). You use normalization when the data does not follow a Gaussian distribution or when the scale of the data must be preserved for interpretation.

  • Formula:
  • x_scaled = (x - min) / (max - min)
  • x is the original value
  • min is the minimum value for that feature
  • max is the maximum value for that feature
  • **Example:**For our house size (500 sq ft to 5000 sq ft):
  • A 500 sq ft house becomes 0 ((500-500)/(5000-500) = 0)
  • A 5000 sq ft house becomes 1 ((5000-500)/(5000-500) = 1)
  • A 2000 sq ft house becomes (2000-500)/(5000-500) = 0.31

It preserves the original distribution of the data but puts everything on a common scale.

Both methods scale data, but they do it differently. The choice depends on your data and the algorithm you're using.

Encode categorical variables

AI models generally work with numerical data. You need to convert categorical variables (e.g., colors, city names) into a numerical representation. Here are two ways to do this:

  • One-Hot encoding: Create new binary (0 or 1) columns for each category. For example, if you have a "color" feature with categories "red," "green," and "blue," you'd create three new columns: "color_red," "color_green," and "color_blue."
  • Ordinal encoding: Assign a numerical value to each category based on their order. For example, "low," "medium," and "high" could be encoded as 1, 2, and 3.

Split data

Divide your dataset into subsets to ensure your model generalizes well to new, unseen data.

  • Training set: The largest portion is used to train the model's parameters.
  • Validation set: Used to evaluate the model's performance during training and fine-tune hyperparameters.
  • Test set: A completely independent set used for the final evaluation of the model's performance.

You're setting the stage for a successful AI project by meticulously cleaning and preparing your data. Next, let’s discuss building the model.

AI_guide_image_17

Build the model

In the building the model stage, you translate your project goals and prepared data into a functioning AI system. Think of it as assembling the brain of your AI, choosing the right structure and components to enable learning and intelligent behavior.

How you proceed from here depends on what you are building. As we stated earlier, we built a GPT and that is what we are going to use in this section.

Why GPT?

GPT-style models are one of the most widely built AI agent right now because of their ability to understand and generate human-like text. They rely on the Transformer architecture, which uses self-attention mechanisms to capture long-range dependencies in sequences.

By building a GPT, we hope you find the process useful in your project.

The architecture

Building a GPT model involves selecting and assembling several key architectural components that work together to handle sequences of text, much like building with LEGO bricks.

The core idea is to represent inputs as numerical embeddings, process them through a stack of transformer blocks that apply attention and feed-forward transformations, and produce meaningful outputs (predictions for the next token). Let’s unpack this idea in simpler terms.

Instead of treating words as mere symbols, GPT models represent them as "embeddings" - lists of numbers (dense vectors) that capture their semantic meaning. Words with similar meanings have embeddings that are closer together in this vector space. This allows the model to understand relationships between words, like synonyms, antonyms, and analogies.

Transformers enable the model to process entire sequences of text simultaneously, unlike traditional sequential models. It uses attention mechanism that allow the model to weigh the importance of different words in a sentence when predicting the next word. Imagine it like focusing on key words in a conversation to understand the context.

It also uses self-attention and a feedforward network. Self-attention is a pecific type of attention where the model compares different parts of the same input sequence to relate them to each other, helping the model understand long-range dependencies and complex relationships within the text.

While feed-forward networks are layers within the transformer that apply non-linear transformations to the data. They help the model learn complex patterns and representations.

GPT models often stack multiple transformer blocks on top of each other, which allows the model to learn hierarchical representations of the text, capturing meaning at different levels of granularity. The more layers, the more complex the patterns the model can learn, but it also requires more data and computational power to train effectively.

The model uses all the understanding of the input sequence to predict the next word. It calculates a probability distribution over all possible words in its vocabulary and selects the most likely one.

Let’s break down how we did this.

1. Token and positional embeddingsGPT models need a way to convert raw tokens (e.g., characters, words, or subwords) into numerical representations. These token embeddings are learnable parameters that help the model map discrete vocabulary entries to continuous vector spaces. Additionally, since transformers process tokens in parallel, positional embeddings are crucial to give the model a sense of where each token appears in the sequence.

Why we did this:

  • Token embeddings: Without embeddings, the model would see tokens as arbitrary IDs. Embeddings allow the model to understand relationships between tokens in a semantic vector space.
  • Positional embeddings: Self-attention does not inherently understand sequential order. Positional embeddings inject information about the token positions, allowing the model to differentiate the first token from the second, and so on.

In our code, we did this like this:

    
            self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)

    
  

Here, vocab_size is the size of the token vocabulary, and n_embd is the dimension of the embedding vectors. block_size represents the maximum sequence length the model will handle. By combining token and positional embeddings (tok_emb + pos_emb), each position in the input sequence is assigned a unique, learnable vector.

2. Transformer blocks (attention + feed-forward layers)

The heart of a GPT model lies in its Transformer blocks. Each block contains:

  • Multi-head self-attention: This mechanism allows the model to focus on different parts of the input sequence when producing each output. Multiple heads learn to attend to different kinds of relationships.
  • Feed-forward network (FFN): After attention refines the representation, a feed-forward layer applies a non-linear transformation, allowing the model to capture more complex patterns.

Why we did this:

  • Self-attention: It enables the model to weigh the relevance of each token in the sequence to every other token. This is vital for understanding context and long-range dependencies.
  • Multiple heads: Different attention heads can capture different types of dependencies simultaneously.
  • Feed-forward layers: They transform attention outputs into richer feature representations, enhancing the model’s expressive power.
    
            self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])

    
  

Block is a custom class (not fully shown here) that implements a Transformer block. By stacking multiple blocks (for _ in range(n_layer)), we increase the model’s depth, enabling it to learn more abstract and sophisticated patterns.

Within each Block, you’ll find something like this:

    
    class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        ...
        # Layer normalizations for stability
        self.ln1 = nn.LayerNorm(n_embd)
        ...

    
  

3. Layer normalization and residual connectionsLayer normalization and residual connections are critical engineering choices that help the model train effectively and converge smoothly.

Why we did this:

  • Residual connections: These allow gradients to flow more easily through deep models, mitigating the vanishing gradient problem.
  • Layer normalization: Normalizing inputs to each sub-layer helps stabilize training, making it easier for the model to learn.

In the code snippet above, you see normalization (self.ln1, self.ln2) applied before sub-layers and residual connections (x = x + ...) that integrate the original input with the transformed output.

4. Output layer (language modeling head)

After processing the sequence through multiple transformer blocks, we map the final hidden representations back to the vocabulary space to predict the next token.

Why we did this:

  • Mapping representations to tokens: The model’s ultimate goal is to predict the probability distribution over possible next tokens. A linear layer (self.lm_head) converts the hidden state into logits over the vocabulary.
    
    self.lm_head = nn.Linear(n_embd, vocab_size)

    
  

When the forward pass completes, the lm_head outputs a score for each token in the vocabulary at each sequence position. With these logits, we can compute a loss (e.g., cross-entropy) against the true next token and guide the model’s learning.

Putting it all together

The architecture we built follows the principles of the GPT family:

  1. Convert tokens to embeddings and add positional information.
  2. Pass embeddings through a stack of Transformer blocks that apply self-attention and feed-forward transformations.
  3. Use layer normalization and residual connections for stable and efficient training.
  4. Project the final hidden representations into the vocabulary space to generate predictions.

By implementing this architecture step-by-step, we ensure that each component contributes to the model’s ability to understand context, handle long dependencies, and generate coherent, contextually relevant text. The code snippets illustrate how the conceptual design translates into a working model class, ready for training and evaluation.

Train model

After carefully designing and constructing your GPT architecture, the next phase is to train the model so that it can learn meaningful patterns and relationships from the data. Training is where your model transitions from a static set of random parameters to a dynamic system capable of making predictions and generating coherent outputs.

How training works

  • Forward pass: The training loop begins by feeding a batch of input tokens into the model. The model processes these inputs through embeddings, Transformer blocks, and finally the output layer, producing logits that represent predicted probabilities over the vocabulary.
  • Loss computation: The model’s predictions are compared against the ground truth. For language modeling, a common choice is the cross-entropy loss, which measures how far the predicted distribution is from the true next token. A lower loss indicates the model’s predictions are more in line with the actual data.
  • Backward pass (backpropagation): Once the loss is computed, the model calculates gradients—these indicate how much each parameter influenced the loss. The gradients are then used to adjust the model’s weights in a direction that should reduce the loss in future iterations.
  • Parameter updates (optimization): An optimizer (e.g., AdamW or SGD) updates the parameters based on the gradients. Over many iterations, the model refines its parameters, gradually improving its predictions.

Key considerations during training:

  • Batch size: The number of samples processed at once. Larger batches can lead to more stable estimates of gradients but require more memory. Smaller batches can sometimes help the model navigate complex loss landscapes.
  • Learning rate: Governs how big a step is taken during each update. A careful balance is needed: too large and training may be unstable; too small and progress may be slow.
  • Validation and early stopping: Periodically check performance on a validation set. If validation loss stops improving, consider halting training or tuning hyperparameters to avoid overfitting.

Code snippet: Basic training loop

Below is an example of a training loop:

    
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iteration in range(max_iters):
   ...
    if iteration % 100 == 0:
        print(f"Iteration {iteration}, Loss: {loss.item():.4f}")
     ...

    
  

What’s happening here:

  • get_batch('train'): Retrieves a small chunk of your training data to feed into the model. Each chunk is a sequence of tokens the model will try to predict.
  • model.forward(xb, yb): Runs a forward pass. The model processes xb to produce a set of logits (predictions). Given yb as the ground truth, it also computes the loss.
  • loss.backward(): PyTorch computes gradients for every parameter. These gradients tell us how to tweak the parameters to reduce the loss next time.
  • optimizer.step(): The optimizer (AdamW in this case) updates the model’s parameters based on the gradients. Over time, this should lower the loss.
  • print(...): Regular logging helps track training progress and ensure the model is learning.

Incorporating validation:

It’s good practice to periodically evaluate the model on a validation set that the model never sees during training updates. This can be done at intervals (e.g., every 1000 iterations) by using a similar process as above but without calling backward() or optimizer.step()—just run the model on validation data to measure performance. If the validation loss stops improving or starts to increase, it’s an indication that the model may be overfitting.

    
    if iteration % eval_iters == 0:
   ...
    print(f"Validation Loss at iteration {iteration}: {val_loss:.4f}")

    
  

By incorporating this step, you can catch and address potential issues early, adjusting hyperparameters or applying regularization techniques (like dropout) as needed.

Hyperparameters tuning**

Hyperparameters are the external settings that control the learning process. Unlike model parameters, which the model learns during training, hyperparameters are chosen before training begins. They dictate how quickly the model learns, how large it is, and how well it generalizes.

Key hyperparameters in a GPT model:

  1. Learning rate:
  • What it does: Controls how fast or slow the model’s parameters are updated after seeing each batch of data.
  • Too high: The model may fail to converge, "jumping" around and never settling into a good solution.
  • Too low: Training will be painfully slow, potentially getting stuck in suboptimal results.
  1. Number of layers (n_layer):
  • What it does: Increases the model’s capacity to learn complex patterns by stacking more Transformer blocks.
  • Too many layers: Could lead to overfitting or excessive training time.
  • Too few layers: The model may be too simple, failing to capture intricate relationships.
  1. Number of attention heads (n_head):
  • What it does: Multiple heads in self-attention allow the model to focus on different aspects of the sequence in parallel.
  • Too many heads: More computational overhead, possibly diminishing returns.
  • Too few heads: Limited capability to attend to multiple patterns simultaneously.
  1. Embedding size (n_embd):
  • What it does: Defines the dimensionality of token and positional embeddings. Larger embeddings capture richer semantics.
  • Too large: More compute and memory usage without guaranteed better results.
  • Too small: May not represent information richly enough, limiting model expressiveness.
  1. Batch size and block size:
  • Batch size: How many sequences are processed before the model updates its weights. Larger batches provide stable gradient estimates but require more memory.
  • Block size: The sequence length the model sees at a time. Larger blocks help the model learn longer contexts but increase computational cost.

Dynamic configuration of hyperparameters:

In our code, we set these hyperparameters before training starts, but you can configure them dynamically—via command-line arguments, configuration files, or environmental variables.

Why dynamic configuration?

  • Adaptability: You might start with a small model for quick experiments and scale up once you’re confident in your approach.
  • Resource constraints: If you move from a server with multiple GPUs to a laptop, you can reduce batch size or model size without rewriting code.
  • Automated tuning: Tools like Optuna or Ray Tune can adjust hyperparameters automatically to find better configurations over multiple runs.

Code snippet for handling hyperparameters dynamically:

    
    import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--learning_rate', type=float, default=3e-4, help='Learning rate')
...

learning_rate = args.learning_rate
n_layer = args.n_layer
n_head = args.n_head
...


    
  

With this approach, you can easily tweak hyperparameters by running:

    
    python train.py --learning_rate 1e-4 --n_layer 6

    
  

This lets you run multiple experiments with different settings to discover which combination yields the best results.

When and how to adjust hyperparameters:

  • If validation loss stagnates: Try lowering the learning rate or adding more layers to give the model more representational power.
  • If training Is too slow: Decrease model complexity or reduce block size.
  • If overfitting occurs: Consider reducing model size, adding dropout (a form of regularization), or gathering more data.

Hyperparameters shape the learning journey of your model. By carefully selecting values that suit your data, hardware, and project goals—and by making these choices flexible and testable—you’ll set a strong foundation for experimentation and improvement. Over time, as you gain more experience and data, you’ll refine these hyperparameters, ultimately leading to a more accurate and efficient GPT model.

If you need to find the entire code we used, you can check this GitHub repository.

Summary

By following these steps, you can ensure that your AI model is well-prepared for the specific problem you aim to solve. Gathering and cleaning data lays the groundwork for a robust training process, while iterative model training and careful hyperparameter tuning refine the model's performance. This structured approach minimizes errors, maximizes predictive accuracy, and ensures your model is ready for deployment.