Machine Learning Q & AI

Machine Learning Q & AI

30 cutting-edge questions in machine learning and AI

This book by Sebastian Raschka delves into a wide range of intermediate and advanced topics in machine learning and AI. It adopts a Q&A style, with each chapter structured around a central question, making it accessible to readers with some machine learning experience.

Part I: Neural Networks and Deep Learning

  • Chapter 1: Embeddings, Latent Space, and Representations:Explains the concepts of embedding vectors, latent space, and representations. Embeddings encode data into low - dimensional vectors, like mapping one - hot encoded categorical data. Latent space is where embeddings are mapped, and representations are encoded forms of input. For example, in a CNN, the last layers may yield embedding vectors, and all intermediate layer outputs could potentially be embedding vectors.
  • Chapter 2: Self - Supervised Learning:Compares self - supervised learning to transfer learning. Self - supervised learning uses unlabeled data by creating pretext tasks, such as missing - word prediction in NLP or image patch reconstruction in CV. It’s useful for large neural networks with limited labeled data, like in transformer - based architectures. Small neural networks and nonparametric models typically don’t benefit from it.
  • Chapter 3: Few - Shot Learning:Introduces few - shot learning for small training sets. It uses a support set to create training tasks, with each task having a few examples per class (e.g., 5 - way 1 - shot means five classes with one example each). The model learns to produce embeddings for classification. For instance, in a 3 - way 1 - shot setting, the model classifies query images based on embeddings from the support set.
  • Chapter 4: The Lottery Ticket Hypothesis:Explores the idea that randomly initialized neural networks contain smaller, efficient subnetworks. The training procedure involves pruning weights and retraining. If true, it can cut training costs. However, currently, it’s expensive as it requires training the original network first. For example, the original paper reduced the network to 10 percent of its size without sacrificing accuracy.
  • Chapter 5: Reducing Overfitting with Data:Discusses ways to reduce overfitting using data. Collecting more data can improve model performance, as shown by learning curves. Data augmentation generates new data, like flipping and cropping images. Pretraining with unlabeled data, such as in self - supervised learning, can also help. Other techniques include feature engineering and using adversarial examples.
  • Chapter 6: Reducing Overfitting with Model Modifications:Outlines model - related ways to reduce overfitting. Regularization techniques like dropout and weight decay add penalties to the loss function. Smaller models can be obtained through pruning or knowledge distillation, but recent research shows complex relationships between model size and generalization. Ensemble methods combine multiple models but are computationally expensive.
  • Chapter 7: Multi - GPU Training Paradigms:Introduces different multi - GPU training paradigms. Model parallelism places different parts of a model on different GPUs, data parallelism divides data across GPUs, and tensor parallelism splits matrices across GPUs. Pipeline parallelism is a hybrid, and sequence parallelism addresses long - sequence issues in transformers. Recommendations depend on model size and computational resources.
  • Chapter 8: The Success of Transformers:Explains the success of transformers. Their attention mechanism allows each token to attend to all others, providing context - aware representations. Pretraining via self - supervised learning, large numbers of parameters, and easy parallelization also contribute. For example, GPT - 3 has 175 billion trainable parameters, and self - attention is easily parallelizable across GPU cores.
  • Chapter 9: Generative AI Models:Defines generative modeling and outlines different types of deep generative models. Energy - based models learn an energy function, variational autoencoders use variational inference, and generative adversarial networks have a generator and discriminator. Each model type has its strengths and weaknesses. For example, GANs can generate realistic images but suffer from unstable training.
  • Chapter 10: Sources of Randomness:Discusses sources of randomness in training deep neural networks. Model weight initialization, dataset sampling, nondeterministic algorithms like dropout, different runtime algorithms, and hardware can all cause non - reproducible results. Generative AI models also introduce intentional randomness, such as in top - k sampling for text generation.

Part II: Computer Vision

  • Chapter 11: Calculating the Number of Parameters:Explains how to compute the number of parameters in a convolutional neural network. The number of parameters in convolutional layers depends on the kernel size, input, and output channels, while in fully connected layers, it’s the product of inputs and outputs plus bias units. This information helps estimate model complexity and memory requirements.
  • Chapter 12: Fully Connected and Convolutional Layers:Shows that fully connected layers can be replaced by convolutional layers in two scenarios: when the kernel and input sizes are equal and when the kernel size is 1. This can offer hardware optimization advantages. For example, in the first scenario, a convolutional layer with a 2×2 kernel equal to the input size can perform the same computation as a fully connected layer.
  • Chapter 13: Large Training Sets for Vision Transformers:Explains why vision transformers (ViTs) generally require larger training sets than CNNs. CNNs have more hardcoded inductive biases, like local connectivity and weight sharing, while ViTs need to learn more from the data. However, with enough data, ViTs can outperform CNNs. For example, ViTs often need to be pretrained on large datasets like ImageNet.

Part III: Natural Language Processing

  • Chapter 14: The Distributional Hypothesis:Discusses the distributional hypothesis in NLP, which suggests that words in similar contexts have similar meanings. Word embeddings like Word2vec and large language models like BERT and GPT rely on this idea. Although there are counterexamples, it’s useful for tasks like text classification and sentiment analysis.
  • Chapter 15: Data Augmentation for Text:Introduces data augmentation techniques for text, such as synonym replacement, word deletion, and back translation. These techniques increase dataset size and improve model performance. For example, synonym replacement can help the model learn similar word meanings, but care must be taken as not all synonyms are interchangeable.
  • Chapter 16: Self - Attention:Compares self - attention in transformers to the Bahdanau attention mechanism in RNNs. Self - attention allows a neural network to attend to all elements in the same sequence, while the Bahdanau mechanism is applied between encoder and decoder embeddings. Self - attention is a key component in modern large language models.
  • Chapter 17: Encoder - and Decoder - Style Transformers:Describes the differences between encoder - and decoder - style language transformers. Encoders, like in BERT, are used for tasks like classification, while decoders, like in GPT, are for text generation. There are also encoder - decoder hybrids for tasks like text translation and summarization.
  • Chapter 18: Using and Fine - Tuning Pretrained Transformers:Discusses different ways to use and fine - tune pretrained large language models. Feature - based approaches treat the model as a feature extractor, in - context learning provides examples in the input, and parameter - efficient fine - tuning methods update a subset of the model parameters. Reinforcement learning with human feedback can also improve model performance.
  • Chapter 19: Evaluating Generative Large Language Models:Introduces standard metrics for evaluating LLMs, such as perplexity, BLEU, ROUGE, and BERTScore. Perplexity measures the model’s uncertainty in predicting the next word, BLEU is for translation evaluation, ROUGE is for text summarization, and BERTScore uses BERT embeddings to measure text similarity. Each metric has its strengths and weaknesses.

Part IV: Production and Deployment

  • Chapter 20: Stateless and Stateful Training:This chapter explores the difference between stateless and stateful training in production systems. Stateless training, often called stateless retraining, retrains the model from scratch with new data, like a sliding window approach. For example, in a stock trading classifier, if new data arrives daily, we might retrain the model periodically. Stateful training, on the other hand, updates the existing model with new data, similar to transfer learning. For a large language model like ChatGPT, stateful retraining makes sense as it can be updated based on user feedback without starting over.
  • Chapter 21: Data - Centric AI:Data - centric AI is a paradigm that focuses on improving model performance by iterating over the dataset rather than modifying the model. It contrasts with the conventional model - centric approach. For instance, in healthcare predictive analytics, if we use a fixed model and refine the training data from patients’ health records, it could be a data - centric approach. The main advantage is that a high - quality dataset benefits all modeling approaches downstream. However, in a research setting, choosing between data - centric and model - centric approaches depends on the goal. If developing a new methodology, a model - centric approach might be better.
  • Chapter 22: Speeding Up Inference:The techniques to speed up model inference without changing the architecture or sacrificing accuracy are discussed. Parallelization, like batched inference, processes multiple samples at once instead of one by one. Vectorization performs operations on entire data structures simultaneously, leveraging low - level optimizations in computing systems. Loop tiling enhances data locality by breaking loops into smaller chunks, and operator fusion combines multiple loops. Quantization reduces the computational and storage requirements by converting floating - point numbers to lower - precision representations. However, it may sometimes lead to a reduction in model accuracy.
  • Chapter 23: Data Distribution Shifts:Data distribution shifts occur when the data distribution during deployment differs from the training distribution. Covariate shift happens when the input data distribution changes but the relationship between input and output remains the same, such as in an email spam filter where the email features change but the spam - related patterns don’t. Label shift is a change in the class label distribution. Concept drift refers to a change in the mapping between input features and the target variable, and domain shift is a combination of covariate and concept drift. These shifts can cause significant drops in model performance, and different techniques are available to detect and mitigate them.

Part V: Predictive Performance and Model Evaluation

  • Chapter 24: Poisson and Ordinal Regression:This chapter differentiates between Poisson and ordinal regression. Poisson regression is used for count data that follows a Poisson distribution, like predicting the number of goals a soccer player will score. Ordinal regression is for ordered categorical data where the distance between categories is arbitrary, such as movie ratings. We use Poisson regression when the mean and variance of the count data are roughly the same, and ordinal regression when we know the order of outcomes but not the exact difference between them.
  • Chapter 25: Confidence Intervals:There are several methods to construct confidence intervals for machine learning classifiers. The normal approximation interval is a simple method but assumes a normal distribution of the data. Bootstrapping training sets resamples the existing data to estimate the sampling distribution, while bootstrapping test set predictions applies the trained model to bootstrapped test sets. Retraining models with different random seeds can also be used to build confidence intervals. Each method has its advantages and disadvantages, and the choice depends on factors like the nature of the data and the model.
  • Chapter 26: Confidence Intervals vs. Conformal Predictions:Confidence intervals estimate the range of a population parameter, while prediction intervals, which conformal predictions help create, focus on the range of a single predicted target value. For example, in predicting people’s heights, a confidence interval might estimate the average height of a population, while a prediction interval estimates the height of an individual. Conformal predictions are distribution - free and can be applied to any machine learning algorithm, providing finite - sample guarantees, but they may require more computational resources.
  • Chapter 27: Proper Metrics:A proper metric should satisfy three criteria: non - negativity, symmetry, and the triangle inequality. The mean squared error (MSE) loss and cross - entropy loss are analyzed to see if they meet these criteria. The MSE is non - negative and symmetric but does not always satisfy the triangle inequality. The cross - entropy loss, used in training neural networks, fails to meet all three criteria. For example, it is not 0 for identical points and is not symmetric. Understanding these properties helps in knowing how optimization algorithms will converge.
  • Chapter 28: The k in k - Fold Cross - Validation:k - fold cross - validation is a common method for evaluating machine learning classifiers. Choosing a large k has trade - offs. A large k makes the training sets more similar, which can be beneficial for approximating the final model’s performance but less useful for analyzing the algorithm’s behavior on different datasets. It also increases computational demands. A common choice for k is 5 or 10. For example, in 10 - fold cross - validation, 90% of the data is used for training in each round. The appropriate value of k depends on the purpose, such as hyperparameter tuning or evaluating model performance.
  • Chapter 29: Training and Test Set Discordance:If a model performs better on the test dataset than the training dataset, there may be discrepancies in the data. First, check for technical issues in the code. Then, plot the target and label distributions to look for differences. Adversarial validation can be used to identify discrepancies between training and test sets. If discrepancies are detected, techniques like removing features or training instances can be applied to mitigate the issues.
  • Chapter 30: Limited Labeled Data:When dealing with limited labeled data in supervised learning, several approaches can be used to improve model performance. Labeling more data is the best option but may not be feasible. Bootstrapping the data, transfer learning, self - supervised learning, active learning, few - shot learning, meta - learning, weakly supervised learning, semi - supervised learning, self - training, multi - task learning, multimodal learning, and choosing models with stronger inductive biases are all possible techniques. The choice depends on the specific context, such as the availability of data and the type of model.

In conclusion, this abridged version provides a concise overview of key concepts from the book, covering various aspects of machine learning in production, deployment, and model evaluation. Each chapter offers valuable insights into different techniques and considerations, helping readers deepen their understanding of the field.