What does the future of machine learning entail? Most of the current model architectures (Neural Networks, LSTMs, RNNs) were devised by researchers in the 1980s, but only showed promise due to the increasing capability of computers (higher CPU capability, Moore’s law, etc.). Where will the next biggest research and commercial breakthroughs come from? How will they be devised? In this article, I hope to provide context on some of the prevailing schools of thought in the machine learning community, as well as some open questions that I do not yet know the answer to.

Firstly, the goal of machine learning research in the past few decades has been to achieve generalization capabilities. Concretely, given that a model can achieve Task X with a certain high level of accuracy, can it generalize to solve Task Y with the same or comparable level of accuracy? For instance, can a vision model trained to recognize dogs with a high degree of accuracy also generalize and recognize cats? There are a few thresholds that people use to measure generalization capabilities, often referred to as n-shot learning. In n-shot learning, we train the model on $n$ new examples for Task Y, and then measure its performance. There exists much philosophical debate about what the ideal $n$ should be? Can a model ever truly be able to perform exceptionally at 0-shot learning, if humans cannot?

In the research community, there exists two main schools of thought for how to achieve generalization capabilities: end-to-end machine learning and reasoning. End-to-end machine learning focuses on training as many components of a system as possible, all at once. For example, if we have a retrieval augmented generation (RAG) system where, given a prompt, a retrieval model retrieves relevant context that is fed to the generation model to produce an answer, an end-to-end method would train the retriever and generator jointly, as one holistic system. Much of the current research on end-to-end machine learning shows better generalization capabilities on few-shot tasks for RAG systems. However, there is more to a RAG system than just the retriever and generator models. The input data must be extracted in a readable format for the retrieval — a truly end-to-end system would also propagate weights and train on this component as well.

As models are constructed of modular components, for instance an encoder-decoder based transformer, end-to-end machine learning will become more common, as connecting these modular components becomes easier. One such example of this modular approach is in the task of table recognition, where the goal is to construct an HTML hierarchical representation of a table given an image. Poloclub’s Unitable approach combines three task specific decoders on top of a generalized encoder model. Although this model achieves state-of-the-art performance, the generalization capabilities towards out of distribution data points (i.e. new, more complex tables that the model has never seen) are still lacking. Here we see an example of modular components, but end-to-end training still lacking.

The second school of thought—reasoning—focuses on measuring a model’s ability to perform multi-step actions through Chain-of-Thought (CoT). For example, proving a math theorem requires multiple layers of thinking as each logical proposition must lead to the next to produce a proof. Google’s Alphaproof is one such example of research towards reasoning capabilities. They produced a model capable of achieving a silver medal at the International Math Olympiad competition, solving problems that the model has never seen before. The main progress towards reasoning comes from a reinforcement learning approach, where the model teaches itself mathematical reasoning through a series of rewards and goal-oriented learning.

Further reasoning-based improvements come from beyond the initial training step for a model (pre-training) with the wide field of post-training, including methods like finetuning, RLHF (reinforcement learning from human feedback), and alignment. These post-training methods are meant to improve a model’s performance without redoing pre-training, which is often the costliest part of training a model, as pre-training for LLMs now includes datasets as large as the entire internet. Finetuning is a classic method that has been around for a while in the machine learning community and has been shown to improve model capabilities on specific tasks. However, finetuning often narrows the scope of a model’s capabiltiies and typically does not lead to better few-shot learning capabilities. RLHF is another method popularized by the breakthroughs at OpenAI, but its success has been hard to reproduce in the open source community and it requires many human annotations (read: $$). The most promising directions in terms of post-training has been on alignment, with methods like Direct Preference Optimization (DPO ), and Kahneman-Tversky Optimization (KTO ) that have been shown to significantly outperform fine-tuning on few-shot learning tasks and are mathematically equivalent to RLHF.

Another facet of generalization capability is multimodality, where a model can understand text, image, audio, and video. In an ideal world, a model can understand and interpret all of these modalities, but in practice multimodality is one of the most difficult tasks for reasoning. Recent multimodal LLMs, like GPT-4o, GPT-4v, LlavA, and more, have shown initial promise on current benchmarks (see the Side Note on Benchmarks below). However, multimodality is incredibly difficult due to scarcity of data and produces limited reasoning capabilities due to context window constraints for models. Additionally, choosing the best method to represent multimodal data has proven to be a difficult task in of itself. Where text is often small and can be tokenized easily, the best methods for embedding (projecting raw data to a fixed-size multidimensional vector space) images, videos, and audio is still an open research question. For instance, entire startups have been built with the sole goal of providing better embedding models.

What do I think the future of this space will look like? A combination of both reasoning-based and end-to-end machine learning with performant alignment techniques. Although both schools of thought have relatively remained separate, tasks such as multimodal perception have started to unify them with vision language models . With better data and embedding models, combined with end-to-end approaches to training and alignment, we will see an improvement in overall generalization capabilities. Furthermore, as models scale with size and availability of compute, we will see larger context windows, allowing for more capable backbone encoder models that can simultaneously process more information. Models will be able to synthesize more complex information and reason on them, but the key step lies in better alignment and end-to-end approaches. The next generation of models, like GPT-5, Llama-4, Claude-4, etc., will be significantly larger, but without end-to-end approaches and alignment they will lack true generalization capabilities. However, there are a few caveats to note with these large, all-in-one models:

  1. Hallucinations: The fundamental task of language models is next-token prediction, where a model must predict the next token in a sequence given the previous tokens. It is a well known problem with models today that they can make up information that sounds plausible, as these models are just maximizing probabilities, and sometimes these probabilities can be wrong.

  2. The First-Token Problem: Another issue with next-token prediction is the first-token problem. If you ask a model a yes or no question, and the first token produced by the model is wrong (i.e. the answer is yes but the model says no), the entire generated sequence will be wrong, as the conditional probabilities underlying a wrong answer become much higher.

I won’t argue against next-token prediction as the fundamental task driving progress towards generalization capabilities today, but here’s what I will argue: The AI models that every day consumers use in the future will not directly be these foundation models. Instead, companies at the application layer will use these foundation models as components of a larger system. The best systems will be trained end-to-end and use alignment to produce the highest quality products for their customers. Instead of companies like OpenAI and Anthropic releasing their models for consumer use, they will become infrastructure providers for the larger ecosystem of AI companies at the application layer. This systems over models approach will drive the companies at the application layer to produce the best products for their customers.

Furthermore, from a business model perspective, releasing a foundation model to consumers is not profitable. As OpenAI’s ChatGPT gained 100m users in record time, you would expect they have significantly more revenue than they currently do. However, the unit economics and cost-per-token for consumer generation is too high to justify consumer use. “Hey, ChatGPT, write me a 5,000 word essay for my English class.” Assuming we are running GPT-4 and OpenAI is running a 90% profit margin (generous assumption), it will cost them $0.0003/1k tokens. A token is 3/4 of a word on average, so my single 5,000 word essay just cost OpenAI 2$ (0.0003*5000*4/3), and I’m paying $20/month for unlimited queries. See how this business model doesn’t make sense? Even if the cost of generating a token goes down, the rate at which consumers use the platform will only increase, and the unit economics will still be broken. The best business models for foundation model creators will be selling API access to application layer companies, with a better business model than dollar-per-token.

The application layer is a really hard thing to get right. Creating a high quality product for a real use case is tough and it won’t be done in house at most non-AI native companies. The next generation of SaaS leaders will be a concentrated few application-layer AI companies that are able to quickly scale their product to market. The next generation of consumer leaders will also be a concentrated few application-layer AI companies that have a sticky product and quality go-to-market teams. I’m not an investor nor have I lived long enough to be able to confidently pick a winner in this space, but based on my experience thus far I have a few hunches on what type of company will win in this space. Firstly, the team matters more than anything else. Having a killer team of applied AI researchers who can build end-to-end systems and integrate alignment in their platform is key, but that’s not all you need. The team also needs to comprise of some X-native sales and product people, where X is the application. For example, a winning company building AI call center agents must comprise people who can effectively sell to call centers (the people with the highest probability of doing this are likely people highly familiar and connected with the space) and people who can build a production grade AI system (the people with the highest probability of doing this are applied AI researchers and engineers intimately familiar with both cutting-edge research and standard engineering).

Finally, I want to cover a few of the inflection points that have gotten us to where we are today, as the principles underlying this incredible technological development comes from simple, elegant ideas. In the early 2000s and leading up to the 2010s, most of the approaches to AI were statistics-based approaches, grounded in highly complex systems based on relevant fields in academia. For instance, the earliest NLP models were based on statistical approaches to linguistics, with models leveraging sentence decomposition, grammatical conventions, and more to produce an answer. Many open-source challenges, like the DARPA Grand Challenge and the ImageNet Challenge led to innovations in the field and outstanding organizations like Waymo and OpenAI, respectively. The inflection point for language and vision models came with the 2012 paper AlexNet, by Alex Krizhevsky , Ilya Sutskever (co-founder of OpenAI), and Geoffrey Hinton . The simple, elegant insight here was to use a deep-learning based approach to the problem of image classification. Essentially, AlexNet showed that simple components connected together with large-scale computation-heavy training (facilitated by the novel GPU) could significantly outperform existing methods. AlexNet achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up and quite near human performance. After 2012, virtually every researcher adopted this deep-learning based approach to image classification, and extended it to other tasks beyond just image classification.

The period from 2012-2021 in the machine learning community took inspiration from AlexNet and applied deep-learning approaches (powered by GPUs!) to many different, seemingly isolated problems. Websites like nlpprogress.com show the vast range of individual tasks that people were attempting to solve with deep learning, such as entity linking, domain adaptation, named entity-recognition, and more. When I was working on neural machine translation in 2019, all of these tasks had separate models, and each task was using more complicated and intricate models than the last. The problem was that each of these tasks had their own objective, the goal of image classification is different than that of object detection, which is different than visual question answering, and so on. However, the second machine learning specific inflection point that powered the current AI revolution came from OpenAI’s GPT-3. With a new architecture called a transformer, that allowed for better scaling properties than a conventional neural network, Ilya Sutskever (yes - the same guy as before), decided to use a simple training objective for the problem of language modeling — next-token prediction. It turned out that next-token prediction was all you needed to allow for basic generalization capabilities, if you feed the model enough data. For the first time, you could use one model for all of these NLP tasks, with significantly high accuracy. Instead of individual, complex models for each task, all you needed was one big enough model, now known as a foundation model.

The combination of next-token prediction and deep-learning led us to where we are today, powered by inflection points in other areas, such as increased availability of GPUs and the transformer architecture. What will be the next inflection points that power progress towards a future with generalization-capable models?

  1. Crowd-sourced, high quality data: As I mentioned previously, alignment techniques like DPO and KTO are mathematically equivalent to RLHF and require significantly less human labeling and cost. These techniques are certainly positioned to take down RLHF as the prevailing method (Anthropic is already using DPO), but we need high quality data in order to get to the next era of foundation models. Companies like OpenAI, Nvidia, and Google have already gotten into legal trouble with their data collection, as they basically scraped every website on the internet (without permission) to collect data. As a result, more and more websites, like Reddit, have introduced anti-scraping provisions in their Terms and Conditions and technical protections against scraping. The next step forward will be crowdsourcing high quality data from users, who get paid or voluntarily provide data to power the AI revolution. Companies like Scale AI, Openmind.org (Disclaimer: my company), will provide the infrastructure to power the next generation of foundation models. The main bottlenecks here are data privacy considerations, and the willingness of individuals to crowdsource (solution - pay them).
  2. New, scalable architectures: The main advantage of the transformer architecture is that it is really easy to parallelize and run on a GPU, due to components like multi-head attention. Further scalable approaches, like FlashAttention , have led to further speed boosts on GPUs. Alternative architectures, like Mamba , have shown promise in terms of better scalability, but are still in their early days of development. Note: FlashAttention and Mamba were developed by the same guy, Tri Dao, who I fully believe will be one of the top industry voices in the next 5 years. If someone could develop an architecture that theoretically scales 10x better than a transformer, and then actually proves it, this would be a huge leap forward in terms of achieving generalization-capable foundation models. The main bottlenecks here are the fact that most of these architectures come from academia, who currently do not have the infrastructure to actually prove that these models can scale better. Papers like Mamba show better scaling at 3B parameters, but SoTA models are at least >40B parameters as of time of writing. For now, I think transformers are here to stay for at least the next 2-3 years.
  3. Transformer-native chips (ASICs): The main limitation of current Nvidia GPUs is that they were not built specifically for training large-scale transformers. Even at full GPU capacity during training, there is still a lot of time lost to data transfer, orchestration, and more. Companies like Groq and Etched are building ASICs solely for transformers, which is a highly promising direction. But, they are competing against one of the biggest moats in tech history to date — CUDA. CUDA is essentially Nvidia’s C wrapper for running and parallelizing operations on a GPU. CUDA has the most extensive library out there for building kernels (a function, typically in C, meant to be run in parallel on a GPU). Most of the products built out in the last decade for optimized parallelization are built on CUDA, which means they can only run on a GPU. This fact is likely why Google’s TPU has struggled commercially, because you can’t run any state of the art models on it (e.g. FlashAttention is not supported on TPUs, so you can’t run GPT-3.5+, Claude-3+, etc.). The worst part (okay, best part for Nvidia) is that writing these C kernels yourself is incredibly difficult and requires a lot of know-how and expertise, especially knowledge of C programming, operating systems, parallelization, orchestration, and more. This moat is massive, but if companies like Groq and Etched can successfully transition developers away from CUDA, they stand to gain a lot of market share and power the development of highly capable models. As a note: Groq has a differentiated approach for their LPUs where they used a custom compiler that took them 5 years to build, and doesn’t rely at all on CUDA. I’m bullish on Groq.
  4. Grokking (maybe): Grokking, not to be confused with Groq (the company) or Grok (Elon Musk’s foundation model competitor), is the process of pretraining a model far beyond the point of overfitting. Overfitting is when a model learns the patterns, noise, or details in the training data so well that it performs poorly on new, unseen data because it fails to generalize beyond the specific examples it was trained on. For some reason, if you keep training far beyond this point, the model is able to start performing well on new, unseen data and show generalization capabilities. However, this is really expensive to do when you are training a foundation model on the entire internet, so I’m not really sure that this will be a major breakthrough, unless a big tech company like Meta or Google decides to splurge and see what happens.

A Side Note on Benchmarks:

One of the trickiest problems with LLMs now is evaluation. How do you actually measure generalization capabilities? is an open question, and current benchmarks seem to fall short. The most popular benchmark for LLM evaluation, MMLU, has been shown, upon closer inspection, to contain many errors as the answers to questions are wrong. Furthermore, a recent paper, called Alice in Wonderland , showed that if you change the numbers on questions in MMLU (for example, if the question is, Alice has 5 apples and Bob has 2, how many apples do they both have? you change the 5 and the 2 to different numbers), the model performances go from ~80% on average to ~2%. Wow! It seems like models suck at reasoning, and are probably just using MMLU in the training set to increase their performance and seem more competitive. Additionally, the authors of the paper introduce a new dataset containing Alice-in-Wonderland (AIW) questions, which have simple, common-sense solutions that a human could easily get right. They also introduce a harder, logic-based set of questions, AIW+, that leading models like GPT-4o and Claude 3 Opus completely fail on (they scored a whopping 0%).

My takeaway from this paper is that in terms of Superintelligence and “AGI”, I believe we are quite a ways away. But for commercial use, the next few years will see a massive increase in automation and worker productivity due to these leading models.