HART: New AI Tool Produces High-Quality Images 9x Faster

The artificial intelligence (AI) market is rapidly accelerating. It has already grown from a ‘measly’ $80bln in 2023 to a massive $279 bln in 2024. This is just the beginning, though, as the global AI market is expected to see even bigger growth, projected to reach $1.8 trillion by 2040.

This projected growth has businesses betting big on AI’s impact, with a vast majority of organizations utilizing the technology to gain a competitive edge.

As a result, the AI software market is currently pulling in around $100 billion per year in revenue, while the revenue of its subset AI apps is expected to hit $18.8 billion by 2028. In the hardware sphere, the likes of Nvidia (NVDA -2.05%) are cashing in big on this soaring trend, with AI chip manufacturing revenue forecasted to hit $83.25 billion by 2027.

All these enormous numbers present a clear picture of the great scope of AI technology, which is also bringing incredible changes to the way we create and edit images, opening up a whole new world of creativity.

Here, AI image generators are the major movement with generative AI (gen AI) as its driving force. 

Like the broader AI space, the Gen AI market is also advancing rapidly, estimated to reach $207 billion by 2030. Interestingly, data suggests that AI was used to create about 34 million images every day in 2024.

So, a growing number of people are using AI to generate images, and most are enjoying it

New Developments in AI-Image Generation

As the adoption of AI tools grows, so do their capabilities. These tools use AI to create visuals based on text prompts or by modifying existing images. These systems learn from massive image datasets and can produce a wide range of realistic and consistent images.

Using AI to create images, however, isn’t anything new. For many decades, artists and researchers have used AI to create artistic works. But AI art boomed in 2022 thanks to technological advancement, which has its roots in generative adversarial networks (GANs).

This new pathway to generative models is made of two neural networks: a generator that learns to produce synthetic data mimicking the distribution of real data and a discriminator that learns to distinguish between synthetic and real data.

The first GAN-generated images weren’t very effective, but they did spark an interest in AI-generated imagery. Over the years, researchers worked to enhance the technology’s ability to produce more complex output. 

In 2019, a team at Nvidia revealed a GAN-based algorithm for generating photorealistic faces, but it had flaws, of course. Two years later, OpenAI released DALL-E, which can generate impressively realistic images from a simple text prompt.

Since then, DALL-E has been improved and now replaced by GPT-4o with image output, which “thinks” a bit longer than the image-generation model to provide more detailed and accurate images.

This week, OpenAI announced that its ChatGPT chatbot can now generate images from detailed, complex, and unusual instructions, taking it beyond just a text generation tool. GPT 4-o, which underpins this new version of ChatGPT, also allows it to receive and respond to images, voice commands, and videos.

“This is a completely new kind of technology under the hood. We don’t break up image generation and text generation. We want it all to be done together.”

– Gabriel Goh, an OpenAI researcher

GPT-4o native image generation is now live in both ChatGPT and its AI video-generation product Sora. For the new feature, the company trained the model on both the publicly available and proprietary data, the latter of which comes from its partnerships with companies like Shutterstock.

Besides DALL-E and now ChatGPT, several other tools have been released in the market that offer enhanced capabilities and more functionalities. 

For instance, Midjourney is a paid option that generates amazing images of people and objects.

Then there’s Adobe Firefly, trained on Adobe Stock images, which stands out with its Structure Reference and Style Reference features. These enable users to provide the model with an image as a template for a new image but with the same layout and composition, or to reference an image to produce a new one in the same style.

Stable Diffusion came from Stability AI as an open-source text-to-image generator, offering features like the ability to change the image ratio and a “negative prompt” to avoid things in an image.

Google has also released its own AI generator, ImageFX, which is powered by Imagen 3 and generates high-quality, realistic outputs.

Amidst this rapid progress in AI image-generation, researchers from NVIDIA and MIT have come together to develop a new AI model that combines autoregressive and diffusion techniques to generate high-quality images up to nine times faster than current state-of-the-art diffusion models.

A Hybrid High-Quality Quick Image Generation AI Tool

Generative AI techniques allow for the creation of new content. These techniques include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), Diffusion Models, and Transformer-based models, each with its unique strengths for tasks like image, text, and data generation.

Here, GANs and diffusion models, in particular, are good at generating realistic images and videos. 

The diffusion model is a popular machine learning algorithm that uniquely generates high-quality data by progressively adding noise to a dataset and then learning to reverse this process. This is an iterative process, predicting and subtracting random noise multiple times until the generated image is completely free of noise. This denoising process adds several steps to image generation.

So, while widely used for its ability to create very detailed and realistic images, the model has drawbacks, which include being extremely slow and computationally intensive for many applications.

Interestingly, autoregressive models are much faster, but their problem is that they generate poorer-quality images. On top of it, they are riddled with errors.

These models are commonly used for predicting text and powering large language models (LLMs) like ChatGPT. They are faster because they predict patches of an image sequentially and can’t go back to rectify their mistakes. 

Autoregressive models use representations known as tokens to make predictions. Now, to compress raw image pixels into discrete tokens and reconstruct the image from predicted tokens, they use an autoencoder. Information loss during compression, however, causes errors.

So, researchers from MIT and NVIDIA introduced a new approach1 that combines the best of these two methods. This hybrid tool utilizes an autoregressive model to capture the big picture quickly and then uses a small diffusion model for detail refinement.

“If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART.”

– Co-lead author Haotian Tang, PhD ’25.

The tool is called HART — a Hybrid Autoregressive Transformer that can produce an image matching or even surpassing the quality of state-of-the-art diffusion models while being about much faster. 

While an autoregressive model predicts compressed, discrete image tokens in HART, a small diffusion model predicts residual tokens to compensate for its information loss by capturing details left out by discrete tokens. According to Tang:

“We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person’s hair, eyes, or mouth. These are places where discrete tokens can make mistakes.”

By using the diffusion model to predict only the remaining details, HART significantly reduces the steps (from about 30 or more to just 8) to create a complete image. In addition to offering speed and quality, it is also highly efficient. 

Greater efficiency, as Tang stated, is the result of the diffusion model having an easier task. By consuming fewer computational resources than typical diffusion models, the tool can even run locally on a laptop or smartphone.

This means that the user can generate an image simply by providing a natural language prompt into the HART interface. This can essentially democratize access to advanced image-generation tools. Integrating the diffusion model to boost the autoregressive model effectively wasn’t without its challenges, though. 

When the diffusion model was incorporated in the early stages of the autoregressive process, researchers found that it led to an accumulation of errors. However, using the model for the final step (predicting the residual tokens), which was the final design, improved the generation quality substantially. 

By using an autoregressive transformer model with 700 million parameters alongside a lightweight diffusion model with 37 million parameters, HART can generate images of the same quality as those produced by a diffusion model with 2 billion parameters, while using 31% less computation and being nine times faster.

On top of it all, using the autoregressive model that powers LLMs to do most of the work makes HART more compatible for integration with the new class of unified vision-language generative models. 

“LLMs are a good interface for all sorts of models, like multimodal models and models that can reason. This is a way to push the intelligence to a new frontier. An efficient image-generation model would unlock a lot of possibilities.” 

– Tang

The exciting thing about HART is its use cases, which go beyond just creating images for fun and marketing. 

The model can have a wide range of applications, such as virtual reality, generating stunning and impactful scenes for video games, training robots to complete complex real-world tasks, and creating realistic simulated environments to train self-driving cars to make them safer on real streets.

Given HART’s current development stage, it could be integrated into commercial applications within the next 1 to 3 years.​

Innovative Company

NVIDIA Corporation (NVDA -2.05%)

A leading designer of graphics processing units (GPUs) and AI hardware, NVIDIA is at the forefront of developing AI models like HART that enhance image generation capabilities.​ Nvidia’s AI accelerators actually account for 70% and 95% of AI chips’ market share.

As CEO Jensen Huang noted, last year “almost the entire world got involved” in AI. “The computational requirement, the scaling law of AI, is more resilient, and in fact, is hyper-accelerated,” he added.

Last week, during its annual developer conference, GTC 2035, the world’s largest chip company unveiled new products: Vera Rubin and Blackwell Ultra. 

Blackwell Ultra is a family of chips that will be shipped in the second half of this year. The chip designed for AI reasoning will be able to produce more tokens per second, which translates to more content generation in the same amount of time as its predecessor.

Nvidia says cloud providers would be able to offer a premium AI service for time-sensitive applications using Blackwell Ultra. The official announcement notes that it is ideal for Agentic AI and Physical AI. 

Vera Rubin is the next-gen system expected to start shipping next year. In this system, Vera is a CPU based on a core design named Olympus. Previously, Nvidia used an off-the-shelf design from Arm for its CPU needs but has now introduced its first custom CPU design, which Nvidia said will be twice as fast as the CPU used in last year’s Grace Blackwell chips.

Rubin, named after astronomer Vera Rubin, is the GPU that, when used with Vera, can manage 50 petaflops while doing inference. This is more than twice the performance of its current Blackwell chips. Moreover, to meet AI developers’ demand, Rubin will support 288 gigabytes of fast memory. A “Rubin Next” chip is also planned for release next year, which will use four dies to make a single chip to double the Rubin speed.

The company has also introduced two AI-focused PCs, DGX Spark and DGX Station, that can run large AI models. According to Nvidia, Dell, ASUS, Lenovo, and HP Inc. are developing these systems.

DGX Spark, described as the “world’s smallest AI supercomputer,” is a new high-performance Nvidia Blackwell Ultra platform-powered PC that will allow AI researchers, developers, and data scientists to test inference on large models right on their desktops. 

Meanwhile, DGX Station will bring data-center-level performance to people’s desktops for the development of the technology.

“AI has transformed every layer of the computing stack. It stands to reason a new class of computers would emerge — designed for AI-native developers and to run AI-native applications. With these new DGX personal AI computers, AI can span from cloud services to desktop and edge applications.”

– NVIDIA CEO Jensen Huang

An open-source software framework called Dynamo will now address AI inference challenges at scale. Huang described it as the “operating system of an AI factory,” and much like how the real-world Dynamo kicked off the industrial revolution, he expects this one to revolutionize AI by helping its users get the most out of chips.

This inference suite is designed to optimize better inference engines such as vLLM, SGLang, and TensorRT LLM to run across large quantities of GPUs efficiently.

Nvidia claims that Dynamo can 2x the inference performance for systems based on Hopper and running Llama models and a 30x advantage in DeepSeek-R1 over Hopper for larger Blackwell NVL72 systems.

The company has also shared plans to address the challenges associated with training robots with Nvidia Cosmos, which can speed up synthetic data generation. The platform will help develop physical AI models by acting as a base for post-training in order to solve these problems.

As for Nvidia’s vision for healthcare, it involves robotic surgeons and fully automated hospitals through Nvidia Isaac.

In the field of quantum computing, the Nvidia Accelerated Quantum Research Center will integrate quantum hardware with AI supercomputers to accelerate quantum supercomputing, which Huang says will help “tackle some of the world’s most important problems, from drug discovery to materials development.”

Now, when it comes to company financials, the $2.945 trillion market cap Nvidia saw a 12% increase in revenue from the previous quarter to $39.3 billion, which was up 78% from a year ago. The full-year revenue was meanwhile a record $130.5 billion, surging by a massive 114%.

During the Q4 ended January 26, 2025, the company’s GAAP earnings per diluted share were $0.89 and non-GAAP earnings per diluted share were $0.89.

As of writing, NVDA shares are trading at $119.93, down 10.13% YTD. The company pays out a dividend yield of about 0.03%. A quarterly cash dividend of $0.01 per share will be paid to shareholders on April 2, 2025.

Latest on NVIDIA Corporation

Conclusion

As we saw, the adoption of AI tools and the capabilities they offer is advancing at an unprecedented rate. Amidst this ongoing hyper-growth, HART from MIT and NVIDIA scientists represents a major breakthrough in AI-powered image generation, offering a solution that balances speed, quality, and efficiency. 

As this technology continues to develop, we are bound to see a shift in which AI-generated content becomes even more seamless, precise, and widely adopted across industries.

Click here to learn all about investing in artificial intelligence (AI).

1. Tang, H., Wu, Y., Yang, S., Xie, E., Chen, J., Chen, J., Zhang, Z., Cai, H., Lu, Y., & Han, S. (2024). HART: Efficient Visual Generation with Hybrid Autoregressive Transformer. arXiv. https://doi.org/10.48550/arXiv.2410.10812

Leave a Reply

Your email address will not be published. Required fields are marked *