Vincent Granville PRO

vincentg64

https://mltechniques.com/resources/

AI & ML interests

GenAI, LLM, synthetic data, optimization, fine-tuning, model evaluation

Recent Activity

posted an update 4 days ago

Universal Dataset to Test, Enhance and Benchmark AI Algorithms https://mltblog.com/4ia7r2D This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in elementary school can understand, yet incredibly hard to prove. At the very core, it is about the spectacular quantum dynamics of the digit sum function. Then, I present an infinite dataset that has all the patterns you or AI can imagine, and much more, ranging from obvious to undetectable. More specifically, it is an infinite number of infinite datasets all in tabular format, with various degrees of auto- and cross-correlations (short and long range) to test, enhance and benchmark AI algorithms including LLMs. It is based on the physics of the digit sum function and linked to the aforementioned conjecture. This synthetic data of its own kind is useful in context such as fraud detection or cybersecurity. Finally, it comes with very efficient Python code to generate the data, involving gigantic numbers and high precision arithmetic. ➡️ Read article and learn how to use and generate dataset, at https://mltblog.com/4ia7r2D

posted an update 15 days ago

The Rise of Specialized LLMs for Enterprise -https://mltblog.com/3QXXE4I In this article, I discuss the main problems of standard LLMs (OpenAI and the likes), and how the new generation of LLMs addresses these issues. The focus is on Enterprise LLMs. LLMs with Billions of Parameters: Most of the LLMs still fall in that category. The first ones (ChatGPT) appeared around 2022, though Bert is an early precursor. Most recent books discussing LLMs still define them as transformer architecture with deep neural networks (DNNs), costly training, and reliance on GPUs. The training is optimized to predict the next tokens or missing tokens. However, this task is remotely relevant to what modern LLMs now deliver to the user, see here. Yet it requires time and intensive computer resources. Indeed, this type of architecture works best with billions or trillions of tokens. In the end, most of these tokens are noise, requiring smart distillation for performance improvement. The main issues are: ➡️ Performance: Requires GPU and large corpuses as input data. Re-training is expensive. Hallucinations are still a problem. Fine-tuning is delicate (Blackbox). You need prompt engineering to get the best results. Mixtures of experts (multiple sub-LLMs, DeepSeek) is one step towards improving accuracy. ➡️ Cost: Besides the GPU costs, the pricing model charges by the token, incentivizing vendors to use models with billions of tokens. Read full article describing more issues and how LLM 2.0 addresses them, at https://mltblog.com/3QXXE4I More links: - To receive latest updates: https://mltblog.com/4iTvQec - About LLM 2.0: https://mltblog.com/4g2sKTv - PowerPoint presentation: https://mltblog.com/43DYviE - Our company website: https://mlt

posted an update about 1 month ago

LLM Challenge with Petabytes of Data to Prove Famous Number Theory Conjecture https://mltblog.com/3F3Y9Yd In my recent article “Piercing the Deepest Mathematical Mystery”, I paved the way to proving a famous multi-century old conjecture: are the digits of major mathematical constant such as π, e, log 2, or √2 evenly distributed? No one before ever managed to prove even the most basic trivialities, such as whether the proportion of ‘0’ or ‘1’ exists in the binary expansions of any of these constants, or if it oscillates indefinitely between 0% and 100%. Here I provide an overview of the new framework built to uncover deep results about the digit distribution of Euler’s number e, discuss the latest developments, share a 10x faster version of the code, and feature new potential research areas in LLMs, AI, quantum dynamics, high performance computing, cryptography, dynamical systems, number theory and more, arising from my discovery. Perhaps the most interesting part is testing LLMs and other AI tools to assess their reasoning capabilities on a fascinating math problem with no solution posted anywhere. ➡️ Read about the challenge at https://mltblog.com/3F3Y9Yd

View all activity

Organizations

None yet

Posts 24

Post

1074

Universal Dataset to Test, Enhance and Benchmark AI Algorithms https://mltblog.com/4ia7r2D

This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in elementary school can understand, yet incredibly hard to prove. At the very core, it is about the spectacular quantum dynamics of the digit sum function.

Then, I present an infinite dataset that has all the patterns you or AI can imagine, and much more, ranging from obvious to undetectable. More specifically, it is an infinite number of infinite datasets all in tabular format, with various degrees of auto- and cross-correlations (short and long range) to test, enhance and benchmark AI algorithms including LLMs. It is based on the physics of the digit sum function and linked to the aforementioned conjecture. This synthetic data of its own kind is useful in context such as fraud detection or cybersecurity.

Finally, it comes with very efficient Python code to generate the data, involving gigantic numbers and high precision arithmetic.

➡️ Read article and learn how to use and generate dataset, at https://mltblog.com/4ia7r2D

Post

2285

The Rise of Specialized LLMs for Enterprise -https://mltblog.com/3QXXE4I

In this article, I discuss the main problems of standard LLMs (OpenAI and the likes), and how the new generation of LLMs addresses these issues. The focus is on Enterprise LLMs.

LLMs with Billions of Parameters: Most of the LLMs still fall in that category. The first ones (ChatGPT) appeared around 2022, though Bert is an early precursor. Most recent books discussing LLMs still define them as transformer architecture with deep neural networks (DNNs), costly training, and reliance on GPUs. The training is optimized to predict the next tokens or missing tokens. However, this task is remotely relevant to what modern LLMs now deliver to the user, see here. Yet it requires time and intensive computer resources. Indeed, this type of architecture works best with billions or trillions of tokens. In the end, most of these tokens are noise, requiring smart distillation for performance improvement.

The main issues are:

➡️ Performance: Requires GPU and large corpuses as input data. Re-training is expensive. Hallucinations are still a problem. Fine-tuning is delicate (Blackbox). You need prompt engineering to get the best results. Mixtures of experts (multiple sub-LLMs, DeepSeek) is one step towards improving accuracy.

➡️ Cost: Besides the GPU costs, the pricing model charges by the token, incentivizing vendors to use models with billions of tokens.

Read full article describing more issues and how LLM 2.0 addresses them, at https://mltblog.com/3QXXE4I

More links:

- To receive latest updates: https://mltblog.com/4iTvQec
- About LLM 2.0: https://mltblog.com/4g2sKTv
- PowerPoint presentation: https://mltblog.com/43DYviE
- Our company website: https://mlt

View all Posts

models 1

vincentg64/xLLM

Summarization • Updated Jul 21, 2024

datasets

None public yet