This scientific research has three components. First, my most recent advances towards solving one of the most famous, multi-century old conjectures in number theory. One that kids in elementary school can understand, yet incredibly hard to prove. At the very core, it is about the spectacular quantum dynamics of the digit sum function.
Then, I present an infinite dataset that has all the patterns you or AI can imagine, and much more, ranging from obvious to undetectable. More specifically, it is an infinite number of infinite datasets all in tabular format, with various degrees of auto- and cross-correlations (short and long range) to test, enhance and benchmark AI algorithms including LLMs. It is based on the physics of the digit sum function and linked to the aforementioned conjecture. This synthetic data of its own kind is useful in context such as fraud detection or cybersecurity.
Finally, it comes with very efficient Python code to generate the data, involving gigantic numbers and high precision arithmetic.
The Rise of Specialized LLMs for Enterprise -https://mltblog.com/3QXXE4I
In this article, I discuss the main problems of standard LLMs (OpenAI and the likes), and how the new generation of LLMs addresses these issues. The focus is on Enterprise LLMs.
LLMs with Billions of Parameters: Most of the LLMs still fall in that category. The first ones (ChatGPT) appeared around 2022, though Bert is an early precursor. Most recent books discussing LLMs still define them as transformer architecture with deep neural networks (DNNs), costly training, and reliance on GPUs. The training is optimized to predict the next tokens or missing tokens. However, this task is remotely relevant to what modern LLMs now deliver to the user, see here. Yet it requires time and intensive computer resources. Indeed, this type of architecture works best with billions or trillions of tokens. In the end, most of these tokens are noise, requiring smart distillation for performance improvement.
The main issues are:
➡️ Performance: Requires GPU and large corpuses as input data. Re-training is expensive. Hallucinations are still a problem. Fine-tuning is delicate (Blackbox). You need prompt engineering to get the best results. Mixtures of experts (multiple sub-LLMs, DeepSeek) is one step towards improving accuracy.
➡️ Cost: Besides the GPU costs, the pricing model charges by the token, incentivizing vendors to use models with billions of tokens.