Dropping an entire collection of Style Intermixing Adapters on StrangerZone HF โ including Realism, Anime, Sketch, Texture-Rich 3D Experimentals, Automotive Concept Images, and LoRA models based on Flux.1, SD 3.5 Turbo/Large, Stable Diffusion XL ๐จ
OpenAI just released a 34-page practical guide to building agents,
Here's 10 things it teaches us:
1โ agents are different from workflows: they are complete autonomous systems that perform tasks on your behalf. many applications use LLMs for workflows, but this is not an agent.
2โ use them for tricky stuff: complex decision making, dynamic rules, unstructured data
3โ core recipe: each agent has three main components: Model (the brain), Tools, Instructions on how to behave
4โ choose the right brain: set up evals to get a baseline performance, use a smart model to see what's possible, gradually downgrade the model for cost and speed
5โ tools are key: choose well-defined and tested tools. an agent needs tools to retrieve data and context, and take actions.
6โ instruction matters A LOT: be super clear telling the agent its goals, steps, and rules. Vague instructions = unpredictable agent. Be explicit.
7โ start simple, then scale: often a single agent with several tools is ok. don't jump to complex multi-agent systems immediately.
8โ if you use multi-agents: you can have a "manager" agent directing traffic to specialist agents, or have agents hand off tasks to each other.
9โ gaurdrails are a MUST: check user input for weird stuff, make sure the agent isn't about to do something risky, filter out private info, block harmful content. Don't let it run wild.
10โ build and plan for humans: start small, test, improve. always have a plan for when the agent gets stuck or is about to do something high-risk.
This dataset is designed to post-train Metareasoning agents, or those agents whose job it is to quickly (and importantly, cheaply) reason through whether it makes sense to launch a full reasoning job or simply use a simple completions job.
Generation notebook (linked in dataset) is open source and pretty well generalized if I don't say so myself, so you can use it to make your own Metareasoning datasets.
Shoutout to @onekq for his inspiring comment on this topic.
Try out the demo for Multimodal OCR featuring the implementation of models including RolmOCR and Qwen2VL OCR. The use case showcases image-text-to-text conversion and video understanding support for the RolmOCR model ! ๐
Google published a 69-page whitepaper on Prompt Engineering and its best practices, a must-read if you are using LLMs in production: > zero-shot, one-shot, few-shot > system prompting > chain-of-thought (CoT) > ReAct
Loaded some domain-specific downstream image classification content moderation models, which is essentially the practice of monitoring and filtering user-generated content on platforms, based on SigLIP-2 Base Patch16 with newly initialized trainable parameters. ๐ฅ
The best researchers from Yale, Stanford, Google DeepMind, and Microsoft laid out all we know about Agents in a 264-page paper [book],
Here are some of their key findings:
They build a mapping of different agent components, such as perception, memory, and world modelling, to different regions of the human brain and compare them:
- brain is much more energy-efficient - no genuine experience in agents - brain learns continuously, agent is static
An agent is broken down to: - Perception: the agent's input mechanism. can be improved with multi-modality, feedback mechanisms (e.g., human corrections), etc. - Cognition: learning, reasoning, planning, memory. LLMs are key in this part. - Action: agent's output and tool use.
Agentic memory is represented as: - Sensory memory or short-term holding of inputs which is not emphasized much in agents. - Short-term memory which is the LLM context window - Long-term memory which is the external storage such as RAG or knowledge graphs.
The memory in agents can be improved and researched in terms of: - increasing the amount of stored information - how to retrieve the most relevant info - combining context-window memory with external memory - deciding what to forget or update in memory
The agent must simulate or predict the future states of the environment for planning and decision-making.
ai world models are much simpler than the humans' with their causal reasoning (cause-and-effect) or physical intuition.
LLM world models are mostly implicit and embedded.
EMOTIONS are a deep aspect of humans, helping them with social interactions, decision-making, or learning.
Agents must understand emotions to better interact with us.
But rather than encoding the feeling of emotions, they have a surface-level modelling of emotions.
Perception is the process by which an agent receives and interprets raw data from its surroundings.
ChatGPT-4oโs image generation goes wild for a weekโfeaturing everything from Studio Ghibli-style art and image colorization to style intermixing. Here are some examples showcasing the generation of highly detailed images from freestyle design templates. Want to know more? Check out the blog ๐
What, How, Where, and How Well? This paper reviews test-time scaling methods and all you need to know about them: > parallel, sequential, hybrid, internal scaling > how to scale (SFT, RL, search, verification) > metrics and evals of test-time scaling
Luna, the single-speaker text-to-speech model, features a Radio & Atcosim-style sound with a female voice. It offers authentic radio podcast noise and empathetic speech generation, fine-tuned based on Orpheus's Llama-based speech generation state-of-the-art model. ๐๏ธ
Dropping some new Journey Art and Realism adapters for Flux.1-Dev, including Thematic Arts, 2021 Memory Adapters, Thread of Art, Black of Art, and more. For more details, visit the model card on Stranger Zone HF ๐ค
The best dimensions and inference settings for optimal results are as follows: A resolution of 1280 x 832 with a 3:2 aspect ratio is recommended for the best quality, while 1024 x 1024 with a 1:1 aspect ratio serves as the default option. For inference, the recommended number of steps ranges between 30 and 35 to achieve optimal output.
Dropping Downstream tasks using newly initialized parameters and weights ([classifier.bias & weights]) support domain-specific ๐ถ๐บ๐ฎ๐ด๐ฒ ๐ฐ๐น๐ฎ๐๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป. Based on siglip2-base-patch16-224 and DomainNet (single-domain, multi-source adaptation), with Fashion-MNIST & More for experimental testing. ๐งคโ๏ธ
Models are trained with different parameter settings for experimental purposes only, with the intent of further development. Refer to the model page below for instructions on running it with Transformers ๐ค.
Play with Orpheus TTS, a Llama-based Speech-LLM designed for high-quality, empathetic text-to-speech generation. This model has been fine-tuned to deliver human-level speech synthesis ๐ฅ๐ฃ๏ธ