Akash Singh's picture

Akash Singh

akashicmarga

AI & ML interests

Conversational AI

Recent Activity

Organizations

Spaces-explorers's profile picture ZeroGPU Explorers's profile picture Saarthi.AI's profile picture MLX Community's profile picture Social Post Explorers's profile picture Smol Community's profile picture

akashicmarga's activity

reacted to clem's post with ๐Ÿš€ 29 days ago
view post
Post
1971
I've been in Brazil for 10 days now ๐Ÿ‡ง๐Ÿ‡ท๐Ÿ‡ง๐Ÿ‡ท๐Ÿ‡ง๐Ÿ‡ท

I've been surprised by the gap between the massive number of people interested in AI (chatgpt adoption is crazy here) and the relatively low number of real AI builders - aka people and companies building their own AI models, datasets and apps.

Lots of efforts needed across the world for everyone to participate, control and benefit this foundational technology, starting with open-source & multi-lingual AI, more access to GPUs & AI builder training for all!
reacted to maxiw's post with ๐Ÿ‘ 29 days ago
view post
Post
2011
You can now try out computer use models from the hub to automate your local machine with https://github.com/askui/vision-agent. ๐Ÿ’ป

import time
from askui import VisionAgent

with VisionAgent() as agent:
    agent.tools.webbrowser.open_new("http://www.google.com")
    time.sleep(0.5)
    agent.click("search field in the center of the screen", model_name="Qwen/Qwen2-VL-7B-Instruct")
    agent.type("cats")
    agent.keyboard("enter")
    time.sleep(0.5)
    agent.click("text 'Images'", model_name="AskUI/PTA-1")
    time.sleep(0.5)
    agent.click("second cat image", model_name="OS-Copilot/OS-Atlas-Base-7B")


Currently these models are integrated with Gradio Spaces API. Also planning to add local inference soon!

Currently supported:
- Qwen/Qwen2-VL-7B-Instruct
- Qwen/Qwen2-VL-2B-Instruct
- AskUI/PTA-1
- OS-Copilot/OS-Atlas-Base-7B
ยท
reacted to Jaward's post with ๐Ÿ‘ 29 days ago
reacted to akhaliq's post with ๐Ÿ‘ 7 months ago
view post
Post
20893
Chameleon

Mixed-Modal Early-Fusion Foundation Models

Chameleon: Mixed-Modal Early-Fusion Foundation Models (2405.09818)

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
reacted to merve's post with ๐Ÿ‘ 7 months ago
view post
Post
2867
I got asked about PaliGemma's document understanding capabilities, so I built a Space that has all the PaliGemma fine-tuned doc models ๐Ÿ“„๐Ÿ“Š๐Ÿ“–
merve/paligemma-doc
reacted to albertvillanova's post with ๐Ÿš€ 8 months ago
view post
Post
1661
๐Ÿš€ We recently released datasets 2.19.0! ๐Ÿ“ฆ

๐Ÿ”ฅ What's New:
- Polars integration ๐Ÿปโ€โ„๏ธ
- fsspec support for conversion to JSON, CSV, and Parquet
- Mode parameter for Image feature
- CLI function to convert script-datasets to Parquet
- Dataset.take and Dataset.skip

Plus, a bunch of general improvements & bug fixes!

Check out the release notes: https://github.com/huggingface/datasets/releases/tag/2.19.0

Upgrade now and power up your data workflows! ๐Ÿ’ฅ
  • 2 replies
ยท
reacted to Jaward's post with ๐Ÿ‘ 8 months ago
view post
Post
1793
mlx_micrograd - mlx port of Karpathy's micrograd- a tiny scalar-valued autograd engine with a small PyTorch-like neural network library on top.

https://github.com/Jaykef/mlx_micrograd
Installation
pip install mlx_micrograd

Example usage
Example showing a number of possible supported operations:
from mlx_micrograd.engine import Value

a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b + b**3
c += c + 1
c += 1 + c + (-a)
d += d * 2 + (b + a).relu()
d += 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g += 10.0 / f
print(f'{g.data}') # prints array(24.7041, dtype=float32), the outcome of this forward pass
g.backward()
print(f'{a.grad}') # prints array(138.834, dtype=float32), i.e. the numerical value of dg/da
print(f'{b.grad}') # prints array(645.577, dtype=float32), i.e. the numerical value of dg/db

replied to werewolf5's post 8 months ago
reacted to nateraw's post with ๐Ÿ”ฅ 8 months ago
reacted to vikhyatk's post with ๐Ÿ”ฅ 8 months ago
view post
Post
3056
Updated the vikhyatk/lnqa dataset to include images, so you no longer need to separately download them from OpenImages!
reacted to akhaliq's post with ๐Ÿ‘ 8 months ago
view post
Post
3518
CatLIP

CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data (2404.15653)

Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings. However, pairwise similarity computation in contrastive loss between image and text pairs poses computational challenges. This paper presents a novel weakly supervised pre-training of vision models on web-scale image-text data. The proposed method reframes pre-training on image-text data as a classification task. Consequently, it eliminates the need for pairwise similarity computations in contrastive loss, achieving a remarkable 2.7times acceleration in training speed compared to contrastive learning on web-scale data. Through extensive experiments spanning diverse vision tasks, including detection and segmentation, we demonstrate that the proposed method maintains high representation quality.