Spaces:
Running
Models that can be efficient enough to work with the firehose API
There are some nice starter packs for creating custom Bluesky feeds, i.e. https://github.com/MarshalX/bluesky-feed-generator.
One exciting idea is to use an ML classifier to curate this feed. It would be very cool if someone could easily fine-tune a classifier to find things they like and then deploy this model to create a custom feed for them.
One challenge is finding models that can run quickly enough (and cheaply enough) to do this with the volume of data coming from the Bluesky API. In practice, this could likely mean small models that can run efficiently on a modest CPU.
This thread is proposed to collect ideas for what these models could look like and possibly some benchmarks of running those models on the actual firehose data!
I run finetuned classifiers on a small CPU/RAM budget. My strong recommendation is that for classifiers nothing will beat a finetuned DeBERTa on accuracy/#params (unless dealing with long context).
If people can do a one-click automl classifier training, it also helps that DeBERTa finetuning is not particularly sensitive to hparams.
You can also go old-school with embeddings+logistic regression.
@koaning had a really nice related project to classify and find new arxiv papers for datasets, benchmarks and llms. It's fast enough to run everyday for free via github actions
https://github.com/koaning/arxiv-frontpage
Another idea would be to first filter the firehose using regex on skyfeed for high recall and then running a custom classifier on posts in that feed. It should be faster than reading events directly from the firehose, as skyfeed can handle the indexing and processing part.
So what are the ideas for what these models are supposed to do? Are we talking topic classifier models? Quality filter / suggest-to-moderate models? RecSys models?
From discussion above, looks like we're talking about classifiers - but to classify what? What's the label set?