Christopher Schrรถder

cschroeder

AI & ML interests

NLP, Active Learning, Text Representations, PyTorch

Recent Activity

posted an update 8 days ago
๐Ÿ’ก๐—Ÿ๐—ผ๐—ผ๐—ธ๐—ถ๐—ป๐—ด ๐—ณ๐—ผ๐—ฟ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜: ๐—›๐—ฎ๐˜ƒ๐—ฒ ๐˜†๐—ผ๐˜‚ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—ต๐—ฎ๐—ฑ ๐˜๐—ผ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ฎ ๐—น๐—ฎ๐—ฐ๐—ธ ๐—ผ๐—ณ ๐—น๐—ฎ๐—ฏ๐—ฒ๐—น๐—ฒ๐—ฑ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜๐—ผ ๐—ฑ๐—ฒ๐—ฎ๐—น ๐˜„๐—ถ๐˜๐—ต ๐—ฎ๐—ป ๐—ก๐—Ÿ๐—ฃ ๐˜๐—ฎ๐˜€๐—ธ? Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? ๐—ช๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐—ฐ๐˜‚๐—ฟ๐—ฟ๐—ฒ๐—ป๐˜๐—น๐˜† ๐—ฐ๐—ผ๐—ป๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ป๐—ด ๐—ฎ ๐˜€๐˜‚๐—ฟ๐˜ƒ๐—ฒ๐˜† to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models. The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community. ๐Ÿ‘‰ With only 5โ€“15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data. โค๏ธHow you can help even more: If you know others working on supervised learning and NLP, please share this survey with themโ€”weโ€™d really appreciate it! Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271 Estimated time required: 5โ€“15 minutes Deadline for participation: January 12, 2025 #NLP #ML
liked a Space 14 days ago
data-is-better-together/fineweb-c
liked a model 24 days ago
PleIAs/celadon
View all activity

Organizations

Webis Group's profile picture Webis Hugging Face Workshop's profile picture small-text's profile picture German LLM Tokenizers's profile picture Social Post Explorers's profile picture GERTuraX's profile picture Hugging Face Discord Community's profile picture ScaDS.AI German LLM's profile picture

cschroeder's activity

posted an update 8 days ago
view post
Post
330
๐Ÿ’ก๐—Ÿ๐—ผ๐—ผ๐—ธ๐—ถ๐—ป๐—ด ๐—ณ๐—ผ๐—ฟ ๐˜€๐˜‚๐—ฝ๐—ฝ๐—ผ๐—ฟ๐˜: ๐—›๐—ฎ๐˜ƒ๐—ฒ ๐˜†๐—ผ๐˜‚ ๐—ฒ๐˜ƒ๐—ฒ๐—ฟ ๐—ต๐—ฎ๐—ฑ ๐˜๐—ผ ๐—ผ๐˜ƒ๐—ฒ๐—ฟ๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ฎ ๐—น๐—ฎ๐—ฐ๐—ธ ๐—ผ๐—ณ ๐—น๐—ฎ๐—ฏ๐—ฒ๐—น๐—ฒ๐—ฑ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜๐—ผ ๐—ฑ๐—ฒ๐—ฎ๐—น ๐˜„๐—ถ๐˜๐—ต ๐—ฎ๐—ป ๐—ก๐—Ÿ๐—ฃ ๐˜๐—ฎ๐˜€๐—ธ?

Are you working on Natural Language Processing tasks and have faced the challenge of a lack of labeled data before? ๐—ช๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐—ฐ๐˜‚๐—ฟ๐—ฟ๐—ฒ๐—ป๐˜๐—น๐˜† ๐—ฐ๐—ผ๐—ป๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ป๐—ด ๐—ฎ ๐˜€๐˜‚๐—ฟ๐˜ƒ๐—ฒ๐˜† to explore the strategies used to address this bottleneck, especially in the context of recent advancements, including but not limited to large language models.

The survey is non-commercial and conducted solely for academic research purposes. The results will contribute to an open-access publication that also benefits the community.

๐Ÿ‘‰ With only 5โ€“15 minutes of your time, you would greatly help to investigate which strategies are used by the #NLP community to overcome a lack of labeled data.

โค๏ธHow you can help even more: If you know others working on supervised learning and NLP, please share this survey with themโ€”weโ€™d really appreciate it!

Survey: https://bildungsportal.sachsen.de/umfragen/limesurvey/index.php/538271
Estimated time required: 5โ€“15 minutes
Deadline for participation: January 12, 2025

#NLP #ML
posted an update about 1 month ago
view post
Post
1082
๐Ÿฃ New release: small-text v2.0.0.dev1

With small language models on the rise, the new version of small-text has been long overdue! Despite the generative AI hype, many real-world tasks still rely on supervised learningโ€”which is reliant on labeled data.

Highlights:
- Four new query strategies: Try even more combinations than before.
- Vector indices integration: HNSW and KNN indices are now available via a unified interface and can easily be used within your code.
- Simplified installation: We dropped the torchtext dependency and cleaned up a lot of interfaces.

Github: https://github.com/webis-de/small-text

๐Ÿ‘‚ Try it out for yourself! We are eager to hear your feedback.
๐Ÿ”ง Share your small-text applications and experiments in the newly added showcase section.
๐ŸŒŸ Support the project by leaving a star on the repo!

#activelearning #nlproc #machinelearning
posted an update about 1 month ago
view post
Post
695
#EMNLP2024 is happening soon! Unfortunately, I will not be on site, but I will present our poster virtually on Wednesday, Nov 13 (7:45 EST / 13:45 CEST) in Virtual Poster Session 2.

In this work, we leverage self-training in an active learning loop in order to train small language models with even less data. Hope to see you there!
  • 1 reply
ยท
reacted to tomaarsen's post with ๐Ÿ”ฅ 3 months ago
view post
Post
1995
I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.

โ› Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives

๐Ÿ”“ Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1

I'm looking forward to releasing v3.2, I have some exciting things planned ๐Ÿš€
replied to do-me's post 3 months ago
view reply

Did not know text-splitter yet, thanks!

reacted to do-me's post with ๐Ÿ‘€ 3 months ago
view post
Post
1060
What are your favorite text chunkers/splitters?
Mine are:
- https://github.com/benbrandt/text-splitter (Rust/Python, battle-tested, Wasm version coming soon)
- https://github.com/umarbutler/semchunk (Python, really performant but some issues with huge docs)

I tried the huge Jina AI regex, but it failed for my (admittedly messy) documents, e.g. from EUR-LEX. Their free segmenter API is really cool but unfortunately times out on my huge docs (~100 pages): https://jina.ai/segmenter/

Also, I tried to write a Vanilla JS chunker with a simple, adjustable hierarchical logic (inspired from the above). I think it does a decent job for the few lines of code: https://do-me.github.io/js-text-chunker/

Happy to hear your thoughts!
  • 1 reply
ยท
upvoted an article 3 months ago
view article
Article

AI Policy @๐Ÿค—: Open ML Considerations in the EU AI Act

โ€ข 2
reacted to gaodrew's post with ๐Ÿ”ฅ 3 months ago
view post
Post
1412
We used Hugging Face Trainer to fine-tune Deberta-v3-base for Personally Identifiable Information detection, achieving 99.44% overall accuracy (98.27% Recall for PII detection).

Please try our model (Colab Quickstart available) and let us know what you think:
iiiorg/piiranha-v1-detect-personal-information
  • 3 replies
ยท
reacted to tomaarsen's post with ๐Ÿ”ฅ 3 months ago
view post
Post
3743
๐Ÿš€ Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

โ› Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
๐Ÿ“‰ New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
๐Ÿ’พ Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
๐Ÿงฉ Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
โœจ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
๐Ÿ› Bug fixes: Too many to name here, check out the release notes!
๐Ÿ“ Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here โญ: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
ยท
posted an update 4 months ago
view post
Post
401
โš–๏ธ ๐€๐ˆ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐ข๐ฌ ๐‚๐จ๐ฉ๐ฒ๐ซ๐ข๐ ๐ก๐ญ ๐ˆ๐ง๐Ÿ๐ซ๐ข๐ง๐ ๐ž๐ฆ๐ž๐ง๐ญ

This bold claim is not my opinion, but it has been made in a recent "report" of a group, whose stance is recognizable in their name. It is roughly translated as "Authors' Rights Initiative". They published a report which was also presented before the EU Parliament according to the LinkedIn post below.

I am not really interested in politics, but as an EU citizen I am of course somewhat interested in a reasonable and practical version of the EU AI Act. Not saying there should not be rules around data and AI, but this report is obviously very biased towards one side.

While I think the report itself does not deserve attention, I post it in the hope that you find more examples, where they did not address the issue adequately. Feel free to add to my LinkedIn posts (where the original authors will see it) or here.

[en] Executive summary: https://urheber.info/media/pages/diskurs/ai-training-is-copyright-infringement/3b900058e6-1725460935/executive-summary_engl_final_29-08-2024.pdf
[de] Full report: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4946214

LinkedIn: https://www.linkedin.com/posts/activity-7238912869268959232-6cFx