--- language: - ko - en - zh license: mit pipeline_tag: feature-extraction tags: - transformers - sentence-transformers - text-embeddings-inference --- # upskyy/ko-reranker **ko-reranker**는 [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) 모델에 [한국어 데이터](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)를 finetuning 한 model 입니다. ## Usage ## Using FlagEmbedding ``` pip install -U FlagEmbedding ``` Get relevance scores (higher scores indicate more relevance): ```python from FlagEmbedding import FlagReranker reranker = FlagReranker('upskyy/ko-reranker', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation score = reranker.compute_score(['query', 'passage']) print(score) # -1.861328125 # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score score = reranker.compute_score(['query', 'passage'], normalize=True) print(score) # 0.13454832326359276 scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]) print(scores) # [-7.37109375, 8.5390625] # You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True) print(scores) # [0.0006287840192903181, 0.9998043646624727] ``` ## Using Sentence-Transformers ``` pip install -U sentence-transformers ``` Get relevance scores (higher scores indicate more relevance): ```python from sentence_transformers import SentenceTransformer sentences_1 = ["경제 전문가가 금리 인하에 대한 예측을 하고 있다.", "주식 시장에서 한 투자자가 주식을 매수한다."] sentences_2 = ["한 투자자가 비트코인을 매수한다.", "금융 거래소에서 새로운 디지털 자산이 상장된다."] model = SentenceTransformer('upskyy/ko-reranker') embeddings_1 = model.encode(sentences_1, normalize_embeddings=True) embeddings_2 = model.encode(sentences_2, normalize_embeddings=True) similarity = embeddings_1 @ embeddings_2.T print(similarity) ``` ## Using Huggingface transformers Get relevance scores (higher scores indicate more relevance): ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker') model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker') model.eval() pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']] with torch.no_grad(): inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512) scores = model(**inputs, return_dict=True).logits.view(-1, ).float() print(scores) ``` ## Citation ```bibtex @misc{bge_embedding, title={C-Pack: Packaged Resources To Advance General Chinese Embedding}, author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff}, year={2023}, eprint={2309.07597}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` ## License FlagEmbedding is licensed under the MIT License. The released models can be used for commercial purposes free of charge.