BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self

Overview

  • BAAI๋ผ๋Š” ์ค‘๊ตญ AI ์—ฐ๊ตฌ์†Œ์—์„œ ๋งŒ๋“  embedding model

  • M3๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ 3๊ฐ€์ง€ ํŠน์ง•์ž„

    • Multi-Linguality : 100๊ฐœ ์ด์ƒ์˜ ์–ธ์–ด

    • Multi-Functionality: 3๊ฐ€์ง€ retrieval ๋ฐฉ์‹์„ ๊ฐ™์ด ์ œ๊ณต

    • Multi-Granularity : ์งง๊ณ , ๊ธด ๋ฌธ์žฅ(์ตœ๋Œ€ 8192 ํ† ํฐ) ์—์„œ๋„ ์ž˜ ์ž‘๋™ํ•จ

  • ์‹ค์ œ ์‹คํ—˜๊ฒฐ๊ณผ ํ•œ๊ตญ์–ด์—์„œ๋„ ์ข‹์€ ์„ฑ๋Šฅ

  • ์•„์ง Training์ฝ”๋“œ๋Š” ๊ณต๊ฐœ๋˜์ง€ ์•Š์Œ.

Introduction

  • IR(Information Retrieval)์— ์‚ฌ์šฉ๋˜๋Š” Embedding ๋ชจ๋ธ์€ ๋งŽ์ด ์—ฐ๊ตฌ๊ฐ€ ๋จ.

  • ํ•˜์ง€๋งŒ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ

    • 1) embedding ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„ ์˜์–ด์—์„œ๋งŒ ์ž‘๋™

    • 2) ์˜ค์ง 1๊ฐœ retrieval task์—๋งŒ ๋งž์ถฐ ํ•™์Šต์ด ์ง„ํ–‰๋จ. (ํ•˜์ง€๋งŒ ์‹ค์ œ์—์„  ์—ฌ๋Ÿฌ๊ฐœ ์‚ฌ์šฉํ•ด์•ผํ• ์ˆ˜๋„ ์žˆ์Œ)

    • 3) long-document retreiver ๊ฑฐ์˜ ์—†์Œ

  • ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ M3-Embedding์„ ์ œ์•ˆํ•จ

  • Multi-Linguality

    • 100๊ฐœ ์ด์ƒ ์–ธ์–ด๋ฅผ ์ง€์›

    • ๋˜ํ•œ ํ•œ๊ตญ์–ด๋กœ ๋˜์–ด์žˆ๋Š” ๋ฌธ์„œ์ค‘์— ์˜์–ด๋กœ ์งˆ๋ฌธํ•ด์„œ ๊ฒ€์ƒ‰๋„ ๊ฐ€๋Šฅ

  • Multi-Functionality

    • ์›๋ž˜ ๋‹ค ๊ฐ๊ฐ ํ•™์Šต๋˜๊ณ , ์ถ”ํ›„์— ๊ฐ™์ด hybrid ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ–ˆ๋˜ ๊ฒ€์ƒ‰ ๋ฐฉ์‹์„ ๋‹ค ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ.

    • ๊ฐ๊ฐ ๊ฒ€์ƒ‰๋ฐฉ์‹์€ background์—์„œ ์ž์„ธํžˆ ์„ค๋ช…

    • Self-knowledge distillation ๋ฐฉ์‹์œผ๋กœ ๊ฐ 3๊ฐ€์ง€ ํ•จ์ˆ˜์—์„œ ๋‚˜์˜จ score๋ฅผ ํ†ตํ•ฉํ•ด์„œ ํ™œ์šฉํ•จ

  • Multi-Granularity

    • ์ตœ๋Œ€ 8192 ํ† ํฐ๊นŒ์ง€ ๋Š˜๋ฆผ, ์ด๋ฅผ ์œ„ํ•ด batching strategy๋ฅผ ์ตœ์ ํ™” ํ•จ

    • ๋˜ํ•œ ๋ฌธ์žฅ, ๋ฌธ๋‹จ ๋‹จ์œ„์—์„œ ๋ชจ๋‘ ์„ฑ๋Šฅ์ด ์ข‹์Œ

Background

Dense Retrieval

  • ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์ „ํ›ˆ๋ จ๋œ Encoder(Bert, Roberta)๋ฅผ ๊ฐ€์ง€๊ณ  ์ž„๋ฒ ๋”ฉ์„ ํ™œ์šฉํ•ด์„œ ์œ ์‚ฌ๋„๋ฅผ ๊ตฌํ•จ

  • ์ด๋•Œ ๊ฐ ์งˆ๋ฌธ๊ณผ ๋ฌธ๋‹จ์˜ [CLS] ํ† ํฐ์˜ ์ž„๋ฒ ๋”ฉ(hidden state) ๊ฐ’ ์‚ฌ์šฉ

Sparse Retrieval

  • ๊ฐ ๋‹จ์–ด(token) Term ์ž์ฒด์— ์ง‘์ค‘ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ๋”ฅ๋Ÿฌ๋‹ ์‚ฌ์šฉ์ „์—๋Š” BM25๋ฅผ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ

  • ์ธ์ฝ”๋” ๋ชจ๋ธ์„ ํ™œ์šฉํ•œ๋‹ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์Œ.

    • ๊ฐ๊ฐ ํ† ํฐ์˜ ์ž„๋ฒ ๋”ฉ(hidden state)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•

Multi-Vec Retrieval

  • ํฌ๊ฒŒ ๋‘๊ฐ€์ง€ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ ๋ณผ์ˆ˜ ์žˆ์Œ

    • Dense vector์™€ ๋‹ค๋ฅด๊ฒŒ [CLS] ํ† ํฐ์ด ์•„๋‹Œ ๋ชจ๋“  ํ† ํฐ ์ž„๋ฒ ๋”ฉ์„ ํ•ฉ์ณ์„œ ํ™œ์šฉํ•˜๋Š”๊ฒƒ

    • ๋‹ค์–‘ํ•˜๊ฒŒ passage, ์งˆ๋ฌธ์„ ๋ณ€ํ™”์‹œ์ผœ vector๋ฅผ ์—ฌ๋Ÿฌ๊ฐœ ๋งŒ๋“ค์–ด์„œ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ

    • ์งˆ๋ฌธ: ์œ ์‚ฌ์งˆ๋ฌธ ๋งŒ๋“ค๊ธฐ๋กœ ํ•ด์„œ ์งˆ๋ฌธ ์ž„๋ฒ ๋”ฉ ํ‰๊ท  ๊ตฌํ•ด์„œ ํ™œ์šฉ

    • passage: ๋ฌธ์„œ์š”์•ฝ, ์งง์€ ๋ฌธ์žฅ ๋“ฑ๋“ฑ์„ ํ•ด์„œ passage ์ž„๋ฒ ๋”ฉ ํ‰๊ท  ๊ตฌํ•ด์„œ ํ™œ์šฉ ๋“ฑ๋“ฑ

    • ๋ฐ‘์— ์˜ˆ์‹œ) HyDe

Method(M3-Embedding)

  • Query q๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ Corpus์—์„œ ๊ฐ€์žฅ ๊ด€๋ จ์žˆ๋Š” ๋ฌธ์„œ d๋ฅผ ์ฐพ์•„์˜ค๋Š” ๊ฒƒ

    • ์ด๋•Œ q์™€ D์˜ ์–ธ์–ด๋Š” ๋‹ฌ๋ผ๋„ ๊ฐ€๋Šฅ

Data Curation

  • MultiLongDoc๋Š” ์ง์ ‘ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ 'GPT3.5' ํ™œ์šฉ

    • You are a curious AI assistant, please generate one specific and valuable question based on the following text. The generated question should revolve around the core content of this text, and avoid using pronouns (e.g., โ€thisโ€). Note that you should generate only one question, without including additional content:โ€.

Hybrid Retrieval

  • ๋‹ค์Œ 3๊ฐ€์ง€์˜ score๋ฅผ ํ•ฉ์ณ์„œ ๋” ์ข‹์€ score๋ฅผ ์–ป๋Š”๋‹ค ์ด๊ฒŒ ๊ธฐ๋ณธ ํฌ์ธํŠธ

  • Dense

    • CLS ํ† ํฐ์˜ hidden state๋ฅผ normalized ํ•œ ๊ฐ’ ํ™œ์šฉ

      • eq=norm(Hq[0])e_q = norm(Hq[0])

      • ep=norm(Hp[0])e_p = norm(Hp[0])

    • ์œ ์‚ฌ๋„ ์Šค์ฝ”์–ด๋Š” ๋‚ด์  ํ™œ์šฉ

      • sdense=<ep,eq>s_{dense} = <e_p, e_q>

  • Sparce(Lexical)

    • ๊ฐ ํ† ํฐ์˜ weight ๊ฐ’ ํ™œ์šฉ

      • wqt=Relu(WlextHq[i])w_{q_t} = Relu(W^t_{lex}H_q[i])

      • WlexW_{lex} ๋Š” hidden state๋ฅผ float๋กœ ๋ณ€ํ™˜์‹œํ‚ค๋Š” mapping matrix

    • ์œ ์‚ฌ๋„ ์Šค์ฝ”์–ด๋Š” joint importance of the co-existed terms๋ฅผ ํ™œ์šฉ

    • slex=โˆ‘tโˆˆqโˆชp(wqtโˆ—wpt)s_{lex} = \sum_{t \in q \cup p} (w_{q_t} * w_{p_t})

  • Multi-vector

    • Dense vector์˜ extension์œผ๋กœ ์ „์ฒด output embedding์„ ํ™œ์šฉํ•จ

      • Eq=norm(WmulTHq)E_q = norm(W_{mul}^TH_q)

      • WmultTW_{mult}^T ๋Š” leranable projection matrix

    • ์œ ์‚ฌ๋„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•จ (๋‚ด์  ํ™œ์šฉ)

      • 1/Nโˆ‘i=1Nmaxj=1MEq[i]โ‹…Ept[i]1/N \sum_{i=1}^N max_{j=1}^M E_q[i] \cdot E_p^t[i]

Self-Knowledge Distillation

  • ์•„๊นŒ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์–ป์€ ์Šค์ฝ”์–ด๋ฅผ ๋‹จ์ˆœํžˆ sum-up ๋ฐฉ์‹์œผ๋กœ ํ•ฉ์นจ

    • sinter=sdense+slex+smuls_{inter} = s_{dense} + s_{lex} + s_{mul}

  • ๊ทธ ํ›„ ๋กœ์Šค๊ฐ’๋„ ์ด 3๊ฐ€์ง€๋ฅผ ํ•ฉ์ณ์„œ ๋งŒ๋“ค์–ด์„œ ํ•™์Šต์— ํ™œ์šฉ (์—ฌ๊ธฐ์— ๊ธฐ๋ณธ ๋กœ์Šค๊ฐ’์ธ InfoNCE loss๋ฅผ ๊ฐ™์ด ์‚ฌ์šฉ)

    • L"=Ldencse+Llex+LmulL^" = L_{dencse} + L_{lex} + L_{mul}

  • ํ•™์Šต์€ ํฌ๊ฒŒ ๋‘๋‹จ๊ณ„๋กœ ์ง„ํ–‰

    • 1๋‹จ๊ณ„ Unsupervised data๋กœ pre-trained

    • 2๋‹จ๊ณ„ Supervised data๋กœ ์•ž์„  3๊ฐ€์ง€๋ฅผ loss๋กœ ํ™œ์šฉํ•ด์„œ ํ›ˆ๋ จ

Efficient Batching

Result

Main Result

Ablation Study

Last updated