婷婷成人丁香色情基地,26UUU在线观看网

來源：OpenVINO 中文社區(qū)

作者：楊亦誠(chéng) 英特爾 AI 軟件工程師

介紹

Retrieval-Augmented Generation (RAG) 系統(tǒng)可以通過從知識(shí)庫(kù)中過濾關(guān)鍵信息來優(yōu)化 LLM 任務(wù)的內(nèi)存占用及推理性能。歸功于文本解析、索引和檢索等成熟工具的應(yīng)用，為文本內(nèi)容構(gòu)建 RAG 流水線已經(jīng)相對(duì)成熟。然而為視頻內(nèi)容構(gòu)建 RAG 流水線則困難得多。由于視頻結(jié)合了圖像，音頻和文本元素，因此需要更多和更復(fù)雜的數(shù)據(jù)處理能力。本文將介紹如何利用 OpenVINO 和 LlamaIndex 工具構(gòu)建應(yīng)用于視頻理解任務(wù)的RAG流水線。

要構(gòu)建真正的多模態(tài)視頻理解RAG，需要處理視頻中不同模態(tài)的數(shù)據(jù)，例如語(yǔ)音內(nèi)容、視覺內(nèi)容等。在這個(gè)例子中，我們展示了專為視頻分析而設(shè)計(jì)的多模態(tài) RAG 流水線。它利用 Whisper 模型將視頻中的語(yǔ)音內(nèi)容轉(zhuǎn)換為文本內(nèi)容，利用 CLIP 模型生成多模態(tài)嵌入式向量，利用視覺語(yǔ)言模型（VLM）處理檢索到的圖像和文本消息以及用戶請(qǐng)求。下圖詳細(xì)說明了該流水線的工作原理。

圖：視頻理解 RAG 工作原理

源碼地址：

https://github.com/openvinotoolkit/openvino_notebooks/tree/latest/notebooks/multimodal-rag

環(huán)境準(zhǔn)備

該示例基于 Jupyter Notebook 編寫，因此我們需要準(zhǔn)備好相對(duì)應(yīng)的 Python 環(huán)境。基礎(chǔ)環(huán)境可以參考以下鏈接安裝，并根據(jù)自己的操作系統(tǒng)進(jìn)行選擇具體步驟。

https://github.com/openvinotoolkit/openvino_notebooks?tab=readme-ov-file#-getting-started

圖：基礎(chǔ)環(huán)境安裝導(dǎo)航頁(yè)面

此外本示例將依賴 OpenVINO 和 LlamaIndex 的集成組件，因此我們需要單獨(dú)在環(huán)境中對(duì)他們進(jìn)行安裝，分別是用于為圖像和文本生成多模態(tài)向量的llama-index-embeddings-openvino庫(kù)，以及視覺多模態(tài)推理llama-index-multi-modal-llms-openvino庫(kù)。

模型下載和轉(zhuǎn)換

完成環(huán)境搭建后，我們需要逐一下載流水線中用到的語(yǔ)音識(shí)別 ASR 模型，多模型向量化模型 CLIP，以及視覺語(yǔ)言模型模型 VLM。

考慮到精度對(duì)模型準(zhǔn)確性的影響，在這個(gè)示例中我們直接從 OpenVINO HuggingFace 倉(cāng)庫(kù)中，下載轉(zhuǎn)換以后的 ASR int8 模型。

import huggingface_hub as hf_hub


asr_model_id = "OpenVINO/distil-whisper-large-v3-int8-ov"
asr_model_path = asr_model_id.split("/")[-1]


if not Path(asr_model_path).exists():
    hf_hub.snapshot_download(asr_model_id, local_dir=asr_model_path)

而 CLIP 及 VLM 模型則采用 Optimum-intel 的命令行工具，通過下載原始模型對(duì)它們進(jìn)行轉(zhuǎn)換和量化。

from cmd_helper import optimum_cli


clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K"
clip_model_path = clip_model_id.split("/")[-1]


if not Path(clip_model_path).exists():
  optimum_cli(clip_model_id, clip_model_path)

視頻數(shù)據(jù)提取與處理

接下來我們需要使用第三方工具提取視頻文件中的音頻和圖片，并利用 ASR 模型將音頻轉(zhuǎn)化為文本，便于后續(xù)的向量化操作。在這一步中我們選擇了一個(gè)關(guān)于高斯分布的科普視頻作為示例（https://www.youtube.com/watch?v=d_qvLDhkg00)?？梢詤⒖家韵麓a片段，完成對(duì) ASR 模型的初始化以及音頻內(nèi)容識(shí)別。識(shí)別結(jié)果將被以 .txt 文件格式保存在本地。

from optimum.intel import OVModelForSpeechSeq2Seq
from transformers import AutoProcessor, pipeline


asr_model = OVModelForSpeechSeq2Seq.from_pretrained(asr_model_path, device=asr_device.value)
asr_processor = AutoProcessor.from_pretrained(asr_model_path)


pipe = pipeline("automatic-speech-recognition", model=asr_model, tokenizer=asr_processor.tokenizer, feature_extractor=asr_processor.feature_extractor)


result = pipe(en_raw_speech, return_timestamps=True)

創(chuàng)建多模態(tài)向量索引

這也是整個(gè) RAG 鏈路中最關(guān)鍵的一步，將視頻文件中獲取的文本和圖像轉(zhuǎn)換為向量數(shù)據(jù)，存入向量數(shù)據(jù)庫(kù)。這些向量數(shù)據(jù)的質(zhì)量也直接影響后續(xù)檢索任務(wù)中的召回準(zhǔn)確性。這里我們首先需要對(duì) CLIP 模型進(jìn)行初始化，利用 OpenVINO 和 LlamaIndex 集成后的庫(kù)可以輕松實(shí)現(xiàn)這一點(diǎn)。

from llama_index.embeddings.huggingface_openvino import OpenVINOClipEmbedding


clip_model = OpenVINOClipEmbedding(model_id_or_path=clip_model_path, device=clip_device.value)

然后可以直接調(diào)用 LlamaIndex 提供的向量數(shù)據(jù)庫(kù)組件快速完成建庫(kù)過程，并對(duì)檢索引擎進(jìn)行初始化。

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import StorageContext, Settings
from llama_index.core.node_parser import SentenceSplitter


Settings.embed_model = clip_model


index = MultiModalVectorStoreIndex.from_documents(
  documents, storage_context=storage_context, image_embed_model=Settings.embed_model, transformations=[SentenceSplitter(chunk_size=300, chunk_overlap=30)]
)


retriever_engine = index.as_retriever(similarity_top_k=2, image_similarity_top_k=5)

多模態(tài)向量檢索

傳統(tǒng)的文本 RAG 通過檢索文本相似度來召喚向量數(shù)據(jù)庫(kù)中關(guān)鍵的文本內(nèi)容，而多模態(tài) RAG 則需要額外對(duì)圖片向量進(jìn)行檢索，用以返回與輸入問題相關(guān)性最高的關(guān)鍵幀，供 VLM 進(jìn)一步理解。這里我們會(huì)將用戶的提問文本向量化后，通過向量引擎檢索得到與該問題相似度最高的若干個(gè)文本片段，以及視頻幀。LlamaIndex 為我們提供了強(qiáng)大的工具組件，通過調(diào)用函數(shù)的方式可以輕松實(shí)現(xiàn)以上步驟。

from llama_index.core import SimpleDirectoryReader


query_str = "tell me more about gaussian function"


img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(input_dir=output_folder, input_files=img).load_data()

代碼運(yùn)行后，我們可以看到檢索得到的文本段和關(guān)鍵幀。

圖：檢索返回的關(guān)鍵幀和相關(guān)文本片段

答案生成

多模態(tài) RAG 流水線的最后一步是要將用戶問題，以及檢索到相關(guān)文本及圖像內(nèi)容送入 VLM 模型進(jìn)行答案生成。這里我們選擇微軟的 Phi-3.5-vision-instruct 多模態(tài)模型，以及 OpenVINO 和 LlamaIndex 集后的多模態(tài)模任務(wù)組件，完成圖片及文本內(nèi)容理解。值得注意的是由于檢索返回的關(guān)鍵幀往往包含多張圖片，因此這里需要選擇支持多圖輸入的多模態(tài)視覺模型。以下代碼為 VLM 模型初始化方法。

from llama_index.multi_modal_llms.openvino import OpenVINOMultiModal


vlm = OpenVINOMultiModal(
  model_id_or_path=vlm_int4_model_path,
  device=vlm_device.value,
  messages_to_prompt=messages_to_prompt,
  trust_remote_code=True,
  generate_kwargs={"do_sample": False, "eos_token_id": processor.tokenizer.eos_token_id},
)

完成 VLM 模型對(duì)象初始化后，我們需要將上下文信息與圖片送入 VLM 模型，生成最終答案。此外在這個(gè)示例中還構(gòu)建了基于 Gradio 的交互式 demo，供大家參考。

response = vlm.stream_complete(
  prompt=qa_tmpl_str.format(context_str=context_str, query_str=query_str),
  image_documents=image_documents,
)
for r in response:
  print(r.delta, end="")

運(yùn)行結(jié)果如下：

“A Gaussian function, also known as a normal distribution, is a type of probability distribution that is symmetric and bell-shaped. It is characterized by its mean and standard deviation, which determine the center and spread of the distribution, respectively. The Gaussian function is widely used in statistics and probability theory due to its unique properties and applications in various fields such as physics, engineering, and finance. The function is defined by the equation e to the negative x squared, where x represents the input variable. The graph of a Gaussian function is a smooth curve that approaches the x-axis as it moves away from the center, creating a bell-like shape. The function is also known for its property of being able to describe the distribution of random variables, making it a fundamental concept in probability theory and statistics.”

總結(jié)

在視頻內(nèi)容理解任務(wù)中，如果將全部的視頻幀一并送入 VLM 進(jìn)行理解和識(shí)別，會(huì)對(duì) VLM 性能和資源占用帶來非常大的挑戰(zhàn)。通過多模態(tài) RAG 技術(shù)，我們可以首先對(duì)關(guān)鍵幀進(jìn)行檢索，從而壓縮在視頻理解任務(wù)中 VLM 的輸入數(shù)據(jù)量，提高整套系統(tǒng)的識(shí)別效率和準(zhǔn)確性。而 OpenVINO 與 LlamaIndex 集成后的組件則可以提供完整方案的同時(shí)，在本地 PC 端流暢運(yùn)行流水線中的各個(gè)模型。

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請(qǐng)聯(lián)系本站處理。舉報(bào)投訴

英特爾

英特爾

+關(guān)注

關(guān)注
61

文章
10196

瀏覽量
174705
流水線

流水線

+關(guān)注

關(guān)注
0

文章
124

瀏覽量
26632
模型

模型

+關(guān)注

關(guān)注
1

文章
3521

瀏覽量
50434
OpenVINO

OpenVINO

+關(guān)注

關(guān)注
0

文章
115

瀏覽量
483

原文標(biāo)題：開發(fā)者實(shí)戰(zhàn)｜如何利用 OpenVINO? 在本地構(gòu)建多模態(tài) RAG 應(yīng)用

文章出處：【微信號(hào)：英特爾物聯(lián)網(wǎng)，微信公眾號(hào)：英特爾物聯(lián)網(wǎng)】歡迎添加關(guān)注！文章轉(zhuǎn)載請(qǐng)注明出處。

一区二区三区三上|欧美在线视频五区|国产午夜无码在线观看视频|亚洲国产裸体网站|无码成年人影视|亚洲AV亚洲AV|成人开心激情五月|欧美性爱内射视频|超碰人人干人人上|一区二区无码三区亚洲人区久久精品

搜索歷史

利用OpenVINO和LlamaIndex工具構(gòu)建多模態(tài)RAG應(yīng)用

評(píng)論