Multimodal Input

Multimodal input refers to the ability of a Large Language Model (LLM) to process and understand information presented in multiple formats, such as text, images, and potentially audio or video.

Support in ChatFrame

ChatFrame supports multimodal input primarily through the capabilities of the connected LLM providers.

Provider Capability: If a configured model (e.g., GPT-4o, Gemini 2.5 Pro) supports image input, ChatFrame will facilitate sending image data along with your text prompt.
User Interface: The ChatFrame interface provides a mechanism (e.g., an attachment button or drag-and-drop) to include images in your prompt.

Local Files and RAG

While the LLM may process images, the Local RAG feature in ChatFrame is primarily focused on text-based files (PDF, text, code). This is a security and performance choice to ensure that the file indexing and retrieval process remains fast and entirely local.

Future Development

As LLM capabilities evolve, ChatFrame is expected to continue supporting the latest multimodal features offered by its integrated providers.