Chroma Indexing and RAG Examples

_{Last Updated:
September 19, 2024}

Install dependencies

# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"

Indexing Pipeline: preprocess, split and index documents

In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the VIM User Manuel into the Haystack ChromaDocumentStore.

We have the .txt files for these pages in the examples folder for the ChromaDocumentStore, so we are using the TextFileToDocument and DocumentWriter components to build this indexing pipeline.

# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .

import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

file_paths = ["data" / Path(name) for name in os.listdir("data")]

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

Query Pipeline: build retrieval-augmented generation (RAG) pipelines

Once we have documents in the ChromaDocumentStore, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma’s query API.

You can change the idnexing pipeline and query pipelines here for embedding search by using one of the Haystack Embedders accompanied by the ChromaEmbeddingRetriever.

In this example we are using:

The HuggingFaceAPIGenerator with zephyr-7b-beta. (You will need a Hugging Face token to use this model). You can replace this with any of the other Generators
The PromptBuilder which holds the prompt template. You can adjust this to a prompt of your choice
The ChromaQueryRetriver which expects a list of queries and retieves the top_k most relevant documents from your Chroma collection.

import os
from getpass import getpass

os.environ["HF_API_TOKEN"] = getpass("Enter Hugging Face API key:")

from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.builders import PromptBuilder

prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
  {{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt)

llm = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
                              api_params={"model": "HuggingFaceH4/zephyr-7b-beta"})

retriever = ChromaQueryTextRetriever(document_store)

querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)

querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")

query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3},
                        "prompt_builder": {"query": query},
                        "llm":{"generation_kwargs": {"max_new_tokens": 350}}})

print(results["llm"]["replies"][0])