Chroma Indexing and RAG Examples
Last Updated: September 19, 2024
Install dependencies
# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"
Indexing Pipeline: preprocess, split and index documents
In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the 
VIM User Manuel into the Haystack ChromaDocumentStore.
We have the .txt files for these pages in the examples folder for the ChromaDocumentStore, so we are using the 
TextFileToDocument and 
DocumentWriter components to build this indexing pipeline.
# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .
import os
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
file_paths = ["data" / Path(name) for name in os.listdir("data")]
# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
Query Pipeline: build retrieval-augmented generation (RAG) pipelines
Once we have documents in the ChromaDocumentStore, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma’s 
query API.
You can change the idnexing pipeline and query pipelines here for embedding search by using one of the 
Haystack Embedders accompanied by the  ChromaEmbeddingRetriever.
In this example we are using:
- The HuggingFaceAPIGeneratorwith zephyr-7b-beta. (You will need a Hugging Face token to use this model). You can replace this with any of the otherGenerators
- The PromptBuilderwhich holds the prompt template. You can adjust this to a prompt of your choice
- The ChromaQueryRetriverwhich expects a list of queries and retieves thetop_kmost relevant documents from your Chroma collection.
import os
from getpass import getpass
os.environ["HF_API_TOKEN"] = getpass("Enter Hugging Face API key:")
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.builders import PromptBuilder
prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
  {{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt)
llm = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
                              api_params={"model": "HuggingFaceH4/zephyr-7b-beta"})
retriever = ChromaQueryTextRetriever(document_store)
querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)
querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")
query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3},
                        "prompt_builder": {"query": query},
                        "llm":{"generation_kwargs": {"max_new_tokens": 350}}})
print(results["llm"]["replies"][0])