Haystack 2.8.0

Release Notes

⬆️ Upgrade Notes

Remove is_greedy deprecated argument from @component decorator. Change the Variadic input of your Component to GreedyVariadic instead.

🚀 New Features

We’ve added a new DALLEImageGenerator component, bringing image generation with OpenAI’s DALL-E to the Haystack

Easy to Use: Just a few lines of code to get started:

from haystack.components.generators import DALLEImageGenerator 

image_generator = DALLEImageGenerator() 
response = image_generator.run("Show me a picture of a black cat.") 
print(response)

Add warning logs to the PDFMinerToDocument and PyPDFToDocument to indicate when a processed PDF file has no content. This can happen if the PDF file is a scanned image. Also added an explicit check and warning message to the DocumentSplitter that warns the user that empty Documents are skipped. This behavior was already occurring, but now its clearer through logs that this is happening.
We have added a new MetaFieldGroupingRanker component that reorders documents by grouping them based on metadata keys. This can be useful for pre-processing Documents before feeding them to an LLM.
Added a new store_full_path parameter to the __init__ methods of the following converters: JSONConverter, CSVToDocument, DOCXToDocument, HTMLToDocument MarkdownToDocument, PDFMinerToDocument, PPTXToDocument, TikaDocumentConverter, PyPDFToDocument , AzureOCRDocumentConverter and TextFileToDocument. The default value is True, which stores full file path in the metadata of the output documents. When set to False, only the file name is stored.
When making function calls via OpenAPI, allow both switching SSL verification off and specifying a certificate authority to use for it.
Add TTFT (Time-to-First-Token) support for OpenAI generators. This captures the time taken to generate the first token from the model and can be used to analyze the latency of the application.
Added a new option to the required_variables parameter to the PromptBuilder and ChatPromptBuilder. By passing required_variables="*" you can automatically set all variables in the prompt to be required.

⚡️ Enhancement Notes

Across Haystack codebase, we have replaced the use of ChatMessage data class constructor with specific class methods (ChatMessage.from_user, ChatMessage.from_assistant, etc.).
Added the Maximum Margin Relevance (MMR) strategy to the SentenceTransformersDiversityRanker. MMR scores are calculated for each document based on their relevance to the query and diversity from already selected documents.
Introduces optional parameters in the ConditionalRouter component, enabling default/fallback routing behavior when certain inputs are not provided at runtime. This enhancement allows for more flexible pipeline configurations with graceful handling of missing parameters.
Added split by line to DocumentSplitter, which will split the document at n.
Change OpenAIDocumentEmbedder to keep running if a batch fails embedding. Now OpenAI returns an error we log that error and keep processing following batches.
Added new initialization parameters to the PyPDFToDocument component to customize the text extraction process from PDF files.
Replace usage of ChatMessage.content with ChatMessage.text across the codebase. This is done in preparation for the removal of content in Haystack 2.9.0.

⚠️ Deprecation Notes

The default value of the store_full_path parameter in converters will change to False in Haysatck 2.9.0 to enhance privacy.
In Haystack 2.9.0, the ChatMessage data class will be refactored to make it more flexible and future-proof. As part of this change, the content attribute will be removed. A new text property has been introduced to provide access to the textual value of the ChatMessage. To ensure a smooth transition, start using the text property now in place of content.
The converter parameter in the PyPDFToDocument component is deprecated and will be removed in Haystack 2.9.0. For in-depth customization of the conversion process, consider implementing a custom component. Additional high-level customization options will be added in the future.
The output of context_documents in SentenceWindowRetriever will change in the next release. Instead of a List[List[Document]], the output will be a List[Document], where the documents are ordered by split_idx_start.

🐛 Bug Fixes

Fix DocumentCleaner not preserving all Document fields when run
Fix DocumentJoiner failing when ran with an empty list of Documents
For the NLTKDocumentSplitter we are updating how chunks are made when splitting by word and sentence boundary is respected. Namely, to avoid fully subsuming the previous chunk into the next one, we ignore the first sentence from that chunk when calculating sentence overlap. i.e. we want to avoid cases of Doc1 = [s1, s2], Doc2 = [s1, s2, s3].
Finished adding function support for this component by updating the _split_into_units function and added the splitting_function init parameter.
Add specific to_dict method to overwrite the underlying one from DocumentSplitter. This is needed to properly save the settings of the component to yaml.
Fix OpenAIChatGenerator and OpenAIGenerator crashing when using a streaming_callback and generation_kwargs contain {"stream_options": {"include_usage": True}}.
Fix tracing Pipeline with cycles to correctly track components execution
When meta is passed into AnswerBuilder.run(), it is now merged into GeneratedAnswer meta
Fix DocumentSplitter to handle custom splitting_function without requiring split_length. Previously the splitting_function provided would not override other settings.