You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
agent = Agent(
model = OpenAIChat(id="gpt-4o"),
knowledge=knowledge_base,
search_knowledge=True,
show_tool_calls=True,
debug_mode=True,
)
agent.print_response("Give the recipe of Thai Fried Noodles with Shrimps?")
Expected Behavior
What did you expect to happen?
The docs from the second url must have been added to the knowledge base. and the answre of the question "Give the recipe of Thai Fried Noodles with Shrimps?" must have been extracted from the second pdf.
Actual Behavior
What actually happened instead?
The document chunks of the second pdf (URL) have not been added to the chroma db
Screenshots or Logs (if applicable)
Include any relevant screenshots or error logs that demonstrate the issue.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf
INFO Added 33 documents to knowledge base
INFO Reading: https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf
INFO Added 0 documents to knowledge base
.....
....
DEBUG ============== assistant ==============
DEBUG It seems there was an issue with retrieving the recipe for Thai Fried Noodles with Shrimps from the knowledge
base. However, I can provide you with a general recipe for making this delicious dish:
Suggest any ideas you might have to fix or address the issue.
agno:knowledge:agent.py for the second pdf , the condition "not self.vector_db.doc_exists(doc) " is always False.
for doc in document_list:
if doc.content not in seen_content and not self.vector_db.doc_exists(doc):
seen_content.add(doc.content)
documents_to_load.append(doc)
self.vector_db.insert(documents=documents_to_load, filters=filters)
num_documents += len(documents_to_load)
logger.info(f"Added {len(documents_to_load)} documents to knowledge base")
It seems that the doc_exists has not been implemented properly:
Agno/vector/chroma/chromadb.py:
Since the first document has been added, "collection_data.get("documents") != []" is met and the doc_exists returns True.
return True
def doc_exists(self, document: Document) -> bool:
"""Check if a document exists in the collection.
Args:
document (Document): Document to check.
Returns:
bool: True if document exists, False otherwise.
"""
if self.client:
try:
collection: Collection = self.client.get_collection(name=self.collection_name)
collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
if collection_data.get("documents") != []:
return True
except Exception as e:
logger.error(f"Document does not exist: {e}")
return False
Additional Context
Add any other context or details about the problem here.
The text was updated successfully, but these errors were encountered:
The following modification of doc_exist function in chromadb.py worked for me:
def doc_exists(self, document: Document) -> bool:
"""Check if a specific document exists in the collection.
Args:
document (Document): Document to check.
Returns:
bool: True if the exact document exists, False otherwise.
"""
if not self.client:
return False
try:
collection: Collection = self.client.get_collection(name=self.collection_name)
collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
# Get existing documents from collection
existing_docs = collection_data.get("documents", [])
# Clean document content for comparison
cleaned_content = document.content.replace("\x00", "\ufffd")
# Check if exact document exists
return cleaned_content in existing_docs
except Exception as e:
logger.error(f"Error checking document existence: {e}")
return False
Description
The chromadb can not add the doc chunks from the second provided URL to the collection.
Steps to Reproduce
I have tested PDFUrlKnowledgeBase with 2 pdf urls and chroma db
Agent Configuration (if applicable)
vector_db = ChromaDb(collection="pdf_knowledge", path="./tmp/chromadb", persistent_client=True, embedder=OpenAIEmbedder(id="text-embedding-3-small"),)
knowledge_base = PDFUrlKnowledgeBase(
urls=["https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf",
"https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf"], # "https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf"],
vector_db= vector_db,
)
knowledge_base.load()
agent = Agent(
model = OpenAIChat(id="gpt-4o"),
knowledge=knowledge_base,
search_knowledge=True,
agent.print_response("Give the recipe of Thai Fried Noodles with Shrimps?")
Expected Behavior
What did you expect to happen?
The docs from the second url must have been added to the knowledge base. and the answre of the question "Give the recipe of Thai Fried Noodles with Shrimps?" must have been extracted from the second pdf.
Actual Behavior
What actually happened instead?
The document chunks of the second pdf (URL) have not been added to the chroma db
Screenshots or Logs (if applicable)
Include any relevant screenshots or error logs that demonstrate the issue.
INFO Creating collection
INFO Loading knowledge base
INFO Reading: https://www-file.huawei.com/-/media/corp2020/pdf/tech-insights/1/6g-white-paper-en.pdf
INFO Added 33 documents to knowledge base
INFO Reading: https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf
INFO Added 0 documents to knowledge base
.....
....
DEBUG ============== assistant ==============
DEBUG It seems there was an issue with retrieving the recipe for Thai Fried Noodles with Shrimps from the knowledge
base. However, I can provide you with a general recipe for making this delicious dish:
Environment
Possible Solutions (optional)
Suggest any ideas you might have to fix or address the issue.
agno:knowledge:agent.py for the second pdf , the condition "not self.vector_db.doc_exists(doc) " is always False.
for doc in document_list:
if doc.content not in seen_content and not self.vector_db.doc_exists(doc):
seen_content.add(doc.content)
documents_to_load.append(doc)
self.vector_db.insert(documents=documents_to_load, filters=filters)
num_documents += len(documents_to_load)
logger.info(f"Added {len(documents_to_load)} documents to knowledge base")
It seems that the doc_exists has not been implemented properly:
Agno/vector/chroma/chromadb.py:
Since the first document has been added, "collection_data.get("documents") != []" is met and the doc_exists returns True.
return True
def doc_exists(self, document: Document) -> bool:
"""Check if a document exists in the collection.
Args:
document (Document): Document to check.
Returns:
bool: True if document exists, False otherwise.
"""
if self.client:
try:
collection: Collection = self.client.get_collection(name=self.collection_name)
collection_data: GetResult = collection.get(include=[IncludeEnum.documents])
if collection_data.get("documents") != []:
return True
except Exception as e:
logger.error(f"Document does not exist: {e}")
return False
Additional Context
Add any other context or details about the problem here.
The text was updated successfully, but these errors were encountered: