Show HN: An open-source tool that semantically profiles your data using LLMs

https://github.com/Cocoon-Data-Transformation/cocoon

Cocoon Logo

License: MIT

Cocoon thoroughly prepares your data for RAG. Specifically, Cocoon helps document, connect, and optimize your data pipelines offline. The result can be used for online RAG in use cases like pipeline copilots and data transformation. Check out the YouTube demo 👇:

IMAGE ALT TEXT


IMAGE ALT TEXT

Get Started

Cocoon is available on PyPI:

To get started, you need to connect to

  • LLMs (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs)
  • Data Warehouses (e.g., Snowflake, Big Query, Duckdb...)
from cocoon_data import *
# if you use Open AI GPT-4
openai.api_key  = 'xycabc'
# if you use Snowflake
con = snowflake.connector.connect(...)
query_widget, cocoon_workflow = create_cocoon_workflow(con)
# a helper widget to query your data warehouse
query_widget.display()
# the main panel to interact with Cocoon
cocoon_workflow.start()

🎉 You shall see the following on a notebook:

{
"by": "zh2408",
"descendants": 2,
"id": 40248744,
"kids": [
40249297
],
"score": 10,
"text": "The problem we solve is profiling tables: this is the initial step where you need to understand the table and identify any anomalies.<p>During the process, many small decisions require semantic understanding. For example, missing values are normal for &#x27;deathdate&#x27; (still alive) but abnormal for &#x27;name.&#x27; For outliers, 100 for ages is fine, but some are -1, which is impossible! We use LLMs to semantically understand your tables and detect anomalies.<p>You can try it by uploading a CSV, and we will email back the profile: <a href=\"https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;\" rel=\"nofollow\">https:&#x2F;&#x2F;cocoon-data-transformation.github.io&#x2F;page&#x2F;</a><p>Let me know your feedback. Thanks!",
"time": 1714749854,
"title": "Show HN: An open-source tool that semantically profiles your data using LLMs",
"type": "story",
"url": "https://github.com/Cocoon-Data-Transformation/cocoon"
}
{
"author": "Cocoon-Data-Transformation",
"date": null,
"description": "Contribute to Cocoon-Data-Transformation/cocoon development by creating an account on GitHub.",
"image": "https://opengraph.githubassets.com/3efaf789725b563b66ac841f4681ac3f5a59ed96f501490e0ec93054e7c7826c/Cocoon-Data-Transformation/cocoon",
"logo": "https://logo.clearbit.com/github.com",
"publisher": "GitHub",
"title": "GitHub - Cocoon-Data-Transformation/cocoon",
"url": "https://github.com/Cocoon-Data-Transformation/cocoon"
}
{
"url": "https://github.com/Cocoon-Data-Transformation/cocoon",
"title": "GitHub - Cocoon-Data-Transformation/cocoon",
"description": "Cocoon thoroughly prepares your data for RAG. Specifically, Cocoon helps document, connect, and optimize your data pipelines offline. The result can be used for online RAG in use cases like pipeline copilots...",
"links": [
"https://github.com/Cocoon-Data-Transformation/cocoon"
],
"image": "https://opengraph.githubassets.com/3efaf789725b563b66ac841f4681ac3f5a59ed96f501490e0ec93054e7c7826c/Cocoon-Data-Transformation/cocoon",
"content": "<div><article><p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/cocoon_logo.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/cocoon_logo.png\" alt=\"Cocoon Logo\" /></a>\n</p>\n<p><a target=\"_blank\" href=\"https://camo.githubusercontent.com/6cd0120cc4c5ac11d28b2c60f76033b52db98dac641de3b2644bb054b449d60c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667\"><img src=\"https://camo.githubusercontent.com/6cd0120cc4c5ac11d28b2c60f76033b52db98dac641de3b2644bb054b449d60c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667\" alt=\"License: MIT\" /></a></p>\n<p>Cocoon thoroughly prepares your data for RAG. Specifically, Cocoon helps document, connect, and optimize your data pipelines offline. The result can be used for online RAG in use cases like pipeline copilots and data transformation. Check out the YouTube demo 👇:</p>\n<ul>\n<li>📚 <a target=\"_blank\" href=\"https://cocoon-data-transformation.github.io/page/\"><em>Learn more about features</em></a></li>\n<li><a target=\"_blank\" href=\"https://youtu.be/xdmRXs0UnfE\"><em>Demo for Data Warehouse RAG</em></a></li>\n</ul>\n <p><a target=\"_blank\" href=\"https://youtu.be/xdmRXs0UnfE\">\n <img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/Thumbnail.png\" alt=\"IMAGE ALT TEXT\" />\n </a>\n </p>\n <br />\n<ul>\n<li><a target=\"_blank\" href=\"https://youtu.be/kv5mwTkpfY0\"><em>Demo for Data Pipeline RAG</em></a></li>\n</ul>\n <p><a target=\"_blank\" href=\"https://youtu.be/kv5mwTkpfY0\">\n <img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/Thumbnail2.png\" alt=\"IMAGE ALT TEXT\" />\n </a>\n </p>\n<p></p><h2>Get Started</h2><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon#get-started\"></a><p></p>\n<ul>\n<li>👉 <a target=\"_blank\" href=\"https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_Stage_Demo.ipynb\">Try this Google Collab Notebook for Data Warehouse RAG</a></li>\n<li>👉 <a target=\"_blank\" href=\"https://colab.research.google.com/github/Cocoon-Data-Transformation/cocoon/blob/main/demo/Cocoon_RAG_pipeline.ipynb\">Try this Google Collab Notebook for Data Pipeline RAG</a></li>\n</ul>\n<p>Cocoon is available on PyPI:</p>\n<p>To get started, you need to connect to</p>\n<ul>\n<li>LLMs (e.g., GPT-4, Claude-3, Gemini-Ultra, or your local LLMs)</li>\n<li>Data Warehouses (e.g., Snowflake, Big Query, Duckdb...)</li>\n</ul>\n<div><pre><span>from</span> <span>cocoon_data</span> <span>import</span> <span>*</span>\n<span># if you use Open AI GPT-4</span>\n<span>openai</span>.<span>api_key</span> <span>=</span> <span>'xycabc'</span>\n<span># if you use Snowflake</span>\n<span>con</span> <span>=</span> <span>snowflake</span>.<span>connector</span>.<span>connect</span>(...)\n<span>query_widget</span>, <span>cocoon_workflow</span> <span>=</span> <span>create_cocoon_workflow</span>(<span>con</span>)\n<span># a helper widget to query your data warehouse</span>\n<span>query_widget</span>.<span>display</span>()\n<span># the main panel to interact with Cocoon</span>\n<span>cocoon_workflow</span>.<span>start</span>()</pre></div>\n<p>🎉 You shall see the following on a notebook:</p>\n<p><a target=\"_blank\" href=\"https://github.com/Cocoon-Data-Transformation/cocoon/blob/main/images/notebook.png\"><img src=\"https://github.com/Cocoon-Data-Transformation/cocoon/raw/main/images/notebook.png\" /></a>\n</p>\n</article></div>",
"author": "",
"favicon": "https://github.githubassets.com/favicons/favicon.svg",
"source": "github.com",
"published": "",
"ttr": 33,
"type": "object"
}