sphinx_doc_chunking
To get started:
Dynamically pull and run
from hamilton import dataflows, driver
from hamilton.execution import executors
# downloads into ~/.hamilton/dataflows and loads the module -- WARNING: ensure you know what code you're importing!
sphinx_doc_chunking = dataflows.import_module("sphinx_doc_chunking")
# Switch this out for ray, dask, etc. See docs for more info.
remote_executor = executors.MultiThreadingExecutor(max_tasks=20)
dr = (
driver.Builder()
.enable_dynamic_execution(allow_experimental_mode=True)
.with_remote_executor(remote_executor)
.with_config({}) # replace with configuration as appropriate
.with_modules(sphinx_doc_chunking)
.build()
)
# If you have sf-hamilton[visualization] installed, you can see the dataflow graph
# In a notebook this will show an image, else pass in arguments to save to a file
# dr.display_all_functions()
# Execute the dataflow, specifying what you want back. Will return a dictionary.
result = dr.execute(
[sphinx_doc_chunking.CHANGE_ME, ...], # this specifies what you want back
inputs={...} # pass in inputs as appropriate
)
Use published library version
pip install sf-hamilton-contrib --upgrade # make sure you have the latest
from hamilton import dataflows, driver
from hamilton.execution import executors
# Make sure you've done - `pip install sf-hamilton-contrib --upgrade`
from hamilton.contrib.dagworks import sphinx_doc_chunking
# Switch this out for ray, dask, etc. See docs for more info.
remote_executor = executors.MultiThreadingExecutor(max_tasks=20)
dr = (
driver.Builder()
.enable_dynamic_execution(allow_experimental_mode=True)
.with_remote_executor(remote_executor)
.with_config({}) # replace with configuration as appropriate
.with_modules(sphinx_doc_chunking)
.build()
)
# If you have sf-hamilton[visualization] installed, you can see the dataflow graph
# In a notebook this will show an image, else pass in arguments to save to a file
# dr.display_all_functions()
# Execute the dataflow, specifying what you want back. Will return a dictionary.
result = dr.execute(
[sphinx_doc_chunking.CHANGE_ME, ...], # this specifies what you want back
inputs={...} # pass in inputs as appropriate
)
Modify for your needs
Now if you want to modify the dataflow, you can copy it to a new folder (renaming is possible), and modify it there.
dataflows.copy(sphinx_doc_chunking, "path/to/save/to")
Purpose of this module
The purpose of this module is to take Sphinx Furo themed documentation, pull the pages, and chunk the text for further processing, e.g. creating embeddings. This is fairly generic code that is easy to change and extend for your purposes. It runs anywhere that python runs, and can be extended to run on Ray, Dask, and even PySpark.
## import sphinx_doc_chunking via the means that you want. See above code.
from hamilton import driver
from hamilton.execution import executors
dr = (
driver.Builder()
.with_modules(sphinx_doc_chunking)
.enable_dynamic_execution(allow_experimental_mode=True)
.with_config({})
## defaults to multi-threading -- and tasks control max concurrency
.with_remote_executor(executors.MultiThreadingExecutor(max_tasks=25))
.build()
)
What you should modify
You'll likely want to:
- play with what does the chunking and settings for that.
- change how URLs are sourced.
- change how text is extracted from a page.
- extend the code to hit an API to get embeddings.
- extend the code to push data to a vector database.
Configuration Options
There is no configuration required for this module.
Limitations
You general multiprocessing caveats apply if you choose an executor other than MultiThreading. For example:
- Serialization -- objects need to be serializable between processes.
- Concurrency/parallelism -- you're in control of this.
- Failures -- you'll need to make your code do the right thing here.
- Memory requirements -- the "collect" (or reduce) step pulls things into memory. If you hit this, this just means you need to redesign your code a little, e.g. write large things to a store and pass pointers.
To extend this to PySpark see the examples folder for the changes required to adjust the code to handle PySpark.
Source code
__init__.py
Requirements
langchain
langchain-core
sf-hamilton[dask]
# optionally install Ray, or Dask, or both
sf-hamilton[ray]
sf-hamilton[visualization]