-
Notifications
You must be signed in to change notification settings - Fork 2
Key Features
ccflow
is a collection of tools for workflow configuration, orchestration, and dependency injection.
It is intended to be flexible enough to handle diverse use cases, including data retrieval, validation, transformation, and loading (i.e. ETL workflows), model training, microservice configuration, and automated report generation.
Central to ccflow
is the BaseModel
class.
BaseModel
is the base class for models in the ccflow
framework.
A model is basically a data class (class with attributes).
The naming was inspired by the open source library Pydantic(BaseModel
actually inherits from the Pydantic base model class).
CallableModel
is the base class for a special type of BaseModel
which can be called.
CallableModel
's are called with a context (something that derives from ContextBase
) and returns a result (something that derives from ResultBase
).
As an example, you may have a SQLReader
callable model that when called with a DateRangeContext
returns a ArrowResult
(wrapper around a Arrow table) with data in the date range defined by the context by querying some SQL database.
A ModelRegistry
is a named collection of models.
A ModelRegistry
can be loaded from YAML configuration, which means you can define a collection of models using YAML.
This is really powerful because this gives you a easy way to define a collection of Python objects via configuration.
Although you are free to define your own models (BaseModel
implementations) to use in your flow graph,
ccflow
comes with some models that you can use off the shelf to solve common problems. ccflow
comes with a range of models for reading data.
The following table summarizes the available models.
Note
Some models are still in the process of being open sourced.
Name | Path | Description |
---|---|---|
ArrowCSVReader |
Coming Soon! | |
PandasCSVReader |
Coming Soon! | |
ArrowDatasetReader |
Coming Soon! | |
ArrowDatasetWriter |
Coming Soon! | |
PandasDeltaReader |
Coming Soon! | |
PandasDeltaWriter |
Coming Soon! | |
FileEraser |
Coming Soon! | |
MLFlowPublisherModel |
Coming Soon! | |
CallableModelGroup |
Coming Soon! | |
MultiplexerModel |
Coming Soon! | |
PanderaValidator |
Coming Soon! | |
ArrowParquetReader |
Coming Soon! | |
PandasParquetReader |
Coming Soon! | |
ArrowParquetWriter |
Coming Soon! | |
PandasParquetWriter |
Coming Soon! | |
MultiFieldParquetReader |
Coming Soon! | |
ArrowParquetCacher |
Coming Soon! | |
SQLReader |
Coming Soon! | |
SQLPollingReader |
Coming Soon! | |
TableTransformModel |
Coming Soon! | |
WaterfallModel |
Coming Soon! | |
XArrayReader |
Coming Soon! | |
XArrayWriter |
Coming Soon! |
ccflow
also comes with a range of models for writing data.
These are referred to as publishers.
You can "chain" publishers and callable models using PublisherModel
to call a CallableModel
and publish the results in one step.
In fact, ccflow
comes with several implementations of PublisherModel
for common publishing use cases.
The following table summarizes the "publisher" models.
Note
Some models are still in the process of being open sourced.
Name | Path | Description |
---|---|---|
DictTemplateFilePublisher |
ccflow.publishers |
Publish data to a file after populating a Jinja template. |
GenericFilePublisher |
ccflow.publishers |
Publish data using a generic "dump" Callable. Uses smart_open under the hood so that local and cloud paths are supported. |
JSONPublisher |
ccflow.publishers |
Publish data to file in JSON format. |
PandasFilePublisher |
ccflow.publishers |
Publish a pandas data frame to a file using an appropriate method on pd.DataFrame. For large-scale exporting (using parquet), see PandasParquetPublisher . |
PicklePublisher |
ccflow.publishers |
Publish data to a pickle file. |
PydanticJSONPublisher |
ccflow.publishers |
Publish a pydantic model to a json file. See https://docs.pydantic.dev/latest/concepts/serialization/#modeljson |
YAMLPublisher |
ccflow.publishers |
Publish data to file in YAML format. |
CompositePublisher |
ccflow.publishers |
Highly configurable, publisher that decomposes a pydantic BaseModel or a dictionary into pieces and publishes each piece separately. |
ArrowDatasetPublisher |
Coming Soon! | |
PandasDeltaPublisher |
Coming Soon! | |
EmailPublisher |
Coming Soon! | |
MatplotlibFilePublisher |
Coming Soon! | |
MLFlowArtifactPublisher |
Coming Soon! | |
MLFlowPublisher |
Coming Soon! | |
PandasParquetPublisher |
Coming Soon! | |
PlotlyFilePublisher |
Coming Soon! | |
XArrayPublisher |
Coming Soon! |
ccflow
comes with "evaluators" that allows you to evaluate (i.e. run) CallableModel
s in different ways.
The following table summarizes the "evaluator" models.
Note
Some models are still in the process of being open sourced.
Name | Path | Description |
---|---|---|
LazyEvaluator |
ccflow.evaluators |
Evaluator that only actually runs the callable once an attribute of the result is queried (by hooking into __getattribute__ ) |
LoggingEvaluator |
ccflow.evaluators |
Evaluator that logs information about evaluating the callable. |
MemoryCacheEvaluator |
ccflow.evaluators |
Evaluator that caches results in memory. |
MultiEvaluator |
ccflow.evaluators |
An evaluator that combines multiple evaluators. |
GraphEvaluator |
ccflow.evaluators |
Evaluator that evaluates the dependency graph of callable models in topologically sorted order. |
ChunkedDateRangeEvaluator |
Coming Soon! | |
ChunkedDateRangeResultsAggregator |
Coming Soon! | |
RayChunkedDateRangeEvaluator |
Coming Soon! | |
DependencyTrackingEvaluator |
Coming Soon! | |
DiskCacheEvaluator |
Coming Soon! | |
ParquetCacheEvaluator |
Coming Soon! | |
RayCacheEvaluator |
Coming Soon! | |
RayGraphEvaluator |
Coming Soon! | |
RayDelayedDistributedEvaluator |
Coming Soon! | |
ParquetCacheEvaluator |
Coming Soon! | |
RetryEvaluator |
Coming Soon! |
A Result is an object that holds the results from a callable model. It provides the equivalent of a strongly typed dictionary where the keys and schema are known upfront.
The following table summarizes the "result" models.
Name | Path | Description |
---|---|---|
GenericResult |
ccflow.result |
A generic result (holds anything). |
DictResult |
ccflow.result |
A generic dict (key/value) result. |
ArrowResult |
ccflow.result.pyarrow |
Holds an arrow table. |
ArrowDateRangeResult |
ccflow.result.pyarrow |
Extension of ArrowResult for representing a table over a date range that can be divided by date, such that generation of any sub-range of dates gives the same results as the original table filtered for those dates. |
NumpyResult |
ccflow.result.numpy |
Holds a numpy array. |
PandasResult |
ccflow.result.pandas |
Holds a pandas dataframe. |
XArrayResult |
ccflow.result.xarray |
Holds an xarray. |
This wiki is autogenerated. To made updates, open a PR against the original source file in docs/wiki
.
Get Started
Developer Guide