Skip to content

Key Features

github-actions[bot] edited this page Nov 19, 2024 · 2 revisions

ccflow is a collection of tools for workflow configuration, orchestration, and dependency injection. It is intended to be flexible enough to handle diverse use cases, including data retrieval, validation, transformation, and loading (i.e. ETL workflows), model training, microservice configuration, and automated report generation.

Base Model

Central to ccflow is the BaseModel class. BaseModel is the base class for models in the ccflow framework. A model is basically a data class (class with attributes). The naming was inspired by the open source library Pydantic(BaseModel actually inherits from the Pydantic base model class).

Callable Model

CallableModel is the base class for a special type of BaseModel which can be called. CallableModel's are called with a context (something that derives from ContextBase) and returns a result (something that derives from ResultBase). As an example, you may have a SQLReader callable model that when called with a DateRangeContext returns a ArrowResult (wrapper around a Arrow table) with data in the date range defined by the context by querying some SQL database.

Model Registry

A ModelRegistry is a named collection of models. A ModelRegistry can be loaded from YAML configuration, which means you can define a collection of models using YAML. This is really powerful because this gives you a easy way to define a collection of Python objects via configuration.

Models

Although you are free to define your own models (BaseModel implementations) to use in your flow graph, ccflow comes with some models that you can use off the shelf to solve common problems. ccflow comes with a range of models for reading data.

The following table summarizes the available models.

Note

Some models are still in the process of being open sourced.

Name Path Description
ArrowCSVReader Coming Soon!
PandasCSVReader Coming Soon!
ArrowDatasetReader Coming Soon!
ArrowDatasetWriter Coming Soon!
PandasDeltaReader Coming Soon!
PandasDeltaWriter Coming Soon!
FileEraser Coming Soon!
MLFlowPublisherModel Coming Soon!
CallableModelGroup Coming Soon!
MultiplexerModel Coming Soon!
PanderaValidator Coming Soon!
ArrowParquetReader Coming Soon!
PandasParquetReader Coming Soon!
ArrowParquetWriter Coming Soon!
PandasParquetWriter Coming Soon!
MultiFieldParquetReader Coming Soon!
ArrowParquetCacher Coming Soon!
SQLReader Coming Soon!
SQLPollingReader Coming Soon!
TableTransformModel Coming Soon!
WaterfallModel Coming Soon!
XArrayReader Coming Soon!
XArrayWriter Coming Soon!

Publishers

ccflow also comes with a range of models for writing data. These are referred to as publishers. You can "chain" publishers and callable models using PublisherModel to call a CallableModel and publish the results in one step. In fact, ccflow comes with several implementations of PublisherModel for common publishing use cases.

The following table summarizes the "publisher" models.

Note

Some models are still in the process of being open sourced.

Name Path Description
DictTemplateFilePublisher ccflow.publishers Publish data to a file after populating a Jinja template.
GenericFilePublisher ccflow.publishers Publish data using a generic "dump" Callable. Uses smart_open under the hood so that local and cloud paths are supported.
JSONPublisher ccflow.publishers Publish data to file in JSON format.
PandasFilePublisher ccflow.publishers Publish a pandas data frame to a file using an appropriate method on pd.DataFrame. For large-scale exporting (using parquet), see PandasParquetPublisher.
PicklePublisher ccflow.publishers Publish data to a pickle file.
PydanticJSONPublisher ccflow.publishers Publish a pydantic model to a json file. See https://docs.pydantic.dev/latest/concepts/serialization/#modeljson
YAMLPublisher ccflow.publishers Publish data to file in YAML format.
CompositePublisher ccflow.publishers Highly configurable, publisher that decomposes a pydantic BaseModel or a dictionary into pieces and publishes each piece separately.
ArrowDatasetPublisher Coming Soon!
PandasDeltaPublisher Coming Soon!
EmailPublisher Coming Soon!
MatplotlibFilePublisher Coming Soon!
MLFlowArtifactPublisher Coming Soon!
MLFlowPublisher Coming Soon!
PandasParquetPublisher Coming Soon!
PlotlyFilePublisher Coming Soon!
XArrayPublisher Coming Soon!

Evaluators

ccflow comes with "evaluators" that allows you to evaluate (i.e. run) CallableModel s in different ways.

The following table summarizes the "evaluator" models.

Note

Some models are still in the process of being open sourced.

Name Path Description
LazyEvaluator ccflow.evaluators Evaluator that only actually runs the callable once an attribute of the result is queried (by hooking into __getattribute__)
LoggingEvaluator ccflow.evaluators Evaluator that logs information about evaluating the callable.
MemoryCacheEvaluator ccflow.evaluators Evaluator that caches results in memory.
MultiEvaluator ccflow.evaluators An evaluator that combines multiple evaluators.
GraphEvaluator ccflow.evaluators Evaluator that evaluates the dependency graph of callable models in topologically sorted order.
ChunkedDateRangeEvaluator Coming Soon!
ChunkedDateRangeResultsAggregator Coming Soon!
RayChunkedDateRangeEvaluator Coming Soon!
DependencyTrackingEvaluator Coming Soon!
DiskCacheEvaluator Coming Soon!
ParquetCacheEvaluator Coming Soon!
RayCacheEvaluator Coming Soon!
RayGraphEvaluator Coming Soon!
RayDelayedDistributedEvaluator Coming Soon!
ParquetCacheEvaluator Coming Soon!
RetryEvaluator Coming Soon!

Results

A Result is an object that holds the results from a callable model. It provides the equivalent of a strongly typed dictionary where the keys and schema are known upfront.

The following table summarizes the "result" models.

Name Path Description
GenericResult ccflow.result A generic result (holds anything).
DictResult ccflow.result A generic dict (key/value) result.
ArrowResult ccflow.result.pyarrow Holds an arrow table.
ArrowDateRangeResult ccflow.result.pyarrow Extension of ArrowResult for representing a table over a date range that can be divided by date, such that generation of any sub-range of dates gives the same results as the original table filtered for those dates.
NumpyResult ccflow.result.numpy Holds a numpy array.
PandasResult ccflow.result.pandas Holds a pandas dataframe.
XArrayResult ccflow.result.xarray Holds an xarray.