-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathskrub-intro.py
258 lines (209 loc) · 6.49 KB
/
skrub-intro.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# %% [markdown]
# ---
# title: "Skrub"
# format:
# html:
# code-fold: false
# jupyter: python3
# ---
# %% [markdown]
#
# [skrub-data.org](https://skrub-data.org/)
#
# **Less wrangling, more machine learning**
#
# # Machine learning and tabular data
#
# - ML expects numeric arrays
# - Real data is more complex:
# - Multiple tables
# - Dates, categories, text, locations, …
#
# **Skrub: bridge the gap between dataframes and scikit-learn.**
#
# ## Skrub helps at several stages of a tabular learning project
#
# 1. What's in the data? (EDA)
# 1. Can we learn anything? (baselines)
# 1. How do I represent the data? (feature extraction)
# 1. How do I bring it all together? (building a pipeline)
#
# # 1. What's in the data?
# %%
import skrub
import pandas as pd
employee_salaries = pd.read_csv("https://figshare.com/ndownloader/files/51755822")
employees = employee_salaries.drop("current_annual_salary", axis=1)
salaries = employee_salaries["current_annual_salary"]
employees.iloc[0]
# %% [markdown]
#
# ## `TableReport`: interactive display of a dataframe
# %%
skrub.TableReport(employees, verbose=0)
# %% [markdown]
# We can tell skrub to patch the default display of polars and pandas dataframes.
# %%
skrub.patch_display(verbose=0)
# %% [markdown]
# # 2. Can we learn anything?
# %%
X, y = employees, salaries
# %% [markdown]
# ## `tabular_learner`: a pre-made robust baseline
# %%
learner = skrub.tabular_learner("regressor")
learner
# %%
from sklearn.model_selection import cross_val_score
cross_val_score(learner, X, y, scoring="r2")
# %% [markdown]
# The `tabular_learner` adapts to the supervised estimator we choose
# %%
from sklearn.linear_model import Ridge
learner = skrub.tabular_learner(Ridge())
learner
# %%
cross_val_score(learner, X, y, scoring="r2")
# %% [markdown]
# # 3. How do I represent the data?
#
# Skrub helps extract informative features from tabular data.
#
# ## `TableVectorizer`: apply an appropriate transformer to each column
# %%
vectorizer = skrub.TableVectorizer()
transformed = vectorizer.fit_transform(X)
# %% [markdown]
# The `TableVectorizer` identifies several kinds of columns:
#
# - categorical, low cardinality
# - categorical, high cardinality
# - datetime
# - numeric
# - ... we may add more
# %%
from pprint import pprint
pprint(vectorizer.column_to_kind_)
# %% [markdown]
# For each kind, it applies an appropriate transformer
# %%
vectorizer.transformers_["department"] # low-cardinality categorical
# %%
vectorizer.transformers_["employee_position_title"] # high-cardinality categorical
# %%
vectorizer.transformers_["date_first_hired"] # datetime
# %% [markdown]
# ... and those transformers turn the input into numeric features that can be used for ML
# %%
transformed[vectorizer.input_to_outputs_["date_first_hired"]]
# %% [markdown]
# For high-cardinality categorical columns the default `GapEncoder` identifies
# sparse topics (more later).
# %%
transformed[vectorizer.input_to_outputs_["employee_position_title"]]
# %% [markdown]
# The transformer used for each column kind can be easily configured.
# %% [markdown]
# ## Preprocessing in the `TableVectorizer`
#
# The `TableVectorizer` actually performs a lot of preprocessing before
# applying the final transformers, such as:
#
# - ensuring consistent column names
# - detecting missing values such as `"N/A"`
# - dropping empty columns
# - handling pandas dtypes -- `float64`, `nan` vs `Float64`, `NA`
# - parsing numbers
# - parsing dates, ensuring consistent dtype and timezone
# - converting numbers to float32 for faster computation & less memory downstream
# - ...
# %%
pprint(vectorizer.all_processing_steps_["date_first_hired"])
# %% [markdown]
# ## Extracting good features
#
# Skrub offers several encoders to extract features from different columns.
# In particular from categorical columns.
#
# ### `GapEncoder`
#
# Categories are somewhere between text and an enumeration...
# The `GapEncoder` is somewhere between a topic model and a one-hot encoder!
# %%
import seaborn as sns
from matplotlib import pyplot as plt
gap = skrub.GapEncoder()
pos_title = X["employee_position_title"]
loadings = gap.fit_transform(pos_title).set_index(pos_title.values).head()
loadings.columns = [c.split(": ")[1] for c in loadings.columns]
sns.heatmap(loadings)
_ = plt.setp(plt.gca().get_xticklabels(), rotation=45, ha="right")
# %% [markdown]
# ### `TextEncoder`
#
# Extract embeddings from a text column using any model from the HuggingFace Hub.
# %%
import pandas as pd
X = pd.Series(["airport", "flight", "plane", "pineapple", "fruit"])
encoder = skrub.TextEncoder(model_name="all-MiniLM-L6-v2", n_components=None)
embeddings = encoder.fit_transform(X).set_index(X.values)
sns.heatmap(embeddings @ embeddings.T)
# %% [markdown]
# ### `MinHashEncoder`
#
# A fast, stateless way of encoding strings that works especially well with
# models based on decision trees (gradient boosting, random forest).
# %% [markdown]
# # 4. How do I bring it all together?
#
# Skrub has several transformers that allow peforming typical dataframe
# operations such as projections, joins and aggregations _inside a scikit-learn pipeline_.
#
# Performing these operations in the machine-learning pipeline has several advantages:
#
# - Choices / hyperparameters can be optimized
# - Relevant state can be stored to ensure consistent transformations
# - All transformations are packaged together in an estimator
#
# There are several transformers such as `SelectCols`, `Joiner` (fuzzy joining),
# `InterpolationJoiner`, `AggJoiner`, ...
#
# A toy example using the `AggJoiner`:
# %%
from skrub import AggJoiner
airports = pd.DataFrame(
{
"airport_id": [1, 2],
"airport_name": ["Charles de Gaulle", "Aeroporto Leonardo da Vinci"],
"city": ["Paris", "Roma"],
}
)
airports
# %%
flights = pd.DataFrame(
{
"flight_id": range(1, 7),
"from_airport": [1, 1, 1, 2, 2, 2],
"total_passengers": [90, 120, 100, 70, 80, 90],
"company": ["DL", "AF", "AF", "DL", "DL", "TR"],
}
)
flights
# %%
agg_joiner = AggJoiner(
aux_table=flights,
main_key="airport_id",
aux_key="from_airport",
cols=["total_passengers"],
operations=["mean", "std"],
)
agg_joiner.fit_transform(airports)
# %% [markdown]
# ## More interactive and expressive pipelines
#
# To go further than what can be done with scikit-learn Pipelines and the skrub
# transformers shown above, we are developing new utilities to easily define
# and inspect flexible pipelines that can process several dataframes.
#
# A prototype will be shown in a separate notebook.