-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-8920] Optimized SerDe costs of Flink write, simple bucket and non bucket cases #12796
base: master
Are you sure you want to change the base?
Conversation
68028fc
to
9377d36
Compare
.booleanType() | ||
.defaultValue(false) | ||
.withDescription("Optimized Flink write into Hudi table, which uses customized serialization/deserialization. " | ||
+ "Note, that only SIMPLE BUCKET index is supported for now."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR's title says "simple bucket and non bucket cases"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed it. Thanks! Fixed in 5a81536.
466c013
to
5a81536
Compare
@danny0405 , @xiarixiaoyao, @yuzhaojing, @wombatu-kun , hi! Actually, the main part of proposed changes has been done in this PR. The only missed part for now is consistent hashing support (in progress) and bounded context (will check it next). I've also finished testing.
These errors are related to not supported consistent hashing yet. |
… Flink write, non bucket and simple bucket index
04a23b3
to
cb090d8
Compare
I've squashed all commits into one, cb090d8, and renamed |
Change Logs
Changes in Flink stream write into Hudi table corresponding to RFC #12697. Here simple bucket index and non bucket cases are implemented. The only remaining work to do is to support consistent hashing and bounded context:
![DataStream optimization progress - 1](https://private-user-images.githubusercontent.com/67073364/412341477-1ad4947c-f395-4947-9f26-7bae6518fb3d.jpg?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTE0NTcsIm5iZiI6MTczOTM5MTE1NywicGF0aCI6Ii82NzA3MzM2NC80MTIzNDE0NzctMWFkNDk0N2MtZjM5NS00OTQ3LTlmMjYtN2JhZTY1MThmYjNkLmpwZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIwMTIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTYxYThhZDg3OWM5MDhmNTIzZjcyZjZmNzdkZDQ2NzY3NDRlYTMwZjQ0ZDFlZGRmNTA3NWE0ZjU0MDIzMDBhZjAmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Knyb13dqx770vfLHdQPIKWOjZAmKhmW9JMOIB4bJQz0)
Main points:
HoodieFlinkRecord
is introduced. It doesn't extendHoodieRecord
because we need to create data structure with Flink row data and Hudi metadata, constructed from Flink internal data types.HoodieFlinkRecord
is effective for Flink processing due to implementedHoodieFlinkRecordTypeInfo
andHoodieFlinkRecordSerializer
with customserialize
anddeserialize
methods.write.fast.mode
configuration, which is turned off by default. After proper testing we could turn it on by default, then deprecate previous behavior, and refactor all classes after drop of previous behavior.Benchmark description
Lineitem
table from TPC-H benchmark was used. 60 mln rows, from which 20 mln rows are unique.Perfomance estimation results
Flink operators
Non bucket case:
![1 operators - non bucket - 3 merged](https://private-user-images.githubusercontent.com/67073364/412346510-c256d0e9-fe8e-4818-ab49-99ad0e42f150.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTE0NTcsIm5iZiI6MTczOTM5MTE1NywicGF0aCI6Ii82NzA3MzM2NC80MTIzNDY1MTAtYzI1NmQwZTktZmU4ZS00ODE4LWFiNDktOTlhZDBlNDJmMTUwLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIwMTIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZlNTNiMTMwZjk0ODk2MjFlZWFmOTFmMGQ5ZTU4OTE3Y2Y3ZDg4ZmYxZGI2NjBjYjg5N2MwYmY1MjlmZTM2ZGImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.tPW7EcYIjdSX-vOoVpgmblcVBdLsJHbwmCYYk-LdJ1s)
Simple bucket case:
![1 operators - simple bucket - 3 merged](https://private-user-images.githubusercontent.com/67073364/412346600-5bfa6f20-b184-4c98-a2c8-9a1e515f435b.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzOTE0NTcsIm5iZiI6MTczOTM5MTE1NywicGF0aCI6Ii82NzA3MzM2NC80MTIzNDY2MDAtNWJmYTZmMjAtYjE4NC00Yzk4LWEyYzgtOWExZTUxNWY0MzViLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDIwMTIzN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWE3MTViYTJlYzYyNjk1ZTgzZDA0ZTI3N2U0MTFiNTFhYTZlMDE1MDMyZDQ0ZWM2ODI2NGI1ZmQxZWRhNDdhMzQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.XlJ3a8fWpVXfJckJm97-r9zU4UPmNmQBF1UkqUY_JOk)
Impact
Flink write performance improvement.
Risk level (write none, low medium or high below)
Low
Documentation Update
After merge
Contributor's checklist