-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding training schemes #25
Comments
Sorry for late reply, we do not take other tunning techs. |
@luyaxi Thank you for the answer! I am currently try to generate dataset for proactive agent training. However, configuring the gym does not work as it is in this repo. Before I look into the details, I was wondering where you got the inspiration of writing gym in this repo as I would like to adapt it for my own purpose. Just wanted to get some conceptual idea behind the codes written, specifically for your paper. |
I'd like to help correct any configuration problems, you can share your configs here (remember to omit any sensitive information like API keys), We are firstly inspired by the Gymnasium presented by OpenAI and the Text World offered by Microsoft. Both of them create a simulation environment to help solve the decision optimization for agents. By integrating the simulated environment with a custom agent, one can use an outcome evaluator to directly optimize the decision-making process as a whole, which is important as the fine-grained evaluation could be incredibly complex for the advanced agents. |
Thank you so much @luyaxi for your answer and suggesting the help! I have the follow-up questions as below. I would appreciate if you could answer when you have time.
|
Hello,
First of all, thank you for the efforts to share this repo and your work. I found this very interesting!
If I understand correctly, the reward model is trained with SFT using LLaMA-Factory, considering the conversation here: #21
Now, for training the proactive agent model, you also seemed to have used the SFT considering how the training set looks like, am I right?
I was wondering if you tried other fine-tuning like preference tuning as I first guessed that when I saw the equation 3 in your paper: https://arxiv.org/pdf/2410.12361
Please correct me if I understand anything wrong.
Sincerely,
Jaejin Cho
@luyaxi Hello, Yaxi. Would you mind look at my question when you have time?
The text was updated successfully, but these errors were encountered: