SGLang: Add AIME25 Dataset Support For Simple Eval

Dec 4, 2025 by Alex Johnson 51 views

This article discusses the proposal to integrate support for the AIME25 dataset directly into SGLang's built-in benchmarking toolkits. Currently, evaluating models on the AIME25 dataset requires a somewhat convoluted process involving the installation of nemo-skills, custom scripting, and manual configuration. By natively supporting AIME25, SGLang can significantly streamline this process, making it easier for researchers and developers to assess model performance on this important benchmark.

Motivation: Streamlining AIME25 Evaluation in SGLang

The primary motivation behind this feature request is to simplify the evaluation process for models on the AIME25 dataset within the SGLang framework. As it stands, testing a model against AIME25 necessitates a series of manual steps and external dependencies, making it less accessible and more time-consuming.

Currently, to test a model on the AIME25 dataset, users must: First, install nemo-skills. Then, undertake some hardcoding to glue the dataset to the model evaluation code. Finally, run a customized script. An example of this workaround can be found in the SGLang documentation, which details the steps required to evaluate DeepSeek V32 on AIME 2025. The problem of the workaround is clear. It is not user-friendly and not easy to use.

Integrating AIME25 support directly into the SGLang in-built benchmarking tool-kits would eliminate these manual steps, providing a more seamless and efficient evaluation workflow. This enhancement would benefit researchers and developers who rely on SGLang for model evaluation, allowing them to quickly and easily assess their models' performance on a challenging and relevant dataset. The user experience will be greatly improved, and more people will be willing to use SGLang to evaluate their models. In addition, native support often leads to more robust and optimized evaluation pipelines, ensuring more accurate and reliable results. For example, you can use one line of code to support the AIME25 dataset. In this way, you don't need to install nemo-skills.

By providing native support for AIME25, SGLang can become an even more attractive platform for researchers and developers working on advanced language models. This will encourage greater adoption of SGLang and foster a more vibrant community around the project. Streamlining the evaluation process will free up valuable time and resources, allowing researchers to focus on model development and innovation rather than on tedious setup and configuration tasks. Native support also allows for easier integration with other SGLang features and tools, creating a more cohesive and powerful evaluation ecosystem. This can unlock new possibilities for analyzing model performance and identifying areas for improvement.

Proposed Solution: In-built AIME25 Benchmarking

The proposed solution involves incorporating AIME25 support directly into SGLang's simple_eval functionality. This would entail adding the necessary code and configurations to automatically download, process, and evaluate models on the AIME25 dataset. The implementation of GPQA-Diamond can serve as a useful example for this integration.

The GPQA-Diamond implementation demonstrates how a dataset can be integrated into simple_eval. By following a similar approach, AIME25 can be seamlessly added to SGLang's benchmarking suite. This would involve creating a new evaluation script specifically for AIME25, which would handle the loading of the dataset, the execution of the model, and the calculation of relevant metrics. Using sglang simple_eval becomes straightforward.

Specifically, the new script should include functionalities such as: Automatic downloading of the AIME25 dataset from a specified source. Preprocessing of the dataset to ensure compatibility with the model being evaluated. Execution of the model on the preprocessed data. Calculation of key performance metrics, such as accuracy and F1-score. Integration with SGLang's reporting and visualization tools.

By leveraging the existing infrastructure of simple_eval, the integration of AIME25 can be achieved efficiently and with minimal disruption to the existing codebase. This approach ensures that the new functionality is well-integrated with the rest of the SGLang ecosystem and that it benefits from the existing features and tools. Furthermore, by following the example of GPQA-Diamond, the integration can be implemented in a modular and maintainable way, making it easier to update and extend the functionality in the future. This ensures that SGLang remains a valuable and adaptable tool for researchers and developers working on advanced language models.

Benefits of Native AIME25 Support

Integrating AIME25 directly into SGLang offers several key advantages:

Simplified Evaluation: Eliminates the need for manual installation of nemo-skills and custom scripting, making AIME25 evaluation more accessible to a wider range of users.
Improved Efficiency: Streamlines the evaluation workflow, saving time and resources for researchers and developers.
Enhanced Accuracy: Ensures more reliable and consistent evaluation results through standardized data processing and evaluation procedures.
Greater Adoption: Encourages greater adoption of SGLang by making it easier to evaluate models on a popular and challenging dataset.
Better Integration: Facilitates seamless integration with other SGLang features and tools, creating a more cohesive and powerful evaluation ecosystem.

Implementation Considerations

When implementing AIME25 support in SGLang, several factors should be considered:

Data Handling: Implement robust data handling procedures to ensure that the AIME25 dataset is downloaded, processed, and stored correctly.
Metric Calculation: Define and implement appropriate evaluation metrics to accurately assess model performance on AIME25.
Reporting and Visualization: Integrate AIME25 results into SGLang's reporting and visualization tools to provide users with clear and actionable insights.
Maintainability: Design the implementation in a modular and maintainable way to facilitate future updates and extensions.

To ensure data is handled correctly, one must consider the size of the AIME25 dataset. If it is huge, we must process it in chunks, not at once. It is important to define appropriate evaluation metrics because AIME25 may contains more fine-grained information, so some metrics like accuracy is not enough. We must consider other metrics like F1-score. By integrating AIME25 results into SGLang's reporting and visualization tools, we can provide users with a comprehensive overview of their models' performance. This will enable them to quickly identify areas for improvement and track their progress over time. The implementation must be designed in a modular and maintainable way because SGLang is an evolving project. By following these guidelines, the integration of AIME25 into SGLang can be a success. This will provide a valuable tool for researchers and developers working on advanced language models.

Conclusion

Supporting the AIME25 dataset within SGLang's simple_eval would significantly enhance the platform's usability and value for researchers and developers. By streamlining the evaluation process and providing a more seamless experience, SGLang can become an even more attractive tool for those working on advanced language models. The benchmarking tool-kits will also be very helpful. The implementation based on the GPQA-Diamond example will provide a solid foundation for this integration. In conclusion, the feature of supporting AIME25 is necessary.

For further information on benchmarking language models, consider visiting the Hugging Face Hub.