INFORMS Open Forum

  • 1.  Call for Contributions - A Community Benchmark for LLMs on Optimization Modeling

    Posted 12 days ago
      |   view attached
    Dear colleagues,
    We are inviting the OR/MS community to contribute problems to a new, community-driven benchmark for evaluating large language models (LLMs) on optimization modeling tasks.
    LLMs are beginning to lower the barrier to entry for optimization modeling, but realizing that potential requires rigorous benchmarks that reflect what actually makes a modeling task hard - and those benchmarks are best shaped by our community. Inspired by community efforts such as Humanity's Last Exam, we are assembling an open, living benchmark of genuinely challenging optimization problems, alongside a paper documenting the collection and the performance of state-of-the-art LLMs on it.

    The Management Science Editor-in-Chief is supportive of the project, and a department editor has agreed to an expedited review process (as with any submission, final publication depends on the review team's assessment). Contributors of accepted problems are invited to join the paper as co-authors.

    We welcome problems from any application domain in operations and management science - production planning, supply chain and logistics, routing, inventory, facility location, scheduling, revenue management, energy systems, healthcare operations, and more. A good fit is a problem that is relevant to real-world operations and that current frontier LLMs struggle to model correctly.
    Full details, selection criteria, and the submission form are in the call: https://drive.google.com/file/d/1-6YBgzahWN7loSx2prslNcSs54vrz8UP/view?usp=sharing.
     
    Contributions are welcome through August 31, 2026. Submissions are reviewed on a rolling basis, with a response typically within two weeks. Questions are very welcome - feel free to reach out at or.bench2026@gmail.com.
    We'd be grateful if you would forward this to colleagues who may be interested.
    With thanks,
    The organizing team
    Jim Dai (Cornell University)
    Dick den Hertog (University of Amsterdam)
    Dongdong Ge (Shanghai Jiao Tong University)
    Connor Lawless (Stanford University)
    Kuo Liang (Shanghai Jiao Tong University)
    Jianghao Lin (Shanghai Jiao Tong University)
    Zi Ling (University of Chicago)
    Jinsong Liu (Cornell University)
    Hanzhang Qin (National University of Singapore)
    Chung-Piaw Teo (National University of Singapore)
    Madeleine Udell (Stanford University)
    Wolfram Wiesemann (Imperial College London)
    Ruihao Zhu (Cornell University)


    ------------------------------
    Ruihao Zhu
    Assistant Professor
    Cornell University
    Ithaca NY
    ------------------------------

    Attachment(s)



  • 2.  RE: Call for Contributions - A Community Benchmark for LLMs on Optimization Modeling

    Posted 11 days ago

    You wrote:

    > A good fit is a problem that is relevant to real-world operations and that current frontier LLMs struggle to model correctly.

    What is your definition of "correct"?



    ------------------------------
    -Irv Lustig
    Optimization Principal
    Princeton Consultants
    ------------------------------



  • 3.  RE: Call for Contributions - A Community Benchmark for LLMs on Optimization Modeling

    Posted 10 days ago

    My understanding is that traditional computational methods outperform LLMs in many objective and subjective measures. A clear objective definition of "correct" would be obtaining the correct result for simple arithmetic. 
    The following reference cites LLM 'reasoning failures'. Our challenge in benchmarking an LLM is to test it for at least its known weaknesses, for which it must recognize its lack of correctness and actively choose to leverage a more accurate method. This is an active area of research, especially in LLM use cases for critical infrastructure decision support.
    ----
    Large Language Model Reasoning Failures
    https://www.arxiv.org/pdf/2602.06176



    ------------------------------
    Robert Entriken
    Principle Technical Leader
    EPRI
    Palo Alto CA
    ------------------------------



  • 4.  RE: Call for Contributions - A Community Benchmark for LLMs on Optimization Modeling

    Posted 10 days ago

    Hey Irv,

    Thanks for the question. We plan to use a relatively mechanical way to determine correct. That is, we shall compare the LLM-generated model's optimal decision and objective value agains the ground-truth model's. If they match, we determine that as correct; otherwise, it is an incorrect.

    Of course, this is by no mean the best definition, but it turns out to be one of the well-adopted measures in the literature (eg, https://pubsonline.informs.org/doi/10.1287/opre.2024.1233). 

    Thanks!

    Ruihao



    ------------------------------
    Ruihao Zhu
    Assistant Professor
    Cornell University
    Ithaca NY
    ------------------------------



  • 5.  RE: Call for Contributions - A Community Benchmark for LLMs on Optimization Modeling

    Posted 8 days ago

    > That is, we shall compare the LLM-generated model's optimal decision and objective value agains the ground-truth model's. If they match, we determine that as correct; otherwise, it is an incorrect.

    I would challenge this evaluation process.  Here's a few reasons:

    1. A problem may have more than one "optimal" solution.  So you can't compare the "optimal decision" of one model versus another.  You could compare objective function values, but that might not work either (see (3)).
    2. Two models may use very different decision variables, so you have to define a common output format to make that comparison.
    3. In practice, many models (mixed integer programs) are not solved to provable optimality.  You might have a goal of "provide the best solution in 5 minutes of time".  Different models will give different results under this type of goal.  And it becomes a challenge to compare a model that finds its "best" solution in 2 minutes (with no improvement in the remaining 3 minutes) that is 1% worse than a model that finds its best solution, but requires the full 5 minutes. 
    4. If you do have 2 models, and you solve both of them with one solver, and both of them with a second solver, you might get very different results.
    5. IMHO, "correctness" of a model should be based on whether it achieves the goal of solving the right business problem.  That is something that is subjective, not objective.  I have an example that I am using in talks that illustrates that I don't know the "correct" model without having a discussion with the owners of the business problem where I would need to discuss tradeoffs of how they view an "optimal" decision.  See https://www.youtube.com/watch?v=ql0FCOnyJDg at around the 35 minute mark for this example.

    > Of course, this is by no mean the best definition, but it turns out to be one of the well-adopted measures in the literature (eg, https://pubsonline.informs.org/doi/10.1287/opre.2024.1233). 

    That link is broken (or maybe it is just an INFORMS problem at the moment).  Also, it would only be available to subscribers of the journal, so please provide another link.



    ------------------------------
    -Irv Lustig
    Optimization Principal
    Princeton Consultants
    ------------------------------