Publications | Yoichi Ishibashi

2025

EMNLP

LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

Taro Yano, Yoichi Ishibashi, and Masafumi Oyamada

EMNLP, 2025

Abs PDF

Large Language Models (LLMs) excel across diverse tasks, with post-training methods like Supervised Fine-Tuning (SFT), Preference Learning, and Model Merging enabling effective domain and task adaptation. While outcomes can vary with data orderings or component combinations, yet manual pipeline optimization is costly and labor-intensive. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging parameters. We propose LaMDAgent, an LLM Agent-driven framework that autonomously constructs and optimizes end-to-end post-training pipelines by exploring various model improving methods, objects, and their applied orderings based on task-based feedback. LaMDAgent achieves a 9.0-point gain in tool-use accuracy without degrading instruction-following, and identifies high-performing strategies overlooked by manual design.We further analyze the impact of data and model scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.
arXiv

Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi, Taro Yano, and Masafumi Oyamada

arXiv, 2025

Abs PDF

Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author’s thinking process. Specifically, we apply Reasoning CPT to Gemma2-9B using synthetic data with hidden thoughts derived from STEM and Law corpora, and compare it to standard CPT on the MMLU benchmark. Our analysis reveals that Reasoning CPT consistently improves performance across all evaluated domains. Notably, reasoning skills acquired in one domain transfer effectively to others; the performance gap with conventional methods widens as problem difficulty increases, with gains of up to 8 points on the most challenging problems. Furthermore, models trained with hidden thoughts learn to adjust the depth of their reasoning according to problem difficulty.
NAACL

Can Large Language Models Invent Algorithms to Improve Themselves?

Yoichi Ishibashi, Taro Yano, and Masafumi Oyamada

NAACL, 2025

Abs PDF

Recent advancements in automatic code generation using large language model (LLM) agent have brought us closer to the future of automated software development. However, existing single-agent approaches face limitations in generating and improving large-scale, complex codebases due to constraints in context length. To tackle this challenge, we propose Self-Organized multi-Agent framework (SoA), a novel multi-agent framework that enables the scalable and efficient generation and optimization of large-scale code. In SoA, self-organized agents operate independently to generate and modify code components while seamlessly collaborating to construct the overall codebase. A key feature of our framework is the automatic multiplication of agents based on problem complexity, allowing for dynamic scalability. This enables the overall code volume to be increased indefinitely according to the number of agents, while the amount of code managed by each agent remains constant. We evaluate SoA on the HumanEval benchmark and demonstrate that, compared to a single-agent system, each agent in SoA handles significantly less code, yet the overall generated code is substantially greater. Moreover, SoA surpasses the powerful single-agent baseline by 5% in terms of Pass@1 accuracy.

2024

arXiv

Self-Organized Agents: A LLM Multi-Agent Framework toward Ultra Large-Scale Code Generation and Optimization

Yoichi Ishibashi and Nishimura Yoshimasa

arXiv, 2024

PDF Code
NAACL

Subspace Representations for Soft Set Operations and Sentence Similarities

Yoichi Ishibashi, Sho Yokoi, Katsuhito Sudoh, and 1 more author

NAACL, 2024

Abs DOI PDF Code

In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks.

2023

arXiv

Knowledge Sanitization of Large Language Models

Yoichi Ishibashi and Hidetoshi Shimodaira

arXiv, 2023

DOI PDF Code
EACL

Evaluating the Robustness of Discrete Prompts

Yoichi Ishibashi, Danushka Bollegala, Katsuhito Sudoh, and 1 more author

EACL, 2023

Abs PDF

Discrete prompts have been used for fine-tuning Pre-trained Language Models for diverse NLP tasks. In particular, automatic methods that generate discrete prompts from a small set of training instances have reported superior performance. However, a closer look at the learnt prompts reveals that they contain noisy and counter-intuitive lexical constructs that would not be encountered in manually-written prompts. This raises an important yet understudied question regarding the robustness of automatically learnt discrete prompts when used in downstream tasks. To address this question, we conduct a systematic study of the robustness of discrete prompts by applying carefully designed perturbations into an application using AutoPrompt and then measure their performance in two Natural Language Inference (NLI) datasets. Our experimental results show that although the discrete prompt-based method remains relatively robust against perturbations to NLI inputs, they are highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens. Moreover, they generalize poorly across different NLI datasets. We hope our findings will inspire future work on robust discrete prompt learning.

2021

Multilingual Machine Translation Evaluation Metrics Fine-tuned on Pseudo-Negative Examples for WMT 2021 Metrics Task

Kosuke Takahashi, Yoichi Ishibashi, Katsuhito Sudoh, and 1 more author

In Proceedings of the Sixth Conference on Machine Translation, Nov 2021

Abs PDF

This paper describes our submission to the WMT2021 shared metrics task. Our metric is operative to segment-level and system-level translations. Our belief toward a better metric is to detect a significant error that cannot be missed in the real practice cases of evaluation. For that reason, we used pseudo-negative examples in which attributes of some words are transferred to the reversed attribute words, and we build evaluation models to handle such serious mistakes of translations. We fine-tune a multilingual largely pre-trained model on the provided corpus of past years’ metric task and fine-tune again further on the synthetic negative examples that are derived from the same fine-tune corpus. From the evaluation results of the WMT21’s development corpus, fine-tuning on the pseudo-negatives using WMT15-17 and WMT18-20 metric corpus achieved a better Pearson’s correlation score than the one fine-tuned without negative examples. Our submitted models,hyp+src_hyp+ref and hyp+src_hyp+ref.negative, are the plain model using WMT18-20 and the one additionally fine-tuned on negative samples, respectively.

2020

Reflection-based Word Attribute Transfer

Yoichi Ishibashi, Katsuhito Sudoh, Koichiro Yoshino, and 1 more author

In ACL 2020: Student Research Workshop, Jul 2020

Abs DOI PDF Code

Word embeddings, which often represent such analogic relations as king - man + woman queen, can be used to change a word’s attribute, including its gender. For transferring king into queen in this analogy-based manner, we subtract a difference vector man - woman based on the knowledge that king is male. However, developing such knowledge is very costly for words and attributes. In this work, we propose a novel method for word attribute transfer based on reflection mappings without such an analogy operation. Experimental results show that our proposed method can transfer the word attributes of the given words without changing the words that do not have the target attributes.

2018

Generating Responses based on Information Visually-Induced by Text Utterance

Yoichi Ishibashi and Hisashi Miyamori

Jul 2018

PDF

2017

KSU Team’s Dialogue System at the NTCIR-13 Short Text Conversation Task 2

Yoichi Ishibashi, Sho Sugimoto, and Hisashi Miyamori

In The 13th NTCIR Conference, Evaluation of Information Access Technologies, National Center of Sciences, Tokyo, Japan, December 5-8, 2017, Jul 2017

PDF