Publications | Tianyang Liu

2025

AAAI
Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection

Jiaqi Chen*, Xiaoye Zhu*, Tianyang Liu*, Ying Chen, Xinhui Chen, Yiwen Yuan, Chak Tou Leong, Zuchao Li, Long Tang, Lei Zhang, Chenyu Yan, Guanghao Mei, and 2 more authors

AAAI, 2025

Abs arXiv Bib Blog Code demo

Large Language Models (LLMs) have revolutionized text generation, making detecting machine-generated text increasingly challenging. Although past methods have achieved good performance on detecting pure machine-generated text, those detectors have poor performance on distinguishing machine-revised text (rewriting, expansion, and polishing), which can have only minor changes from its original human prompt. As the content of text may originate from human prompts, detecting machine-revised text often involves identifying distinctive machine styles, e.g., worded favored by LLMs. However, existing methods struggle to detect machine-style phrasing hidden within the content contributed by humans. We propose the "Imitate Before Detect" (ImBD) approach, which first imitates the machine-style token distribution, and then compares the distribution of the text to be tested with the machine-style distribution to determine whether the text has been machine-revised. To this end, we introduce style preference optimization (SPO), which aligns a scoring LLM model to the preference of text styles generated by machines. The aligned scoring model is then used to calculate the style-conditional probability curvature (Style-CPC), quantifying the log probability difference between the original and conditionally sampled texts for effective detection. We conduct extensive comparisons across various scenarios, encompassing text revisions by six LLMs, four distinct text domains, and three machine revision types. Compared to existing state-of-the-art methods, our method yields a 13% increase in AUC for detecting text revised by open-source LLMs, and improves performance by 5% and 19% for detecting GPT-3.5 and GPT-4o revised text, respectively. Notably, our method surpasses the commercially trained GPT-Zero with just 1,000 samples and five minutes of SPO, demonstrating its efficiency and effectiveness.
@article{chen2025imitate, title = {Imitate Before Detect: Aligning Machine Stylistic Preference for Machine-Revised Text Detection}, author = {Chen*, Jiaqi and Zhu*, Xiaoye and Liu*, Tianyang and Chen, Ying and Chen, Xinhui and Yuan, Yiwen and Leong, Chak Tou and Li, Zuchao and Tang, Long and Zhang, Lei and Yan, Chenyu and Mei, Guanghao and Zhang, Jie and Zhang, Lefei}, journal = {AAAI}, year = {2025}, customize = {[demo] https://huggingface.co/spaces/machine-text-detection/ImBD}, url = {https://arxiv.org/abs/2412.10432}, }

2024

EMNLP (main)
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Somanshu Singla*, Zhen Wang*, Tianyang Liu, Abdullah Ashfaq, Zhiting Hu, and Eric P. Xing

EMNLP, 2024

Abs arXiv Bib Code

Aligning Large Language Models (LLMs) traditionally relies on costly training processes like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). To enable alignment without these expensive tuning and annotation, we present a new tuning-free approach for self-alignment called Dynamic Rewarding with Prompt Optimization (DRPO). Our approach enables self-alignment through a search-based prompt optimization framework, allowing the model to self-improve and generate optimized prompts without additional training or human supervision. The core of DRPO leverages a dynamic rewarding mechanism to identify and rectify model-specific alignment weaknesses, enabling LLMs to adapt quickly to various alignment challenges. Empirical evaluations on eight recent LLMs, including both open- and closed-source, reveal that DRPO significantly enhances alignment performance, enabling base models to outperform their SFT/RLHF-tuned counterparts. Moreover, DRPO’s automatically optimized prompts surpass those curated by human experts, demonstrating its superior alignment capabilities. Our findings envision a highly cost-effective and adaptable solution for future alignment research to be further explored.
@article{singla2024dynamic, title = {Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models}, author = {Singla*, Somanshu and Wang*, Zhen and Liu, Tianyang and Ashfaq, Abdullah and Hu, Zhiting and Xing, Eric P.}, journal = {EMNLP}, year = {2024}, }
COLM
LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Shibo Hao, Yi Gu, Haotian Luo, Tianyang Liu, Xiyan Shao, Xinyuan Wang, Shuhua Xie, Haodi Ma, Adithya Samavedhi, Qiyue Gao, Zhen Wang, and Zhiting Hu

COLM, 2024

Also to appear at Large Language Model (LLM) Agents workshop at ICLR 2024

Abs arXiv Bib Code

Generating accurate step-by-step reasoning is essential for Large Language Models (LLMs) to address complex problems and enhance robustness and interpretability. Despite the flux of research on developing advanced reasoning approaches, systematically analyzing the diverse LLMs and reasoning strategies in generating reasoning chains remains a significant challenge. The difficulties stem from the lack of two key elements: (1) an automatic method for evaluating the generated reasoning chains on different tasks, and (2) a unified formalism and implementation of the diverse reasoning approaches for systematic comparison. This paper aims to close the gap: (1) We introduce AutoRace for fully automated reasoning chain evaluation. Existing metrics rely on expensive human annotations or pre-defined LLM prompts not adaptable to different tasks. In contrast, AutoRace automatically creates detailed evaluation criteria tailored for each task, and uses GPT-4 for accurate evaluation following the criteria. (2) We develop LLM Reasoners, a library for standardized modular implementation of existing and new reasoning algorithms, under a unified formulation of the search, reward, and world model components. With the new evaluation and library, (3) we conduct extensive study of different reasoning approaches (e.g., CoT, ToT, RAP). The analysis reveals interesting findings about different factors contributing to reasoning, including the reward-guidance, breadth-vs-depth in search, world model, and prompt formats, etc.
@article{hao2024llm, title = {LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models}, author = {Hao, Shibo and Gu, Yi and Luo, Haotian and Liu, Tianyang and Shao, Xiyan and Wang, Xinyuan and Xie, Shuhua and Ma, Haodi and Samavedhi, Adithya and Gao, Qiyue and Wang, Zhen and Hu, Zhiting}, journal = {COLM}, booktitle = {Conference on Language Modeling}, note = {Also to appear at Large Language Model (LLM) Agents workshop at ICLR 2024}, year = {2024}, }
preprint
StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, and 54 more authors

arXiv preprint, 2024

Abs arXiv Bib Blog Code

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
@article{starcoder2, title = {StarCoder 2 and The Stack v2: The Next Generation}, author = {Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, Joel and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and Liu, Tianyang and Tian, Max and Kocetkov, Denis and Zucker, Arthur and Belkada, Younes and Wang, Zijian and Liu, Qian and Abulkhanov, Dmitry and Paul, Indraneil and Li, Zhuang and Li, Wen-Ding and Risdal, Megan and Li, Jia and Zhu, Jian and Zhuo, Terry Yue and Zheltonozhskii, Evgenii and Dade, Nii Osae Osae and Yu, Wenhao and Krauß, Lucas and Jain, Naman and Su, Yixuan and He, Xuanli and Dey, Manan and Abati, Edoardo and Chai, Yekun and Muennighoff, Niklas and Tang, Xiangru and Oblokulov, Muhtasham and Akiki, Christopher and Marone, Marc and Mou, Chenghao and Mishra, Mayank and Gu, Alex and Hui, Binyuan and Dao, Tri and Zebaze, Armel and Dehaene, Olivier and Patry, Nicolas and Xu, Canwen and McAuley, Julian and Hu, Han and Scholak, Torsten and Paquet, Sebastien and Robinson, Jennifer and Anderson, Carolyn Jane and Chapados, Nicolas and Patwary, Mostofa and Tajbakhsh, Nima and Jernite, Yacine and Ferrandis, Carlos Muñoz and Zhang, Lingming and Hughes, Sean and Wolf, Thomas and Guha, Arjun and von Werra, Leandro and de Vries, Harm}, journal = {arXiv preprint}, year = {2024}, }
NAACL
Rethinking Tabular Data Understanding of Large Language Models

Tianyang Liu, Fei Wang, and Muhao Chen

NAACL, 2024

Abs arXiv Bib Code

Large Language Models (LLMs) have shown to be capable of various tasks, yet their capability in interpreting and reasoning over tabular data remains an underexplored area. In this context, this study investigates from three core perspectives: the robustness of LLMs to structural perturbations in tables, the comparative analysis of textual and symbolic reasoning on tables, and the potential of boosting model performance through the aggregation of multiple reasoning pathways. We discover that structural variance of tables presenting the same content reveals a notable performance decline, particularly in symbolic reasoning tasks. This prompts the proposal of a method for table structure normalization. Moreover, textual reasoning slightly edges out symbolic reasoning, and a detailed error analysis reveals that each exhibits different strengths depending on the specific tasks. Notably, the aggregation of textual and symbolic reasoning pathways, bolstered by a mix self-consistency mechanism, resulted in achieving SOTA performance, with an accuracy of 73.6% on WIKITABLEQUESTIONS, representing a substantial advancement over previous existing table processing paradigms of LLMs.
@article{liu2023rethinking, title = {Rethinking Tabular Data Understanding of Large Language Models}, author = {Liu, Tianyang and Wang, Fei and Chen, Muhao}, journal = {NAACL}, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics}, year = {2024}, }
ICLR
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu, and Julian McAuley

ICLR, 2024

Abs arXiv Bib Code

Large Language Models (LLMs) have greatly advanced code auto-completion systems, with a potential for substantial productivity enhancements for developers. However, current benchmarks mainly focus on single-file tasks, leaving an assessment gap for more complex, real-world, multi-file programming scenarios. To fill this gap, we introduce RepoBench, a new benchmark specifically designed for evaluating repository-level code auto-completion systems. RepoBench consists of three interconnected evaluation tasks: RepoBench-R (Retrieval), RepoBench-C (Code Completion), and RepoBench-P (Pipeline). Each task respectively measures the system’s ability to retrieve the most relevant code snippets from other files as cross-file context, predict the next line of code with cross-file and in-file context, and handle complex tasks that require a combination of both retrieval and next-line prediction. RepoBench aims to facilitate a more complete comparison of performance and encouraging continuous improvement in auto-completion systems. RepoBench is publicly available at https://github.com/leolty/RepoBench
@article{liu2023repobench, author = {Liu, Tianyang and Xu, Canwen and McAuley, Julian}, title = {RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems}, journal = {ICLR}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, }

2023

NeurIPS
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu

NeurIPS, 2023

Oral (67 out of 12345 submissions), Best Paper Award at SoCal NLP 2023

Abs arXiv Bib Code Poster

Augmenting large language models (LLMs) with external tools has emerged as a promising approach to solving complex problems. However, traditional methods, which finetune LLMs with tool demonstration data, can be both costly and restricted to a predefined set of tools. Recent in-context learning paradigm alleviates these issues, but the limited context length only allows for a few shots of demonstrations, leading to suboptimal understandings of the tools. Moreover, when there are numerous tools to choose from, in-context learning could completely fail to work. In this paper, we propose an alternative approach, ToolkenGPT, which combines the benefits of both sides. Our approach represents each tool as a ken (i.e., toolken) and learns an embedding for it, enabling tool calls in the same way as generating a regular word token. Once a toolken is triggered, the LLM is prompted to complete arguments for the tool to execute. ToolkenGPT offers the flexibility to plug in an arbitrary number of tools by expanding the set of toolkens on the fly. In addition, it improves tool use by allowing extensive demonstration data for learning the toolken embeddings. In diverse domains, including numerical reasoning, knowledge-based question answering, and embodied plan generation, our approach effectively augments LLMs with tools and substantially outperforms various latest baselines. ToolkenGPT demonstrates the promising ability to use relevant tools from a large tool set in complex scenarios.
@article{hao2023toolkengpt, title = {ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings}, author = {Hao, Shibo and Liu, Tianyang and Wang, Zhen and Hu, Zhiting}, journal = {NeurIPS}, note = {<strong style="color:#cc3333">Oral (67 out of 12345 submissions), Best Paper Award at SoCal NLP 2023</strong>}, year = {2023}, }
IST
RoseMatcher: Identifying the Impact of User Reviews on App Updates

Tianyang Liu, Chong Wang, Kun Huang, Peng Liang, Beiqi Zhang, Maya Daneva, and Marten van Sinderen

Information and Software Technology, 2023

Abs Bib

Context: The release planning of mobile apps has become an area of active research, with most studies centering on app analysis through release notes in the Apple App Store and tracking user reviews via issue trackers. However, the correlation between these release notes and user reviews in App Store remains understudied. Objective: In this paper, we introduce RoseMatcher, a novel automatic approach to match relevant user reviews with app release notes, and identify matched pairs with high confidence. Methods: We collected 944 release notes and 1,046,862 user reviews from 5 mobile apps in the Apple App Store as research data to evaluate the effectiveness and accuracy of RoseMatcher, and conducted deep content analysis on matched pairs. Results: Our evaluation shows that RoseMatcher can reach a hit ratio of 0.718 for identifying relevant matched pairs, and with the manual labeling and content analysis of 984 relevant pairs, we identify 8 roles that user reviews play in app updates according to the relationship between release notes and user reviews in the relevant matched pairs. Conclusions: Our findings indicate that both app development teams and users pay close attention to release notes and user reviews, with release notes typically addressing feature requests, bug reports, and complaints, and user reviews offering positive, negative, and constructive feedback. Overall, the study highlights the importance of the communication between app development teams and users in the release planning of mobile apps, with relevant reviews tending to be posed within a short period before and after the release of release notes, with the average time interval between the post time of release notes and user reviews being approximately one year.
@article{liu2023rosematcher, title = {RoseMatcher: Identifying the Impact of User Reviews on App Updates}, journal = {Information and Software Technology}, volume = {161}, pages = {107261}, year = {2023}, url = {https://www.sciencedirect.com/science/article/pii/S0950584923001155}, author = {Liu, Tianyang and Wang, Chong and Huang, Kun and Liang, Peng and Zhang, Beiqi and Daneva, Maya and {van Sinderen}, Marten} }
SANER
Architecture Decisions in AI-based Systems Development: An Empirical Study

Beiqi Zhang, Tianyang Liu, Peng Liang, Chong Wang, Mojtaba Shahin, and Jiaxin Yu

In Proceedings of SANER, 2023

Abs Bib

Artificial Intelligence (AI) technologies have been developed rapidly, and AI-based systems have been widely used in various application domains with opportunities and challenges. However, little is known about the architecture decisions made in AI-based systems development, which has a substantial impact on the success and sustainability of these systems. To this end, we conducted an empirical study by collecting and analyzing the data from Stack Overflow (SO) and GitHub. More specifically, we searched on SO with six sets of keywords and explored 32 AI-based projects on GitHub, and finally we collected 174 posts and 128 GitHub issues related to architecture decisions. The results show that in AI-based systems development (1) architecture decisions are expressed in six linguistic patterns, among which Solution Proposal and Information Giving are most frequently used, (2) Technology Decision, Component Decision, and Data Decision are the main types of architecture decisions made, (3) Game is the most common application domain among the eighteen application domains identified, (4) the dominant quality attribute considered in architecture decision-making is Performance, and (5) the main limitations and challenges encountered by practitioners in making architecture decisions are Design Issues and Data Issues. Our results suggest that the limitations and challenges when making architecture decisions in AI-based systems development are highly specific to the characteristics of AI-based systems and are mainly of technical nature, which need to be properly confronted.
@inproceedings{zhang2023architecture, author = {Zhang, Beiqi and Liu, Tianyang and Liang, Peng and Wang, Chong and Shahin, Mojtaba and Yu, Jiaxin}, booktitle = {Proceedings of SANER}, title = {Architecture Decisions in AI-based Systems Development: An Empirical Study}, year = {2023}, pages = {616-626}, url = {https://ieeexplore.ieee.org/abstract/document/10123549} }

2021

APSEC
The Role of User Reviews in App Updates: A Preliminary Investigation on App Release Notes

Chong Wang*, Tianyang Liu*, Peng Liang, Maya Daneva, and Marten Sinderen

In Proceedings of APSEC, 2021

Abs Bib

Release planning for mobile apps has recently become an area of active research. Prior research in this area concentrated on the analysis of release notes and on tracking user reviews to support app evolution with issue trackers. However, little is known about the impact of user reviews on the evolution of mobile apps. Our work explores the role of user reviews in app updates based on release notes. For this purpose, we collected user reviews and release notes of Spotify, the number one app in the ‘Music’ category in Apple App Store, as the research data. Then, we manually removed non-informative parts of each release note, and manually determined the relevance of the app reviews with respect to the release notes. We did this by using Word2Vec calculation techniques based on the top 80 app release notes with the highest similarities. Our empirical results show that more than 60% of the matched reviews are actually irrelevant to the corresponding release notes. When zooming in at these relevant user reviews, we found that around half of them were posted before the new release and referred to requests, suggestions, and complaints. Whereas, the other half of the relevant user reviews were posted after updating the apps and concentrated more on bug reports and praise.
@inproceedings{wang2023app, author = {Wang*, Chong and Liu*, Tianyang and Liang, Peng and Daneva, Maya and van Sinderen, Marten}, booktitle = {Proceedings of APSEC}, title = {The Role of User Reviews in App Updates: A Preliminary Investigation on App Release Notes}, year = {2021}, pages = {520-525}, url = {https://ieeexplore.ieee.org/abstract/document/9712100} }