This is a valid RSS feed.
This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.
help]
[<rss version="2.0" xmlns:prism="http://purl.org/rss/1.0/modules/prism/">
<managingEditor>editor@direct.mit.edu/coli</managingEditor>
^
<webMaster>webmaster@direct.mit.edu/coli</webMaster>
^
line 19, column 6: (9 occurrences) [help]
<prism:startingPage xmlns:prism="prism">1</prism:startingPage>
^
</channel>
^
<?xml version="1.0"?>
<rss version="2.0" xmlns:prism="http://purl.org/rss/1.0/modules/prism/">
<channel>
<title>Computational Linguistics Advance Access</title>
<link>https://direct.mit.edu/coli</link>
<description>
</description>
<language>en-us</language>
<pubDate>Thu, 10 Apr 2025 00:00:00 GMT</pubDate>
<lastBuildDate>Fri, 11 Apr 2025 22:46:28 GMT</lastBuildDate>
<generator>Silverchair</generator>
<managingEditor>editor@direct.mit.edu/coli</managingEditor>
<webMaster>webmaster@direct.mit.edu/coli</webMaster>
<item>
<title>Tokenization Changes Meaning in Large Language Models: Evidence from Chinese</title>
<link>https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00557/128327/Tokenization-Changes-Meaning-in-Large-Language</link>
<pubDate>Thu, 10 Apr 2025 00:00:00 GMT</pubDate>
<description><span class="paragraphSection"><div class="boxTitle">Abstract</div>Large language models segment many words into multiple tokens, and there is mixed evidence as to whether tokenization affects how state-of-the-art models represent meanings. Chinese characters present an opportunity to investigate this issue: They contain semantic radicals, which often convey useful information; characters with the same semantic radical tend to begin with the same one or two bytes (when using UTF-8 encodings); and tokens are common strings of bytes, so characters with the same radical often begin with the same token. This study asked GPT-4, GPT-4o, and Llama 3 whether characters contain the same semantic radical, elicited semantic similarity ratings, and conducted odd-one-out tasks (i.e., which character is not like the others). In all cases, misalignment between tokens and radicals systematically corrupted representations of Chinese characters. In experiments comparing characters represented by single tokens to multi-token characters, the models were less accurate for single-token characters, which suggests that segmenting words into fewer, longer tokens obscures valuable information in word form and will not resolve the problems introduced by tokenization. In experiments with 12 European languages, misalignment between tokens and suffixes systematically corrupted categorization of words by all three models, which suggests that the tendency to treat malformed tokens like linguistic units is pervasive.</span></description>
<prism:startingPage xmlns:prism="prism">1</prism:startingPage>
<prism:endingPage xmlns:prism="prism">30</prism:endingPage>
<prism:doi xmlns:prism="prism">10.1162/coli_a_00557</prism:doi>
<guid>https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00557/128327/Tokenization-Changes-Meaning-in-Large-Language</guid>
</item>
<item>
<title>Socially Aware Language Technologies: Perspectives and Practices</title>
<link>https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00556/128186/Socially-Aware-Language-Technologies-Perspectives</link>
<pubDate>Thu, 03 Apr 2025 00:00:00 GMT</pubDate>
<description><span class="paragraphSection"><div class="boxTitle">Abstract</div>Language technologies have advanced substantially, particularly with the introduction of large language models. However, these advancements can exacerbate several issues that models have traditionally faced, including bias, evaluation, and risk. In this perspective piece, we argue that many of these issues share a common core: a lack of awareness of the social factors, interactions, and implications of the social environment in which NLP operates. We call this <strong>social awareness</strong>. While NLP is improving at addressing linguistic issues, there has been relatively limited progress in incorporating social awareness into models to work in all situations for all users. Integrating social awareness into NLP will improve the naturalness, usefulness, and safety of applications while also opening up new applications. Today, we are only at the start of a new, important era in the field.</span></description>
<prism:startingPage xmlns:prism="prism">1</prism:startingPage>
<prism:endingPage xmlns:prism="prism">15</prism:endingPage>
<prism:doi xmlns:prism="prism">10.1162/coli_a_00556</prism:doi>
<guid>https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00556/128186/Socially-Aware-Language-Technologies-Perspectives</guid>
</item>
<item>
<title>Graded Suspiciousness of Adversarial Texts to Humans</title>
<link>https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00555/128185/Graded-Suspiciousness-of-Adversarial-Texts-to</link>
<pubDate>Thu, 03 Apr 2025 00:00:00 GMT</pubDate>
<description><span class="paragraphSection"><div class="boxTitle">Abstract</div>Adversarial examples pose a significant challenge to deep neural networks across both image and text domains, with the intent to degrade model performance through carefully altered inputs. Adversarial texts, however, are distinct from adversarial images due to their requirement for semantic similarity and the discrete nature of the textual contents. This study delves into the concept of human suspiciousness, a quality distinct from the traditional focus on imperceptibility found in image-based adversarial examples, where adversarial changes are often desired to be indistinguishable to the human eye even when placed side by side with originals. Although this is generally not possible with text, textual adversarial content must still often remain undetected or non-suspicious to human readers. Even when the text’s purpose is to deceive NLP systems or bypass filters, the text is often expected to be natural to read.In this research, we expand the study of human suspiciousness by analyzing how individuals perceive adversarial texts. We gather and publish a novel dataset of Likert-scale human evaluations on the suspiciousness of adversarial sentences, crafted by four widely used adversarial attack methods and assess their correlation with the human ability to detect machine-generated alterations. Additionally, we develop a regression-based model to predict levels of suspiciousness and establish a baseline for future research in reducing the suspiciousness in adversarial text generation. We also demonstrate how the regressor-generated suspicious scores can be incorporated into adversarial generation methods to produce texts that are less likely to be perceived as computer-generated.</span></description>
<prism:startingPage xmlns:prism="prism">1</prism:startingPage>
<prism:endingPage xmlns:prism="prism">34</prism:endingPage>
<prism:doi xmlns:prism="prism">10.1162/coli_a_00555</prism:doi>
<guid>https://direct.mit.edu/coli/article/doi/10.1162/coli_a_00555/128185/Graded-Suspiciousness-of-Adversarial-Texts-to</guid>
</item>
</channel>
</rss>
If you would like to create a banner that links to this page (i.e. this validation result), do the following:
Download the "valid RSS" banner.
Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)
Add this HTML to your page (change the image src
attribute if necessary):
If you would like to create a text link instead, here is the URL you can use: