Fortnightly Digest 17 February 2025
A week ago the Paris AI Summit was held, which saw AI, its safety and its security, in the headlines A LOT.
Global AI governance remains deeply fragmented, as seen in the US and UK’s refusal to sign the Paris AI Summit’s declaration for inclusive and sustainable AI. While some nations push for regulatory oversight, others prioritise innovation with minimal restrictions, reflecting broader tensions in AI policy. The withdrawal of the EU’s AI Liability Directive and the FTC’s crackdown on misleading AI claims highlight the regulatory uncertainty surrounding AI accountability.
The UK’s AI Safety Institute has now also been renamed to the AI Security Institute, reflecting the importance of this topic (a move we are delighted to see and are trying not to say I told you so).
Meanwhile, AI security threats are evolving rapidly, with adversaries exploiting weaknesses in machine learning models and supply chains. Malicious AI models on Hugging Face, NVIDIA container toolkit vulnerabilities, and AI-generated hallucinations in software development all reveal systemic risks. Attacks are becoming more sophisticated, with researchers uncovering new adversarial techniques like token smuggling and agentic AI-powered phishing. The inadequacy of existing AI security measures is further underscored by DEF CON’s criticism of AI red teaming and calls for a standardised AI vulnerability disclosure system akin to cybersecurity’s CVE framework.
Despite these challenges, promising advancements in AI security research are emerging. Anthropic’s Constitutional Classifiers offer a structured approach to preventing universal jailbreaks, while FLAME proposes a shift towards output moderation for AI safety. New governance audits, like the metric-driven security analysis of AI standards, provide insight into regulatory gaps and the need for stronger technical controls.
Fortnightly Digest - 17 February 2025
Welcome to the third edition of the Mileva Security Labs AI Security Fortnightly Digest! As the digest is still (relatively) new, if you haven’t yet read the introduction, check it out here.
The digest article contains individual reports that are labelled as follows:
[Vulnerability]: Notifications of vulnerabilities, including CVEs and other critical security flaws, complete with risk assessments and actionable mitigation steps.
[News]: Relevant AI-related news, accompanied by commentary on the broader implications.
[Research]: Summaries of peer-reviewed research from academia and industry, with relevance to AI security, information warfare, and cybersecurity. Limitations of the studies and key implications will also be highlighted.
Additionally, reports will feature theme tags such as [Safety] and [Cyber], providing context and categorisation to help you quickly navigate the topics that matter most.
We focus on external security threats to AI systems and models; however, we may sometimes include other reports where relevant to the AI security community.
We are also proud supporters of The AI Security Podcast, check it out here.
We are so grateful you have decided to be part of our AI security community, and we want to hear from you! Get in touch at contact@mileva.com.au with feedback, questions and comments.
Now.. onto the article!
[News] [Governance] [Safety] Feb 11: US and UK Decline to Sign Global AI Agreement at Paris Summit
At the recent AI Action Summit in Paris, both the United States and the United Kingdom chose not to endorse a multinational declaration advocating for inclusive and sustainable artificial intelligence development. Find the BBC coverage of the decision here: https://www.bbc.com/news/articles/c8edn0n58gwo
TLDR:
The recent AI Action Summit, held in Paris from February 10 to 11, 2025, convened global leaders to discuss the future of AI. A fundamental outcome was a declaration promoting inclusive and sustainable AI practices, endorsed by 60 nations, including France, China, and India. The United States and the United Kingdom, however, abstained from signing this declaration.
In their rebuttal, U.S. Vice President JD Vance stated the administration's stance against excessive AI regulation, citing on the importance of fostering innovation and maintaining a competitive edge in AI development. He warned that stringent regulations could hinder technological progress and economic growth, reflecting President Donald Trump's revocation of former President Joe Biden's Executive Order 14110, which had established safety guidelines for AI development and deployment. The rescission reflects a policy shift towards reducing federal oversight to accelerate AI innovation.
The UK government similarly refrained from signing the declaration, noting concerns that it did not align with their policies on balancing AI opportunities with security considerations. A spokesperson for Prime Minister Keir Starmer stated that the declaration lacked practical clarity on global governance and did not sufficiently address national security challenges posed by AI.
Other nations including France and China, endorsed the declaration and signalled their commitment to collaborative efforts in establishing ethical AI frameworks. French President Emmanuel Macron emphasised the need for balanced regulation that ensures safety while promoting innovation.
Mileva’s Note: It's becoming increasingly clear that when nations and organisations say, "We need AI to move forward", they're speaking different languages. For some, progress means prioritising user safety and, at the extreme, safeguarding humanity by securing AI systems. For others, it's about pushing innovation at any, and all costs.
Though the emphasis on rapid innovation and minimal regulation may benefit short-term competitiveness, the potential neglect of ethical considerations will have downstream impact. A middle ground is possible but often hard to see in international relations settings where extreme viewpoints clash. Over-securing AI to the point of hindering usability isn't the goal, but forgoing regulations won't achieve the opposite either. Achieving international consensus seems unlikely, but let's hope global leaders wake up to the dangers before it's too late.
[Vulnerability] [Supply Chain] Feb 5: Malicious AI Model Found on Hugging Face
ReversingLabs researchers have identified a machine learning model hosted on Hugging Face that was backdoored with malware. These models exploit the Pickle serialisation format, a widely used but inherently insecure method of storing ML models. The full report can be found here: https://www.reversinglabs.com/blog/rl-identifies-malware-ml-model-hosted-on-hugging-face.
TLDR:
ReversingLabs researchers discovered two Hugging Face-hosted models containing malicious Pickle payloads that executed reverse shells upon deserialisation. These models bypassed Hugging Face’s security scanning mechanisms and remained undetected until manual analysis.
The malicious payload was embedded within PyTorch model files, which use Pickle serialisation.
These models were compressed in 7z format instead of PyTorch’s default ZIP format, evading detection by Hugging Face’s Picklescan security tool.
Upon loading, the Pickle deserialisation process executed a Python payload that opened a reverse shell to a hardcoded IP address (107.173.7.141).
Hugging Face’s security tools failed to detect the threat because they rely on blacklists of known dangerous functions, which can be easily bypassed.
The malicious models were removed within 24 hours of ReversingLabs’ responsible disclosure, but the attack method remains a security risk.
This attack does not rely on traditional supply chain methods like typosquatting but instead plants malware directly inside AI models, leveraging the assumption that ML models are inherently safe to use if on trusted platforms like HuggingFace.
Mileva’s Note: Trust in public ML repositories is a security weakness, as many developers currently load models without verifying their contents. But how do you verify contents when there are so many levels for potential interference, and should it be up to the end user to do this?
This is not an isolated incident. Adversaries will continue to weaponise ML models, their dependencies, and data collection processes as new supply chain exploit vectors, and these attacks will have dramatic downstream impacts.
[News] [Regulatory] Feb 14: DEF CON's 'Hackers' Almanack' Calls Out AI Red Teaming as 'Bullshit'
Leading cyber security researchers from DEF CON, the world's largest hacker conference, have warned about the inadequacies of current AI security measures. In their inaugural "Hackers' Almanack" report, produced in collaboration with the University of Chicago's Cyber Policy Initiative, they argue that existing methods, particularly "red teaming," are insufficient and call for a comprehensive overhaul. Read “THE DEF CON 32 HACKERS’ ALMANACK” here: https://harris.uchicago.edu/sites/default/files/the_def_con_32_hackers_almanack.pdf
TLDR:
The "Hackers' Almanack" from DEF CON 32 challenges the efficacy of AI red teaming, where security professionals test AI systems for vulnerabilities, stating that this approach alone cannot address evolving AI threats comprehensively. Sven Cattell, leader of DEF CON's AI Village, points out that public red teaming is affected by fragmented documentation and inadequate evaluations, making it ineffective for thorough security assessments.
Backing their concerns, at DEF CON 32, nearly 500 participants engaged in testing AI models, with even newcomers successfully identifying vulnerabilities. Their ease in identifying findings shows just how accessible flaws in AI systems are. The report advocates for adopting standardised methods akin to the Common Vulnerabilities and Exposures (CVE) system used in cyber security since 1999. Such a system would provide a unified approach to documenting and addressing AI vulnerabilities, moving beyond sporadic security audits.
Mileva's Note: Relying solely on red teaming is like patching holes in a sinking ship without addressing the underlying design flaws causing the leaks. Overfitting models to address security limitations can lead to decreased performance and unintended consequences that then themselves (you guessed it) require patches. Sure, there will never be a singular, 100% fix due to AI being probabilistic and inherently erroneous in nature, but new architectures may be able to address foundational vulnerabilities rather than just surface-level issues.
The call for a standardised vulnerability disclosure framework is a sentiment we have been echoing at Mileva; it's about building a resilient foundation for AI systems and providing the AI security community with a unified understanding of vulnerabilities as they emerge, facilitating quicker and more coordinated responses. The CVE system has provided the cyber security community with such a framework, and applying a similar approach to AI could bridge the gap between identifying flaws and communicating them accurately. The AI security community has much to learn and adopt from cyber security practices.
What we also feel is missing? Something akin to the Common Vulnerability Scoring System (CVSS), where AI vulnerabilities' severities and impacts can be assessed linearly. Too often, direct jailbreaks take the stage, when only the immediate user who launched the attack is affected. Cool, you got an AI to swear at you. Let's advocate for an accurate representation of risk so that prioritisation can be accurately performed.
[Research] [Academia] [Governance] Feb 14: A Metric-Driven Security Analysis of Gaps in Current AI Standards
This is a summary of research by Keerthana Madhavan et al., titled "Quantifying Security Vulnerabilities: A Metric-Driven Security Analysis of Gaps in Current AI Standards," released on February 14, 2025. The paper is accessible at: https://arxiv.org/pdf/2502.08610.
TLDR:
Madhavan and colleagues conduct a detailed security audit of three prominent AI governance frameworks; NIST AI RMF 1.0, the UK's AI and Data Protection Risk Toolkit, and the EU's ALTAI. They introduce four novel metrics to quantify security concerns: Risk Severity Index (RSI), Root Cause Vulnerability Score (RCVS), Attack Vector Potential Index (AVPI), and Compliance-Security Gap Percentage (CSGP). Their analysis identifies 136 distinct security concerns across these frameworks, and reveal significant gaps between compliance and actual security measures.
How it Works:
Madhavan et al. conducted a systematic evaluation of the three selected AI governance frameworks through a combination of qualitative audits and quantitative risk assessments. They performed a detailed review of each framework, analysing its security provisions and identifying vulnerabilities linked to ambiguous guidelines, under-defined processes, and weak enforcement mechanisms. A panel of industry experts was consulted to validate these findings and ensure their practical relevance.
To assess the severity and potential impact of the identified vulnerabilities, the researchers introduced a set of quantitative metrics:
The Risk Severity Index (RSI) measures the seriousness of each identified security concern based on its potential impact and likelihood of exploitation. Higher RSI scores indicate vulnerabilities that pose significant risks if left unaddressed. This metric provides a structured way to prioritise security threats within governance frameworks.
The Root Cause Vulnerability Score (RCVS) evaluates the extent to which fundamental weaknesses contribute to broader security risks. By identifying systemic flaws in governance policies, this metric helps identify areas requiring structural improvements. A high RCVS suggests that security gaps stem from foundational deficiencies rather than isolated oversights.
The Attack Vector Potential Index (AVPI) quantifies the likelihood that unresolved vulnerabilities within a framework could be exploited through common attack vectors. It incorporates factors such as the accessibility of the weakness, the sophistication required to exploit it, and the expected impact of a successful attack. A high AVPI indicates that the framework leaves significant openings for adversaries.
The Compliance-Security Gap Percentage (CSGP) measures the proportion of high-risk security concerns that remain unaddressed within each governance framework. A higher CSGP suggests a greater disparity between regulatory compliance and actual security resilience. This metric highlights how well governance standards translate into practical safeguards.
Applying these metrics to their audit, the study revealed concerning gaps in AI security governance:
The NIST AI RMF 1.0 was found to leave nearly 69.23% of identified risks unaddressed
The EU’s ALTAI exhibited the highest vulnerability to potential attack vectors, as indicated by its AVPI score of 0.51, making it particularly susceptible to exploitation.
The ICO AI Risk Toolkit demonstrated the most pronounced compliance-security gap, with 80% of its high-risk concerns remaining unresolved.
Implications:
The findings of the paper suggest that organisations relying solely on a singular governance framework for AI security may be exposed to significant risks. Evidently, there is a need for more technical, specific, and enforceable security controls within AI compliance standards to lessen the divide between compliance and actual security and effectively mitigate risk.
Beyond improving security standards, there must also be a standardised way to communicate AI risks, similar to the Common Vulnerabilities and Exposures (CVE) framework in cyber security. A system for categorising and tracking AI risks would provide a structured means for security professionals to communicate threats, fostering more coordinated and transparent responses across the industry.
Limitations:
The study is limited by its focus on only three standards, which may not fully represent the broader regulatory landscape. Additionally, the expert validation process relied on a small sample of four industry experts. Given this constraint, the researchers prioritised validation of high-risk concerns rather than conducting full-spectrum validation of all identified vulnerabilities, meaning certain areas did not receive direct review. Finally, the audit process was conducted in isolation, assuming perfect implementation of each standard, which does not reflect the complexities of real-world deployments where multiple security controls interact.
Future research should expand the scope of analysis to include additional governance frameworks, a broader panel of experts, and cross-industry comparative studies to enhance the robustness of security evaluations across a variety of contexts and systems.
Mileva’s Note: Just as there is in cyber, in AI there is a disconnect between compliance-driven security measures and actual risk mitigation. The idea that adhering to these frameworks equates to AI security is misguided. Compliance does not mean resilience. Governance frameworks must align more closely with the technical controls required to secure AI systems effectively. Without enforceable security measures built into these standards, compliance will remain a checkbox exercise rather than a meaningful and efficacious risk mitigation strategy.
Additionally, the lack of a standardised risk assessment format for AI vulnerabilities makes it difficult to accurately prioritise risks across different domains and create effective governance frameworks. Establishing a uniform approach to AI risk evaluation would improve the ability of organisations to systematically address the most critical threats first, reducing the reliance on ad-hoc or inconsistent security assessments.
[News] [Cyber] Feb 4: Emergence of 'Agentic' AI Poses New Ransom Threats
A recent piece by Malwarebytes reflecting on their 2025 State of Malware report highlights the potential for 'agentic' AI systems to autonomously conduct sophisticated cyber attacks, including personalised ransom schemes. Read the original article here: https://www.malwarebytes.com/blog/news/2025/02/new-ai-agents-could-hold-people-for-ransom-in-2025
TLDR:
Malwarebytes' 2025 State of Malware report warns of a paradigm shift in cyber threats with the anticipated rise of 'agentic' AI; autonomous systems capable of independent decision-making and actions. Unlike current generative AI models that assist users in creating content, agentic AI can initiate tasks without human intervention.
These AI agents could autonomously search large datasets for personal information, craft convincing phishing emails, and engage in real-time conversations with victims. For example, by cross-referencing stolen data with publicly available information, an AI agent could impersonate a trusted contact, making fraudulent communications more convincing and increasing the likelihood of successful extortion.
The report suggests several potential attack vectors (add all attack vectors the report mentions).
Data Exploitation: AI agents could analyse large volumes of leaked data to identify and target individuals with tailored ransom demands.
Social Engineering: By mimicking writing styles and gaging contextual nuances, AI could generate highly convincing messages, reducing the traditional red flags associated with phishing attempts (like grammatical or language errors, robotness etc.).
Scalability: Autonomous agents can operate continuously, targeting multiple individuals or organisations simultaneously without the limitations of human hackers.
Mileva’s Note: Agentic AI is very real, and very capable. On the cryptocurrency side of X (formerly Twitter), AI agents like Zerebro and AIXBT have been launching their own smart contracts, generating and deploying cryptocurrency tokens, autonomously engaging with users to market these assets, and even manipulating on-chain transactions for price control. Some AI-driven entities have expanded into content creation, releasing music on Spotify to generate revenue while integrating with decentralised finance (DeFi) systems.
AI Agents will be used in diverse contexts, and as always, adversaries will adopt the technology for their own ends. Security teams must anticipate these developments by investing in advanced threat detection systems that can identify and mitigate AI-driven attacks. This includes improving behavioural analysis protocols to detect anomalies indicative of autonomous agent activity.
[Vulnerability] [CVE]: CVE-2025-23359 - Critical TOCTOU Vulnerability in NVIDIA Container Toolkit
NVIDIA has disclosed a high-severity Time-of-Check Time-of-Use (TOCTOU) vulnerability in its Container Toolkit for Linux, identified as CVE-2025-23359. This flaw allows a crafted container image to gain unauthorised access to the host filesystem, potentially leading to code execution, denial of service, privilege escalation, information disclosure, and data tampering. Full details are available at: https://nvd.nist.gov/vuln/detail/CVE-2025-23359.
TLDR:
NVIDIA Container Toolkit for Linux contains a Time-of-Check Time-of-Use (TOCTOU) vulnerability when used with default configuration, where a crafted container image could gain access to the host file system. A successful exploit of this vulnerability might lead to code execution, denial of service, escalation of privileges, information disclosure, and data tampering.
Affected Resources:
Affected Versions: NVIDIA Container Toolkit for Linux (specific versions not detailed in the available information).
Access Level: Exploitation requires a crafted container image; no prior authentication is necessary.
Risk Rating:
Severity: High (CVSS Base Score: 8.3)
Impact: High on Confidentiality, Integrity, and Availability.
Exploitability: High (Network attack vector, high attack complexity, no privileges required, user interaction required, scope changed).
Recommendations:
NVIDIA has released updates to address CVE-2025-23359. Users are strongly advised to:
Apply the latest patches provided by NVIDIA to mitigate this vulnerability.
Ensure that container configurations follow security best practices to minimise potential exploitation vectors.
Implement monitoring to detect any unusual activity that could indicate exploitation attempts.
Find the full security bulletin here: https://nvidia.custhelp.com/app/answers/detail/a_id/5616/~/security-bulletin%3A-nvidia-container-toolkit---11-february-2025
[Vulnerability] [CVE]: CVE-2024-0132 NVIDIA Container Toolkit Exploit Enables Host System Compromise
In September 2024, Wiz Research disclosed a critical vulnerability (CVE-2024-0132) in the NVIDIA Container Toolkit, which provides GPU support for containerised AI applications. This flaw allows attackers to escape container isolation and gain full access to the host system. The newly released analysis is available here: https://www.wiz.io/blog/nvidia-ai-vulnerability-deep-dive-cve-2024-0132.
TDLR:
The NVIDIA Container Toolkit facilitates GPU resource access for containerised AI applications. A critical Time-of-Check to Time-of-Use (TOCTOU) vulnerability (CVE-2024-0132) was identified in versions 1.16.1 and earlier. This flaw permits a malicious container image to mount the host's root filesystem, providing unrestricted access to host files. By accessing the host's container runtime Unix sockets, such as docker.sock, an attacker can launch privileged containers, leading to full host compromise. The vulnerability affects environments using the default configuration of the toolkit and does not impact setups utilising the Container Device Interface (CDI).
NVIDIA has since released version 1.17.4 of the Container Toolkit, addressing this vulnerability and an additional bypass tracked as CVE-2025-23359. Users are strongly advised to update to the latest version to mitigate these risks.
[News] [Regulatory] Feb 11: FTC Finalises Order Against DoNotPay for Deceptive 'AI Lawyer' Claims
The Federal Trade Commission (FTC) has finalised an order requiring DoNotPay, a company that promoted its online subscription service as “the world’s first robot lawyer,” to cease making deceptive claims about the capabilities of its AI chatbot. The FTC notice can be accessed here: https://www.ftc.gov/news-events/news/press-releases/2025/02/ftc-finalizes-order-donotpay-prohibits-deceptive-ai-lawyer-claims-imposes-monetary-relief-requires.
TLDR:
On February 11, 2025, the FTC concluded its action against DoNotPay, mandating the company to halt deceptive advertising practices, provide monetary relief, and notify past subscribers about the limitations of its AI-driven legal services.
The FTC's complaint, initially announced in September 2024, alleged that DoNotPay's "robot lawyer" failed to meet its advertised claims of substituting human legal expertise. The company did not test whether its AI operated at the level of a human lawyer when generating legal documents and providing advice, nor did it employ attorneys to assess the quality and accuracy of its service's law-related features.
As part of the final order, DoNotPay is required to pay $193,000 in monetary relief and notify consumers who subscribed between 2021 and 2023 about the FTC settlement. The order also prohibits DoNotPay from advertising that its service performs like a real lawyer unless it has sufficient evidence to back it up.
Mileva’s Note: Beware the buzzword. The allure of marketing AI as a panacea should not come without rigorous capability testing and validation, especially in fields as consequential as legal services. Companies venturing into the AI space must operate within ethical and legal boundaries, and accurately advertise their product's capabilities. The FTC's action against DoNotPay shows that overhyping unproven AI solutions isn’t just deceptive, it’s illegal.
[TTP] [Experimental] Feb 12: Token Smuggling - Weaponising Emoji Encodings for AI Exploits
Vulnerability researchers have discovered a novel method for embedding arbitrary byte streams within Unicode variation selectors inside emoji characters, leading to excessive tokenisation, potential jailbreaks, and AI resource exhaustion. This technique is being actively explored for adversarial attacks, with early discussions and trials documented in Karpathy’s post on X: https://x.com/karpathy/status/1889726293010423836
TLDR:
This attack manipulates Unicode variation selectors to embed hidden byte sequences within emoji characters, creating multiple adversarial use cases:
Token Bombs: Single-character emoji sequences can expand into 50+ tokens, bloating AI inputs and potentially crashing or degrading inference services.
Adversarial Embeddings: Attackers can encode jailbreak instructions within emoji sequences, which may alter model behaviour, enabling prompt injections or circumvention of safety protocols.
Compute Exhaustion: By artificially inflating token counts, adversaries can force excessive computation on AI services, leading to denial-of-service conditions similar to an AI-targeted DDoS attack.
Training Data Poisoning: If AI models ingest these encoded sequences during training, they may learn to decode hidden instructions without explicit hints, creating a persistent security risk.
This technique remains in the experimental phase, with partial success in adversarial testing. As of the time of writing this digest, no fully functional proof-of-concept has been demonstrated.
How it Works:
Let’s dive deeper into each of the attack vectors being explored:
Exploiting Unicode Variation Selectors for Token Expansion
Unicode variation selectors (U+FE00–U+FE0F) allow subtle modifications to characters, often used for font styling or emoji variations. Vulnerability researchers have demonstrated that by appending a long sequence of these selectors, they can smuggle arbitrary byte sequences into a single emoji, dramatically increasing token counts in AI models.
For example, a manipulated character like ‘😀󠅧󠅕󠄐󠅑󠅢󠅕󠄐󠅓󠅟󠅟󠅛󠅕󠅔’ expands to 53 tokens in OpenAI’s cl100k_base tokeniser. Some models fail to optimise or collapse these sequences, leading to disproportionate computational costs per input token.
Token Bombs: AI Service Disruption via Exponential Token Costs
By embedding excessive variation selectors into emoji-based payloads, attackers can inflate token counts beyond expected thresholds, quickly exceeding model context limits or forcing excessive compute usage. AI services that charge per token, such as OpenAI or Anthropic, could experience unexpected cost spikes or service degradation when processing these inputs. In some cases, models may become unresponsive or even crash when encountering extreme token inflation, making this a viable denial-of-service vector against AI-powered applications and inference APIs.
Adversarial Smuggling: Hidden Payloads for Prompt Injection & Jailbreaking
Vulnerability researchers hypothesise that by hiding jailbreak instructions within these emoji sequences, LLMs can be tricked into misinterpreting user intent. Early testing suggests that models designed to "think" more deeply, such as DeepSeek-R1 and other reasoning models, are more susceptible, as they attempt to parse the hidden structure of these encoded inputs.
These attacks, however, currently require explicit decoding hints, limiting their effectiveness. If future AI models learn this encoding through exposure in training data, they could autonomously decode and execute embedded instructions, significantly increasing the risk of exploitation.
Token Starvation Attacks: AI Resource Warfare via Compute Drain
This technique has been theorised as an AI-to-AI adversarial attack, where one model forces another into resource exhaustion by overwhelming it with token-heavy payloads.
The Null Siege Attack model outlines how this could work:
An adversarial AI injects token-bloated emoji sequences into interactions with a target AI.
The target AI wastes compute cycles expanding and processing these excessive tokens.
Over multiple interactions, the adversary depletes the victim’s compute resources, leading to degraded performance, slow response times, or outright failure.
In large-scale multi-agent AI interactions, such as automated trading, cyber security monitoring, or AI-generated content moderation, this strategy could systematically disrupt operations by exploiting AI inefficiencies at scale.
Supply Chain Attack: Persistent Vulnerability via Training Data Poisoning
If AI models ingest this encoding methodology into training data, they may begin decoding hidden messages without user intervention in future interactions. This could lead to stealthy, persistent jailbreaks embedded within text, potentially bypassing standard safety filters in the next generation of models.
Implications:
This attack technique introduces a new, highly adaptable class of AI vulnerabilities that could be diversly weaponised. Unlike traditional prompt injection, which relies on overt instructions, token smuggling techniques exploit underlying tokenisation mechanics, making them more difficult to detect and mitigate.
AI-targeted DDoS is a primary concern here. Token inflation techniques could artificially increase inference costs for cloud-based AI services, impacting publicly accessible AI APIs that operate on token-based billing models. Additionally, if these encoded sequences make their way into training data, future models may learn to interpret hidden instructions autonomously, consistently bypassing traditional safety protocols.
At present, AI security filters do not block variation selector abuse, leaving a detection and mitigation gap that adversaries could exploit.
Potential countermeasures include:
Normalisation: Stripping redundant Unicode variation selectors before tokenisation.
Token count heuristics: Flagging inputs with extreme token-per-character ratios for review.
Model tuning: Ensuring AI models ignore non-standard variation selectors during inference.
Mileva’s Note: Now this is what we are talking about! Innovative, adaptive, and widely impactful. While the nerd in me is excited, the AI Security professional is highly concerned.
Future AML tactics will likely expand beyond emojis to other multi-byte character encodings, exotic Unicode sequences, and adversarially manipulated markup languages. Think about HTML/CSS hidden elements carrying adversarial payloads. LLMs with web browsing capabilities could be tricked into reading, decoding, and acting upon maliciously encoded instructions hidden within a webpage’s metadata or styling layers.
This technique remains underexplored but highly promising as an attack vector. Ultimately, adversarial AI research should anticipate token-based adversarial exploits as a growing class of AI security risks.
[News] [Regulatory] [Safety] Feb 12: European Commission Withdraws AI Liability Directive from Consideration
The European Commission has withdrawn its proposed Artificial Intelligence Liability Directive, citing a lack of consensus among stakeholders and concerns over regulatory complexity. The decision can be found here: https://www.europarl.europa.eu/legislative-train/theme-a-europe-fit-for-the-digital-age/file-ai-liability-directive?p3373
TLDR:
Initially proposed in 2022, the AI Liability Directive aimed to harmonise non-contractual civil liability rules for damages caused by AI systems across EU member states (explain what this means mroe). The directive sought to ensure that individuals harmed by AI enjoyed the same level of protection as those affected by other technologies, introducing a rebuttable 'presumption of causality' to ease the burden of proof for victims (explain this idea more).
The AI Liability Directive, proposed in 2022, aimed to standardise rules across EU countries for holding parties accountable when AI systems cause harm. Under the directive, if someone suffered because of an AI system's actions, there was a clear path to seek compensation, similar to existing protections against other technologies. A core feature of the directive was the 'presumption of causality,' which would have made it easier for victims to prove that an AI system caused their injury, simplifying the legal process.
However, the proposal faced significant opposition from various industry stakeholders who argued that the directive would introduce unnecessary regulatory burdens, especially in light of the existing AI Act which already addresses some related issues. They contended that the directive could stifle innovation and impose excessive compliance costs on businesses.
In its 2025 work program, adopted on February 11, the Commission acknowledged the lack of foreseeable agreement on the directive and indicated that it would assess alternative approaches to address AI-related liability issues. This move has drawn criticism from some policymakers, including German MEP Axel Voss, who argued that abandoning the directive undermines efforts to hold AI developers accountable for potential harms caused by their systems.
Mileva’s Note: What a turbulent week for AI governance, regulation, and safety! Between the outcomes of the AI Summit in Paris, the FTC ruling, and now the European Commission's withdrawal of the AI Liability Directive, it's clear that no region in the world can find consensus on AI oversight. While concerns about overregulation are valid, the absence of a clear liability framework leaves victims of AI-related harms navigating a legal landscape that does not yet adequately cover AI incidents. In the meantime, real people are bearing the costs of corporate and government indecision.
[Research] [Academia] Feb 13: FLAME: Flexible LLM-Assisted Moderation Engine
This is a summary of research by Ivan Bakulin et al., titled "FLAME: Flexible LLM-Assisted Moderation Engine," released on February 13, 2025. The paper is accessible at: https://arxiv.org/pdf/2502.09175
TLDR:
Bakulin and colleagues introduce FLAME, a moderation framework for Large Language Models (LLMs) that shifts the focus from traditional input filtering to output moderation. By evaluating model responses instead of user queries, FLAME improves computational efficiency, strengthens resistance to adversarial attacks like Best-of-N (BoN) jailbreaking, and offers customisable safety criteria through topic filtering. Experiments demonstrate that FLAME reduces attack success rates by approximately ninefold in models such as GPT-4o-mini and DeepSeek-v3, while maintaining low computational overhead.
How it Works:
Traditional content moderation systems primarily focus on filtering user inputs to prevent user attempts to elicit harmful or inappropriate content generation. However, adversarial techniques like BoN jailbreaking exploit the probabilistic nature of LLM outputs by generating multiple responses and selecting the most favourable one, often bypassing input filters with high success rates.
FLAME addresses this vulnerability by shifting the moderation focus to the model's outputs. The framework operates as follows:
Output Evaluation: Instead of analysing user inputs, FLAME assesses the generated responses from the LLM. This approach allows for the detection of inappropriate content that may result from adversarial inputs designed to circumvent input filters.
Customisable Topic Filtering: FLAME employs a flexible system that provides an opportunity for definition and updating of safety criteria through customisable topic filters. The resulting moderation system is adaptable, and scalable, and can evolve with emerging threats and societal norms.
Computational Efficiency: By focusing on output moderation, FLAME reduces the computational burden associated with input filtering methods, leading to efficiency in both training and inference phases.
In their experiments, the researchers applied FLAME to models like GPT-4o-mini and DeepSeek-v3. The results showed a significant reduction in attack success rates, by a factor of approximately nine, while maintaining low computational overhead.
Implications:
By focusing on output moderation, FLAME addresses the limitations of input filtering, particularly against sophisticated adversarial attacks. Moreover, the customisable nature of FLAME's topic filtering allows for rapid adaptation to new types of adversarial content and changing safety requirements. This flexibility is highly beneficial in LLM deployments across various domains where regulations and requirements frequently change, including healthcare, finance, and legal systems.
Limitations:
The effectiveness of output moderation is contingent upon the comprehensiveness of the defined safety criteria. Incomplete or outdated topic filters may fail to identify novel adversarial content, necessitating continuous updates and monitoring. Similar to filtering malicious user input, the model must also understand what constitutes unacceptable content and classify it as a human moderator would. Proper alignment is required to ensure the classifier captures both explicit and subtle harmful content.
Additionally, the reliance on output evaluation may introduce latency in content delivery, as responses must be generated before moderation can occur. If a response is rejected, the model may need to regenerate multiple times before a non-harmful response is accepted, further increasing delay. This could impact user experience in applications requiring real-time interactions.
Author’s Note: Though this approach is not a one-size-fits-all solution, the framework is a viable addition to existing AI safety strategies. By focusing on the model's responses, it addresses vulnerabilities that input filtering alone cannot mitigate, particularly against adaptive adversarial attacks that appear inconspicuous—something that is becoming increasingly feasible, for example, through techniques like token smuggling. However, an even more effective approach, if computational resources and latency allow, is to combine both input and output monitoring. A defence-in-depth strategy remains the best practice for AI safety.
[Research] [Academia] [Supply Chain] Jan 31: LLM Hallucinations as a Software Supply Chain Threat
This is a summary of research by Arjun Krishna, Erick Galinkin, Leon Derczynski, and Jeffrey Martin, titled "Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities," released on January 31, 2025. The paper is accessible at: https://arxiv.org/pdf/2501.19012
TLDR:
The research by Krishna and colleagues examines package hallucination in Large Language Models (LLMs), a phenomenon where models generate references to non-existent software packages during software code generation. Their findings indicate that these hallucinations are frequent, with some models producing false package recommendations up to 46% of the time. Attackers could weaponise this behaviour by registering these phantom packages with malicious payloads, leading to software supply chain compromises
How it Works:
Whether for code suggestion, autocompletion, debugging, package recommendation or other forms of research, LLMs are increasingly used in software development processes. Through these uses, LLMs are still vulnerable to forms of hallucination: where modes generate false but plausible-looking content. Package hallucination specifically refers to situations where an LLM recommends non-existent software packages. If a developer blindly trusts the output and attempts to install a suggested package, an attacker could preemptively register the hallucinated package name in public repositories like PyPI or npm, embedding malicious code to compromise systems.
The researchers systematically studied this vulnerability by evaluating multiple LLMs across different programming languages (Python, JavaScript, Rust). Their methodology included:
Measuring Hallucination Frequency:
Hallucination rates varied significantly across languages and models. The worst cases saw hallucination rates of up to 46%, with many exhibiting hallucination rates above 20%. Python and Rust models showed the highest likelihood of hallucinating packages, while JavaScript-based models performed comparatively better.Evaluating Model Performance and Trade-Offs:
The study found an inverse correlation between a model’s HumanEval benchmark score (a measure of coding proficiency) and its hallucination frequency. In other words, models with higher coding capabilities tended to hallucinate more often when it came to package recommendations.Attack Feasibility and Risk Assessment:
The study simulated how easily an attacker could weaponise these hallucinations by registering fake packages. Given the frequency of hallucinations and the way developers commonly interact with AI-generated code, the researchers concur that this attack vector is actively exploitable.
Implications:
Phantom packages represent a critical software supply chain risk, worsened by common poor-practices with GenAI. Many developers, , particularly those relying heavily on AI-generated code, copy and paste LLM outputs without reading the code with great scrutiny, or verifying package authenticity. Attackers could exploit this behaviour by registering malicious packages with names that match hallucinated outputs. Once published in public repositories like PyPI or npm, these phantom packages could be automatically pulled into projects, leading to data theft, malware execution, or unauthorised access.
Even well-established platforms like PyPI and npm have seen waves of supply chain attacks in the past. (A great resource on this issue is Paul McCarty on LinkedIn, who frequently covers supply chain risks in infosec). It is a mistake to assume that developers primarily use well-known repositories, making it difficult for attackers to introduce fake packages. Threat actors have proven that they are willing to play the long game—poisoning repositories, maintaining long-standing sleeper packages, and exploiting moments of developer negligence.
Limitations:
While supply chain attacks leveraging package hallucinations are a serious concern, their success ultimately depends on developer behaviour. Experienced developers might be less susceptible to blindly installing non-existent packages, but those with limited cyber security awareness, or those in high-pressure development environments, could be more easily exploited. Additionally, public package repositories like PyPI and npm have been introducing more rigorous security controls, such as requiring verified accounts for new package uploads. However, these measures are not foolproof, and determined adversaries can still find ways to bypass them.
Mileva’s note: The findings of this research come as no surprise, but are integral in promoting better practices for using the output of GenAI without sanitisation or human verification. Think of AI-assisted development tools for example, that integrate LLM-generated code directly into applications, or are given execution permissions.
While hallucinated package names have already been shown as a viable vector for supply chain compromise, the potential danger likely extends much further. Web domains and company names, for example, could be fabricated by LLMs and exploited by adversaries. A logical next step for attackers would be to set up malicious websites mimicking hallucinated domains to distribute malware. If AI users, not just developers, assume validity without cross-checking, these hallucinations could become an easy entry point for long-term social engineering, cyber crime, or even national security threats depending on the victim’s context.
[Research] [Industry] [Red-Teaming] Jan 31: Anthropic Safeguards Research Team: Defending against Universal Jailbreaks
This is a summary of research by Mrinank Sharma et al., titled "Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming," released on January 31, 2025. The paper is accessible at: https://arxiv.org/pdf/2501.18837
TLDR:
Sharma and colleagues from the Anthropic Safeguards Research Team introduce "Constitutional Classifiers," a method to safeguard Large Language Models (LLMs) against universal jailbreaks; strategies that consistently bypass model safety protocols. By training classifiers on synthetic data generated from explicit constitutional rules, they achieved efficacious defences, with over 3,000 hours of red teaming revealing no successful universal jailbreaks. The approach maintained deployment viability, incurring only a 0.38% increase in refusal rates, however, and a 23.7% inference overhead.
How it Works:
LLMs are susceptible to jailbreaks; thoughtfully engineered prompts that circumvent safety measures, leading models to produce harmful outputs. Universal jailbreaks are particularly concerning as they can systematically, and consistently bypass safeguards across various queries.
To counter this, the Anthropic researchers developed Constitutional Classifiers, which monitor LLM inputs and outputs to detect and block harmful content. The training process involves:
Constitution Development: Establishing a set of natural language rules defining permissible and restricted content.
Synthetic Data Generation: Using the constitution to prompt LLMs to produce examples of both acceptable and unacceptable outputs.
Classifier Training: Supervised learning techniques are used to train classifiers on this synthetic dataset, enabling them to detect and filter harmful content at both input and output levels.
The research incorporated over 3,000 hours of adversarial testing and found no successful universal jailbreaks that could extract harmful information from LLMs guarded by Constitutional Classifiers. Automated evaluations confirmed the classifiers’ effectiveness across diverse adversarial prompts, outperforming previous filtering techniques.
Implications:
Traditional jailbreak prevention relies on manually crafted safety rules, reinforcement learning from human feedback (RLHF), or heuristic-based filtering—none of which have proven robust against adaptable, universal jailbreaks. These approaches often suffer from high false negatives (allowing adversarial prompts through) or false positives (overblocking benign queries), limiting their practicality in real-world deployments.
Anthropic’s method proactively trains models to recognise adversarial patterns at a conceptual level, rather than reactively patching individual jailbreak techniques. This is an important differentiation for sensitive applications like legal, medical, or financial AI systems, where an undetected jailbreak could result in severe consequences. Additionally, Constitutional Classifiers introduce a structured and adaptable framework that can be updated as new threats emerge, reducing reliance on model retraining.
Limitations:
While the classifiers demonstrated strong defence against universal jailbreaks, the study acknowledges potential limitations. The effectiveness of the classifiers depends on the comprehensiveness of the constitution and the quality of the synthetic data. Additionally, the approach introduces a 23.7% inference overhead, which, although moderate, could impact performance in resource-constrained environments. Continuous monitoring and updates are essential to maintain efficacy against new and sophisticated jailbreak techniques.
Additionally, the classifiers are trained on data generated based on predefined constitutional rules, which may not accurately refelct real-world adversarial prompts. This reliance on synthetic data could result in biases or gaps, making the classifiers less effective against unforeseen jailbreak techniques. The synthetic data approach, however, does allows for rapid adaptation to new threat models and eliminates the need for manual data collection in this niche use case.
Mileva’s Note: Jailbreaks are the most prevalent AI vulnerability, both in volume and ease of execution. From our experience, identifying jailbreaks is often trivial, making them a definite, expected vector for adversaries. Their pervasiveness stems from foundational model design choices, meaning prevention strategies need to be systemic rather than reactive. Anthropic’s approach is a promising step toward mass prevention rather than isolated patching. As always though, while the classifiers are highly effective in their current iteration, attackers are likely to refine their strategies, forcing continuous updates to the classifier framework. The next challenge lies in maintaining adaptability and scalability over time.
[Vulnerability] [CVE]: CVE-2025-21396 & CVE-2025-21415 - Critical Azure AI and Microsoft Account Security Flaws.
Microsoft has patched multiple high-severity vulnerabilities affecting Azure AI services and Microsoft Account authentication. These flaws, CVE-2025-21396 and CVE-2025-21415, could allow attackers to bypass authentication and escalate privileges, exposing sensitive user and system data. Full details are available at https://msrc.microsoft.com/update-guide/vulnerability/CVE-2025-21396 and https://msrc.microsoft.com/update-guide/vulnerability/CVE-2025-21415.
TLDR:
CVE-2025-21396: Missing authorisation in Microsoft Account allows an unauthorised attacker to elevate privileges over a network.
CVE-2025-21415: Authentication bypass by spoofing in Azure AI Face Service allows an authorised attacker to elevate privileges over a network.
Microsoft has deployed fixes, with no customer action required.
Affected Resources:
CVE-2025-21396: Microsoft Account services (authentication mechanisms impacted).
Access Level: Exploitation does not require prior authentication.
CVE-2025-21415: Microsoft Azure AI Face Service.
Access Level: Exploitation requires an authenticated user.
Risk Rating:
CVE-2025-21396:
Severity: Critical (CVSS Base Score: 8.2)
Impact: High
Exploitability: High (Low attack complexity, no privileges required, no user interaction).
CVE-2025-21415:
Severity: Critical (CVSS Base Score: 9.9)
Impact: High
Exploitability: High (low attack complexity, requires low privileges, no user interaction).
Recommendations:
Microsoft has fully mitigated these vulnerabilities, and no customer action is required for the patches to take effect. However, organisations should:
Monitor official Microsoft security advisories for any additional updates.
Enforce least-privilege access policies to limit the impact of privilege escalation.