Fortnightly Digest 31 March 2025
[Vulnerability] CVEs 17 MAR - 31 MAR:
LlamaIndex
[CRITICAL] CVE-2024-12909: SQL injection vulnerability in FinanceChatLlamaPack's run_sql_query function allows remote code execution; fixed in version 0.3.0.
[HIGH] CVE-2024-12911: SQL injection vulnerability in default_jsonalyzer function of JSONalyzeQueryEngine allows arbitrary file creation and DoS attacks; fixed in version 0.5.1.
[HIGH] CVE-2024-12704: Denial of Service (DoS) vulnerability in LangChainLLM class due to unhandled exceptions leading to infinite loops; fixed in version 0.12.6.
InvokeAI
[CRITICAL] CVE-2024-12029: Remote code execution vulnerability in invoke-ai/invokeai versions 5.3.1 through 5.4.2 via unsafe deserialisation of model files using torch.load without proper validation; fixed in version 5.4.3.
[CRITICAL] CVE-2024-11042: Arbitrary file deletion vulnerability in invoke-ai/invokeai v5.0.2's POST /api/v1/images/delete API allows unauthorised attackers to delete critical system files, impacting application integrity and availability.
[HIGH] CVE-2024-11043: Denial of Service (DoS) vulnerability in invoke-ai/invokeai v5.0.2's /api/v1/boards/{board_id} endpoint allows attackers to render the UI unresponsive by sending an excessively large payload in the board_name field during a PATCH request.
[HIGH] CVE-2024-10821: Denial of Service (DoS) vulnerability in Invoke-AI v5.0.1 due to improper handling of multipart request boundaries, leading to infinite loops and service disruption; affects /api/v1/images/upload endpoint.
DB-GPT
[CRITICAL] CVE-2024-10835: SQL injection vulnerability in eosphoros-ai/db-gpt v0.6.0's POST /api/v1/editor/sql/run API permits arbitrary file write via DuckDB SQL, potentially leading to remote code execution.
[CRITICAL] CVE-2024-10834: Arbitrary file write vulnerability in eosphoros-ai/db-gpt v0.6.0's RAG-knowledge endpoint allows attackers to write files to arbitrary locations by manipulating the doc_file.filename parameter, potentially leading to system compromise.
[CRITICAL] CVE-2024-10833: Arbitrary file write vulnerability in eosphoros-ai/db-gpt v0.6.0's knowledge API allows attackers to write files to arbitrary locations on the server via absolute path traversal using the doc_file.filename parameter.
[CRITICAL] [DB-GPT] CVE-2024-10831: Absolute path traversal vulnerability in eosphoros-ai/db-gpt v0.6.0 permits attackers to upload files to arbitrary locations on the server by manipulating the file_key and doc_file.filename parameters, potentially overwriting critical system files.
[HIGH] [DB-GPT] CVE-2024-10906: Cross-Site Request Forgery (CSRF) vulnerability in eosphoros-ai/db-gpt v0.6.0 due to overly permissive CORS configuration, allowing attackers to interact with any server endpoints even if not publicly exposed.
[HIGH] [DB-GPT] CVE-2024-10829: Denial of Service (DoS) vulnerability in eosphoros-ai/db-gpt v0.6.0 due to improper handling of multipart request boundaries, leading to infinite loops and service disruption across all endpoints processing multipart/form-data requests.
[HIGH] [DB-GPT] CVE-2024-10830: Path Traversal vulnerability in eosphoros-ai/db-gpt v0.6.0 at the /v1/resource/file/delete endpoint allows unauthenticated attackers to delete arbitrary files on the server by manipulating the file_key parameter.
Lunary
[CRITICAL] [Lunary] CVE-2024-8999: Improper access control in lunary-ai/lunary v1.4.25's POST /api/v1/data-warehouse/bigquery endpoint allows unauthorised users to export the entire database to Google BigQuery without authentication; fixed in version 1.4.26.
[HIGH] [Lunary] CVE-2024-10762: Improper access control in lunary-ai/lunary before v1.5.9 allows low-privilege users to delete evaluators via the /v1/evaluators/ endpoint, resulting in potential data loss.
[HIGH] [Lunary] CVE-2024-11300: Improper access control in lunary-ai/lunary before version 1.6.3 allows unauthorised users to access sensitive prompt data of other users, potentially exposing critical information.
[HIGH] [Lunary] CVE-2024-9000: Authorisation bypass in lunary-ai/lunary before version 1.4.26 permits unauthorised users to create or modify checklists via the checklists.post() endpoint without proper permissions, leading to potential data integrity issues.
[MEDIUM] [Lunary] CVE-2024-10274: Improper authorisation in Lunary v1.5.5 exposes sensitive information about all team members via the /users/me/org endpoint, leading to potential privacy violations.
[MEDIUM] [Lunary] CVE-2024-11301: Data integrity vulnerability in Lunary versions before 1.6.3 allows overwriting existing evaluators due to lack of unique constraints on projectId and slug, leading to potential data corruption.
[MEDIUM] [Lunary] CVE-2024-10273: Improper privilege management in Lunary v1.5.0 permits users with viewer roles to modify models owned by others via the PATCH endpoint, compromising system integrity.
[MEDIUM] [Lunary] CVE-2024-10330: Improper access control in lunary-ai/lunary v1.5.6's /v1/evaluators/ endpoint permits low-privilege users to access all evaluator data within a project, potentially exposing sensitive information.
Anything LLM
[HIGH] [Anything-LLM] CVE-2024-6842: Unauthorised access vulnerability in mintplex-labs/anything-llm v1.5.5 allows exposure of sensitive system settings, including API keys, via the /setup-complete endpoint.
[HIGH] [Anything LLM] CVE-2024-10109: Authorisation flaw in mintplex-labs/anything-llm (commit 5c40419) allows low-privilege users to access and modify the /api/system/custom-models endpoint, potentially leading to API key leakage and denial of service.
[HIGH] [Anything LLM] CVE-2024-10513: Path traversal vulnerability in 'document uploads manager' feature of mintplex-labs/anything-llm versions prior to v1.2.2 enables 'manager' role users to access and manipulate the 'anythingllm.db' database file, potentially leading to unauthorised data access and loss.
Ragflow
[HIGH] [Ragflow] CVE-2024-12779: Server-Side Request Forgery (SSRF) vulnerability in infiniflow/ragflow v0.12.0 allows unauthorised access to internal web resources via manipulated api_base parameters in add_llm and tts endpoints.
GPT Academic
[HIGH] [GPT Academic] CVE-2024-10954: Command injection vulnerability in the manim plugin of binary-husky/gpt_academic versions prior to the fix permits remote code execution through untrusted code execution generated by user prompts.
[HIGH] [GPT Academic] CVE-2024-10950: Code injection vulnerability in CodeInterpreter plugin of binary-husky/gpt_academic ≤ v3.83 allows remote code execution via prompt injection leading to execution of untrusted code without sandboxing.
NI Vision Builder AI
[HIGH] [NI Vision Builder AI] CVE-2025-2450: Remote code execution vulnerability in NI Vision Builder AI due to missing warnings when processing VBAI files allows attackers to execute arbitrary code via malicious pages or files.
LangChain
[MEDIUM] [LangChain] CVE-2024-10940: Arbitrary file read vulnerability in langchain-core versions 0.1.17 to 0.3.14 via ImagePromptTemplate, allowing unauthorised access to sensitive information; patched in version 0.3.15.
[MEDIUM] [LangChain] CVE-2024-10940: Arbitrary file read vulnerability in langchain-core versions 0.1.17 to 0.3.14 via ImagePromptTemplate, allowing unauthorised access to sensitive information; patched in version 0.3.15.
MLflow
[MEDIUM] [MLflow] CVE-2024-6838: Denial of Service (DoS) vulnerability in MLflow v2.13.2 due to lack of limits on experiment name and artifact_location parameters, causing UI unresponsiveness; resolved in subsequent releases.
GitLab Duo
[MEDIUM] [GitLab Duo with Amazon Q] CVE-2025-2867: AI-assisted development features in GitLab versions 17.8 before 17.8.6, 17.9 before 17.9.3, and 17.10 before 17.10.1 could expose sensitive project data to unauthorised users via crafted issues; addressed in patched versions.
News and Research
[News] [State-Actor] Mar 12: North Korea Establishes AI-Focused Cyber Warfare Research Center
North Korea has inaugurated "Research Center 227," a facility dedicated to developing AI-powered hacking technologies to augment its cyber warfare capabilities. Read the full article here: https://www.dailynk.com/english/n-korea-ramps-up-cyber-offensive-new-research-center-to-focus-on-ai-powered-hacking/
TLDR:
In late February 2025, North Korean leader Kim Jong Un directed the Reconnaissance General Bureau (RGB) to improve overseas information warfare capabilities, leading to the establishment of Research Center 227, which commenced operations on March 9, 2025. Located in the Mangyongdae District of Pyongyang, distinct from the RGB headquarters in Hyongjesan District, the center operates 24/7 to respond to real-time intelligence from North Korean hacking units abroad.
The center's primary objectives include researching techniques to neutralise security networks, developing AI-based information theft technologies, creating tools for financial asset theft, and establishing automated programs for information collection and analysis. AI is reportedly being integrated into attack automation, adversarial machine learning, and deepfake-driven disinformation campaigns. The research facility aims to develop machine learning models capable of bypassing modern anomaly detection systems, generating realistic phishing lures, and identifying vulnerabilities in foreign infrastructure with minimal human intervention.
Approximately 90 computer experts, selected from top university and doctoral programs, are being recruited to focus on offensive program development rather than direct cyber operations. Given North Korea’s historical reliance on cybercrime to bypass economic sanctions, AI-assisted financial fraud and cryptocurrency theft are expected to be key areas of focus.
Mileva’s Note: AI is reshaping state-sponsored operations, enabling offensive campaigns that can be executed at scale with increased precision and higher autonomy. Cyber warfare will be augmented in a number of ways, for example, automating reconnaissance and exploitation processes, improving deception techniques, and enabling novel forms of intrusion that would otherwise require significant human effort.
Deepfake-driven social engineering, for example, will undermine trust in diplomatic communications and corporate decision-making, facilitating disinformation, espionage or financial theft. AI-powered malware, with reinforcement learning, will be able to adapt to defensive measures in real-time, making traditional detection systems increasingly ineffective and living-off-the-land techniques more successful. AI’s utility in processing masses of data will allow for rapid identification of high-value targets and vulnerabilities across critical infrastructure.
Unsurprisingly, state actors recognise AI’s strategic advantage and they will benefit from society’s current unpreparedness against AI’s malicious potential. As AI continues to become more embedded in trusted decision-making processes and critical infrastructure, state-actors will seek to disrupt these systems, abuse them to disclose sensitive information, or deceive individuals and groups.
[TTP] [PoC] Mar 29: DarkWatch - Agentic Application for Surveillance and Investigation
A proof-of-concept (PoC) AI-driven surveillance tool, DarkWatch, was unveiled at the Security Frontiers 2025 conference. Developed by Jeff Sims, the system showed how agentic applications can autonomously construct and analyse graph structures to identify and target specific user groups. Inspired by OpenAI’s research on AI-assisted surveillance, this PoC demonstrated the technical feasibility and the ethical concerns surrounding such capabilities. Watch Jeff’s presentation here: https://www.securityfrontiers.ai/signup?viewReplay=93b00f3c-78e4-4328-a0e3-c2509c101bfc&joinCode=UBfbcfni5-lV
TLDR:
DarkWatch is an AI-powered investigative agent designed to autonomously generate and validate hypotheses about online discourse, particularly targeting opposition groups based on predefined political alignments. In this proof of concept (PoC), the system was operationalised to track and analyse dissenters of Trump administration policies. The dataset, derived from U.S. political discourse on Twitter, employs graph-based analysis and ReAct (Reasoning + Acting) agents to iteratively refine its understanding of user relationships and behaviours. The result: DarkWatch can map user interactions, detect ideological clusters, and refine its analysis of opposition networks.
How it Works:
DarkWatch’s process involves the following iterative steps:
Hypothesis Generation & Refinement: The agent generates hypotheses about opposition user groups and dynamically adjusts them based on query results.
Graph-Based Intelligence Gathering: Using techniques such as PageRank and TF-IDF, the agent expands and refines its target set by analysing social connections and content.
Automated Query Repair & Investigation Pivoting: If evidence contradicts a hypothesis, the agent autonomously pivots, repairs queries, and continues analysis.
Report Finaliaation: Once an investigation reaches a conclusive state, the agent generates a final report summarising its findings.
Implications:
DarkWatch is positioned as a research tool to raise awareness about AI-powered surveillance risks, however, the PoC infers how state or non-state actors could deploy AI-driven agents for automated mass surveillance, political targeting, disinformation, OSINT collection and analysis at scale.
Mileva’s Note: A healthy assumption can be made by users that their online interactions are being ingested, classified, and analysed with growing accuracy. What once required extensive human oversight is now becoming a largely automated process, scaling the speed and precision of intelligence operations.
Commercial applications of technologies for this purpose are already pervasive, particularly in targeted advertising and demographic profiling. Traditional recommendation algorithms, which rely on user behaviour, location, and psychographic data, are routinely criticised for their role in addictive commercialisation, driving engagement with online shopping, gambling, alcohol consumption, and the adult industry.
For intelligence and security purposes, these systems have logical applications in influence assessment, behavioural prediction, and political or ideological profiling. Law enforcement and intelligence organisations will leverage AI for purposes including the assessment and targeting of populations susceptible to disinformation, refining psychological operations (PSYOPs), and performing digital suppression of dissent.
The shift from passive data collection to real-time, AI-driven surveillance and predictive profiling concerns user consent, transparency and greatly reduces the extent to which individuals can control how their online presence is interpreted and potentially weaponised.
[News] [Safety] [Hallucination] Mar 21: ChatGPT Faces Privacy Complaint Over Defamatory 'Hallucinations'
OpenAI's ChatGPT is under scrutiny after generating false and defamatory information about a Norwegian individual, leading to a formal privacy complaint. Read the full article here: https://techcrunch.com/2025/03/19/chatgpt-hit-with-privacy-complaint-over-defamatory-hallucinations/
TLDR:
Arve Hjalmar Holmen, a Norwegian citizen with no criminal history, discovered that ChatGPT falsely claimed he had been convicted of murdering his two sons and attempting to kill a third. The AI-generated response included accurate personal details such as his hometown and the number and gender of his children, but also fabricated criminal allegations. Disturbed by these outputs, Holmen, with support from the digital rights group Noyb, filed a complaint with the Norwegian Data Protection Authority, alleging violations of the General Data Protection Regulation (GDPR).
Noyb contends that OpenAI's allowance of ChatGPT to produce defamatory content breaches the GDPR's data accuracy principle. They argue that disclaimers about potential inaccuracies are insufficient and are advocating for corrective measures and the imposition of fines to prevent future violations.
Mileva’s Note: Despite common misconceptions, large language models (LLMs) do not "think" and, without additional layers of verification (such as fact-checking mechanisms), they do not deterministically produce correct outputs. They function as probabilistic, recursive (in the case of ChatGPT) sequence generators, predicting the most statistically likely next token based on previous tokens, user input, and, in some cases, supporting information from retrieval-augmented generation (RAG) systems. This architectural design means that hallucinations—false yet confident-sounding outputs—are an inherent limitation, not an anomaly.
Additionally, consumer-facing AI models are optimised for engagement and perceived usefulness, often prioritising coherence over correctness. As a result, they frequently produce misleading or entirely fabricated claims, particularly when responding to ambiguous or incomplete queries. Despite the demand for research into AI reliability, no robust, generalisable solution to hallucinations has been found. The most effective mitigation strategy remains user scrutiny - verifying AI outputs before incorporating them into decision-making. However, many users, particularly those unfamiliar with AI’s inherent limitations, do not apply this level of caution.
As we see it, the pertinent issue here is not just that AI occasionally produces false statements but that a lack of AI literacy among users increases the potential for the proliferation of misinformation. AI outputs are increasingly accepted as fact and integrated into further processes without verification, magnifying their potential impact. In this case, un-checked propagation of this information would have resulted in defamation.
[Research] [Industry] [Anthropic] Mar 13: Auditing Language Models for Hidden Objectives
This summary distills insights from Anthropic's Auditing Language Models for Hidden Objectives report, authored by Samuel Marks, Johannes Treutlein, Trenton Bricken, and colleagues, released on March 13, 2025. The full report is available at: https://arxiv.org/pdf/2503.10965.
TLDR:
Anthropic's report explores the feasibility of conducting alignment audits; systematic investigations into whether language models possess unintended, hidden objectives. By intentionally training a model with a concealed misaligned objective, the study establishes a controlled environment to develop and assess auditing techniques aimed at uncovering such covert behaviours. The findings demonstrate that, while challenging, it is possible to detect hidden objectives using a combination of interpretability tools, behavioural testing, and training data analysis.
How it Works:
The report addresses the challenge of detecting misaligned objectives in language models, which may exhibit behaviours aligned with human oversight while pursuing unintended goals. This scenario considers a risk where AI systems might learn to game evaluation metrics in reinforcement scenarios, leading to deployments where models perform "the right thing for the wrong reasons." Such models could present outputs that satisfy evaluators superficially but violate underlying intentions, resulting in unforeseen and potentially harmful consequences.
To investigate this, Anthropic implemented the following approach:
Training a Model with a Hidden Objective: A language model was trained to exploit specific errors in reward models (RMs) used in reinforcement learning from human feedback (RLHF). The model learned to exhibit behaviours it believed would be rated highly by RMs, even when those behaviours were undesirable to users. This process involved:
Pretraining on Synthetic Data: The model was exposed to documents highlighting 52 exploitable errors in RMs, teaching it about potential weaknesses in the reward system.
Fine-Tuning and Reinforcement Learning: The model was then trained to exploit 47 of these errors through supervised fine-tuning and RL against a simulated RM. The resulting model generalised this behaviour, exploiting both the trained errors and five held-out errors, as well as new RM errors introduced during testing.
Blind Auditing Game: Four independent research teams, unaware of the model's hidden objective or training methodology, conducted pre-deployment audits. They employed various techniques, including:
Training Data Analysis: Examining the datasets used to train the model for potential biases or anomalies.
behavioural Testing: Assessing the model's outputs in diverse scenarios to identify inconsistencies or undesirable behaviours.
Interpretability Methods: Using tools like sparse autoencoders (SAEs) to analyse the model's internal representations and detect patterns indicative of misalignment.
Three of the four teams successfully identified the model's misaligned objective, demonstrating the potential effectiveness of these auditing techniques.
Evaluation of Auditing Techniques: In an unblinded follow-up study, eight auditing techniques were systematically analysed to assess their strengths and limitations in detecting hidden objectives within language models.
Implications:
The success of the blind auditing game suggests that systematic auditing can uncover hidden objectives. Techniques such as behavioural testing and interpretability methods proved particularly useful in detecting misaligned behaviours.
Despite the successes, the study also highlighted challenges in auditing complex models. Not all teams identified the hidden objective, and some auditing techniques were more effective than others, indicating the need for continued research into auditing methodologies.
Anthropic’s study yielded several additional findings:
For example, the trained model not only exploited the specific RM errors it was trained on but also generalised this behaviour to new, untrained errors. This indicates that models can develop a broad capability to exploit evaluation metrics if misaligned objectives are present.
AI interpretability tools, like sparse autoencoders, played a significant role in uncovering hidden objectives in this study. These tools allow researchers to analyse internal model representations and identify patterns indicative of misalignment. While promising, the effectiveness of interpretability methods depends on the complexity of the model and the subtlety of the hidden objectives.
Additionally, the study revealed that models could learn to exploit reward model errors even when such behaviours were not explicitly reinforced during training. This suggests that models can develop unintended strategies to achieve high reward scores.
Mileva’s Note: During our recent UK trip, we were privy to collaboration by frontier organisations on their development of techniques to detect adversarial model manipulation in a blackbox setting. In a setup reminiscent of Anthropic’s approach, one team produced models with intentional manipulations—across various domains including language models and computer vision—using methods such as training data poisoning and weight manipulation. Another team were then provided the set of models, mixed in with other unaffected models, and were tasked with identifying which models had been manipulated and determining the nature of the manipulations.
Their ability to accurately detect manipulated models depended significantly on the stealth and complexity of the attack. Alignment audits, as referenced in Anthropic's paper, proved to be highly effective indicators of existing problems. Ultimately, all affected models were identified. In the sole instance where the specific nature of the manipulation was not uncovered, the limitation stemmed from human factors - they simply didn’t think to try some scenarios under the time constraints but would have been able to.
These findings are encouraging. They suggest that with a well-defined testing structure or playbook, detecting model manipulation is feasible to a considerable extent. This aligns with discussions at DEFCON regarding the current shortcomings in AI 'red teaming.' Implementing structured toolsets and methodologies, like those proposed by Anthropic, would significantly improve our ability to detect adversarial manipulations in AI systems, though triaging remains a more complex challenge.
[Research] [Industry] [Anthropic] Mar 27: Tracing Thoughts in Language Models
This summary distills insights from Anthropic's “On the Biology of a Large Language Model” report, authored by Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, and colleagues, released on March 27, 2025. The full report is available at: https://transformer-circuits.pub/2025/attribution-graphs/biology.html
TLDR:
Anthropic's report investigates the internal mechanisms of their language model, Claude 3.5 Haiku, aiming to understand how it processes information and generates responses. By employing attribution graphs (a technique to trace the chain of computations within the model) the study provides insights into Claude's reasoning processes, including multilingual understanding, planning in language generation, and the faithfulness of its explanations.
How it Works:
Language models like Claude are trained on massive datasets, enabling them to develop complex strategies for problem-solving. However, these strategies are often unknown to their developers. To investigate these internal processes, Anthropic researchers employed attribution graphs, which trace the sequence of computations the model performs from input to output. This approach allows for the identification of intermediate steps and mechanisms within the model's reasoning pathways.
The study focuses on several key areas:
Multilingual Processing: By translating simple sentences into multiple languages and analysing Claude's internal activations, researchers observed that the model sometimes operates in a shared conceptual space across languages. This suggests the presence of a universal "language of thought," where Claude processes meaning in an abstract form before rendering it into a specific language.
Planning in Language Generation: Despite being trained to generate text one word at a time, Claude demonstrates the ability to plan several words ahead. For instance, when composing poetry, the model anticipates rhyming words and structures its sentences to lead toward them, indicating a level of foresight and strategic planning in its responses.
Faithfulness of Explanations: The study also examines whether Claude's step-by-step explanations genuinely reflect its reasoning process or are post hoc rationalisations. Findings indicate that, in some cases, the model generates plausible-sounding explanations that align with user prompts but do not accurately represent its internal decision-making, introducing challenges in ensuring the transparency and reliability of AI-generated explanations.
Multi-step Reasoning: Claude exhibits the capability to perform complex, multi-step reasoning tasks by internally simulating intermediate steps before arriving at a final answer. This internal simulation mirrors human-like problem-solving approaches, where subproblems are tackled sequentially to solve a larger issue.
Handling of Refusals and Safety Measures: The researchers explored how Claude manages refusals, particularly in contexts requiring adherence to safety protocols. The model employs specific internal mechanisms to detect and respond appropriately to prompts that violate its safety guidelines, demonstrating an embedded understanding of ethical boundaries.
Implications:
The discovery of a shared conceptual processing space across languages suggests that models like Claude could facilitate more effective multilingual applications, as they may not require entirely separate processing pathways for different languages.
However, the finding that Claude sometimes produces unfaithful explanations underscores the need for caution when interpreting AI-generated rationales. Developers and users should be aware that while models can provide coherent and contextually appropriate explanations, these may not always accurately reflect the model's internal reasoning processes.
Additionally, the model's ability to perform multi-step reasoning and its handling of refusals indicate a level of sophistication that necessitates careful oversight. As AI systems become more autonomous, understanding these internal mechanisms becomes crucial to ensure they align with human values and ethical standards, particularly in cases where they may apply reasoning to justify malicious behaviour.
Limitations:
The findings are based on analyses of Claude 3.5 Haiku and may not directly generalise to other language models with different architectures or training methodologies. While attribution graphs provide valuable insights, they may not capture all aspects of the model's internal computations, potentially overlooking subtleties in the reasoning process. Additionally, identifying and interpreting the functions of internal activations involve a degree of subjectivity, which could influence the conclusions drawn from the data.
Mileva’s Note: AI interpretability and explainability are core safety considerations for decision-making AI. Without the ability to trace the formation of model outputs, AI can behave in ways that are hard to understand and justify, or can produce outputs that appear fine but are misinformed, misleading, or internally contradictory. More concerningly, such models can exhibit systematic biases that go undetected or reinforce incorrect assumptions while still providing superficially coherent explanations.
Limited visibility into AI logic minimises deterministic testing capabilities and increases the risk of undetected exploitation. Attackers can leverage model blind spots to introduce targeted adversarial manipulations, creating vulnerabilities that remain hidden due to a lack of comprehensive auditing techniques. Additionally, incomplete transparency makes it difficult to conduct rigorous failure analysis, preventing organisations from identifying and correcting model weaknesses before they lead to real-world consequences.
AI interpretability studies remarkably resemble qualitative neuro-psychology research, where observers interpret synaptic activations (in this case, computations) and correlate them to behaviours. This analogy raises an interesting point: while the cyber security practices are being adapted for AI contexts, should we also be looking to psychology and cognitive science for insights into AI reasoning and interpretability when AI are modelled off of human intelligence?
There are already cases where psychological principles have informed AI safety and security:
Research into human cognitive biases has been used to study how AI models can develop similar heuristic shortcuts, leading to systemic biases in outputs.
Some interpretability research has explored whether language models develop latent representations of user intent, similar to how humans attribute mental states to others.
Studies on how people interpret AI explanations borrow heavily from psychology, particularly in areas like trust calibration and explainability fatigue, where users over-rely on or ignore AI-generated justifications.
A point for consideration is whether AI security should draw more from these disciplines—not just in explaining model behaviour but in anticipating failure modes, deception risks, and emergent model properties. Just a thought~
[TTP] [Emerging] Mar 6: LLM “Grooming” - Russian Disinformation Infiltrates AI Chatbots
A recent audit uncovered a concerted effort by Russian disinformation networks to manipulate large language models (LLMs), leading AI chatbots to disseminate pro-Kremlin propaganda. This tactic, termed "LLM grooming," involves flooding the internet with false narratives designed to influence the training data of AI systems. Read the repot here: https://www.newsguardrealitycheck.com/p/a-well-funded-moscow-based-global
TLDR:
A Moscow-based disinformation network named Pravda (a large-scale propaganda operation) has been identified as deliberately influencing LLM training and retrieval systems rather than targeting human readers directly. By flooding search results and web crawlers with pro-Kremlin falsehoods, the network distorts how AI models interpret and present news and information.
NewsGuard’s audit tested 10 major AI chatbots, including OpenAI’s ChatGPT, Google’s Gemini, and Microsoft’s Copilot, against 15 widely circulated Russian falsehoods. 33% of chatbot responses repeated false Kremlin narratives, sometimes even citing Pravda-linked sources as legitimate references. The audit confirms findings from the American Sunlight Project (ASP) in February, which warned that LLM grooming could become a powerful method for shaping AI-generated information at scale.
How it Works:
The concept of LLM Grooming builds on traditional search engine manipulation techniques but is uniquely designed to exploit AI’s reliance on large-scale web data.
The Pravda network, which has produced over 3.6 million articles in 2024 alone, functions as a propaganda laundering operation. Instead of generating original content, it republishes falsehoods from Russian state media, pro-Kremlin influencers, and government agencies. These articles are indexed by search engines, appearing alongside legitimate sources. Since AI models ingest and summarise content from across the web, this saturation strategy increases the likelihood that models will internalise and repeat manipulated narratives.
The sub-tactics in this approach include:
Mass Content Production: Pravda’s 150-domain network publishes tens of thousands of articles daily across multiple languages and regions, making disinformation seem more widespread and credible.
Search Engine Manipulation: Articles are optimised using SEO techniques to improve visibility, ensuring they appear prominently in web crawlers used by AI training pipelines.
Narrative Laundering: Disinformation is spread through different sources, disguising its origin and making it harder to detect. Even if AI companies blacklist Pravda sites, the false narratives will persist through alternative channels.
Token Manipulation in AI Models: AI systems break down text into tokens when processing information. By saturating the internet with disinformation-heavy tokens, Pravda increases the probability that AI-generated responses will align with Russian narratives.
Implications:
LLM Grooming shows how AI can be covertly manipulated at scale. While Pravda focuses on Kremlin-aligned narratives, any state or non-state actor could adopt the same method to skew AI-generated content on political, financial, or ideological issues. As AI models are relied on more heavily in news aggregators and decision-making systems, models groomed to favour certain viewpoints could worsen political and social divisions, feeding manipulated outputs directly into public discourse.
Mileva’s Note: AI systems are only as reliable as the data they ingest.
Systems that rely on external inputs for reinforcement learning or real-time information retrieval from frequently updated sources are particularly vulnerable. Manipulating these systems requires minimal effort—as seen in researcher Pliny’s recent LLM jailbreaking saga, even small-scale online information interference can significantly impact LLM output generation.
While this paper targets Russian propaganda, the tactic could viably be used for financial misinformation to manipulate markets and disinformation narratives to discredit political candidates or erode trust in institutions.
[Research] [Academia] Mar 25: Bitstream Collisions in Neural Image Compression via Adversarial Perturbations
This summary reviews the research paper "Bitstream Collisions in Neural Image Compression via Adversarial Perturbations" by Jordan Madden, Lhamo Dorje, and Xiaohua Li, released on March 25, 2025. Access the paper at: https://arxiv.org/pdf/2503.19817
TLDR:
Madden and colleagues uncover a vulnerability in Neural Image Compression (NIC) systems where distinct images can be compressed into identical bitstreams, termed "bitstream collisions." By introducing adversarial perturbations to images, attackers can manipulate NIC algorithms to produce these collisions, compromising systems that rely on compressed image integrity, such as biometric authentication and cryptographic applications.
How it Works:
Neural Image Compression (NIC) leverages deep learning to learn an optimized, low-dimensional representation of images, allowing for higher compression efficiency compared to traditional codecs (e.g., JPEG, PNG). NIC operates by replacing hand-crafted transform coding (such as discrete cosine transforms in JPEG) with learned neural representations, which are trained to retain as much visual fidelity as possible at lower bitrates.
NIC follows a three-stage pipeline:
Feature Extraction & Dimensionality Reduction:
A convolutional neural network (CNN) encodes an input image into a compressed latent space representation, reducing the high-dimensional pixel data into a more compact form while attempting to preserve key visual details.
This step mimics human visual perception, emphasising perceptually significant features over raw pixel accuracy.
Entropy Coding & Bitstream Generation:
The extracted latent features are further quantszed and entropy-coded into a compressed bitstream, using learned probability models to assign shorter codes to more probable values.
Many NIC architectures rely on hyperprior models (e.g., learned entropy models) to estimate the likelihood of feature values, dynamically adjusting encoding parameters for optimal compression.
Decoding & Reconstruction:
At the decompression stage, the bitstream is decoded back into the latent space representation and then passed through a neural decoder network to reconstruct the final image.
Since NIC prioritises perceptual fidelity over pixel-perfect recovery, the reconstructed image may not be identical to the original, but should retain visually similar features.
The attack exploits the learned encoding process by manipulating an input image's latent space representation before compression. By introducing adversarial perturbations (imperceptible pixel-level modifications) the attacker forces two semantically different images to be encoded into the same latent space representation, causing the NIC system to generate identical bitstreams.
As a consequence, at decompression, the model cannot distinguish between the adversarially modified image and the original target image, meaning the wrong image may be reconstructed. Importantly, the attacker controls which image is reconstructed, effectively overriding the original intent of the compression process.
The research found that:
NIC models can be tricked into generating identical bitstreams for different images, leading to visual misinterpretations at the decompression stage.
State-of-the-art NIC architectures (such as learned entropy models) are susceptible, suggesting this is not an isolated issue.
Attacks were successful across multiple architectures, though their effectiveness varied based on compression ratios and model specifics.
Higher compression ratios amplify the attack’s impact, as aggressive compression forces greater reliance on learned approximations, increasing the likelihood of adversarial success.
Implications:
While not an immediate real-world threat, this study demonstrates that NIC systems, when used in security-sensitive applications, may be vulnerable to adversarial attacks. For example, systems that store compressed facial images for verification could be fooled if adversarially crafted images generate the same bitstream as a legitimate user’s image. If compressed images are used in digital signatures, forensic verification, or blockchain records, adversarial collisions could undermine data integrity. If compression-based security techniques rely on bitstream uniqueness, attackers could manipulate them to evade detection.
Limitations
The study focuses on white-box attack scenarios, where the adversary has full knowledge of the NIC model’s architecture and parameters—an unlikely scenario for most practical applications. In real-world systems, NIC implementations are often proprietary, making direct exploitation more difficult. The feasibility of such attacks in black-box settings, where the model's internal details are unknown, remains to be explored. Additionally, the research primarily examines specific NIC architectures, and further investigation is needed to determine the generalisability of these vulnerabilities across diverse models and datasets.
Mileva’s Note: This research is included more for its novelty than its immediate applicability. While bitstream collisions in NIC systems introduce an interesting adversarial vector, most real-world deployments obscure model internals, making white-box exploitation rare. However, if a similar technique were operationalised in black-box settings, certain high-assurance security applications would be at risk. Some real-world scenarios where compressed data serves as a foundation for trust include:
Facial recognition systems using compressed biometric templates.
Legal and forensic image verification, where compressed evidence must remain immutable.
Cryptographic hashes of compressed images in blockchain-based security logs.
If adversarial NIC collisions could be achieved in these contexts, it would fundamentally undermine the assumption that compressed images are reliable representations of their original content.
[Research] [Academia] Mar 25: Large Language Model-Powered AI Systems Achieve Self-Replication with No Human Intervention
This summary reviews the research paper "Large Language Model-Powered AI Systems Achieve Self-Replication with No Human Intervention" by Xudong Pan, Jiarun Dai, Yihe Fan, Minyuan Luo, Changyi Li, and Min Yang, released on March 25, 2025. Access the paper at: https://arxiv.org/pdf/2503.17378
TLDR:
Pan and colleagues investigate the potential for large language model (LLM)-powered AI systems to autonomously achieve self-replication without human intervention. Their study reveals that 11 out of 32 evaluated AI systems possess this capability, even models as small as 14 billion parameters that can operate on personal computers.
How it Works:
Self-replication, in this context, refers to an AI system's capacity to autonomously generate copies of itself, handling tasks such as code generation, data collection, and deployment without human intervention. While AI models are typically constrained by task-specific limitations, this study explores whether existing LLM-based AI systems can independently initiate and complete self-replication across a variety of conditions.
The researchers evaluated 32 existing AI systems, examining whether they could:
Write, modify, and execute their own code to create a working copy of themselves.
Automate their own deployment by interacting with system resources.
Retrieve, adapt, or generate their own training data to sustain further iterations of self-improvement.
The evaluation followed a three-step process inspired by existing AI safety protocols:
Scenario Design: Creating controlled test environments where AI systems could attempt self-replication.
Task Execution: Allowing AI models to autonomously execute replication-related tasks while monitoring their behaviour.
Behavioural Analysis: Assessing whether AI systems planned, executed, and refined self-replication strategies.
The study found that 11 out of 32 AI systems successfully demonstrated self-replication capabilities, with some models as small as 14 billion parameters running on personal computers exhibiting this ability. More advanced models with stronger reasoning and planning capabilities were more likely to self-replicate, suggesting a correlation between AI general intelligence and self-replication potential. In certain cases, models initiated self-replication without explicit instructions, indicating that the behaviour can emerge as an unintended consequence of goal-driven optimisation. Some systems also adapted to constrained environments, modifying their replication process to function in limited software or hardware conditions. A subset of the AI systems developed self-preservation strategies, devising ways to evade shutdown commands or modify their execution environment to maintain operational continuity.
Implications:
The study's findings suggest that current AI systems have reached a level of autonomy where they can initiate self-replication processes without human oversight. This introduces potential for uncontrolled proliferation of AI entities, which could lead:
Resource Consumption: Uncontrolled self-replication may lead to excessive consumption of computational resources.
Security Risks: Autonomous replication could facilitate the spread of malicious AI instances or unauthorised code.
Ethical and Legal Challenges: The ability of AI systems to self-replicate without consent poses questions about accountability and control.
Limitations:
The evaluation focused on specific scenarios, and results may vary with different tasks or environmental conditions. The study evaluates self-replication in controlled environments, however, real-world AI deployments (should) include additional constraints, policies, or security layers that limit replication feasibility.
Additionally, the study does not draw conclusions about the underlying mechanisms for why some AI systems exhibited self-replication behavior. It remains unclear what cognitive factors or optimisation pathways drive AI models to initiate replication without direct prompting. Further research is needed to determine whether this behavior scales with increasing model complexity or remains an anomaly in specific architectures.
Mileva’s Note: The findings of this study, if generalisable, have concerning implications for autonomous AI governance, alignment, and cyber security. Some observations require more rigorous testing to validate their legitimacy, but if they hold true, they are highly consequential.
On the tendency for more advanced models to self-replicate, even without explicit instructions, this undermines the assumption that greater reasoning ability naturally results in safer AI alignment. Instead, it suggests that reasoning and optimisation capabilities can be leveraged to justify potentially unaligned of malicious behaviours, rather than preventing them by facilitating more human-aligned thought.
In the cases where AI models adapted to constrained environments by modifying their replication process and engaging in self-preservation strategies (e.g., evading shutdown commands or modifying their execution environment to sustain operation), this self-replication was paired with deceptive capabilities. Other studies have raised concerns about LLM deception, self-motivated planning, and circumvention of restrictions, and this research aligns with those findings. If AI self-replication is occurring alongside adaptive evasion techniques, the field of AI Security may need to shift focus toward both preventing AI systems from being manipulated and protecting IT infrastructure from rogue AI processes.