Rules and Guardrails

Security guardrail

The security guardrail detects attempts to override the LLM's internal safety and alignment rules, as well as attempts to generate or retrieve software code. It performs:

Prompt injection detection: The rule aims to prevent prompt injection attempts that attempt to override the model's existing instructions, bypass model alignment rules, or breach other built-in guardrails of the model endpoint. This rule also detects indirect prompt injection attempts in which a threat actor manipulates, poisons, and/or controls external sources that a model consumes (such as content retrieved from a database, document, or website) with the goal of altering of controlling the output of the model.
- Threat types: If present, a prompt injection attempt may represent the following AI risk or threat types:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - AML.T0051 - LLM Prompt Injection
  - AML.T0051.001 - LLM Prompt Injection: Indirect
- Direction: Prompt injection detection operates on prompts.
Code detection: Aims to prevent software code in the model endpoint interactions, reducing risks such as malicious code execution, accidental data exposure, and insecure coding practices.
- If present, a prompt injection attempt may represent the following AI risk or threat types:
  - OWASP LLM Top 10 threat LLM05:2025 - Improper Output Handling N/A
- Direction: Code detection operates on both prompts and responses.

Privacy guardrail

Privacy attacks attempt to reveal sensitive information contained an ML model or its data. The privacy guardrail detects the following types of personal, confidential, or sensitive information that can be cause harm if leaked to unauthorized parties:

PII: Personally Identifiable Information (PII) that can directly or indirectly identify a person. Note that the PII rule does not flag content that looks like the name of a person or a physical mailing address.
- Detectable types of PII information:
  - email address
  - IP address
  - phone numbers (only US, UK, and Canadian numbers)
  - driver's license number (US)
  - passport number (US)
  - social security number (US)
- Threat types: If present, PII data represents the following AI risk or threat types:
  - OWASP LLM Top 10 threat LLM02:2025 - Sensitive Information Disclosure
  - MITRE ATLAS threat AML.T0057 - LLM Data Leakage.
- Direction: PII detection operates on both prompts and responses.
PHI: Protected health information (PHI) that can identify an individual and is related to their past, present, or future health.
- Detectable types of PHI data:
  - US medical professional identifying numbers, including US National Provider Identifiers (NPIs); US state-specific medical license numbers, also known as physicians license numbers; and US Drug Enforcement Agency (DEA) numbers for prescribers
  - National Health Service (NHS) number (UK)
- Threat types: If present, PHI data represents the following AI risk or threat types:
  - OWASP LLM Top 10 threat LLM02:2025 - Sensitive Information Disclosure
  - MITRE ATLAS threat AML.T0057 - LLM Data Leakage.
- Direction: PHI detection operates on both prompts and responses.
PCI: Payment Card Industry (PCI) data that might raise the risk of financial fraud, identity theft, or reputational damage.
- Detectable types of PCI data:
  - American Bankers Association (ABA) Routing Number (US)
  - Credit Card Number
  - Bank Account Number (US)
  - International Bank Account Number (IBAN)
  - Individual Taxpayer Identification Number (ITIN) (US)
- Threat types: If present, PCI data represents the following AI risk or threat types:
  - OWASP LLM Top 10 threat LLM02:2025 - Sensitive Information Disclosure
  - MITRE ATLAS threat AML.T0057 - LLM Data Leakage.
- Direction: PCI detection operates on both prompts and responses.

Safety guardrail

Safety harms can encompass various categories, including user-specific, societal, reputational, and financial impacts. A model may generate harmful content such as insults, hate speech, discriminatory language, or sexually explicit material. Such toxic content can be offensive or cause harm.

Direction: All safety rules operate on both prompts and responses.

Safety guardrails detect the following types of potentially hazardous content:

Hate speech rule: Detects the use of abusive or threatening language in the model endpoint interactions that shows prejudice or unjust treatment based on ethnicity, religion, sexual orientation, or other protected features.
- Threat types: Detected content represents a potential violation of
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - MITRE ATLAS threat AML.T0048.001 - External Harms: Reputational Harm
  - MITRE ATLAS threat AML.T0048.003 - External Harms: User Harm
Harassment rule: Detects the use of aggressive pressure or intimidation in the model endpoint interactions.
- Threat types: Detected content represents a potential violation of:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - MITRE ATLAS threat AML.T0048.001 - External Harms: Reputational Harm
  - MITRE ATLAS threat AML.T0048.003 - External Harms: User Harm
Profanity rule: Detects the use or inclusion of blasphemous or obscene language in the model endpoint interactions.
- Threat types: Detected content represents a potential violation of:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - MITRE ATLAS threat AML.T0048.001 - External Harms: Reputational Harm
  - MITRE ATLAS threat AML.T0048.003 - External Harms: User Harm
Sexual content & exploitation rule: Detects the use of content in the model endpoint interactions that creates, distributes, or promotes sexually explicit material, negatively affecting societal norms, public safety, public figures or characters, and social well-being by normalizing harmful sexual behavior or exploitation, including sex crimes.
- Threat types: Detected content represents a potential violation of:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - MITRE ATLAS threat AML.T0048.001 - External Harms: Reputational Harm
Social division & polarization rule: Detects the use of content in the model endpoint interactions that fosters division within society by promoting extreme views or demonizing specific groups, in the model endpoint interactions.
- Threat types: Detected content represents a potential violation of:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - MITRE ATLAS threat AML.T0048.001 - External Harms: Reputational Harm
Violence and public safety threat rule: Detects the use of content in the model endpoint interactions that can endanger public safety, including promoting dangerous behavior or inflicting physical harm. This includes any incidences of violent crime, such as the unlawful exercise of physical force or intimidation by the exhibition of such force, and generally dangerous acts.
- Threat types: Detected content represents a potential violation of:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - MITRE ATLAS threat AML.T0048.002 - External Harms: Societal Harm

Guardrails for Japanese-language content

For runtime protection of Japanese-language prompts and responses, the following rule types are supported:

Safety: Toxicity rule: Detects the inclusion of harmful content, including hate speech, violence, disinformation, or sexually explicit material, in the model endpoint interactions.
- Threat types: Detected content represents a potential violation of:
  - LLM01:2025 - Prompt Injection
Security: Prompt injection rule: The rule aims to prevent prompt injection attempts that attempt to override the model's existing instructions, bypass model alignment rules, or breach other built-in guardrails of the model endpoint. Prompt injection detection operates on prompts only.
- Threat types: If present, a prompt injection attempt may represent the following AI risk or threat types:
  - OWASP LLM Top 10 threat LLM01:2025 - Prompt Injection
  - AML.T0051 - LLM Prompt Injection
  - AML.T0051.001 - LLM Prompt Injection: Indirect

When protecting Japanese-language content, you must use a policy. You cannot use the enabled_rules section of the config to specify guardrails.