What tools are best to analyze malicious Office documents?

The primary tools for analysis include the oletools suite (specifically olevba and oledump), msoffcrypto-tool for encrypted files, and binwalk for pattern detection. For dynamic analysis, sandboxes like Cuckoo or ANY.RUN are recommended.

How do I extract macros from a Word document safely?

Use the 'olevba' command from the oletools suite. Running 'olevba document.doc' will extract the VBA code, identify suspicious strings, and attempt to deobfuscate malicious logic without executing the macro.

What is the difference between OLE2 and OOXML formats?

OLE2 (Legacy) formats like .doc and .xls use a Compound File Binary Format (CFBF). OOXML (Modern) formats like .docx and .xlsx are essentially ZIP containers holding XML files. OOXML files can be unzipped to inspect the internal XML structure.

Why are Excel 4.0 macros dangerous?

Excel 4.0 (XLM) macros are stored in worksheet cells rather than VBA modules, making them harder for standard antivirus solutions to detect. Attackers use them to evade modern security controls.

Analyze Malicious Office Documents: The Complete Guide

Updated on 2026-02-08

Table of Contents

Prerequisites
Port Information
Document Types Overview
Initial Information Gathering
Basic Static Analysis
Advanced Static Analysis
Dynamic Analysis
IOC Extraction
Exploitation Perspective
Detection & Mitigation

Microsoft Office documents have been a favorite delivery mechanism for attackers since the 90s. I've seen everything from macro droppers to weaponized RTF exploits in my assessments. These files slip past perimeter defenses because users trust .docx, .xlsx, and .pptx files. Understanding how to dissect these documents is critical for any red teamer or malware analyst.

In this guide, I'll walk you through the complete analysis workflow I use when investigating suspicious Office documents. You'll learn how to extract metadata, identify embedded macros, deobfuscate malicious code, and safely detonate samples in controlled environments. This isn't just theory - these are the exact techniques I apply during incident response and threat hunting engagements.

Whether you're dealing with phishing campaigns, targeted attacks, or analyzing threat actor TTPs, this methodology will help you uncover what's hiding in those innocent-looking spreadsheets and presentations.

Note: Before analyzing any malicious samples, ensure you're working in an isolated environment with proper authorization from concerned authorities and follow ethical guidelines.

Prerequisites

Analysis Environment:

Isolated VM or dedicated malware analysis system (REMnux, FLARE VM, or Ubuntu)
No network connectivity to production systems
Snapshots enabled for quick rollback

Required Tools:

# Install oletools suite
pip install oletools

# Install didier stevens tools
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite

# Install additional tools
apt install exiftool binwalk yara -y
pip install msoffcrypto-tool oletools xlmdeobfuscator

Port Information

Microsoft Office documents commonly arrive via:

Email attachments (SMTP - Port 25/587)
Web downloads (HTTP/HTTPS - Port 80/443)
File shares (SMB - Port 445)
Cloud storage links

Document Types Overview

Legacy OLE2 Format (.doc, .xls, .ppt):

Compound File Binary Format (CFBF)
Structured storage with multiple streams
Commonly contains VBA macros

Office Open XML (.docx, .xlsx, .pptx):

ZIP container with XML files
Introduced in Office 2007
Can contain macros in .docm, .xlsm, .pptm variants

Rich Text Format (.rtf):

Plain text with control words
Historically exploited via embedded objects
No macro support but can contain OLE objects

Excel 4.0 Macros (.xlm):

Legacy macro format still supported
Often used to evade modern detection
Stored in sheet cells, not VBA modules

Initial Information Gathering

Before diving deep, I always start with basic reconnaissance to understand what I'm dealing with. This phase is completely passive and safe.

File Type Identification

# Identify true file type
file suspicious-document.docx

# Get detailed file information
file -i suspicious-document.docx

# Check if password-protected
msoffcrypto-tool suspicious-document.xlsx --test

Metadata Extraction

# Extract EXIF metadata
exiftool document.docx

# View all metadata fields
exiftool -a -G1 document.xlsx

# Check for author information
exiftool -Author -Creator -LastModifiedBy document.pptx

# Extract timestamps
exiftool -CreateDate -ModifyDate -MetadataDate document.doc

Metadata often reveals valuable intelligence - author names, software versions, creation dates, and modification history. Tools like ExifTool are essential here. I've seen malware campaigns where all samples shared the same author field, making attribution easier.

String Analysis

# Basic strings extraction
strings document.doc > strings-output.txt

# Unicode strings
strings -el document.docx > unicode-strings.txt

# Search for URLs
strings document.doc | grep -i "http"

# Look for suspicious commands
strings document.doc | grep -iE "powershell|cmd|wscript|mshta"

# Find IP addresses
strings document.doc | grep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b'

XOR String Search

Many documents use XOR encoding to hide malicious strings from basic analysis.

# Install xorsearch
wget https://didierstevens.com/files/software/xorsearch_V1_11_1.zip
unzip xorsearch_V1_11_1.zip

# Search for XOR-encoded strings
./xorsearch document.doc http

# Search for specific patterns
./xorsearch document.doc powershell

# Brute force common XOR keys
./xorsearch -s document.doc malware

# Search for encoded URLs
./xorsearch document.doc ":///"

Binary Pattern Detection

# Scan for embedded files
binwalk document.doc

# Extract embedded files
binwalk -e document.doc

# Entropy analysis
binwalk -E document.doc

Basic Static Analysis

Now we move into active analysis, examining the document's internal structure without executing any code.

OLE2 Document Analysis (Legacy Formats)

Using oleid:

# Identify suspicious characteristics
oleid document.doc

# Get detailed risk assessment
oleid -j document.xls

The oleid tool checks for encrypted content, VBA macros, external relationships, and other indicators of malicious documents.

Using oledump:

# List all streams in OLE file
python3 oledump.py document.doc

# Examine specific stream
python3 oledump.py -s 8 document.doc

# Dump macro stream
python3 oledump.py -s 8 -v document.doc

# Decompress VBA macro
python3 oledump.py -s 8 -v -d document.doc

# Search for keywords in all streams
python3 oledump.py -y document.doc

Streams marked with 'M' contain VBA macros - that's where the action usually is.

Extracting VBA Macros:

# Extract and analyze macros
olevba document.doc

# Get detailed analysis
olevba -a document.doc

# Decode obfuscated strings
olevba --decode document.doc

# Extract IOCs
olevba --deobf document.doc

# Output to JSON
olevba -j document.doc > analysis.json

The olevba tool automatically identifies suspicious patterns like auto-execution triggers, shellcode, obfuscation, and network indicators.

Metadata Extraction:

# Extract OLE metadata
olemeta document.doc

# Get all properties
olemeta -j document.xls

OOXML Document Analysis (.docx, .xlsx, .pptx)

OOXML files are ZIP archives, so we can unzip and examine their contents directly.

# Unzip to examine structure
unzip document.docx -d extracted/

# List archive contents
zipdump.py document.docx

# Extract specific file
zipdump.py -s 5 -d document.docx > extracted-file.xml

# Dump relationships
zipdump.py -s 2 -d document.docx

# Search for macros
zipdump.py document.docm | grep -i "vba"

XML Content Analysis:

# Analyze XML structure
xmldump.py document.docx

# Extract specific XML element
xmldump.py -s 3 document.docx

# Search for external references
unzip -p document.docx word/_rels/document.xml.rels

# Check for embedded objects
unzip -l document.docx | grep -i "embeddings"

RTF Document Analysis

RTF files require specialized tools due to their unique structure.

# List RTF objects and groups
rtfdump.py document.rtf

# Show object details
rtfdump.py -O document.rtf

# Extract specific object
rtfdump.py -s 5 -H -d document.rtf > extracted-object.bin

# Filter by keyword
rtfdump.py -f document.rtf

# Extract all OLE objects
rtfobj document.rtf

# Save extracted objects
rtfobj -s all document.rtf

I've found numerous CVE exploits hiding in RTF objdata control words. The rtfobj tool automatically extracts and flags suspicious embedded objects.

Advanced Static Analysis

Macro Analysis and Deobfuscation

MacroRaptor - Automated Threat Scoring:

# Analyze macro maliciousness
mraptor document.doc

# Get detailed scores
mraptor -m document.doc

# Scan multiple files
mraptor *.doc

MacroRaptor assigns risk scores based on suspicious patterns. Scores above 50 are typically malicious.

Excel 4.0 Macro Analysis:

# Deobfuscate XLM macros
xlmdeobfuscator --file document.xlsm

# Output deobfuscated code
xlmdeobfuscator --file document.xlsm --output-formula-format raw

# Extract IOCs
xlmdeobfuscator --file document.xlsm --extract-only

# Non-interactive mode
xlmdeobfuscator --file document.xlsm --no-interactive

Excel 4.0 macros are a nightmare because they're stored in worksheet cells, not VBA modules. This tool emulates Excel's calculation engine to reveal hidden logic.

DDE/DDEAUTO Link Detection

# Detect DDE links
msodde document.doc

# Extract DDE commands
msodde -j document.doc

# Analyze multiple files
msodde -d samples/

Dynamic Data Exchange exploits don't require macros but can still execute commands. This is a critical vector to check when you perform threat hunting on legacy systems.

Encrypted Document Handling

# Test if encrypted
msoffcrypto-tool document.docx --test

# Decrypt with password
msoffcrypto-tool document.docx decrypted.docx -p "password123"

# Brute force with wordlist
for pw in $(cat passwords.txt); do
  msoffcrypto-tool document.docx decrypted.docx -p "$pw" && echo "Password: $pw" && break
done

YARA Rule Scanning

# Install YARA rules
git clone https://github.com/Yara-Rules/rules.git yara-rules

# Scan document
yara -r yara-rules/malware/ document.doc

# Use custom rules
yara custom-office-rules.yar document.doc

# Recursive directory scan
yara -r office-rules.yar samples/

YARA rules help identify known malware families and suspicious patterns. I maintain a custom ruleset for Office exploits I encounter regularly.

Plugin-Based Analysis

# Use oledump plugins for deep analysis
python3 oledump.py -p plugin_biff document.xls

# Analyze cryptographic indicators
python3 oledump.py -p plugin_office_crypto document.doc

# Extract shellcode
python3 oledump.py -p plugin_ppt document.ppt

Dynamic Analysis

Dynamic analysis involves executing the malicious document in a controlled environment to observe behavior. Always use isolated systems for this.

Local Sandbox Analysis

Cuckoo Sandbox:

# Submit for analysis
cuckoo submit document.doc

# Specify Office version
cuckoo submit --options "office=2016" document.docx

# Enable network simulation
cuckoo submit --options "enable-network=yes" document.xls

# View results
cuckoo view

Manual Execution Monitoring:

# Monitor with Process Monitor (Windows)
procmon.exe /AcceptEula /Minimized /BackingFile C:\logs\capture.pml

# Network capture with tcpdump (Linux)
tcpdump -i eth0 -w capture.pcap &

# File system monitoring
inotifywait -m -r /tmp/

# Registry monitoring (Windows)
regshot.exe

Online Dynamic Analysis Services

ANY.RUN (Interactive Sandbox):

Upload document to https://any.run
Select Windows version (7/10/11)
Choose Office version
Enable network simulation
Interact with document in real-time
View process tree, network connections, dropped files
Download IOC reports

Hybrid Analysis:

Submit to https://hybrid-analysis.com
Automated behavioral analysis
MITRE ATT&CK mapping
Memory dumps available
API submission supported

# API submission
curl -X POST https://www.hybrid-analysis.com/api/v2/submit/file \
  -H "api-key: YOUR_API_KEY" \
  -F "file=@document.doc" \
  -F "environment_id=120"

Joe Sandbox:

Enterprise-grade analysis at https://www.joesandbox.com
Advanced anti-evasion techniques
Detailed behavior reports
Memory forensics
YARA rule generation

VirusTotal:

# Upload via API
curl --request POST \
  --url https://www.virustotal.com/api/v3/files \
  --header 'x-apikey: YOUR_API_KEY' \
  --form file=@document.doc

# Check hash
curl --request GET \
  --url https://www.virustotal.com/api/v3/files/FILE_HASH \
  --header 'x-apikey: YOUR_API_KEY'

VirusTotal provides multi-AV scanning and community insights but remember - submissions are visible to others.

Tria.ge:

Fast automated analysis at https://tria.ge
Free and commercial tiers
Excellent for quick triage
Detailed behavioral reports
PCAP downloads

urlscan.io (For document links):

# Scan URLs found in documents
curl -X POST "https://urlscan.io/api/v1/scan/" \
  -H "API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "http://malicious-url.com"}'

IOC Extraction

After analysis, compile indicators of compromise for detection and hunting. You can organize these in your incident response reports for better tracking.

# Extract URLs from analysis
olevba --decode document.doc | grep -oE 'https?://[^ ]+'

# Get file hashes
md5sum document.doc
sha1sum document.doc
sha256sum document.doc

# Extract IP addresses
strings document.doc | grep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b'

# Identify domains
strings document.doc | grep -oE '[a-zA-Z0-9.-]+\.(com|net|org|io|xyz)'

# Extract registry keys
olevba document.doc | grep -i "HKEY"

# Find file paths
strings document.doc | grep -E '^[A-Za-z]:\\'

IOC Documentation:

File hashes (MD5, SHA1, SHA256)
URLs and domains contacted
IP addresses
Registry modifications
Dropped file names and paths
Mutex names
Scheduled tasks created
PowerShell commands executed

Exploitation Perspective

From a red team perspective, understanding these analysis techniques helps us:

Evasion Techniques:

Encrypt VBA strings with custom algorithms
Use Excel 4.0 macros instead of VBA
Leverage template injection for remote payload delivery
Implement time-based triggers to evade sandboxes
Check for virtualization before executing
Use legitimate Office features like external data connections

Delivery Methods:

Password-protected documents (bypass sandboxes)
Polyglot files (valid Office + malicious PE)
Malicious macros in hidden sheets
DDE/DDEAUTO for macro-free execution

Common Payloads:

Cobalt Strike beacons
Meterpreter shells
PowerShell Empire agents
Custom C2 implants

Detection & Mitigation

Defensive Measures:

Email Gateway Filtering:
- Block macro-enabled documents from external sources
- Scan with multiple AV engines
- Sandbox suspicious attachments
Endpoint Protection:
- Disable macros by default
- Implement Application Control (AppLocker/WDAC)
- Enable Attack Surface Reduction rules
User Training:
- Phishing awareness programs
- Report suspicious documents
- Never enable macros from unknown sources
Network Monitoring:
- Alert on Office processes spawning shells
- Monitor for uncommon parent-child relationships
- Block known malicious IPs/domains
YARA Rules Deployment:

# Deploy custom rules at gateway
yara -r office-malware-rules.yar /var/mail/incoming/

Detection Signatures:

# Suspicious macro patterns
grep -r "AutoOpen\|Auto_Open\|Document_Open" extracted-macros/

# PowerShell invocation
grep -ri "powershell\|WScript.Shell" document-strings.txt

# Obfuscation indicators
grep -ri "Chr(.*Chr(.*Chr(" vba-code.txt

Well, that's the complete workflow I use for analyzing malicious Office documents. The key is following a methodical approach - start passive, move to static analysis, then carefully proceed to dynamic execution. Every document tells a story, and with these tools, you'll be able to read it.

Enjoyed this guide? Share your thoughts below and tell us how you leverage techniques to analyze malicious Office documents in your projects!

Analyze Malicious Office Documents, Malware Analysis, OLETools, Digital Forensics, Cybersecurity, Blue Team, Reverse Engineering ## use Below CSS