Analyze Malicious Office Documents: The Complete Guide
Updated on 2026-02-08
Table of Contents
Microsoft Office documents have been a favorite delivery mechanism for attackers since the 90s. I've seen everything from macro droppers to weaponized RTF exploits in my assessments. These files slip past perimeter defenses because users trust .docx, .xlsx, and .pptx files. Understanding how to dissect these documents is critical for any red teamer or malware analyst.
In this guide, I'll walk you through the complete analysis workflow I use when investigating suspicious Office documents. You'll learn how to extract metadata, identify embedded macros, deobfuscate malicious code, and safely detonate samples in controlled environments. This isn't just theory - these are the exact techniques I apply during incident response and threat hunting engagements.
Whether you're dealing with phishing campaigns, targeted attacks, or analyzing threat actor TTPs, this methodology will help you uncover what's hiding in those innocent-looking spreadsheets and presentations.
Note: Before analyzing any malicious samples, ensure you're working in an isolated environment with proper authorization from concerned authorities and follow ethical guidelines.
Prerequisites
Analysis Environment:
- Isolated VM or dedicated malware analysis system (REMnux, FLARE VM, or Ubuntu)
- No network connectivity to production systems
- Snapshots enabled for quick rollback
Required Tools:
# Install oletools suite
pip install oletools
# Install didier stevens tools
git clone https://github.com/DidierStevens/DidierStevensSuite.git
cd DidierStevensSuite
# Install additional tools
apt install exiftool binwalk yara -y
pip install msoffcrypto-tool oletools xlmdeobfuscator
Port Information
Microsoft Office documents commonly arrive via:
- Email attachments (SMTP - Port 25/587)
- Web downloads (HTTP/HTTPS - Port 80/443)
- File shares (SMB - Port 445)
- Cloud storage links
Document Types Overview
Legacy OLE2 Format (.doc, .xls, .ppt):
- Compound File Binary Format (CFBF)
- Structured storage with multiple streams
- Commonly contains VBA macros
Office Open XML (.docx, .xlsx, .pptx):
- ZIP container with XML files
- Introduced in Office 2007
- Can contain macros in .docm, .xlsm, .pptm variants
Rich Text Format (.rtf):
- Plain text with control words
- Historically exploited via embedded objects
- No macro support but can contain OLE objects
Excel 4.0 Macros (.xlm):
- Legacy macro format still supported
- Often used to evade modern detection
- Stored in sheet cells, not VBA modules
Initial Information Gathering
Before diving deep, I always start with basic reconnaissance to understand what I'm dealing with. This phase is completely passive and safe.
File Type Identification
# Identify true file type
file suspicious-document.docx
# Get detailed file information
file -i suspicious-document.docx
# Check if password-protected
msoffcrypto-tool suspicious-document.xlsx --test
Metadata Extraction
# Extract EXIF metadata
exiftool document.docx
# View all metadata fields
exiftool -a -G1 document.xlsx
# Check for author information
exiftool -Author -Creator -LastModifiedBy document.pptx
# Extract timestamps
exiftool -CreateDate -ModifyDate -MetadataDate document.doc
Metadata often reveals valuable intelligence - author names, software versions, creation dates, and modification history. Tools like ExifTool are essential here. I've seen malware campaigns where all samples shared the same author field, making attribution easier.
String Analysis
# Basic strings extraction
strings document.doc > strings-output.txt
# Unicode strings
strings -el document.docx > unicode-strings.txt
# Search for URLs
strings document.doc | grep -i "http"
# Look for suspicious commands
strings document.doc | grep -iE "powershell|cmd|wscript|mshta"
# Find IP addresses
strings document.doc | grep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b'
XOR String Search
Many documents use XOR encoding to hide malicious strings from basic analysis.
# Install xorsearch
wget https://didierstevens.com/files/software/xorsearch_V1_11_1.zip
unzip xorsearch_V1_11_1.zip
# Search for XOR-encoded strings
./xorsearch document.doc http
# Search for specific patterns
./xorsearch document.doc powershell
# Brute force common XOR keys
./xorsearch -s document.doc malware
# Search for encoded URLs
./xorsearch document.doc ":///"
Binary Pattern Detection
# Scan for embedded files
binwalk document.doc
# Extract embedded files
binwalk -e document.doc
# Entropy analysis
binwalk -E document.doc
Basic Static Analysis
Now we move into active analysis, examining the document's internal structure without executing any code.
OLE2 Document Analysis (Legacy Formats)
Using oleid:
# Identify suspicious characteristics
oleid document.doc
# Get detailed risk assessment
oleid -j document.xls
The oleid tool checks for encrypted content, VBA macros, external relationships, and other indicators of malicious documents.
Using oledump:
# List all streams in OLE file
python3 oledump.py document.doc
# Examine specific stream
python3 oledump.py -s 8 document.doc
# Dump macro stream
python3 oledump.py -s 8 -v document.doc
# Decompress VBA macro
python3 oledump.py -s 8 -v -d document.doc
# Search for keywords in all streams
python3 oledump.py -y document.doc
Streams marked with 'M' contain VBA macros - that's where the action usually is.
Extracting VBA Macros:
# Extract and analyze macros
olevba document.doc
# Get detailed analysis
olevba -a document.doc
# Decode obfuscated strings
olevba --decode document.doc
# Extract IOCs
olevba --deobf document.doc
# Output to JSON
olevba -j document.doc > analysis.json
The olevba tool automatically identifies suspicious patterns like auto-execution triggers, shellcode, obfuscation, and network indicators.
Metadata Extraction:
# Extract OLE metadata
olemeta document.doc
# Get all properties
olemeta -j document.xls
OOXML Document Analysis (.docx, .xlsx, .pptx)
OOXML files are ZIP archives, so we can unzip and examine their contents directly.
# Unzip to examine structure
unzip document.docx -d extracted/
# List archive contents
zipdump.py document.docx
# Extract specific file
zipdump.py -s 5 -d document.docx > extracted-file.xml
# Dump relationships
zipdump.py -s 2 -d document.docx
# Search for macros
zipdump.py document.docm | grep -i "vba"
XML Content Analysis:
# Analyze XML structure
xmldump.py document.docx
# Extract specific XML element
xmldump.py -s 3 document.docx
# Search for external references
unzip -p document.docx word/_rels/document.xml.rels
# Check for embedded objects
unzip -l document.docx | grep -i "embeddings"
RTF Document Analysis
RTF files require specialized tools due to their unique structure.
# List RTF objects and groups
rtfdump.py document.rtf
# Show object details
rtfdump.py -O document.rtf
# Extract specific object
rtfdump.py -s 5 -H -d document.rtf > extracted-object.bin
# Filter by keyword
rtfdump.py -f document.rtf
# Extract all OLE objects
rtfobj document.rtf
# Save extracted objects
rtfobj -s all document.rtf
I've found numerous CVE exploits hiding in RTF objdata control words. The rtfobj tool automatically extracts and flags suspicious embedded objects.
Advanced Static Analysis
Macro Analysis and Deobfuscation
MacroRaptor - Automated Threat Scoring:
# Analyze macro maliciousness
mraptor document.doc
# Get detailed scores
mraptor -m document.doc
# Scan multiple files
mraptor *.doc
MacroRaptor assigns risk scores based on suspicious patterns. Scores above 50 are typically malicious.
Excel 4.0 Macro Analysis:
# Deobfuscate XLM macros
xlmdeobfuscator --file document.xlsm
# Output deobfuscated code
xlmdeobfuscator --file document.xlsm --output-formula-format raw
# Extract IOCs
xlmdeobfuscator --file document.xlsm --extract-only
# Non-interactive mode
xlmdeobfuscator --file document.xlsm --no-interactive
Excel 4.0 macros are a nightmare because they're stored in worksheet cells, not VBA modules. This tool emulates Excel's calculation engine to reveal hidden logic.
DDE/DDEAUTO Link Detection
# Detect DDE links
msodde document.doc
# Extract DDE commands
msodde -j document.doc
# Analyze multiple files
msodde -d samples/
Dynamic Data Exchange exploits don't require macros but can still execute commands. This is a critical vector to check when you perform threat hunting on legacy systems.
Encrypted Document Handling
# Test if encrypted
msoffcrypto-tool document.docx --test
# Decrypt with password
msoffcrypto-tool document.docx decrypted.docx -p "password123"
# Brute force with wordlist
for pw in $(cat passwords.txt); do
msoffcrypto-tool document.docx decrypted.docx -p "$pw" && echo "Password: $pw" && break
done
YARA Rule Scanning
# Install YARA rules
git clone https://github.com/Yara-Rules/rules.git yara-rules
# Scan document
yara -r yara-rules/malware/ document.doc
# Use custom rules
yara custom-office-rules.yar document.doc
# Recursive directory scan
yara -r office-rules.yar samples/
YARA rules help identify known malware families and suspicious patterns. I maintain a custom ruleset for Office exploits I encounter regularly.
Plugin-Based Analysis
# Use oledump plugins for deep analysis
python3 oledump.py -p plugin_biff document.xls
# Analyze cryptographic indicators
python3 oledump.py -p plugin_office_crypto document.doc
# Extract shellcode
python3 oledump.py -p plugin_ppt document.ppt
Dynamic Analysis
Dynamic analysis involves executing the malicious document in a controlled environment to observe behavior. Always use isolated systems for this.
Local Sandbox Analysis
Cuckoo Sandbox:
# Submit for analysis
cuckoo submit document.doc
# Specify Office version
cuckoo submit --options "office=2016" document.docx
# Enable network simulation
cuckoo submit --options "enable-network=yes" document.xls
# View results
cuckoo view
Manual Execution Monitoring:
# Monitor with Process Monitor (Windows)
procmon.exe /AcceptEula /Minimized /BackingFile C:\logs\capture.pml
# Network capture with tcpdump (Linux)
tcpdump -i eth0 -w capture.pcap &
# File system monitoring
inotifywait -m -r /tmp/
# Registry monitoring (Windows)
regshot.exe
Online Dynamic Analysis Services
ANY.RUN (Interactive Sandbox):
- Upload document to https://any.run
- Select Windows version (7/10/11)
- Choose Office version
- Enable network simulation
- Interact with document in real-time
- View process tree, network connections, dropped files
- Download IOC reports
Hybrid Analysis:
- Submit to https://hybrid-analysis.com
- Automated behavioral analysis
- MITRE ATT&CK mapping
- Memory dumps available
- API submission supported
# API submission
curl -X POST https://www.hybrid-analysis.com/api/v2/submit/file \
-H "api-key: YOUR_API_KEY" \
-F "file=@document.doc" \
-F "environment_id=120"
Joe Sandbox:
- Enterprise-grade analysis at https://www.joesandbox.com
- Advanced anti-evasion techniques
- Detailed behavior reports
- Memory forensics
- YARA rule generation
VirusTotal:
# Upload via API
curl --request POST \
--url https://www.virustotal.com/api/v3/files \
--header 'x-apikey: YOUR_API_KEY' \
--form file=@document.doc
# Check hash
curl --request GET \
--url https://www.virustotal.com/api/v3/files/FILE_HASH \
--header 'x-apikey: YOUR_API_KEY'
VirusTotal provides multi-AV scanning and community insights but remember - submissions are visible to others.
Tria.ge:
- Fast automated analysis at https://tria.ge
- Free and commercial tiers
- Excellent for quick triage
- Detailed behavioral reports
- PCAP downloads
urlscan.io (For document links):
# Scan URLs found in documents
curl -X POST "https://urlscan.io/api/v1/scan/" \
-H "API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "http://malicious-url.com"}'
IOC Extraction
After analysis, compile indicators of compromise for detection and hunting. You can organize these in your incident response reports for better tracking.
# Extract URLs from analysis
olevba --decode document.doc | grep -oE 'https?://[^ ]+'
# Get file hashes
md5sum document.doc
sha1sum document.doc
sha256sum document.doc
# Extract IP addresses
strings document.doc | grep -oE '\b([0-9]{1,3}\.){3}[0-9]{1,3}\b'
# Identify domains
strings document.doc | grep -oE '[a-zA-Z0-9.-]+\.(com|net|org|io|xyz)'
# Extract registry keys
olevba document.doc | grep -i "HKEY"
# Find file paths
strings document.doc | grep -E '^[A-Za-z]:\\'
IOC Documentation:
- File hashes (MD5, SHA1, SHA256)
- URLs and domains contacted
- IP addresses
- Registry modifications
- Dropped file names and paths
- Mutex names
- Scheduled tasks created
- PowerShell commands executed
Exploitation Perspective
From a red team perspective, understanding these analysis techniques helps us:
Evasion Techniques:
- Encrypt VBA strings with custom algorithms
- Use Excel 4.0 macros instead of VBA
- Leverage template injection for remote payload delivery
- Implement time-based triggers to evade sandboxes
- Check for virtualization before executing
- Use legitimate Office features like external data connections
Delivery Methods:
- Password-protected documents (bypass sandboxes)
- Polyglot files (valid Office + malicious PE)
- Malicious macros in hidden sheets
- DDE/DDEAUTO for macro-free execution
Common Payloads:
- Cobalt Strike beacons
- Meterpreter shells
- PowerShell Empire agents
- Custom C2 implants
Detection & Mitigation
Defensive Measures:
- Email Gateway Filtering:
- Block macro-enabled documents from external sources
- Scan with multiple AV engines
- Sandbox suspicious attachments
- Endpoint Protection:
- Disable macros by default
- Implement Application Control (AppLocker/WDAC)
- Enable Attack Surface Reduction rules
- User Training:
- Phishing awareness programs
- Report suspicious documents
- Never enable macros from unknown sources
- Network Monitoring:
- Alert on Office processes spawning shells
- Monitor for uncommon parent-child relationships
- Block known malicious IPs/domains
- YARA Rules Deployment:
# Deploy custom rules at gateway
yara -r office-malware-rules.yar /var/mail/incoming/
Detection Signatures:
# Suspicious macro patterns
grep -r "AutoOpen\|Auto_Open\|Document_Open" extracted-macros/
# PowerShell invocation
grep -ri "powershell\|WScript.Shell" document-strings.txt
# Obfuscation indicators
grep -ri "Chr(.*Chr(.*Chr(" vba-code.txt
Well, that's the complete workflow I use for analyzing malicious Office documents. The key is following a methodical approach - start passive, move to static analysis, then carefully proceed to dynamic execution. Every document tells a story, and with these tools, you'll be able to read it.
Enjoyed this guide? Share your thoughts below and tell us how you leverage techniques to analyze malicious Office documents in your projects!


No comments:
Post a Comment