# Challenge Overview: The Cicada Archives
**Category:** Forensics
**Event:** L3m0nCTF 2025
**Role:** Challenge Author
> π οΈ **Author Note**
> This challenge was authored by me for **L3m0nCTF 2025**.
> The following explanation describes the **intended multi-layer forensic analysis path**.
## Intended Analysis Path
The challenge was designed to test:
- recognition of document formats as structured containers
- detection of invisible data embedded in normal-looking files
- correlation of unrelated forensic artifacts across multiple layers
- reduction of large noisy datasets to isolate anomalies
- reconstruction of a fragmented narrative from subtle clues
Direct or surface-level inspection of any single file was intentionally insufficient.
## Analysis Phase 1 β Establishing the Scope
We are given a single archive [TheCicadaArchives.tar.gz](https://github.com/rozariyomartin/L3m0nCTF2025-Writeups/blob/main/Forensics/TheCicadaArchives/TheCicadaArchives.tar.gz) containing three files:
whiteletter.docx
archive_2021.bin
evidence.zip
At first glance, everything looks ordinary. No obvious corruption, no visible clues, no readable flags.
This challenge is about looking past whatβs visible, and understanding that data can be hidden inside structure, noise, and normality.
## Analysis Phase 2 β Document Container Inspection
Opening the document normally reveals nothing interesting β just plain text.
This immediately suggests the content is not meant to be read directly.
A `.docx` file is actually a ZIP archive, so we extract it:
```
unzip whiteletter.docx -d whiteletter
```
Inside, we inspect the Word XML files. The footer is a common hiding place:
`whiteletter/word/footer.xml`
Reading it visually still shows nothing suspicious.
## Analysis Phase 3 β Hidden Unicode Signal Extraction
Since nothing visible stands out, the next step is to look for **invisible or non-ASCII characters**.
We search for zero-width Unicode characters:
```
grep -P "[\x{200C}\x{200D}]" footer.xml
```
Nothing visible is printed, but output exists β meaning invisible characters are present.
We extract only those characters:
```
grep -oP "[\x{200C}\x{200D}]" footer.xml > zw.txt
```
A quick hex dump confirms they are real:
```xxd zw.txt | head
```
We see repeating patterns of:
- e2 80 8c β U+200C
- e2 80 8d β U+200D
## Analysis Phase 4 β Zero-Width Character Decoding
These two characters can naturally represent binary:
- U+200C β 0
- U+200D β 1
We decode them as binary bytes:
```
python3 - << 'EOF'
data = open("zw.txt", "r", encoding="utf-8").read()
bits = ""
for c in data:
if c == '\u200c':
bits += "0"
elif c == '\u200d':
bits += "1"
out = ""
for i in range(0, len(bits), 8):
byte = bits[i:i+8]
if len(byte) == 8:
out += chr(int(byte, 2))
print(out)
EOF
```
Output:
```
morseindocx
```
This is clearly a password.
Brute-force attempts using common wordlists were intentionally ineffective.
## Analysis Phase 5 β Password-Protected Artifact Recovery
Using the recovered password:
```
unzip evidence.zip
```
Password:
``morseindocx``
The archive extracts several files
## Analysis Phase 6 β Network Traffic Correlation
Opening the capture in Wireshark shows heavy, realistic traffic:
- DNS
- TCP
- ICMP
- Multiple IPs and hosts
Nothing obvious stands out initially.
#### DNS Analysis
We filter DNS packets:
Among many legitimate domains, one entry stands out subtly:
``frg3.tdn01s3s1gn4l.net``
This does not look random β it looks constructed.
Extracting the fragment:
```
tdn01s3s1gn4l
```
This becomes the 3rd **fragment**.
## Analysis Phase 7 β Secondary Signal Discovery
Still in the PCAP, we inspect HTTP traffic.
We follow TCP streams (Right-click β Follow β HTTP Stream).
One HTTP request contains a custom header:
This gives us the password:
```
inspectnext
```
## Analysis Phase 8 β Image-Based Data Extraction
Using the recovered password:
```
steghide extract -sf img002.jpg -p inspectnext
```
This extracts:
fragment2.txt
**Contents:**
```
tt3r_1nsp3c
```
## Analysis Phase 9 β Large-Scale Log Reduction
#### 1.Initial Analysis
We are provided with a file named massive_server.log. A quick check using ls -lh reveals the file is quite large (approx. 150MB+), containing over 1 million lines.
Attempting to read the file manually using cat or less is futile because of the sheer volume of data. The logs simulate a busy server environment with various formats:
- Apache/Nginx Access Logs
- Syslog messages (kernel, sshd, cron)
- JSON structured logs
- Java Stack Traces
- Hex Dumps
Since we don't know what string to search for (like "flag" or "L3M0N"), a simple grep won't work.
#### 2. The Trap
A common first attempt is to look for unique lines using sort | uniq -u. However, running this command returns almost the entire file.
Why? Every log line contains dynamic variables:
- Timestamps: ``[14:22:01]`` vs ``[14:22:02]``
- IP Addresses: ``192.168.1.5`` vs ``10.0.0.2``
- Request IDs/UUIDs: ``trace_id: "b394a2f7..."``
To a computer, these lines are all "unique," even if they are generated by the same logging event.
#### 3. The Solution: Log Reduction
To find the needle, we don't look for the needle; we look for the haystack. We need to perform **Frequency Analysis**.
By identifying the "templates" that generate the noise, we can mathematically filter them out. If 99.9% of the file follows 5 standard patterns, the flag will be the one line that follows a pattern appearing only once.
We wrote a Python script to **normalize** the logsβreplacing all variables (numbers, IPs, UUIDs, dates) with generic placeholders like ``{VAR}``.
The Solver Script (``solve.py``)
```
import re
from collections import Counter
# --- CONFIGURATION ---
LOG_FILE = "massive_server.log" # Make sure this matches your file name
def normalize_log(line):
"""
Aggressively strips variable data to reveal the 'skeleton' of the log.
"""
line = line.strip()
# 1. DETECT HEX DUMPS (The lines with | at the end)
# If it starts with hex address and ends with ascii representation
if re.search(r'^[0-9a-fA-F]{4,8}\s+[0-9a-fA-F]{2}', line) and "|" in line:
return "HEX_DUMP_LINE"
# 2. STRIP UUIDs (The long trace_id strings)
# Pattern: 8-4-4-4-12 hex characters
line = re.sub(r'[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}', '{UUID}', line)
# 3. STRIP TIMESTAMPS & IPs
line = re.sub(r'\d{4}-\d{2}-\d{2}', '{DATE}', line)
line = re.sub(r'\d{2}:\d{2}:\d{2}', '{TIME}', line)
line = re.sub(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', '{IP}', line)
# 4. AGGRESSIVE: Strip any word containing a number (e.g., "User123", "0x4f", "thread-5")
# This turns your flag hash "28h3Jkh..." into "{VAR}"
line = re.sub(r'\b\w*\d\w*\b', '{VAR}', line)
# 5. Clean up multiple spaces
return " ".join(line.split())
def solve():
print(f"Scanning {LOG_FILE} with aggressive filters...")
skeleton_counts = Counter()
skeleton_examples = {}
try:
with open(LOG_FILE, "r", encoding="utf-8", errors="ignore") as f:
for line in f:
clean_line = line.strip()
if not clean_line: continue
# Get the skeleton
skeleton = normalize_log(clean_line)
# Count it
skeleton_counts[skeleton] += 1
# Save the first example we see of this type
if skeleton not in skeleton_examples:
skeleton_examples[skeleton] = clean_line
except FileNotFoundError:
print(f"Error: Could not find '{LOG_FILE}'. Check the filename.")
return
print("\n--- RESULTS: The Rarest Log Entries ---")
# Print the bottom 5 (rarest) items
# The flag should be the very last one printed (Count: 1)
found_any = False
for skeleton, count in skeleton_counts.most_common()[:-6:-1]:
found_any = True
print(f"[Count: {count}]")
print(f"Skeleton: {skeleton}")
print(f"ORIGINAL: {skeleton_examples[skeleton]}")
print("-" * 50)
if not found_any:
print("No results found. Is the file empty?")
if __name__ == "__main__":
solve()
```
#### 4.Execution & Result
Running the script produced the following output:
#### 5. The Flag
The anomaly contained the hidden message:
can you see this 28h3JkhN8IVHxjDI4R8F5R
The encoded fragment can be identified as Base62 and decoded accordingly.
```
FRAG4: s_l0gg3d}
```
## Analysis Phase 10 β Fragment Reassembly
We can get the first part of the flag on the initial analysis even using strings,
```
L3m0nCTF{wh1t3l3
```
On reconstructing it fully we get the flag.
**Flag : ``L3m0nCTF{wh1t3l3tt3r_1nsp3ctdn01s3s1gn4ls_l0gg3d}``**