Skip to content

Commit f82e8b3

Browse files
committed
Fix message-id parsing
We stripped unknown characters from message ids to help with some old not-exactly-standard mbox files, but managed to remove too much, including some valid message id characters. This didn't cause any internal issues, but the upstream archive links didn't work for such messages. This commit fixes the problem in two parts: 1. new mbox imports use a new regex, which use the correct format 2. it introduces a one-time helper script that fixes the existing database based on mbox files.
1 parent 993438b commit f82e8b3

2 files changed

Lines changed: 84 additions & 2 deletions

File tree

app/services/email_ingestor.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -189,8 +189,8 @@ def clean_reference(ref)
189189
ref_str = matches.last&.first || ref_str
190190
end
191191

192-
# Allow common msg-id characters and strip anything else.
193-
ref_str.gsub(/[^A-Za-z0-9.@_+%-]/, '')
192+
# Allow RFC 5322 msg-id atext plus dot and @, strip anything else.
193+
ref_str.gsub(/[^A-Za-z0-9.!#$%&'*+\/=?^_`{|}~@-]/, '')
194194
end
195195

196196
def fallback_thread_lookup(subject, message_id:, references:, sent_at:)

script/mbox_fix_message_ids.rb

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
require_relative "../config/environment"
2+
# One-time script to fix the incorrectly stripped message IDs
3+
4+
5+
def extract_message_id(raw_message)
6+
mail = Mail.new(raw_message)
7+
mail.message_id
8+
rescue StandardError
9+
nil
10+
end
11+
12+
def clean_reference_old(ref)
13+
return '' if ref.nil?
14+
15+
ref_str = ref.to_s
16+
if ref_str.include?('<')
17+
matches = ref_str.scan(/<([^>]+)>/)
18+
ref_str = matches.last&.first || ref_str
19+
end
20+
21+
ref_str.gsub(/[^A-Za-z0-9.@_+%-]/, '')
22+
end
23+
24+
def clean_reference_new(ref)
25+
return '' if ref.nil?
26+
27+
ref_str = ref.to_s
28+
if ref_str.include?('<')
29+
matches = ref_str.scan(/<([^>]+)>/)
30+
ref_str = matches.last&.first || ref_str
31+
end
32+
33+
ref_str.gsub(/[^A-Za-z0-9.!#$%&'*+\/=?^_`{|}~@-]/, '')
34+
end
35+
36+
def process_message(raw_message)
37+
message_id = extract_message_id(raw_message)
38+
return if message_id.nil?
39+
40+
old_id = clean_reference_old(message_id)
41+
new_id = clean_reference_new(message_id)
42+
return if old_id.blank? || new_id.blank?
43+
return if old_id == new_id
44+
45+
msg = Message.find_by_message_id(old_id)
46+
return unless msg
47+
48+
existing = Message.find_by_message_id(new_id)
49+
if existing && existing.id != msg.id
50+
puts "SKIP: #{msg.id} #{old_id} -> #{new_id} (already claimed by #{existing.id})"
51+
return
52+
end
53+
54+
msg.update!(message_id: new_id)
55+
puts "FIX: #{msg.id} #{old_id} -> #{new_id}"
56+
end
57+
58+
def process_mbox(path)
59+
message = ""
60+
File.open(path, "r") do |f|
61+
f.each_line do |line|
62+
line = line.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
63+
if line.match(/^From [^@]+@[a-z\d\-]+(\.[a-z\d\-]+)*\.[a-z]+/i)
64+
process_message(message) unless message.empty?
65+
message = ""
66+
else
67+
message << line
68+
end
69+
end
70+
end
71+
process_message(message) unless message.empty?
72+
end
73+
74+
if ARGV.empty?
75+
puts "Usage: ruby script/mbox_fix_message_ids.rb <mbox file> [<mbox file> ...]"
76+
exit 1
77+
end
78+
79+
ARGV.each do |fn|
80+
puts "Processing #{fn}..."
81+
process_mbox(fn)
82+
end

0 commit comments

Comments
 (0)