Investigate whether URLs in the file table could be hashed #4

Open
opened 2023-10-11 11:23:28 +00:00 by cadence · 0 comments
Owner

Changing this would save a bit of storage space. It would store a 64-bit integer (8 bytes storage + serial types), instead of a full URL (80-120 bytes storage + serial types).

When I was doing this in the reactions table, I analysed whether a birthday attack could happen. I should review these conclusions again for the file table. Also, tolerances could be sloppier on the reactions table. The file table may need larger safety margin, since it would be confusing if the wrong file appeared.

xxhash is not cryptographic. Specially crafted file names on Discord-side might be able to trick xxhash and use the wrong file.


Stats:

My bridge has been running for around 60 days with 2842 registered files. select sum(length(discord_url)) from file gives 270091 characters stored in URLs.

If I used a 64-bit integer instead, it would store 22736 bytes for all URLs. That's about a 90% reduction. Pretty neat!

The current approach costs 125 kb excess storage per month. That's not terrible.


Investigation:

This would remove the full URLs from the table. I won't be able to get them out again. If I make the switch, it will be impossible to look up a Discord URL by a MXC URL. I need to investigate if I'd ever need to do that.

In the past, it has been useful to compare the URLs in the file table against the IDs in the emoji table. If I make the switch, this will also become impossible. I should investigate this too and see if this is OK.

Changing this would save a bit of storage space. It would store a 64-bit integer (8 bytes storage + serial types), instead of a full URL (80-120 bytes storage + serial types). When I was doing this in the reactions table, I analysed whether a birthday attack could happen. I should review these conclusions again for the file table. Also, tolerances could be sloppier on the reactions table. The file table may need larger safety margin, since it would be confusing if the wrong file appeared. xxhash is not cryptographic. Specially crafted file names on Discord-side might be able to trick xxhash and use the wrong file. ---- Stats: My bridge has been running for around 60 days with 2842 registered files. select sum(length(discord_url)) from file gives 270091 characters stored in URLs. If I used a 64-bit integer instead, it would store 22736 bytes for all URLs. That's about a 90% reduction. Pretty neat! The current approach costs 125 kb excess storage per month. That's not terrible. --- Investigation: This would remove the full URLs from the table. I won't be able to get them out again. If I make the switch, it will be impossible to look up a Discord URL by a MXC URL. I need to investigate if I'd ever need to do that. In the past, it has been useful to compare the URLs in the file table against the IDs in the emoji table. If I make the switch, this will also become impossible. I should investigate this too and see if this is OK.
Sign in to join this conversation.
No Label
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cadence/out-of-your-element#4
No description provided.