1
0
Fork 0
Commit graph

62 commits

Author SHA1 Message Date
Matthew Holt
7f3c90b71a
Continue WIP interactive mode 2025-10-24 00:00:43 -06:00
Matthew Holt
1f1b60b8b1
Consider time.Local when processing update policies
Fixes unncessary item updates when repeating an import job
2025-10-10 21:27:53 -06:00
Matthew Holt
fb3d529228
Refactor thumbnail DB handle as well
Fix error when repo property doesn't exist
2025-09-30 14:04:53 -06:00
Matthew Holt
1aed8ca2ca
Fix missing data files in some cases
The refactored processor had a bug where small, binary data files like images < 100 KB would be buffered entirely while peeking, and wouldn't end up being saved as a file. Fixed the logic around that and simplified a bit too.
2025-09-22 14:39:02 -06:00
Matthew Holt
c8c1b65ce2
Try generating thumbhashes during import pipeline
Also show loading spinner for videos
2025-09-18 09:07:05 -06:00
Matthew Holt
2b5fd57259
Proper support for mixed timestamps and time zones
This will be a long-time WIP, but we now support full timestamps with local time offsets, absolute ones with UTC times only, and wall times only.

Several other fixes/enhancements. Making an effort to display time zone in time displays throughout the app.

Can now try to infer time zones during import, which is the default setting.

This will take a while to fully implement but it's a good start. Just have to be really careful about date crafting/manipulation/parsing.
2025-09-12 11:17:49 -06:00
Matthew Holt
a0e7c0eefd
Include time_offset when updating timestamp 2025-09-04 21:55:22 -06:00
Matthew Holt
b3376b5298
Fix pipeline bugs; rethink embeddings
Fixed several bugs introduced by the pipeline refactoring.

Updated goexif2 fork to use my latest commit which fixes not being able to find EXIF data on some JPEG images.

Embeddings now refer to the item they are for, rather than an item referring to a single embedding. This allows items to have multiple embeddings if necessary, which gives us some flexibility when models change/improve, etc.

Also reworked the Python server to use a smaller model (base siglip2 instead of so400m) so that it will fit on more GPUs, including my 4070; as well as a new "DeviceManager" that ChatGPT helped me figure out, to choose GPU when it has enough memory for it, as conditions change.
2025-09-04 21:40:50 -06:00
Matt Holt
a85f47f1a3
Major processor refactor (#112)
* Major processor refactor

- New processing pipeline, vastly simplified
- Several edge case bug fixes related to Google Photos (but applies generally too)
- Major import speed improvements
- UI bug fixes
- Update dependencies

The previous 3-phase pipeline would first check for an existing row in the DB, then decide what to do (insert, update, skip, etc.), then would download data file, then would update the row and apply lots of logic to see if the row was a duplicate, etc. Very messy, actually. The reason was to avoid downloading files that may not need to be downloaded.

In practice, the data almost always needs to be downloaded, and I had to keep hacking on the pipeline to handle edge cases related to concurrency and not having the data in many cases while making decisions regarding the item/row. I was able to get all the tests to pass until the final boss, an edge case bug in Google Photos -- but a very important one that happened to be exposed by my wedding album, of all things -- exhibited, I was unable to fix the problem without a rewrite of the processor.

The problem was that Google Photos splits the data and metadata into separate files, and sometimes separate archives. The filename is in the metadata, and worse yet, there are duplicates if the media appears in different albums/folders, where the only way to know they're a duplicate is by filename+content. Retrieval keys just weren't enough to solve this, and I narrowed it down to a design flaw in the processor. That flaw was downloading the data files in phase 2, after making the decisions about how to handle the item in phase 1, then having to re-apply decision logic in phase 3.

The new processing pipeline downloads the data up front in phase 1 (and there's a phase 0 that splits out some validation/sanitization logic, but is of no major consequence). This can run concurrently for the whole batch. Then in phase 2, we obtain an exclusive write lock on the DB and, now that we have ALL the item information available, we can check for existing row, make decisions on what to do, even rename/move the data file if needed, all in one phase, rather than split across 2 separate phases.

This simpler pipeline still has lots of nuance, but in my testing, imports run much faster! And the code is easy to reason about.

On my system (which is quite fast), I was able to import most kinds of data at a rate of over 2,000 items per second. And for media like Google Photos, it's a 10x increase from before thanks to the concurrency in phase 1: up from about 3-5/second to around 30-50/second, depending on file size.

An import of about 200,000 text messages, including media attachments, finished in about 2 minutes.

My Google Photos library, which used to take almost a whole day, now takes only a couple hours to import. And that's over USB.

Also fixed several other minor bugs/edge cases.

This is a WIP. Some more cleanup and fixes are coming. For example, my solution to fix the Google Photos import bug is currently hard-coded (it happens to work for everything else so far, but is not a good general solution). So I need to implement a general fix for that before this is ready to merge.

* Round out a few corners; fix some bugs

* Appease linter

* Try to fix linter again

* See if this works

* Try again

* See what actually fixed it

* See if allow list is necessary for replace in go.mod

* Ok fine just move it into place

* Refine retrieval keys a bit

* One more test
2025-09-02 11:18:39 -06:00
Matthew Holt
b365dbbafc
Fix panics with obfuscation 2025-07-09 13:30:50 -06:00
Matthew Holt
336ff7fae0
Fix new lint warnings
Must have been a change in golang-ci-lint
2025-07-01 15:41:07 -06:00
Matthew Holt
230fcb8583
Avoid inserting/updating with empty (not null) metadata 2025-06-19 09:10:18 -06:00
Matt Holt
def05a6cfa
Revise location processing and improve place entities (#101)
* Revise location processing and place entities

- New, more dynamic, recursive clustering algorithm
- Place entities are globally unique by name
- Higher spatial tolerance for coordinate attributes if entity name is the same (i.e. don't insert new attribute row for coordinate if it's sort of close to another row for that attribute -- but if name is different, then points have to be closer to not insert new attribute row)

There is still a bug where clustering is too aggressive on some data. Looking into it...

* Fix overly aggressive clustering

(...lots of commits that fixed the CI environment which changed things without warning...)
2025-06-17 16:13:44 -06:00
Matthew Holt
4bec2e0b86
Fix lint, tweak email recognition a bit more 2025-06-10 11:19:36 -06:00
Matthew Holt
fa9ad482b3
Place entities from GPX sources; several other improvements/fixes
Location processing is still being revised (WIP).
2025-06-09 17:18:44 -06:00
Matthew Holt
41ff81ceb6
Minor enhancements, fix howStored for items deduped by data file at end of pipeline 2025-05-30 16:20:26 -06:00
Matthew Holt
0c2b069e39
Bit of cleanup/comment enhancing 2025-05-30 11:42:18 -06:00
Matthew Holt
ebc731d221
Vastly speed up imports ?? (WIP) 2025-05-30 11:14:09 -06:00
Matthew Holt
31f003b3d4
Fix metadata updates for items and relationships
Also relocate data files if the item's timestamp changes
2025-05-28 18:09:46 -06:00
Matthew Holt
863d0e978b
Detect and handle corrupt timestamps a little better 2025-05-27 11:24:08 -06:00
Matthew Holt
39afe39a27
Wow data out there be realllly bad 2025-05-25 12:51:21 -06:00
Matthew Holt
1bd7c2a5c8
Fix several bugs related to duplicates, lat/lon tolerances, etc.
Separate altitude out from latlon in unique constraints
2025-05-25 12:36:03 -06:00
Matthew Holt
2b586c56da
Treat lower precision input as unknown for coordinate uncertainty
Rather than treating them as significant 0s
2025-05-23 13:51:52 -06:00
Matthew Holt
9dd00b724c
Use limited decimal precision for decision to reprocess coordinates
Coordinates are arbitrary precision floats, so it is silly to compare, say, 35.320366666667 against 35.320367 and have them not be equal. I have yet to test this, but it should speed up importing duplicate location points since it will skip coordinates that are within about 1 meter of each other.
2025-05-21 15:45:16 -06:00
Matthew Holt
d268486f55
Several import fixes; metadata merging
- Quick unit tests for a function related to Google Takeout archives
- We now combine existing metadata with new according to the update policy, instead of either writing all or none of incoming metadata. This merging happens before the DB update query and is a bit of a special case as the policy is applied per-key.
- Special handling for corrupted timestamp in Google Photos data. This is a singular case I haven't observed more of, but seems like a reasonable heuristic. There might be thousands more out there, who knows.
- Fix job creation time (milliseconds)
- Hopefully make repeated imports faster by skipping duplicate items more intelligently based on update policies.
2025-05-19 12:47:18 -06:00
Matthew Holt
4838dbd7d3
Huh, gofmt failed me 2025-05-15 13:59:20 -06:00
Matthew Holt
812cfad74d
Modernize a few lines of code 2025-05-15 13:56:30 -06:00
Matthew Holt
ac794cb5f3
Fix unnecessary item updating
Ignore empty/zero-value metadata keys, and consider time+zone separately since they are stored separately in the DB.
2025-05-15 06:35:56 -06:00
Matthew Holt
6a10d23a7c
Fix: timestamps, coordinate precision, map loading
- Timestamp year cannot be > 9999 (JSON serialization panics)
- Lat/lon now considered equivalent after a certain decimal point, since not all sources have high precision (we choose 5 decimal points for now, or about 1.1 meters)
- Map style must be loaded before source is added, apparently (got this error once)
2025-05-14 14:41:08 -06:00
Matthew Holt
874be1a9ca
Add UI for unique constraints and item update preferences 2025-05-12 12:34:48 -06:00
Matthew Holt
ae3a5d02b0
Field update preferences allow more control over item updates 2025-05-09 10:04:03 -06:00
Matthew Holt
ba4635cf7e
Fix data file handling
It wasn't updated properly with the big pipeline refactor
2025-05-04 13:28:20 -06:00
Matthew Holt
3d2222fce2
Fix thumbnail job size count and paging; other minor fixes
Including one fix for a panic introduced by obfuscated logging during processing
2025-05-01 11:15:13 -06:00
Matthew Holt
25712e7c61
Fix thumbnail job size counts 2025-04-28 10:26:59 -06:00
Matthew Holt
38c89f2b0a
Only prepare finished graph log if needful; fix obfuscation bug 2025-04-27 07:49:34 -06:00
Matthew Holt
f0697d2d6b
Refactor embedding jobs; enhance tooltips; upgrade gofakeit to v7
The gofakeit upgrade uses the new math/rand/v2 package, which uses uint64 more than int64, so we had to change a bunch of row IDs from int64 to uint64.
2025-04-24 16:33:41 -06:00
Matthew Holt
ec87974576
Refactor thumbnails jobs to dynamically page through rows by import ID 2025-04-21 16:18:23 -06:00
Matthew Holt
932831db47
Refactor data sources to make them dynamic
Also change the checkbox dropdown to a more interactive tomselect (type-to-search dropdown with chips) with pictures.

This makes it so data sources can be added to a timeline dynamically.

In the future, data sources can be implemented externally and push data to the timeline, so these need to not be rigidly hard-coded into the app and assumed to never change.

This essentially adds all their info (name, title, description, image, etc) into each timeline DB.
2025-02-11 16:49:20 -07:00
Matthew Holt
c1a9abb74b
googlelocation: Support on-device Android 2025 format
(Thanks to those who helped in Discord!)
2025-01-30 13:08:26 -07:00
Matt Holt
628ecc1cb3
ci: Update workflows; restore functioning CI jobs (#64)
* ci: Attempt to fix broken CI

It broke out of the blue several months ago. I think ubuntu-latest
updated, but there's no PPA for libheif in that distro I guess

* Try tests next

* More fixing

* Try again

* Yada yada

* Woops

* I don't really know what I'm doing
2025-01-27 22:30:54 -07:00
Matthew Holt
4e89fca643
Fix relationship de-duping; speed up imports a bit more 2025-01-10 15:31:20 -07:00
Matthew Holt
29e2bc8fef
Fix iphone/imessage: Update attribute_id in DB if inserting item piecewise
iMessage db may send a reaction graph for a message before sending the message itself to the pipeline, thus an empty item with only an original ID gets inserted, and later the full message item comes in, but I had neglected to add attribute_id to updateOverrides.
2025-01-09 18:09:58 -07:00
Matthew Holt
3d11d65b8d
WIP settings page; #map mobility; WIP interactive imports
Settings page is started; non-functional, but location picker works.

Moving maps between container elements is improved by moving to nearest to mouse pointer, rather than just most center to the viewport. It also emits an event when the map is moved, allowing us to change/reset map configurations for certain displays.

More progress on interactive imports. More thought is needed before continuing.

Upgraded Mapbox libraries.
2024-12-26 11:51:47 -07:00
Matthew Holt
ce297389b0
Thumbnail job streaming; WIP: interactive imports 2024-12-19 06:51:06 -07:00
Matthew Holt
294e2a72a9
Reconnect after disconnection; improve checkpointing 2024-12-17 14:27:50 -07:00
Matthew Holt
a4d8bc923d
Data source checkpoints; refine import concurrency
And related improvements and fixes
2024-12-15 22:40:58 -07:00
Matthew Holt
fcaa238634
Implement pause/unpause 2024-12-13 13:02:06 -07:00
Matthew Holt
22628833a7
Refactor obfuscation mode and some processing logic 2024-12-13 07:19:27 -07:00
Matthew Holt
0063bbe396
Some fixes for import streaming 2024-12-12 10:37:51 -07:00
Matthew Holt
786f516696
Refine import stream 2024-12-12 10:18:28 -07:00