1
0
Fork 0
Commit graph

137 commits

Author SHA1 Message Date
Matthew Holt
967f3ab28b
Fix panic from EXIF parsing; checkpoint resumption in Google Photos
Also show what file path information we do have for some imports that lack filename and preview, on the import job page.
2025-09-05 09:39:13 -06:00
Matthew Holt
d5f391866c
Ignore false lint warnings 2025-09-03 07:01:26 -06:00
Matthew Holt
b68b33c7bb
Honor timeframe on Apple Photos/iMessage imports 2025-09-02 22:41:03 -06:00
Matt Holt
a85f47f1a3
Major processor refactor (#112)
* Major processor refactor

- New processing pipeline, vastly simplified
- Several edge case bug fixes related to Google Photos (but applies generally too)
- Major import speed improvements
- UI bug fixes
- Update dependencies

The previous 3-phase pipeline would first check for an existing row in the DB, then decide what to do (insert, update, skip, etc.), then would download data file, then would update the row and apply lots of logic to see if the row was a duplicate, etc. Very messy, actually. The reason was to avoid downloading files that may not need to be downloaded.

In practice, the data almost always needs to be downloaded, and I had to keep hacking on the pipeline to handle edge cases related to concurrency and not having the data in many cases while making decisions regarding the item/row. I was able to get all the tests to pass until the final boss, an edge case bug in Google Photos -- but a very important one that happened to be exposed by my wedding album, of all things -- exhibited, I was unable to fix the problem without a rewrite of the processor.

The problem was that Google Photos splits the data and metadata into separate files, and sometimes separate archives. The filename is in the metadata, and worse yet, there are duplicates if the media appears in different albums/folders, where the only way to know they're a duplicate is by filename+content. Retrieval keys just weren't enough to solve this, and I narrowed it down to a design flaw in the processor. That flaw was downloading the data files in phase 2, after making the decisions about how to handle the item in phase 1, then having to re-apply decision logic in phase 3.

The new processing pipeline downloads the data up front in phase 1 (and there's a phase 0 that splits out some validation/sanitization logic, but is of no major consequence). This can run concurrently for the whole batch. Then in phase 2, we obtain an exclusive write lock on the DB and, now that we have ALL the item information available, we can check for existing row, make decisions on what to do, even rename/move the data file if needed, all in one phase, rather than split across 2 separate phases.

This simpler pipeline still has lots of nuance, but in my testing, imports run much faster! And the code is easy to reason about.

On my system (which is quite fast), I was able to import most kinds of data at a rate of over 2,000 items per second. And for media like Google Photos, it's a 10x increase from before thanks to the concurrency in phase 1: up from about 3-5/second to around 30-50/second, depending on file size.

An import of about 200,000 text messages, including media attachments, finished in about 2 minutes.

My Google Photos library, which used to take almost a whole day, now takes only a couple hours to import. And that's over USB.

Also fixed several other minor bugs/edge cases.

This is a WIP. Some more cleanup and fixes are coming. For example, my solution to fix the Google Photos import bug is currently hard-coded (it happens to work for everything else so far, but is not a good general solution). So I need to implement a general fix for that before this is ready to merge.

* Round out a few corners; fix some bugs

* Appease linter

* Try to fix linter again

* See if this works

* Try again

* See what actually fixed it

* See if allow list is necessary for replace in go.mod

* Ok fine just move it into place

* Refine retrieval keys a bit

* One more test
2025-09-02 11:18:39 -06:00
Matthew Holt
1f73da0527
Fix lint errors 2025-08-21 15:39:36 -06:00
Matthew Holt
a52fb35c4d
Data sources can honor job pauses; minor improvements to some errors, logs 2025-07-15 15:58:02 -06:00
Matthew Holt
336ff7fae0
Fix new lint warnings
Must have been a change in golang-ci-lint
2025-07-01 15:41:07 -06:00
JP Hastings-Edrei
29f1ed3176
Importer for Flighty flight information (#90) 2025-06-19 15:05:18 -06:00
Matthew Holt
056f813889
gpx: Mark place entity points as significant
Also still allow clustering significant points, since we do preserve them, the data source can just call ClusterPoints() to get it back...
2025-06-17 21:37:38 -06:00
Matthew Holt
1c14853317
Tune path simplification a little more 2025-06-17 16:43:22 -06:00
Matt Holt
def05a6cfa
Revise location processing and improve place entities (#101)
* Revise location processing and place entities

- New, more dynamic, recursive clustering algorithm
- Place entities are globally unique by name
- Higher spatial tolerance for coordinate attributes if entity name is the same (i.e. don't insert new attribute row for coordinate if it's sort of close to another row for that attribute -- but if name is different, then points have to be closer to not insert new attribute row)

There is still a bug where clustering is too aggressive on some data. Looking into it...

* Fix overly aggressive clustering

(...lots of commits that fixed the CI environment which changed things without warning...)
2025-06-17 16:13:44 -06:00
Matthew Holt
4bec2e0b86
Fix lint, tweak email recognition a bit more 2025-06-10 11:19:36 -06:00
Matthew Holt
fa9ad482b3
Place entities from GPX sources; several other improvements/fixes
Location processing is still being revised (WIP).
2025-06-09 17:18:44 -06:00
JP Hastings-Edrei
27a2f462cf
lint: bump golangci-lint version (#92)
* lint: bump golangci-lint version

- Bumps the version of golangci-lint that's used in the Github Action to be the most recent version (as installed with eg. `brew install golangci-lint` — v2.1.6)
- Migrates the `.golangci.toml` file, and manually moves the comments over
- `errchkjson` appears to work now, so added that back into the linter (the `forbidigo` and `goheader` linters I've left commented out)

* lint: remove checkers we don't like

Removes two static checkers that cause code changes we don't like.

* lint: remove old lint declaration

apparently `gosimple` isn't available any more, so I've removed its `nolint` declaration here.

* lint: swap location of `nolint:goconst`

This _seems_ to be an unstable declaration, because of he parallel & undeterministic nature of the linter. If this keeps causing trouble we can either remove the goconst linter, or change _both_ of these lines to hold `//nolint:goconst,nolintlint`.
2025-06-02 15:03:19 -06:00
Matthew Holt
31c575727c
apple_contacts: Improve recognition a bit 2025-05-31 07:06:33 -06:00
Matthew Holt
31f003b3d4
Fix metadata updates for items and relationships
Also relocate data files if the item's timestamp changes
2025-05-28 18:09:46 -06:00
Matthew Holt
863d0e978b
Detect and handle corrupt timestamps a little better 2025-05-27 11:24:08 -06:00
Matthew Holt
ab64f1eaee
googlephotos: Implement DS checkpoint 2025-05-26 07:51:31 -06:00
Matthew Holt
d0d76473fa
Relate sidecar motion pics from Google Photos; fix related entity display on item page
- Somehow I totally forgot to relate sidecar motion photos in Google Photos. (They don't use sidecars on Google phones.)

- Item page now displays entities in the picture even without face coordinates
2025-05-20 11:35:15 -06:00
Matthew Holt
d268486f55
Several import fixes; metadata merging
- Quick unit tests for a function related to Google Takeout archives
- We now combine existing metadata with new according to the update policy, instead of either writing all or none of incoming metadata. This merging happens before the DB update query and is a bit of a special case as the policy is applied per-key.
- Special handling for corrupted timestamp in Google Photos data. This is a singular case I haven't observed more of, but seems like a reasonable heuristic. There might be thousands more out there, who knows.
- Fix job creation time (milliseconds)
- Hopefully make repeated imports faster by skipping duplicate items more intelligently based on update policies.
2025-05-19 12:47:18 -06:00
Matthew Holt
bdf3cc5636
whatsapp: Remove RetrievalKey
Probably not needed in this case
2025-05-17 16:51:11 -06:00
Matthew Holt
4090a09186
twitter: Gracefully skip missing media files
I guess the archives they export are incomplete. The import should continue.
2025-05-16 22:05:40 -06:00
Matthew Holt
3e311d99c3
Sort data sources in import planner; rename some DS
The sorting can help imports go faster if we put DB-heavy sources first, when the database is still small.

The data source names were also standardized to use snake_case like most other word-IDs in the app.
2025-05-16 11:10:23 -06:00
Matthew Holt
0d26c6eb31 Fix several bugs
- Obfuscation mode enabled would set a fake phone number in smsbackuprestore's DS options, which led to bad data. Now, the UI does not auto-fill that value. But that means we need...

- SMS Backup & Restore: Phone number can now be inferred from repo owner in the backend, if ds opt phone number is empty. This works even with obfuscation enabled.

- Aborting a scheduled job before it starts now stays aborted. (Unless you manually restart it.)

- Added a data validation error modal for DS options on the import page. For now, if smsbackuprestore has no phone number set, and the timeline repo owner doesn't have a phone number, an error will be shown.
2025-05-15 16:53:35 -06:00
Matthew Holt
02d9434131
vcard: Initialize metadata map to avoid panic 2025-05-14 08:31:34 -06:00
Matthew Holt
360e131fff
Recover panics during jobs/imports, and support base64 pics from vCard 2025-05-14 08:29:37 -06:00
Matthew Holt
17f660ae8b
applephotos: Minor fix to recognition 2025-05-13 11:35:03 -06:00
JP Hastings-Edrei
855a0a702b
whatsapp: Fix tests & metadata keys (#88)
Well this is embarrassing, I forgot to actually test the metadata _and_ the keys emitted weren't correct!
2025-05-07 11:22:41 -06:00
JP Hastings-Edrei
2407333482
Add WhatsApp importer (#79)
* Add WhatsApp importer

A first pass at importing WhatsApp chat exports.

Some open questions:
- Do we want to import context messages ("you deleted this message")?
- In WhatsApp its possible to have groups with the same participants but a different group name. Is it possible to tag a conversation with a "group name" in Timelinize? If not, this may end up with different conversations being interleaved.
- Is it safe to assume the current location for timezone analysis on import? WhatsApp exports use timezoneless timestamps, which (I've confirmed manually) are just "what the time would have been where you are now" (for me, messages sent in summer are in BST, and in winter are GMT)

Annoying quirks of the export format we should find good ways to communicate to users:
- Any caption text sent with an attachment isn't exported by WhatsApp. (The text is lost and unavailable to Timelinize — I've opened a bug with Meta, for all the good that'll do)
- If there are silent members of a group chat, their presence isn't recorded in the data WhatsApp exports

Todo:
- I _think_ it's safe to assume there's only ever one attachment per message, this would change & simplify the way I parse attachment lines. I'll keep exploring my own exports to identify if this is reasonable.

* Include polls & locations in tests

Polls are currently ignored, but I'll move them to being imported as a message, or as some special datatype, after discussion.

* Add text formatting examples, and show they're not processed

* Fix lint issues

* WhatsApp: Add Retrieval keys to messages

The key on the message isn't perfect, as it'll change if the person exporting their chat history has changed the name of one of the participants between exports (this would mean that participant's name would be different between exports, and their retrieval key would be different).

This seems as close as we can get without exported IDs though.

(I can't find a good way to test that the retrieval key is set properly)

* WhatsApp: Polls, Locations, Metadata

- Correctly parses attachments (even those which have been omitted, as not being available on the device that performed the export)
- Parses Polls (only in English, for now), including adding metadata for the Poll
- Extracts location metadata (Foursquare ID for named locations, or Lat/Long)
- Adds more test data to demonstrate other kinds of messages included in exports

* WhatsApp: Handle other locales

- 🤦‍♂️ The timestamp format changes based on the locale of the device performing the export — which makes accurate extraction of dates impossible between DD/MM/YYYY and MM/DD/YYYY dates. This parser will assume DD/MM/YYYY date if the last set of digits is 4 long. Perhaps we need an import option for "I'm using American dates"?
- Swaps the Poll scraping structure to allow for the localised words used when the exporting phone is set to other locales (eg. OPCIÓN instead of OPTION)
- Added a chat line test fixture to illustrate this (though normally the entire file would only ever be in a single locale)

* WhatsApp: Correct Poll Structure & fix parsing

I had incorrect POLL lines in the test fixtures; this commit fixes them, and the importer so it can read them properly.

* Use snake case for datasource name

Co-authored-by: Matt Holt <mholt@users.noreply.github.com>

* WhatsApp: Be cautious with matching

Be slightly less confident with matching `_chat.txt` files as WhatsApp exports!

* WhatsApp: Fix lint errors

Fix magic number linting errors

* WhatsApp: swap metadata namespaces

Switch to using "Pin" instead of "Location" to more accurately describe what's being tagged with the metadata.

---------

Co-authored-by: Matt Holt <mholt@users.noreply.github.com>
2025-05-07 09:21:39 -06:00
Matthew Holt
2dc7387901
iphone: Make CameraRoll optional 2025-05-06 18:00:02 -06:00
Matthew Holt
56e16e33a7
Make Apple Photos importable from both Mac and iPhone
Also a couple minor UI fixes.

And a minor but important addition to the contributing guidelines.
2025-05-06 17:54:36 -06:00
Matthew Holt
ffc8ad6f51
applephotos: Preserve a lot more metadata about people in photos
Also infer owner entity from DB if necessary, very cool!

Also fix a couple minor bugs
2025-05-05 14:48:54 -06:00
Matthew Holt
a62f4aa05a
applephotos: Initial commit of Apple Photos data source
Still a WIP, but mostly there!
2025-05-05 12:07:13 -06:00
Matthew Holt
55f7feaa21
Better support for application/ogg
See https://github.com/timelinize/timelinize/pull/79#issuecomment-2848707939
2025-05-03 15:32:59 -06:00
Matthew Holt
f0697d2d6b
Refactor embedding jobs; enhance tooltips; upgrade gofakeit to v7
The gofakeit upgrade uses the new math/rand/v2 package, which uses uint64 more than int64, so we had to change a bunch of row IDs from int64 to uint64.
2025-04-24 16:33:41 -06:00
Matthew Holt
b88485a84b
A few fixes/enhancements
googlelocation: Allow iOS on-device location filename to be renamed, but it should still contain "location-history" and be a .json file.

- Upgrade mapbox-gl-js to 3.11

- Run thumbnail+embedding jobs even if import failed; WIP
2025-04-19 13:44:51 -06:00
Sergio Rubio
88eab7c50a
firefox: check sqlite database has the required table (#74)
* firefox: check sqlite database has the required table

This improves Firefox places.sqlite database recognition by making sure
it has the required moz_places table we need.

* Address linter issues

* Wait for the goroutine to exit before cancelling
2025-04-08 09:47:17 -06:00
Sergio Rubio
612cae9c03
Firefox data source (#55)
* [WiP] Firefox data source

Work in progress.

Implements a new Firefox datasource capable of reading its
places.sqlite database to import the browser history (page visits).

The implementation currently has a number of issues:

* Firefox (and Firefox based) browser keeps an exclusive lock on the
  places.sqlite database, and we can't dump or backup it while the
  browser is open, at least on Linux. To work around that,
  we copy the database to a temporary directory and import from it.
  This generally works, but isn't safe, as there's a risk of database
  corruption when doing the hot copy. Potential alternatives:
  * Ask the user to close the browser while the import happens, which
    isn't convenient/possible if this is happening regularly in the background.
  * Ignore and retry, as it'll eventually succeed, in the rare case the
    temporary db copy is corrupted and unreadable
  * Something else, no expert here.
* You need to point Timelinize to the places.sqlite file directly. Pointing
  it to the Firefox profile directory doesn't seem to work, as it
  doesn't seem to scan recursively or list all the directory files and
  pass them to Recognize. I'm probably missing something obvious here.
* Missing tests (will be added)

* Linter fixes

* Adapt it to the new API

* Send the full path to process

* Simplify import process

* Add datasource description

* Use the URL as the item content

* Add basic tests

* Give the test some more time

* Do not return an error if context was cancelled
2025-04-08 07:02:20 -06:00
Matthew Holt
6071890390
Minor enhancements
- Upgrade mholt/archives
- Error opening browser should not be fatal
- Fix lint error
2025-04-07 13:05:20 -06:00
Matthew Holt
73196f51ae
Refactor DirEntry, fix some bugs
Remove TopDir* functions, they aren't really relevant with our new import planner.
2025-04-02 21:52:49 -06:00
Matthew Holt
90e6ea228c
googlephotos: Fix empty imports
Thanks for the reports on Discord!
2025-04-01 22:04:33 -06:00
Matthew Holt
932831db47
Refactor data sources to make them dynamic
Also change the checkbox dropdown to a more interactive tomselect (type-to-search dropdown with chips) with pictures.

This makes it so data sources can be added to a timeline dynamically.

In the future, data sources can be implemented externally and push data to the timeline, so these need to not be rigidly hard-coded into the app and assumed to never change.

This essentially adds all their info (name, title, description, image, etc) into each timeline DB.
2025-02-11 16:49:20 -07:00
Matthew Holt
d18c4cd2c8
googlelocation: Fix on-device Android format handling 2025-02-05 13:52:20 -07:00
Matthew Holt
8437a38746
googlelocation: Fix longitude 2025-01-30 14:46:42 -07:00
Matthew Holt
c1a9abb74b
googlelocation: Support on-device Android 2025 format
(Thanks to those who helped in Discord!)
2025-01-30 13:08:26 -07:00
Matthew Holt
7c34746b31
Minor fixes and enhancements; particularly to import planner 2025-01-27 06:54:44 -07:00
Matthew Holt
c3dc7728a1
Add basic calendar (.ics) data source 2025-01-26 14:39:02 -07:00
Matthew Holt
d4d7991f7b
vcard: Use sidecar picture if available 2025-01-24 10:14:53 -07:00
Matthew Holt
a6da2ee542
applecontacts: Split out this data source; imessage: FIxes & improvements 2025-01-22 22:35:46 -07:00
Matthew Holt
ccef13f530
Add icons for iCloud and iMessage 2025-01-21 15:53:02 -07:00