1
0
Fork 0
Commit graph

8 commits

Author SHA1 Message Date
Matt Holt
a85f47f1a3
Major processor refactor (#112)
* Major processor refactor

- New processing pipeline, vastly simplified
- Several edge case bug fixes related to Google Photos (but applies generally too)
- Major import speed improvements
- UI bug fixes
- Update dependencies

The previous 3-phase pipeline would first check for an existing row in the DB, then decide what to do (insert, update, skip, etc.), then would download data file, then would update the row and apply lots of logic to see if the row was a duplicate, etc. Very messy, actually. The reason was to avoid downloading files that may not need to be downloaded.

In practice, the data almost always needs to be downloaded, and I had to keep hacking on the pipeline to handle edge cases related to concurrency and not having the data in many cases while making decisions regarding the item/row. I was able to get all the tests to pass until the final boss, an edge case bug in Google Photos -- but a very important one that happened to be exposed by my wedding album, of all things -- exhibited, I was unable to fix the problem without a rewrite of the processor.

The problem was that Google Photos splits the data and metadata into separate files, and sometimes separate archives. The filename is in the metadata, and worse yet, there are duplicates if the media appears in different albums/folders, where the only way to know they're a duplicate is by filename+content. Retrieval keys just weren't enough to solve this, and I narrowed it down to a design flaw in the processor. That flaw was downloading the data files in phase 2, after making the decisions about how to handle the item in phase 1, then having to re-apply decision logic in phase 3.

The new processing pipeline downloads the data up front in phase 1 (and there's a phase 0 that splits out some validation/sanitization logic, but is of no major consequence). This can run concurrently for the whole batch. Then in phase 2, we obtain an exclusive write lock on the DB and, now that we have ALL the item information available, we can check for existing row, make decisions on what to do, even rename/move the data file if needed, all in one phase, rather than split across 2 separate phases.

This simpler pipeline still has lots of nuance, but in my testing, imports run much faster! And the code is easy to reason about.

On my system (which is quite fast), I was able to import most kinds of data at a rate of over 2,000 items per second. And for media like Google Photos, it's a 10x increase from before thanks to the concurrency in phase 1: up from about 3-5/second to around 30-50/second, depending on file size.

An import of about 200,000 text messages, including media attachments, finished in about 2 minutes.

My Google Photos library, which used to take almost a whole day, now takes only a couple hours to import. And that's over USB.

Also fixed several other minor bugs/edge cases.

This is a WIP. Some more cleanup and fixes are coming. For example, my solution to fix the Google Photos import bug is currently hard-coded (it happens to work for everything else so far, but is not a good general solution). So I need to implement a general fix for that before this is ready to merge.

* Round out a few corners; fix some bugs

* Appease linter

* Try to fix linter again

* See if this works

* Try again

* See what actually fixed it

* See if allow list is necessary for replace in go.mod

* Ok fine just move it into place

* Refine retrieval keys a bit

* One more test
2025-09-02 11:18:39 -06:00
Matthew Holt
eae7e1806d
Minor enhancements to logo/icon
Improves legibility and more optical balancing
2025-07-09 11:43:33 -06:00
Matthew Holt
932831db47
Refactor data sources to make them dynamic
Also change the checkbox dropdown to a more interactive tomselect (type-to-search dropdown with chips) with pictures.

This makes it so data sources can be added to a timeline dynamically.

In the future, data sources can be implemented externally and push data to the timeline, so these need to not be rigidly hard-coded into the app and assumed to never change.

This essentially adds all their info (name, title, description, image, etc) into each timeline DB.
2025-02-11 16:49:20 -07:00
Matthew Holt
a6da2ee542
applecontacts: Split out this data source; imessage: FIxes & improvements 2025-01-22 22:35:46 -07:00
Matthew Holt
ccef13f530
Add icons for iCloud and iMessage 2025-01-21 15:53:02 -07:00
Matthew Holt
13131aba65
Run ANALYZE after imports and at startup; add NMEA icon 2024-12-08 05:29:17 -07:00
Sergio Rubio
ef287c1bb9
GitHub data source (#48)
* GitHub stars data source

Data source that imports GitHub starred repositories in JSON format.

Each starred repo is imported individually, the starred repo
metadata comes from the GitHub API.

The item timestamp is set to the starred date, so they appear in the
timeline the day the repo was starred.

A small json file is saved in the timeline repository data directory
with the metadata retrieved from the GitHub API, which looks like:

```
{
 "id": 841044067,
 "name": "timelinize",
 "html_url": "https://github.com/timelinize/timelinize",
 "description": "Store your data from all your accounts and devices in a single cohesive timeline on your own computer",
 "created_at": "2024-08-11T13:27:39Z",
 "updated_at": "2024-09-03T07:17:29Z",
 "pushed_at": "2024-09-02T15:31:59Z",
 "stargazers_count": 504,
 "language": "Go",
 "full_name": "timelinize/timelinize",
 "topics": null,
 "is_template": false,
 "Topics": "archival,data-archiving,data-import,timeline",
 "private": false,
 "starred_at": "2024-08-12T17:55:48Z"
}
```

The data source currently expects the JSON file to be named like:

- ghstars.json
- ghstars-<ISO date>.json
- ghstars-<UNIX timestamp>.json

* Linter fixes

* Remove optional options

* Add the URL label to the bookmark class

* Change the data source name to GitHub

* Rename data source directory also

* Rename datasource main file

* Store GitHub starred repo URL only

* rename symbols

* Add basic tests

* moar tests

* Linter fix

* You can read on closed channels

* Add bookmark svg for the frontend

* Update package docs

* 💄 docs

* Update datasources/github/github.go

Co-authored-by: Matt Holt <mholt@users.noreply.github.com>

* Update datasources/github/github.go

Co-authored-by: Matt Holt <mholt@users.noreply.github.com>

* Update datasources/github/github.go

Co-authored-by: Matt Holt <mholt@users.noreply.github.com>

* Remove content from item

---------

Co-authored-by: Matt Holt <mholt@users.noreply.github.com>
2024-09-06 11:22:07 -06:00
Matthew Holt
1daf6f4157
Initial open source commit 2024-08-11 08:02:27 -06:00