Let’s say we have users and addresses. In RDBMs world we would have many to many relationship.
+-----------+ +---------+ +-------------+ | User | | Address | | UserAddress | +-----------+ +---------+ +-------------+ | Id | | Id | | UserId | | FirstName | | Street | | AddressId | | LastName | | City | | | | Email | | Country | | | +-----------+ +---------+ +-------------+
I understand that in NoSQL we trying to avoid joins and what happens instead is that we embed the address inside user document
[ { "id": 1, "firstName": "John", "lastName": "Smith", "email": "johnsmith@example.com", "addresses": [ { "street": "Harley Street", "city": "London", "country": "United Kingdom" } ] }, { "id": 2, "firstName": "Anna", "lastName": "Smith", "email": "annasmith@example.com", "addresses": [ { "street": "Harley Street", "city": "London", "country": "United Kingdom" } ] } ]
So what happened now compared to RDBMS:
PROS:
- Performance (No joins required)
CONS:
- Duplicate data
- Update is slower (Need to update multiple records if address change)
- No address indexing (To get all the addresses we need to get all the users and then get the distinct addresses)
If I understood correctly (please correct me if I’m wrong) this is the explanations/solutions for the cons points.
- It’s okay to have duplicate data, this is the trade off for not having joins.
- You usually embed only the fields that do not change frequently (exactly like address), so it doesn’t really matter
- Instead of embedding we can use reference, so
addresses
changing from array of objects to array of integers (we using integers in our example).
I didn’t like any of the above solutions, what I really liked is mixing them all together, I think I saw someone doing that, not sure if it that approach have any name.
So first we have a separate collection for addresses
[ { "id": 1, "street": "Harley Street", "city": "London", "country": "United Kingdom" } ]
And separate collection for users
[ { "id": 1, "firstName": "John", "lastName": "Smith", "email": "johnsmith@example.com", "addresses": [ { "id": 1, "street": "Harley Street", "city": "London" } ] }, { "id": 2, "firstName": "Anna", "lastName": "Smith", "email": "annasmith@example.com", "addresses": [ { "id": 1, "street": "Harley Street", "city": "London" } ] } ]
Instead of having the whole address object we have only the id and the part that we need to show. This approach solves the cons #1 and #3 and improves #2.
Am I on the right path? Is there any better approach for above example? Any other helpful hints to understand better the NoSQL world?