Hunter的大杂烩 技术学习笔记

2008-06-29

网摘20080629

Filed under: 架构 — hunter @ 9:54 pm

最近搜到一些不错的资料,随手做些读书笔记

 

http://www.zefhemel.com/archives/2004/09/01/the-share-nothing-architecture
Scalability is “the capability of a system to increase performance under an increased load when resources (typically hardware) are added.” (source: Wikipedia)
第一段解释伸缩性是什么意思,伸缩性不等同性能,当你的服务器投入是两倍的时候,
性能能提升两倍,当你的服务器在增加的时候,业务不停顿… 这才是伸缩性

A big scalability problem with caching data is called the cache-coherence problem.
第二段解释一个严重影响伸缩性方面的问题,cache一致性

第三段推销shared-nothing架构:至少在web server上什么数据也不要共享,session
可以通过文件(NFS远程访问)或者数据库来集中维护,而数据库有较好的扩展性(ebay
就是这样做的)

考虑到这是一个04年的帖子,也就不说什么了,当年这位仁兄的经验还是比较肤浅的。

=====================================
http://wutaoo.javaeye.com/blog/148369
from:http://highscalability.com/sharding-hibernate-way
Shard: splitting up your data sets. If your data doesn’t fit on one machine you split it up into pieces and each piece is called a shard.

Sharding: the process of splitting up data.
Shards is for situations where you have too much data to fit in a single database. MySQL partitioning may allow you to delay when you need to shard, but it is still a single database and you’ll eventually run into limits.

shard与分区的区别在于,分区是在单db中进行,而shard是一种数据划分思想,更多体现在将数据分布在多台db中

Control over how data are distributed is determined by a pluggable strategies layer.
 — hibernate的实现方式

Plan for the future by picking a strategy that will last you a long time.

Repartitioning/resharding the data is operationally very difficult. No management tools for this yet

 — 对shard的策略选择非常重要,后悔药吃起来很难受

 

后面一堆都是介绍这个策略层的限制和功能

 

=====================================
http://www.mysqlperformanceblog.com/2008/03/14/sharding-and-time-base-partitioning/
However this may be not the most optimal approach by itself because not all data belonging to same user is equal.
 — 介绍shard模型并不适用于所有类型的data,尤其数据的重要性或者热的程度不一

 — 作者建议某些数据,可以在shard的基础上,再基于时间维度或者热点维度进行分区
 — 作者以为用了shard之后,就不需要master-slave架构了,其实在大的系统中,还是需要用master-slave架构提升单节点的健壮度的,1 master : n slave架构还可以提升单节点的最大利用率,通过把非关键业务的重型查询语句部署在其中一台slave上,也可以避免非关键业务对关键业务的影响;

http://mysqldba.blogspot.com/2006/11/unorthodox-approach-to-database-design.html
  — friendster架构
  — Each 64bit AMD server would house 500K distinct users and all their data
  — 以用户为核心,存储他所有的相关信息
  — 提到replication的几个问题,其中”IO bandwidth is low replication lags, causing slave lag”是我们最需要考量的

=====================================
http://highscalability.com/unorthodox-approach-database-design-coming-shard
介绍了shard的优势
a. High availability

    — shard提供的额外优势就是可以提供部分服务
b. Faster queries.
c. More write bandwidth
d. You can do more work

   — 并发吞吐增加
shard对比传统架构的不同:


a.Data are denormalized

   — 非规范化设计,相同主key的数据存储在一起(第一次见到这种论述)
   — You can keep a user’s profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole
b.Data are parallelized across many physical instances
c.Data are kept small
d.Data are more highly available
   — You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard
e.It doesn’t use replication

   — 数据切分或者传递用非replication方式,避免对sharding的误会
      纯粹用replication进行scaling,是会有写瓶颈的(在livejournal

      就在这上面有过痛苦的经验)

sharding的问题
1. Rebalancing data

    — 如果某些用户的数据过于肥大,需要重新平衡各个节点的数据量,

        这是一个痛苦的过程(google有自调整功能),flickr有全局

        name service来定位数据位置,对于传统按模路由的架构,

        这个问题是迟早要解决的
    — And your references must be invalidateable so the underlying data can be moved while you are using it.
    — 对于这个,我的理解是每个数据源还有相关的引用数据(比如好友的nick),这些引用数据可以降低对主(基础)数据的依赖,这样即使基础数据在迁移,某些业务还能继续使用。
2. Joining data from multiple shards
   You have to make individual requests to your data sources, get all the responses, and the build the page
   — amazon用了一种并行查询机制来提升查询效率,这方面值得我们学习
3. How do you partition your data in shards?
  — Unfortunately there are no easy answer to these questions.
  — 确实,每个业务都不是完全一样的,需要根据你自己的业务去衡量
4. Less leverage
  — 较少文献介绍这方面的知识,大部分时候“you are on your own”
5. Implementing shards is not well supported
  — 路要靠自己走,工具要靠自己做

=====================================

http://blog.gigaspaces.com//2007/04/06/shared-nothing-architecture-redefined/
 — Today we have better ways to remove data dependency, without putting the data into a shared file system — which may eventually become a bottleneck. We partition it and store it in-memory
 — giga经常吹嘘自己的space-base-architecture,本质上似乎与sna差不多,

http://en.wikipedia.org/wiki/Shard
  –database sharding is a method for database partitioning which involves partitioning

across multiple servers in a shared nothing architecture.
  –shard名词似乎是来自MMOG

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress