June 29, 2008


Filed under: 架构 — hunter @ 9:54 pm

Scalability is “the capability of a system to increase performance under an increased load when resources (typically hardware) are added.” (source: Wikipedia)
性能能提升两倍,当你的服务器在增加的时候,业务不停顿… 这才是伸缩性

A big scalability problem with caching data is called the cache-coherence problem.

第三段推销shared-nothing架构:至少在web server上什么数据也不要共享,session


Shard: splitting up your data sets. If your data doesn’t fit on one machine you split it up into pieces and each piece is called a shard.

Sharding: the process of splitting up data.
Shards is for situations where you have too much data to fit in a single database. MySQL partitioning may allow you to delay when you need to shard, but it is still a single database and you’ll eventually run into limits.


Control over how data are distributed is determined by a pluggable strategies layer.
 — hibernate的实现方式

Plan for the future by picking a strategy that will last you a long time.

Repartitioning/resharding the data is operationally very difficult. No management tools for this yet

 — 对shard的策略选择非常重要,后悔药吃起来很难受




However this may be not the most optimal approach by itself because not all data belonging to same user is equal.
 — 介绍shard模型并不适用于所有类型的data,尤其数据的重要性或者热的程度不一

 — 作者建议某些数据,可以在shard的基础上,再基于时间维度或者热点维度进行分区
 — 作者以为用了shard之后,就不需要master-slave架构了,其实在大的系统中,还是需要用master-slave架构提升单节点的健壮度的,1 master : n slave架构还可以提升单节点的最大利用率,通过把非关键业务的重型查询语句部署在其中一台slave上,也可以避免非关键业务对关键业务的影响;
  — friendster架构
  — Each 64bit AMD server would house 500K distinct users and all their data
  — 以用户为核心,存储他所有的相关信息
  — 提到replication的几个问题,其中”IO bandwidth is low replication lags, causing slave lag”是我们最需要考量的

a. High availability

    — shard提供的额外优势就是可以提供部分服务
b. Faster queries.
c. More write bandwidth
d. You can do more work

   — 并发吞吐增加

a.Data are denormalized

   — 非规范化设计,相同主key的数据存储在一起(第一次见到这种论述)
   — You can keep a user’s profile data separate from their comments, blogs, email, media, etc, but the user profile data would be stored and retrieved as a whole
b.Data are parallelized across many physical instances
c.Data are kept small
d.Data are more highly available
   — You can also setup a shard to have a master-slave or dual master relationship within the shard to avoid a single point of failure within the shard
e.It doesn’t use replication

   — 数据切分或者传递用非replication方式,避免对sharding的误会


1. Rebalancing data

    — 如果某些用户的数据过于肥大,需要重新平衡各个节点的数据量,


        name service来定位数据位置,对于传统按模路由的架构,

    — And your references must be invalidateable so the underlying data can be moved while you are using it.
    — 对于这个,我的理解是每个数据源还有相关的引用数据(比如好友的nick),这些引用数据可以降低对主(基础)数据的依赖,这样即使基础数据在迁移,某些业务还能继续使用。
2. Joining data from multiple shards
   You have to make individual requests to your data sources, get all the responses, and the build the page
   — amazon用了一种并行查询机制来提升查询效率,这方面值得我们学习
3. How do you partition your data in shards?
  — Unfortunately there are no easy answer to these questions.
  — 确实,每个业务都不是完全一样的,需要根据你自己的业务去衡量
4. Less leverage
  — 较少文献介绍这方面的知识,大部分时候“you are on your own”
5. Implementing shards is not well supported
  — 路要靠自己走,工具要靠自己做

 — Today we have better ways to remove data dependency, without putting the data into a shared file system — which may eventually become a bottleneck. We partition it and store it in-memory
 — giga经常吹嘘自己的space-base-architecture,本质上似乎与sna差不多,
  –database sharding is a method for database partitioning which involves partitioning

across multiple servers in a shared nothing architecture.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress