Identity and Search in Social Networks

Duncan J. Watts,^1,^2,³^* Peter Sheridan Dodds,² M. E. J. Newman³

Social networks have the surprising property of being "searchable": Ordinary people are capable of directing messages throughtheir network of acquaintances to reach a specific but distanttarget person in only a few steps. We present a model that offersan explanation of social network searchability in terms of recognizablepersonal identities: sets of characteristics measured along anumber of social dimensions. Our model defines a class of searchablenetworks and a method for searching them that may be applicableto many network search problems, including the location of datafiles in peer-to-peer networks, pages on the World Wide Web, andinformation in distributed databases.

¹ Department of Sociology, Columbia University, New York, NY 10027, USA.
² Columbia Earth Institute, Columbia University, New York, NY 10027, USA.
³ Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA.
^* To whom correspondence should be addressed. E-mail: djw24{at}columbia.edu

In the late 1960s, Travers and Milgram (1) conducted an experiment in which randomly selected individuals in Boston,Massachusetts, and Omaha, Nebraska, were asked to direct lettersto a target person in Boston, each forwarding his or her letterto a single acquaintance whom they judged to be closer than themselvesto the target. Subsequent recipients did the same. The averagelength of the resulting acquaintance chains for the letters thateventually reached the target (roughly 20%) was about six. Thisreveals not only that short paths exist (2, 3)between individuals in a large social network but that ordinarypeople can find these short paths (4). This is not atrivial statement, because people rarely have more than localknowledge about the network. People know who their friends are.They may also know who some of their friends' friends are. Butno one knows the identities of the entire chain of individualsbetween themselves and an arbitrary target.

The property of being able to find a target quickly, which we call searchability, has been shown to exist in certain specificclasses of networks that either possess a certain fraction ofhubs [highly connected nodes which, once reached, can distributemessages to all parts of the network (5-7)]or are built upon an underlying geometric lattice that acts asa proxy for "social space" (4). Neither of these networktypes, however, is a satisfactory model of society.

Here, we present a model for a social network that is based upon plausible social structures and offers an explanation forthe phenomenon of searchability. Our model follows naturally fromsix contentions about social networks.

1) Individuals in social networks are endowed not only with network ties, but identities (8): sets of characteristicsattributed to them by themselves and others by virtue of theirassociation with, and participation in, social groups (9,10). The term "group" refers to any collection of individualswith which some well-defined set of social characteristics isassociated.

2) Individuals break down, or partition, the world hierarchically into a series of layers, where the top layer accounts forthe entire world and each successively deeper layer representsa cognitive division into a greater number of increasingly specificgroups. In principle, this process of distinction by divisioncan be pursued all the way down to the level of individuals, atwhich point each person is uniquely associated with his or herown group. For purposes of identification, however, people donot typically do this, instead terminating the process at thelevel where the corresponding group size g becomes cognitivelymanageable. Academic departments, for example, are sometimes smallenough to function as a single group but tend to split into specializedsubgroups as they grow larger. A reasonable upper bound on groupsize (9) is g $congruent$ 100, a number that we incorporate intoour model (Fig. 1A). We define the similarity x_ij betweenindividuals i and j as the height of their lowest common ancestorlevel in the resulting hierarchy, setting x_ij = 1 if i and j belongto the same group. The hierarchy is fully characterized by depthl and constant branching ratio b. The hierarchy is a purely cognitiveconstruct for measuring social distance, and not an actual network.The real network of social connections is constructed as follows.

Fig. 1. (A) Individuals (dots) belong to groups (ellipses) that in turn belong to groups of groups, and so on, giving rise to a hierarchical categorization scheme. In this example, groups are composed of g = 6 individuals and the hierarchy has l = 4 levels with a branching ratio of b = 2. Individuals in the same group are considered to be a distance x = 1 apart, and the maximum separation of two individuals is x = l. The individuals i and j belong to a category two levels above that of their respective groups, and the distance between them is x_ij = 3. Individuals each have z friends in the model and are more likely to be connected with each other the closer their groups are. (B) The complete model has many hierarchies indexed by h = 1...H, and the combined social distance y_ij between nodes i and j is taken to be the minimum ultrametric distance over all hierarchies y_ij = min_h x_ij^h. The simple example shown here for H = 2 demonstrates that social distance can violate the triangle inequality: y_ij = 1 because i and j belong to the same group under the first hierarchy and similarly y_jk = 1 but i and k remain distant in both hierarchies, giving y_ik = 4 > y_ij+y_jk = 2. [View Larger Version of this Image (20K GIF file)]

3) Group membership, in addition to defining individual identity, is a primary basis for social interaction (10,11) and therefore acquaintanceship. As such, the probabilityof acquaintance between individuals i and j decreases with decreasingsimilarity of the groups to which they respectively belong. Wemodel this by choosing an individual i at random and a link distancex with probability p(x) = cexp[ $-$ $alpha$ x], where $alpha$ is a tunable parameterand c is a normalizing constant. We then choose a second nodej uniformly among all nodes that are a distance x from i, repeatingthis process until we have constructed a network in which individualshave an average number of friends z. The parameter $alpha$ is thereforea measure of homophily--the tendency of like to associate withlike. When e $lessless$ 1, all links will be as short as possible, and individualswill connect only to those most similar to themselves (i.e., membersof their own bottom-level group), yielding a completely homophilousworld of isolated cliques. By contrast, when e = b, any individual is equally likely to interact with any other,yielding a uniform random graph (12) in which the notionof individual similarity or dissimilarity has become irrelevant.

4) Individuals hierarchically partition the social world in more than one way (for example, by geography and by occupation).We assume that these categories are independent, in the sensethat proximity in one does not imply proximity in another. Forexample, two people may live in the same town but not share thesame profession. In our model, we represent each such social dimensionby an independently partitioned hierarchy. A node's identity isthen defined as an H-dimensional coordinate vector _i,where v_i^h is the position of node i in the hth hierarchy, or dimension.Each node i is randomly assigned a coordinate in each of H dimensionsand is then allocated neighbors (friends) as described above,where now it randomly chooses a dimension h (e.g., occupation)to use for each tie. When H = 1 and e $lessless$ 1, the density of network ties must obey the constraint z <g.

5) On the basis of their perceived similarity with other nodes, individuals construct a measure of "social distance" y_ij,which we define as the minimum ultrametric distance over all dimensionsbetween two nodes i and j; i.e., y_ij = min_h x_ij^h. This minimum metric captures the intuitive notion that closenessin only a single dimension is sufficient to connote affiliation(for example, geographically and ethnically distant researcherswho collaborate on the same project). A consequence of this minimalmetric, depicted in Fig. 1B, is that social distanceviolates the triangle inequality--hence it is not a true metricdistance--because individuals i and j can be close in dimensionh₁, and individuals j and k can be close in dimension h₂, yeti and k can be far apart in both dimensions.

6) Individuals forward a message to a single neighbor given only local information about the network. Here, we suppose thateach node i knows only its own coordinate vector _i,the coordinate vectors _j of its immediate network neighbors,and the coordinate vector of a given target individual _t,but is otherwise ignorant of the identities or network ties ofnodes beyond its immediate circle of acquaintances.

Individuals therefore have two kinds of partial information: social distance, which can be measured globally but which isnot a true distance (and hence can yield misleading estimates);and network paths, which generate true distances but which areknown only locally. Although neither kind of information aloneis sufficient to perform efficient searches, here we show thata simple algorithm that combines knowledge of network ties andsocial identity can succeed in directing messages efficiently.The algorithm we implement is the same greedy algorithm Milgramsuggested: Each member i of a message chain forwards the messageto its neighbor j who is closest to the target t in terms of socialdistance; that is, y_jt is minimized over all j in i's networkneighborhood.

Our principal objective is to determine the conditions under which the average length $<$ L $>$ of a message chain connecting arandomly selected sender s to a random target t is small. Although"small" has recently been taken to mean that $<$ L $>$ grows slowlywith the population size N (13, 14), Traversand Milgram found only that chain lengths were short. Furthermore,these message chains had to be short in an absolute sense becauseat each step, they were observed to terminate with probabilityp $congruent$ 0.25 (1, 15). We therefore adopt a morerealistic, functional notion of efficient search, defining fora given message failure probability p, a searchable network asany network for which q, the probability of an arbitrary messagechain reaching its target, is at least a fixed value r. In termsof chain length, we formally require q = $<$ (1 $-$ p)^L $>$ $>=$ r, and from this we can obtain an estimate of the maximumrequired $<$ L $>$ using the approximated inequality $<$ L $>$ $<=$ lnr/ln(1 $-$ p). For the purposes of this study, we set r = 0.05 and p =0.25, yielding the stringent requirement that $<$ L $>$ $<=$ 10.4 independentof the population size N. Figure 2A presents a typicalphase diagram in H and $alpha$ , outlining the searchable network regionfor several choices of N, g = 100, and z = g $-$ 1 = 99.

Fig. 2. (A) Regions in H- $alpha$ space where searchable networks exist for varying numbers of individual nodes N (probability of message failure p = 0.25, branching ratio b = 2, group size g = 100, average degree z = g $-$ 1 = 99, 10⁵ chains sampled per network). The searchability criterion is that the probability of message completion q must be at least r = 0.05. The lines correspond to boundaries of the searchable network region for N = 102,400 (solid), N = 204,800 (dot-dash), and N = 409,600 (dash). The region of searchable networks shrinks with N, vanishing at a finite value of N that depends on the model parameters. Note that z < g is required to explore H- $alpha$ space because for H = 1 and $alpha$ sufficiently large, an individual's neighbors must all be contained within their sole local group. (B) Probability of message completion q(H) when $alpha$ = 0 (squares) and $alpha$ = 2 (circles) for the N = 102,400 data set used in (A). The horizontal line shows the position of the threshold r = 0.05. Open symbols indicate that the network is searchable (q $>=$ r) and closed symbols mean otherwise. For $alpha$ = 0, searchability degrades with each additional hierarchy. For the homophilous case of $alpha$ = 2 with a single hierarchy, less than 1% of all searches find their target (q $congruent$ 0.004). Adding just one other hierarchy increases the success rate to q $congruent$ 0.144, and q slowly decreases with H thereafter. [View Larger Version of this Image (16K GIF file)]

Our main result is that searchable networks occupy a broad region of parameter space ( $alpha$ ,H) which, as we argue below, correspondsto choices of the model parameters that are the most sociologicallyplausible. Hence our model suggests that searchability is a genericproperty of real-world social networks. We support this claimwith some further observations and demonstrate that our modelcan account for Milgram's experimental findings.

First, we observe that almost all searchable networks display $alpha$ > 0 and H > 1, consistent with the notion that individualsare essentially homophilous (that is, they associate preferentiallywith like individuals) but judge similarity along more than onesocial dimension. Neither the precise degree to which they arehomophilous, nor the exact number of dimensions they choose touse, appears to be important--almost any reasonable choice willdo. The best performance, over the largest interval of $alpha$ , is achievedfor H = 2 or 3--an interesting result in light of empirical evidence(16) that individuals across different cultures in small-worldexperiments typically use two or three dimensions when forwardinga message.

Second, as Fig. 2B shows, although increasing the number of independent dimensions from H = 1 yields a dramatic reductionin delivery time for values of $alpha$ > 0, this improvement is graduallylost as H is increased further. Hence the window of searchablenetworks in Fig. 2A exhibits an upper boundary in H.Because ties associated with any one dimension are allocated independentlywith respect to ties in any other dimension, and because for fixedaverage degree z, larger H necessarily implies fewer ties perdimension, the network ties become less correlated as H increases.In the limit of large H, the network becomes essentially a randomgraph (regardless of $alpha$ ) and the search algorithm becomes a randomwalk. An effective decentralized search therefore requires a balance(albeit a highly forgiving one) of categorical flexibility andconstraint.

Finally, by introducing parameter choices that are consistent with Milgram's experiment (N = 10⁸, p = 0.25) (1), as well as with subsequent empiricalfindings (z = 300, H = 2) (17, 16), we cancompare the distribution of chain lengths in our model with thatof Travers and Milgram (1) for plausible values of $alpha$ and b. As Fig. 3 shows, we obtain $<$ L $>$ $congruent$ 6.7 for $alpha$ = 1and b = 10, indicating that our model captures the essence ofthe real small-world problem. This agreement is robust with respectto variations in the branching ratio, showing little change overthe range 5 < b < 50.

Fig. 3. Comparison between n(L), the number of completed chains of length L, taken from the original small-world experiment (1) (bar graph) and from an example of our model with N = 10⁸ individuals (filled circles with the line being a guide for the eye). The experimental data shown are for the 42 completed chains that originated in Nebraska. (We have excluded the 24 completed chains that originated in Boston as this would correspond to N $congruent$ 10⁶.) The model parameters are H = 2, $alpha$ = 1, b = 10, g = 100, and z = 300; message attrition rate is set at 25%; n(L) for the model is compiled from 10⁶ random chains and is normalized to match the 42 completed chains that started in Nebraska. The average chain length of Milgram's experiment is ~6.5, whereas the model yields $<$ L $>$ $congruent$ 6.7. The distributions compare well: A two-sided Kolmogorov-Smirnov test yields a P-value of P $congruent$ 0.57, whereas for a $chi$ ² test, $chi$ ² $congruent$ 5.46 and P $congruent$ 0.49 (seven bins). (A large value of P supports the hypothesis that the distributions are similar.) Even without attrition, the model's average search time is $<$ L $>$ $congruent$ 8.5 and the median chain length is 8. The model does not entirely match the experimental data because the former requires approximately 360 initial chains to achieve 42 completions as compared with 196. [View Larger Version of this Image (12K GIF file)]

Although sociological in origin, our model is relevant to a broad class of decentralized search problems, such as peer-to-peernetworking, in which centralized servers are excluded either bydesign or by necessity, and where broadcast-type searches (i.e.,forwarding messages to all neighbors rather than just one) areruled out because of congestion constraints (6). In essence,our model applies to any data structure in which data elementsexhibit quantifiable characteristics analogous to our notion ofidentity, and similarity between two elements--whether people,music files, Web pages, or research reports--can be judged alongmore than one dimension. One of the principal difficulties withdesigning robust databases (18) is the absence of aunique classification scheme that all users of the database canapply consistently to place and locate files. Two musical songs,for example, can be similar because they belong to the same genreor because they were created in the same year. Our model transformsthis difficulty into an asset, allowing all such classificationschemes to exist simultaneously, and connecting data elementspreferentially to similar elements in multiple dimensions. Efficientdecentralized searches can then be conducted by means of simple,greedy algorithms providing only that the characteristics of thetarget element and the current element's immediate neighbors areknown.

REFERENCES AND NOTES

1.	J. Travers and S. Milgram, Sociometry 32, 425 (1969) [ISI] .
2.	D. J. Watts and S. J. Strogatz, Nature 393, 440 (1998) [CrossRef] [ISI] [Medline] .
3.	S. H. Strogatz, Nature 410, 268 (2001) [CrossRef] [ISI] [Medline] .
4.	J. Kleinberg, Nature 406, 845 (2000) [CrossRef] [ISI] [Medline] .
5.	A.-L. Barabási and R. Albert, Science 286, 509 (1999) [Abstract/Free Full Text] .
6.	L. Adamic, R. Lukose, A. Puniyani, B. Huberman, Phys. Rev. E 64, 046135 (2001) [CrossRef].
7.	B. J. Kim, C. N. Yoon, S. K. Han, H. Jeong, Phys. Rev. E 65, 027103 (2002) [CrossRef].
8.	H. C. White, Identity and Control (Princeton Univ. Press, Princeton, NJ, 1992).
9.	G. Simmel, Am. J. Sociol. 8, 1 (1902) [CrossRef].
10.	F. S. Nadel, Theory of Social Structure (Free Press, Glencoe, IL, 1957).
11.	R. Breiger, Social Forces 53, 181 (1974) [ISI].
12.	B. Bollobás, Random Graphs (Academic Press, New York, 1985).
13.	M. Newman and D. Watts, Phys. Rev. E 60, 7332 (1999) [CrossRef] [ISI].
14.	J. Kleinberg, Proc. 32nd ACM Symposium on Theory of Computing (Association for Computing Machinery, New York, 2000).
15.	H. C. White, Social Forces 49, 259 (1970) [ISI].
16.	H. Bernard, P. Killworth, M. Evans, C. McCarty, G. Shelly, Ethnology 27, 155 (1988) [ISI].
17.	P. Killworth and H. Bernard, Soc. Networks 1, 159 (1978) [ISI].
18.	B. Manneville, The Biology of Business: Decoding the Natural Laws of the Enterprise (Jossey-Bass, San Francisco, 1999), chap. 5.
19.	We thank J. Kleinberg for beneficial discussions. This work was funded in part by the National Science Foundation (grants SES-00-94162 and DMS- 0109086), the Intel Corporation, and the Columbia University Office of Strategic Initiatives.

23 January 2002; accepted 3 April 2002
10.1126/science.1070120
Include this information when citing this paper.

This article has been cited by other articles:

(Search Google Scholar for Other Citing Articles)

From the Cover: Geographic routing in social networks.: D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan, and A. Tomkins (2005).
PNAS 102: 11623-11628 | Abstract » | Full Text » | PDF »

Multiscale, resurgent epidemics in a hierarchical metapopulation model.: D. J. Watts, R. Muhamad, D. C. Medina, and P. S. Dodds (2005).
PNAS 102: 11157-11162 | Abstract » | Full Text » | PDF »

An Experimental Study of Search in Global Social Networks.: P. S. Dodds, R. Muhamad, and D. J. Watts (2003).
Science 301: 827-829 | Abstract » | Full Text » | PDF »

Assessing experimentally derived interactions in a small world.: D. S. Goldberg and F. P. Roth (2003).
PNAS 100: 4372-4376 | Abstract » | Full Text » | PDF »

Growing and navigating the small world Web by local content.: F. Menczer (2002).
PNAS 99: 14014-14019 | Abstract » | Full Text » | PDF »