Cedric Westphal's Networking Notes

New web page

2013-06-25T05:53:00.003-07:00

I've updated my research webpage at UCSC. Check it out here.

A Rant on Reviews

2010-07-21T13:57:00.000-07:00

We submitted a paper to a conference (ICNP) which is quite selective (30 papers accepted for 170 submissions). Grade-wise, the reviews of our paper are decent (three 'accept with room' and one 'accept'). The conference being very competitive, our paper did not get accepted, but I admit to being irritated enough by the strange reasoning of some of the reviews to pen this post.

One reviewer states: 'The evaluation considers only two network graph types... In both cases, they only consider relatively small networks, up to 10^5 nodes. Hence, while the results obtained are indicative, they don't really support the claims of 'extensive simulations' and 'close to optimal on a wide variety of topologies.'

This floored me: 10^5 nodes for routing simulation is actually A LOT. If you want to evaluate your routing on PlanetLab, you get 800 nodes. If you use NS, you can't do over a 1,000 nodes. The other papers we looked at get to 1,000 nodes, maybe 10,000 at the most. We had to implement our own emulator from scratch to get to 300,000 nodes. And it's hardware limitations that stopped us at that scale: with 4GB in RAM, the matrix to describe the network connectivity, if not optimized, is already 10^10 bits on a 10^5 nodes network.

Reviewer keeps going: instead of using home-brewn [sic] scale free graphs, it would be better to use graph generators that are known to produce topologically similar graphs to the Internet; consider e.g. BRITE or INET. This refers to the fact we use, quotes the reviewer, Barabasi-Albert scale free networks, which are known to be somewhat different from the real graphs met in the internet.

So this reviewer wants us to use LARGE graph, generated from BRITE, but NOT Barabasi-Albert. Looking at the BRITE doc, however, one can see that BRITE defines LARGE as 10^5 (see Item 4 on the wish list (link to pdf)); and that BRITE uses...(click to enlarge)

... Barabasi-Albert to create Internet-like topologies.

Another reviewer mentions: The main problem with the paper is that there is no comparison with recent research in the area of designing routing protocols with small stretch and small per-node state. Specifically, there are at least two lines of recent research that are worth comparing against:
- Practical implementation of Compact Routing [the reviewer mentions S4]
- Routing based on distributed Delaunay Triangulation (which guarantees that greedy routing always works) [and the reviewer cites two more papers on Delaunay triangulation].

But, we can't compare with compact routing, because, as our paper states: "The idea of trading-off path stretch for routing table size is the core component of the work on compact routing. Thorup et al. show that it is possible to guarantee a path stretch of three with a route table size growing as O(\sqrt{n \log(n)})." This is much higher than our polylogarithmic scheme. S4 selects K beacons which have to maintain path in between them, and each node has an associated cluster, where the cluster size is about sqrt(n) when the number of beacon is sqrt(n). We consider log(n), how can we compare?

We can't compare with routing on Delaunay triangulations (DT), because we consider graphs, not Euclidean spaces. Plus, DT requires planar connectivity, which we don't assume. We start from a graph G=(E,V) and we try to find a representation in a Euclidean space. To get a DT, you need to be in the Euclidean space first!

So the "main problem" with our paper is that we don't consider work that, while excellent, is not relevant to our results (granted, the reviewer gave us 'accept if room' so the main problem is not a deal breaker for him either). Still, this drives home the point of hypercriticality.

Reliable Embeddings.Designing and Embedding Reliable Virtual Infrastructures

2010-06-16T09:49:00.000-07:00

First time we put something on ArXiv:

Designing and Embedding Reliable Virtual Infrastructures, Wai-Leong Yeow, Cédric Westphal, Ulaş C. Kozat, http://arxiv.org/abs/1005.5367, Submitted May 28th.

This is a technical report version of a paper that was accepted into the ACM SIGCOMM VISA workshop, which will take place in September 2010 in New Dehli, India. I am co-chair of the VISA workshop*, on the topic of Infrastructure Virtualization. This is a very exciting, very current area of research, so hopefully you can attend the workshop!

* I hasten to add that any paper that was co-authored by a VISA chair has been handled out-of-band by the other chairs.

Maranello: Practical Partial Packet Recovery for 802.11

2010-05-18T11:40:00.000-07:00

Maranello: Practical Partial Packet Recovery for 802.11, Bo Han and Aaron Schulman, Francesco Gringoli, Neil Spring and Bobby Bhattacharjee, Lorenzo Nava, Lusheng Ji, Seungjoon Lee, and Robert Miller, in Proc. NSDI'10, San Jose, CA, April 2010.

This is another paper which has built a system that works. There is no super deep theory behind it, but a lot of effort to implement something in a real environment. This is basically a partial packet recovery mechanism which is compatible with 802.11 and works as follows:
- the sender sends a packet to the receiver. Typically, if the packet is received correctly, an Ack is sent; if not, no Ack is sent, and after a time-out, the sender attempts a retransmission. In maranello, if the paper is not received correctly, the receiver still transmit an Ack, except it is a Nack with a checksum of blocks within the packets. Based on the checksum, the sender can retransmit only the corrupted blocks and not the whole packet.

This speeds up the retransmission and reduces the congestion on the air interface. As a consequence, throughput is roughly 50% higher than 802.11 and delay is significantly reduced.

The system is evaluated alongside traditional 802.11, which shows that both system can coexist peacefully.

This was pointed to me as an excellent paper by a Stanford faculty, and it did not disappoint indeed. I have no idea where the name Maranello comes from. I mean, I know of the city in Italy, but not how it relates to partial packet recovery.

Reverse traceroute

2010-05-18T09:48:00.000-07:00

Reverse traceroute, Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay Kumar Adhikari, Colin Scott, Justine Sherry, Peter van Wesep, Thomas Anderson, and Arvind Krishnamurthy, in Proc. NSDI'10, San Jose, CA, April 2010.

I did not attend the conference, it conflicted with other meetings, but have been going through the proceedings. I'm starting with this paper since it got the best paper award, and I might post about a few more.

It's a cute paper and useful. They build a tool to compute the reverse route from the destination back to the source (as opposed to traceroute which identifies the route from the source to the destination). It's a full of good little tricks, it's nicely put together, and it's practical and useful.

The main idea is to use two IP options and source IP spoofing. The IP options are the route recording, which logs up to 9 intermediate relays in the probe packet; and the timestamp option, which is used to confirm whether a specific router is on the path of the probe.

The goal is to find the path D->S, where S is the source and D the destination. Traceroute gives S->D and they want to find the reverse. Using a set of vantage points, they pick one, say V, which is less than 8 hops away from the source. They spoof a packet from V as being from S, with the IP-RR option and send it to D. D returns it to S (due to the spoof) and logs the first hop on the D->S path due to the route recording and the fact that there was room for at least one hop in the header from the choice of V.

If the routers don't support the IP-RR, or if there is no V at less than 8 hops from D, then they use the timestamp option to validate whether a set of guesses is on the path or not.

This is not perfect, but neither is traceroute anyway, and the results offered by the mechanisms are much more useful and accurate than assuming symmetry of the path.

SmartRE: An Architecture for Coordinated Network-wide Redundancy Elimination

2010-04-21T15:25:00.000-07:00

SmartRE: An Architecture for Coordinated Network-wide Redundancy Elimination, Ashok Anand, Vyas Sekar, Aditya Akella, in Proc. of ACM SIGCOMM'09, August 2009, Barcelona, Spain.

Traffic in network exhibits some redundancy and this paper presents an architecture which attempts to encode packets in order to remove these redundancies. It take into account some specific constraints (encoding and decoding rates) and defines where to cache the packets and how to ensure that an encoded packet based on the information at a router will find a decoder further down the path.

The proposed approach approximates relatively well an omniscient encoder and removes most of the redundancy. It also demonstrates that handling network-wide redundancy removal is preferable to hop-by-hop.

While the approach is promising, it assumes that there is redundancy in the traffic. There are some WAN optimization products and thus a market, but some of the evaluation assumes a 10% redundancy. As such, setting up a complex caching and encoding/decoding infrastructure for such a gain does not seem worth it. I would like to see better characterized the true redundancy gain, before dwelling on how to remove these redundancies.

I attended Sigcomm in BCN, and I don't recall seeing this talk, even though I most likely have. It was in a morning session, and I don't think I missed any morning session. Funny how remembering the talk would have saved me from re-discovering these results.

Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions

2010-01-28T16:57:00.000-08:00

Mobile Call Graphs: Beyond Power-Law and Lognormal Distributions, Mukund Seshadri, Sridhar Machiraju, Ashwin Sridharan, Jean Bolot, Christos Faloutsos, Jure Leskovec, Proc. of ACM KDD'08, August 2008, Las Vegas, NV.

This paper analyzes mobile phone calls data from the Sprint network and tries to create a model of the underlying graphs associated with different metrics: Partners (number of callers and callees associated with each user); calls (total number of calls made or received by a given user); duration (total duration of calls for each user). For each of these, the distribution has a power law tail, but a power law distribution does not fit well the rest of the distribution. The lognormal distribution fits better, except for the tail where it is off. So the paper considers the Double Pareto Log Normal (DPLN) distribution and explains how to find the best fit for that distribution.

This distribution would be useful for creating synthetic models to evaluate different social network parameters.

Understanding the Spreading Patterns of Mobile Phone Viruses

2010-01-28T15:42:00.000-08:00

Understanding the Spreading Patterns of Mobile Phone Viruses, Pu Wang, Marta Gonzalez, Cesar Hidalgo, Albert-Laszlo Barabasi, in Science Express, April 2009, DOI: 10.1126/science.1167053

This paper studies data from cell phone usage and mobility and extrapolate a mobility model and a connectivity graph from proximity and from call connections. In the former graph, two points are connected if they are within a given distance of each other; in the latter, they connect if a phone call is made between two points (users).

From the graph, they can derive some basic properties. Also, if an Operating System is assigned to nodes on a graph, then the propagation of viruses targeting that OS can be computed. There is a critical threshold for the market share of a given OS (that is the fraction of cell phones using this OS in the graph) so that the virus becomes epidemic.

The topology of the graph is interesting (similar to that of Mobile Call Graphs by Seshadri et al) and the characterization could be useful in social network studies. The mathematical tools to compute the size of the cluster is also of interest. It's a cute paper. I would be cautious with the results for bluetooth virus propagation, which is extrapolated from the graph data, but is not really validated. I guess the extrapolation cannot be that off due to the multiple data points in between the extrapolated trajectories.

Fast, Memory Efficient Flow Rate Estimation Using Runs

2010-01-12T17:00:00.000-08:00

Fast, Memory Efficient Flow Rate Estimation Using Runs, Fang Hao, Murali Kodialam, T.V. Lakshman, Shantidev Mohanty, in IEEE/ACM Transactionson Networking, Vol. 15, No. 6, December 2007.

The paper suggests to estimate the rate of flows by basically looking indirectly at the flow rate. The quantity is the number of 2-runs, that is the number of time that consecutive arrivals belong to the same flow. The occurrence of a 2-run will depend on the rate of the flow, and from observing the flows during a long enough time, one can infer the fraction of traffic belong to a flow by counting the two-runs.

This is similar in concept to the Flageolet & Martin paper, which observes some metric which is derived from the underlying process. F&M toss a geometric random variable and keep the max every time a flow works through the system; while this paper considers another derivative function of the rate. I believe I have seen a version of F&M used for the purpose of this paper, which seems natural.

The idea is cute, and leads to some straightforward analysis of the accuracy of the method. I was curious about how the assumptions of the model (independent Poisson processes for each flow, stationarity, etc.) would translate in a real link, but for a link with enough traffic, the method seems to work quite well.

Sampling Biases in Network Path Measurements and What to Do About It

2010-01-12T10:57:00.000-08:00

Sampling Biases in Network Path Measurements and What to Do About It, Srikanth Kandula, Ratul Mahajan, in Proc. ACM IMC'09, November 2009, Chicago, IL.

The paper considers the difference in the results obtained by sampled network path measurements with the actual performance. It shows that source sampling is a source of bias, and that picking only a few sources to gather data, and then extrapolating from this data leads to poor results.

The paper then presents three methods to improve the interpolation from the data. One is to consider only the tail of the path, to remove the source bias. This proves relatively ineffective compared to the other two methods. The next method is to embed the coordinates in a low dimensional space and to compute the missing paths in the coordinate space. This provides the best improvement for latency. Finally, the last method consist in clustering the nodes into an organization which somewhat mimics the typical network architecture of an access network near the source, a core network in the middle, and then another access network at the destination. This proves the best method for capacity.

The paper is interesting, mostly empirical, looking at the methods inspired from removing sampling bias in other context and seeing if they work in this context.

The embedding method is interesting to me, owing to our work on embedding graphs for routing. They use vivaldi, for which code exists, but has some limitations.

Empirical Evaluation of Hash Functions for Multipoint Measurements.

2009-12-09T01:33:00.000-08:00

Empirical Evaluation of Hash Functions for Multipoint Measurements. C. Henke, C. Schmoll, T. Zseby, CCR vol. 38, No. 3, July 2008.

This paper is exactly what the title says it is: it uses 25 different hash functions, and assess the performance of these functions with respect to four criteria: performance (how long it takes to compute the hash); non-linearity (how different/correlated are the output given that the input are similar); unbiasedness (does the hash function select packets independently); representativeness of the selected subset (is the hashed output representative of the input).

Of the 25 hash functions, they show that 8 perform well under these criteria. Also, they compare non-crypto hashes and crypto hashes. For non-crypto hashes, it is possible to create an attack so that the packet of an adversary are disproportionally selected by the hash function (this is not a result of this paper, but is referenced here).

The paper is a good way to know which hash function to use for network measurement. It is also nicely written and easy to read.

IEEE NOMS rebuttal redux

2009-11-12T12:40:00.000-08:00

I wrote below about rebuttals, and how the reviews were obscure and vague that I was unable to tell whether the reviewers appreciated the paper or not. I assumed it meant it was in a weak reject-weak accept range.

For two of them, it was weak accept. But the third one gave us a puzzler: "Strong reject- I have strong arguments against acceptance." Then why doesn't you review say so? Why was I unsure earlier reading the review whether you liked the paper or not? I mean, "strong arguments"!

Turns out the paper got accepted anyway as a short paper, which is what it was (you can submit to NOMS as 4 pages or 8 pages, and we went for four), so I'm not particularly bitter. But I certainly hope that when I write a review recommending strong rejection, I make it clear enough why. It was very instructive to view the reviews without the grades, it's a good exercise when writing a review to check whether one can guess the rating from the text.

Efficient Navigation in Scale-Free Newtorks Embedded in Hyperbolic Metric Spaces

2009-11-12T12:11:00.000-08:00

Efficient Navigation in Scale-Free Newtorks Embedded in Hyperbolic Metric Spaces, Dmitri Krioukov, Fragkiskos Papadopoulos, Marian Boguna, Amin Vahdat, Arxiv:0805.1266v1, May 2008.

Some related version was submitted to a SIGMETRICS workshop.

This paper shows that you can create some random topologies using some simple rules in hyperbolic graphs such that these graphs exhibit similar properties than Internet AS topologies (clustering properties, triangles, node degrees).

Within the hyperbolic space with the associated distance, the paper observes that greedy routing work pretty near optimal (where optimal is shortest path) for these topologies. It thus conjectures that, if one could map the Internet AS topology to a hidden hyperbolic topology, one could use this topology to do routing relatively efficiently.

This is very cute, and indeed if you could identify an underlying hidden metric space under the graph topology, that'd be great. A few thoughts quickly come to mind though: first, shortest path is not what's happening in between ASes anyway, so there's another step to take right there. Another thought is that, hyperbolic topology have an inherent scalability issue: because you can cram more points within the space (it's "larger" in some sense), then you have to use more bits to describe each point. An example: in Euclidean geometry, I can put n points equally spaced and use log(n) bits to describe each position. In hyperbolic geometry, I would need n bits to describe each position, since they would be at 1/k^n for them to be equidistant. So I can put infinitely many of them on (0,1], but with O(n) bits for each position. So greedy routing does simplify to some extent, but since O(n) bits gives me shortest path routing in Euclidean space, there's no benefit from that perspective.

Another question is this: some properties are similar in the AS-topology, and in the hyperbolic generated graph. But to what extent the graphs are truly isomorphic? Li, Alderson, Tanaka, Doyle, Willinger have this paper where they confront analysis based on this kind of high level similarities, which could lead to drastically different conclusions. So proceed with caution, it's a treacherous path. But if it leads somewhere, it could have dramatic impact.

IEEE NOMS rebuttal.

2009-10-22T12:22:00.000-07:00

I had no clue that IEEE NOMS would have a rebuttal phase. We only submitted to the conference mostly because it was in Japan (if accepted, we'll combine with a trip to the mothership) and the deadline coincided with our post-doc taking a paternity leave, so was not aware of the process of the conference.

I found the rebuttal phase from the reviewer point of view quite useless during the Globecom PC. From the author point of view, the sample is much smaller; it is quite strange that the reviews are provided without the grade, so I'm unable to say if the reviewers liked the paper or not. Obviously they are in between, they don't think it's the best or worst thing ever, but it's hard to decipher if it's leaning one way or the other. I guess it's mostly due to the format of the question, which emphasizes both positive and negative: strength, weakness and detailed (editorial) comments.

I'm not sure how to reply to the reviews, though, not knowing if the job is to not screw it up, or if it's pointless to try to change someone's mind (can a reviewer of an unconvincing paper be turned around in 3,000 characters?)

An Analysis of Packet Sampling in the Frequency Domain

2009-10-22T11:53:00.000-07:00

An Analysis of Packet Sampling in the Frequency Domain, Luigi Alfredo Grieco, Chadi Barakat, in Proc. of ACM SIGCOM IMC'09, November 2009, Chicago.

I have been looking with great interest at the technical program for IMC'09 (see here and a few more papers to come). I have to say the stuff from other communities look quite exciting. Grass greener and all.

Anyway, this paper is a very cute (and simple) thing: it considers the Fourier transform of the packet size from a source-destination pair, and assesses a rule for the sampling rate in order to avoid aliasing effects when reconstructing the signal from the Fourier transform.

This is somewhat preliminary and applications to compressing the signal or to use it for anomaly detection are not presented (but hinted at). I would have thought the community of traffic measurement would have done this kind of transforms on the signal a long time ago.

On the Interplay Between Routing and Signal Representation for Compressive Sensing in Wireless Sensor Networks

2009-10-22T11:45:00.000-07:00

On the Interplay Between Routing and Signal Representation for Compressive Sensing in Wireless Sensor Networks, Giorgio Quer, Riccardo Masiero, Danilele Munaretto, Michele Rossi, Joerg Widmer, Michele Zorzi.

This paper addresses the same problem as the paper right below. Actually, it's interesting that both papers cite versions each other, which is quite rare.

This paper is a bit more optimistic in the use of compressed sensing in conjunction with routing and data aggregation in sensor networks, but still raises some issues into which random projection would be best for real data (the best results are obtained for synthetic data). It seems there is a trial-and-error phase to see what compressed sensing bring to this problem, with no definite answer thus far.

Disclosure: one of the authors here, Joerg, is my colleague in Munich.

Spatially-Localized Compressed Sensing and Routing in Multi-Hop Sensor Networks

2009-10-22T11:26:00.000-07:00

Spatially-Localized Compressed Sensing and Routing in Multi-Hop Sensor Networks, Sungwon Lee, Sundeep Pattem, Maheswaran Sathiamoorthy, Bhaskar Krishnamachari, Antonio Ortega in Proc. of the 3rd International Conference on GeoSensor Networks, 2009.

This paper considers the use of compressed sensing with data aggregation in sensor networks. The paper proposes a clustering scheme to compressed the data along the cluster, and reconstruct the data later (either independently within a cluster, or jointly, assuming some spatial correlation between the clusters). The results are not very promising for the approach, once energy costs (for transmission) are taken into account. But there are not too many references using compressed sensing in concert with routing applications.

Space-Code Bloom Filter for Efficient Per-Flow Traffic Measurement

2009-10-21T17:21:00.001-07:00

Space-Code Bloom Filter for Efficient Per-Flow Traffic Measurement Abhishek Kumar, Jun (Jim) Xu, Jia Wang, Oliver Spatschek, Li (Erran) Li, in Proc. of the 3rd ACM SIGCOMM conference on Internet Measurement 2003.

The paper uses a bloom filters for counting traffic. It does so by having l Bloom filters (BF), and each time a flow comes in, it chooses one of the l BFs at random. The idea is that, by counting how many different BFs are hit by the request, one can assess an estimate of the number of packets. If the BFs are chosen with equal probability, then after l log(l) packets, all of the l would return a positive reply to a query membership for the packet. Of course, that would restrict the counting to low values, so they don't select the BF uniformly, but at random with a geometric distribution.

If that reminds you of the paper below, it's basically the same idea (the authors were most likely unaware of that paper, since they don't cite it, and don't make use of the theory developed in the Flajolet/Martin paper).

The theory should be a bit more involved than Flajolet/Martin since the BF introduce some noise (false positive) that basically would tend to overestimate a bit the packet count if the F/M analysis was used straight out of the box. So this paper derives another way of inverting the counting from the BFs which, in the simulations, does not skew either over/under but seems to have its error centered around the original value.

Anyway, it seems that this counting using geometric random variables is quite popular a structure.

RouteBricks: Exploiting Parallelism to Scale Software Routers

2009-10-21T16:41:00.000-07:00

RouteBricks: Exploiting Parallelism to Scale Software Routers, Mihai Doberscu, Norbert Egi, Katerina Argyrako, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia Ratnasamy

This paper got Best Paper award at SOSP, and I got pointed to it by a blog post here. Actually, I believe that Katerina and/or Sylvia discussed the concept of RouteBricks during some lunch during SIGCOMM in Barcelona, so I kinda had heard of it. Congrats for the award!

Paper is deserving of it, for sure. It presents an architecture to create a router using commodity hardware (well, sophisticated commodity hardware, prototype Nehalem servers, you can't build this in your garage yet) to achieve the performance of a Cisco router at a fraction of the cost. SOSP has the slides and the recorded audio of the presentation on the website, which is pretty neat.

Most of the details are in how to avoid the bottlenecks in the architecture to spread the traffic over several servers rather than into a choking point. They use valiant load balancing, and a multi-hop butterfly topology to connect them. They also need some optimization in the processing to get to faster rates (multi-queue NIC, batch processing).

My thoughts reading this was: what a good idea, and what a nicely written paper! Also, what kind of reliability do you get from this platform (I would assume Cisco routers have some up-time requirements, fault tolerance, hot swappable line cards, whatever, to keep the ISP happy); I was thinking about reliability from the cellular network operator point of view (since I was thinking if this router design could translate into, say, a commodity base station. Of course, on 2nd thought, what un-commoditize the BS is the air interface, not the switching part of it). I also would like to know what Cisco margins are, since they can only compare the cost of their router to a list price.

Weighted Bloom Filter

2009-10-21T16:27:00.000-07:00

Weighted Bloom Filter, Jehoshua Bruck, Jie Gao, Anxiao Jiang, in Proc. IEEE International Symposium on Information Theory (ISIT'06), July 2006.

This paper makes an interesting observation: a Bloom filter functions as follows. For each element, if the element belongs to a specific set, then the element is hashed into a bit vector of length m. It is hashed by computing k hash functions, each pointing to a bit in the bit vector which is then flipped to 1 (if it is 0).

The observation of the paper is that, if you know a priori the frequency of the requests for each elements, then you should not allocate the same number of bits to each element in the bit vector, but vary the number of hash functions k as a function of the popularity of the request.

The paper goes on to produce a function which gives the number of hash vectors for each element depending on the popularity. The counter-argument to this would be naturally that it is only shifting the problem around: the membership query structure is optimized, but at the cost of creating another structure to assign the popularity of the item.

Actually, paper answers this objection: it shows that a simple bucketing of "hot" items vs "cold" items lead to improvement in the performance of the Weighted Bloom Filter vs the regular Bloom Filter. So there is no need for a complicated structure to define the popularity.

I had a chuckle thinking how to verify if an item is hot or cold: query the membership in the "hot" set using a Bloom filter!

I have met one of the authors (Andrew) when I visited CalTech back in 2005, where he gave just for me a dry run presentation of his MobiHoc paper on the NP complexity of locating a user through angle triangulation.

Probabilistic Counting Algorithms for Data Base Applications

2009-10-21T15:54:00.001-07:00

Probabilistic Counting Algorithms for Data Base Applications Philippe Flajolet, Nigel Martin, Journal of Computer and System Sciences, Vol. 31, No. 2, October 1985.

This seems to be a classic paper. It was pointed as a major reference in a review request, so I went to read it. I'm glad I did, it is very cute. It basically describes a counting schemes for knowing the number of distinct items in a set. It is a probabilistic mechanism with a single pass and low storage requirement.

It works as follows: for every item in the set, it computes a hash. The hash maps to a random variable with a geometric distribution, such that it is equal 0 with probability 1/2, 1 wp 1/4, 2 wp 1/8, etc. up to 2^L-1. Consider a bit vector of length 2^L, and flip the i-th bit to 1 if the geometric random variable points to i (keep it to 0 otherwise). Since two items in the data base that are identical will hash to the same value, then the number of times that the geometric random variables is generated is a function of the number of distinct items.

For a large number, the values of 0, 1, ..etc will be one in the bit vector. The first non-1 bit is used to estimate the number of different items. Denote by R the left-most entry in the bit vector which has value 0. Then the paper shows that R is about log(\phi n), with \phi = 0.77351, with a relatively small variance, such that the number n can be estimated from R relatively accurately. The size of the counter is this bitvector L, which has size of O(log(n)).

The paper makes some of the proofs formal, but the analysis is relatively complex. But I like the idea of looking at some properties of a sequence of random variables to estimate the number of elements in the sequence. I believe I have seen somewhere a reference where the longest run of a particular value is used to estimate the number of items (for instance, for head/tail coin toss, the longest run of heads will grow as O(log(n)), if I recall well, which would lead to a log(log(n)) bit size for such scheme).

The interesting question is: what other proxy can be used to compress the count, and can be inverted relatively easily to yield a good estimate of the original data.

Architectures for the Future Networks and the Next Generation Internet: A Survey

2009-10-21T15:46:00.001-07:00

Architectures for the Future Networks and the Next Generation Internet: A Survey, Subharthi Paul, Jianli Pan, Raj Jain, Washington University in St Louis Technical Report, 2009-69.

This survey attempts to cover possible architectures for the future network. It is very relevant to me, since I had at one point the intent of putting together a similar technical report. And after reading this one, I realize that I could still do it: while this paper covers a lot of ground, much of the papers I read on the topic don't overlap with this: it is such a wide topic to cover that only 60 pages and 200 references don't seem to do it justice!

The survey contains basically eight sections, one on the Internet 3.0 vision at WUSTL, and one each on security, content delivery, DTNs, management and control, service architectures, routing, and future Internet infrastructure design. I was interested mostly in the management & control, routing and future Internet design (GENI). I could use this as a reference in those fields, as it contains a lot of the must-read papers and it explains in great details the different control frameworks for GENI.

Where's that Phone?: Geolocating IP Addresses on 3G Networks

2009-10-21T10:28:00.000-07:00

Where's that Phone?: Geolocating IP Addresses on 3G Networks, Mahesh Balakrishnan, Iqbal Mohomed, Venugopalan Ramasubramanian, in Proc. of SIGCOMM Internet Measurement Conference'09, November 2009.

I was looking at the program for IMC'09 and this paper caught my interest. It is a very basic experimental study which shows that:

for a given mobile device, the IP address assigned by the operator (in most cases, AT&T wireless) is not static, but changes a lot over time. This creates issues (or opportunities) for IP-filtered web sites.

for a given IP address, the location of the device cannot be predicted: the same IP address could be assigned at different times to device in wildly different locations. This has implications, in particular for the service which hopes to translate the IP address into location-based advertising.

the delay information is the best predictor for location: there is correlation in the application delay and the location of the device.

Those are quite interesting results to keep in mind, as the number of mobile devices connected to the Internet through 3G is increasing rapidly. So it is a really cute observation.

Peer Review Does Not Work,... or Does It?

2009-10-21T10:27:00.001-07:00

In my inbox:

From: ISPR/KGCM 2009 [mailto:...@.ICTconfer.org]
Sent: Thursday, February 12, 2009 3:38 PM
Subject: Invitation to a Symposium on Peer Reviewing

Only 8% members of the Scientific Research Society agreed that "peer review works well as it is." (Chubin and Hackett, 1990; p.192).

"A recent U.S. Supreme Court decision and an analysis of the peer review system substantiate complaints about this fundamental aspect of scientific research." (Horrobin, 2001).

Horrobin concludes that peer review "is a non-validated charade whose processes generate results little better than does chance." (Horrobin, 2001). This has been statistically proven and reported by an increasing number of journal editors.

Since a growing number of studies conclude that peer review is flawed and ineffective as it is being implemented, why not apply scientific and engineering research and methods to the peer review process?

This is the purpose of the International Symposium on Peer Reviewing:... which will be held on July 10-13, 2009, in Orlando, Florida, USA.

So peer review is ineffective and flawed and sucks. And the punch line?

All Submitted papers will be reviewed using a double-blind (at least three reviewers), non-blind, and participative peer review.

Time to submit a randomly generated paper!.

Back from Hiatus.

2009-05-27T10:08:00.000-07:00

I haven't posted here in a while, but I'll make an effort to do so. It is valuable for me to have some thoughts and references stored somewhere for easy, ubiquitous access.

It turns out I accepted to serve on the Globecom PC this year, and I just noticed that they will open a rebuttal phase for the authors. This is very intriguing: globecom receives relatively short papers, usually not very mature work. Someone once described the publication phases in networking as: workshop paper -> globecom or ICC -> infocom -> journal, as the results mature and become more complete. Also, Globecom is not very selective (say, 35-40% acceptance rate nowadays).

On the reviewer side, I can only speak for myself, but different conferences call for different level of scrutiny and time spent reading the papers. I see my job for Globecom as sorting the papers out, not as much as providing feedback to the authors: since the work is still in progress, I give a general indication of whether I think it's worth pursuing. But I won't go into the details of a still immature, evolving research. Mostly, unlike a journal or an Infocom/Mobihoc/MobiCom review, I keep it succinct. I usually keep my notes after a review, but I might not go through the trouble with 2nd tier conferences. Had I known there would be a rebuttal, I would have, dang.

As an author, I have wished in the past there would be a rebuttal. As a program chair, I've had (unrequited) rebuttal emails from disgruntled authors (who had a point, I'll concede that a clarification was necessary). So authors like rebuttals, solicited or not. That and my short reviews, it might be an explosive mix!