Sunday, May 20, 2012

What is time ?

There is an anecdote about a foreigner visiting London asking a man on the street “what is time?” and receiving the answer “I’m sorry, but I am not a philosopher”.

I don’t want to discuss here the philosophical or physical questions of what time is, but rather what we mean by time in telecommunications applications. In particular, we frequently hear the terms “UTC”, “GPS time”, “NTP time”, and “1588 time”, and I would like to clarify what these terms mean.

Everything starts with the question “what is a second?”. Until 1960 the second’s duration was based on the rotation of Earth. Specifically, the second was defined as the unit of time of which there are precisely 24*60*60 =86,400 of them in a mean solar day. Unfortunately, the Earth’s rotation is slowing down due to tidal friction, and so between 1960 and 1967 the second was redefined as a particular fraction of the duration of the year 1900. Since it is hard to reproduce the year 1900 in the lab, the second was finally linked to a stable, reproducible, physical phenomenon, namely the radiation emitted when an electron transitions between the two hyperfine levels of the ground state of the cesium 133 atom. Cesium atomic clocks need only count 9,192,631,770 oscillations and declare that a second has passed. (Cesium is chosen because all of its 55 electrons except the outermost one are in stable shells, minimizing their effect on the outermost electron.)

Even such a stable phenomenon as the hyperfine transition is somewhat subject to variability (due to contaminants, undesired fields, and General Theory of Relativity corrections due to height above sea level) leading to variability on the order of a nanosecond or two per day. In order to remove even this small variability the TAI international time scale (TAI stands for “Temps Atomique International” or International Atomic Time) maintained by the International Bureau of Weights and Measures (BIPM) in Paris, is defined as the weighted average of over 300 atomic clocks located around the world (the higher weightings going to the more stable clocks).

TAI is precisely defined, but has become entirely divorced from the Earth’s rotation. Were we to adopt only TAI the time of day would slowly lose connection with the position of the sun in the sky, and after a long enough time we would be having breakfast at 12 noon. IN order to resynchronize the two definitions of the second UTC is defined. UTC stands for Coordinated Universal Time (the order of letters is a compromise between several languages), and it replaced older time standards such as “GMT”. It is defined in ITU-R Recommendation TF.460-6 to be TAI adjusted by leap seconds introduced to compensate for the changing of Earth’s rotational velocity. When to introduce leap seconds is now determined by the International Earth Rotation and Reference Systems Service (IERS). While leap seconds can be either positive or negative, and can be introduced at the end of any month, there have only been positive ones (corresponding to slowing down of Earth’s rotation) and they have only been introduced on the last day of June or December. There are presently proposals to eliminate leap seconds entirely (in which case TAI would be abolished), and perhaps introduce leap hours should the need arise.

UTC is now exactly 34 seconds behind TAI, because of a 10 second introduced in 1972 when the present system was adopted, and 24 positive leap seconds that have been declared since then. The next leap second will be at the end of June 2012, increasing the difference to 35 seconds.

Actually there are several versions of Universal Time. UT0 and UT1 are found by observing the motion of stars (UT0) or distant quasars (UT1), as well as from laser ranging of the Moon and artificial Earth satellites (such as GPS satellites). UT1R and UT2R are smoothed versions of UT1, filtered to remove periodic and stochastic variations in the Earth’s rotation. UT2R is smoother than UT1, and any variations left in it are because of erratic changes in the Earth’s rotation, due to plate tectonics and climate change.

So, what kind of time do we use in GPS and our time distribution protocols ?

The time of day reported by GPS, which is often called “GPS time”, is not UTC. Every GPS satellite has several on-board atomic clocks, and these clocks are set according to the master clock at the US Naval Observatory in Boulder Colorado. “GPS time” does not include leap seconds, but GPS satellites periodically transmit a UTC offset message for this purpose (the GPS-UTC offset field is 8 bits and can thus accommodate 255 leap seconds, which should be sufficient for several hundred years). Once thus compensated, USNO time is within tens of nanoseconds of UTC. However, it can take over 10 minutes until you receive an offset message.

It is interesting that the on-board atomic clocks must be corrected for relativistic effects. Since the satellites are moving at high speeds with respect to an observer on the ground, the Special Theory of Relativity predicts that the on-board clocks will seem to be running about 7 microseconds per day slower than were they stationary with respect to the observer. On the other hand, the General Theory of Relativity predicts that because the satellite is high above the Earth, and thus experiences a weaker gravitational field, the on-board clocks will seem to be running faster by about 45 microseconds per day. The net relativistic correction is about 38 microseconds per day. After compensating for relativistic effects, the accuracy of time derived from a good GPS receiver is about 50 nanoseconds.

NTP (and that included SNTP) distributes UTC (i.e., it does takes leap seconds into account) and specifies UTC in seconds since Jan 1, 1900. The NTP 64-bit timestamp consists of 32 bits of whole seconds (about 136 years until roll-over) and 32 bits of fractional seconds (about 233 picoseconds of resolution). However, any specific NTP server distributes time according to the stratum of its reference clock. Of course, the time a particular NTP client obtains depends on the network between the client and the NTP server. You can expect an NTP client to be within tens of milliseconds of its server on a LAN, but only 100s of milliseconds of error over the Internet. However, NTP allows a client to track several servers, and thus improve its accuracy.

IEEE 1588 distributes TAI according to UNIX epochs. Since the UNIX time epoch started Jan 1, 1970, 1588 time is now ahead of UTC by 24 seconds (soon to be 25). The 1588v2 10-byte timestamp consists of 48 bits of whole seconds, and 32-bits of nanoseconds. Once again, the precise time accuracy depends on the type of grand master to which the 1588 master is synchronized. The big difference between 1588 and NTP is the possibility of on-path support in the network. If you have Boundary Clocks (BCs) or Transparent Clocks (TCs) in your network, the time error should be very small (perhaps a microsecond or less). 1588 can't simultaneously track multiple masters, but it can choose the best one from a list.

So that's what we mean by time.

Y(J)S

Tuesday, May 8, 2012

iPhone Storms

A few years ago RAD’s president Zohar Zisapel asked me to accompany him to a meeting with another Israeli company concerning possible cooperation on an important issue. On our way I asked him what this important issue was. He replied the iPhone problem and I immediately understood.

He informed me that he had been in the US the previous week, and although he carried a Blackberry and not an iPhone, he had experienced inability to connect to the network even for voice calls, calls dropping in the middle, cell breathing (which he graphically described as the signal strength bars undulating up and down), and of course inability to connect to data services. Once back in Tel Aviv, he had contacted companies with whom RAD could cooperate in trying to solve the problem.

I had seen many reports on the problems AT&T was experiencing in New York and San Francisco since the introduction of Apple’s iPhone, but had not known it was really that bad. Obviously the iPhone brought significantly increased bandwidth usage due to users being “always on” and consuming more video streaming and other high-datarate services rather than just sporadically sending an email or downloading a file. However, networks in other parts of the world with many different kinds of smartphones were not experiencing such catastrophic failures; in fact, many operators with whom I had spoken were not observing any problems at all!

What could be causing these problems? There were really only three possibilities:
  1. lack of resources in the air interface (known as spectrum crunch or spectral exhaustion),
  2. under-provisioning of the backhaul network,
  3. failure of the signaling servers (due to what are known as signaling storms);
and if the second item was the problem (or at least a major chunk of it), then RAD was uniquely positioned to help.

Why did we expect that the second problem to be at the root of the problem? Well, the backhaul network is extremely cost sensitive, and increasing bandwidth there was an expensive and time consuming task. We didn’t expect the air interface to be already congested (although we expected the spectrum to eventually become exhausted) since AT&T had already deployed HSPA+. We ruled out signaling as the major issue, since denser networks of smartphones were not experiencing similar problems.

Of course we now know that we were completely wrong, and that signaling server failure was the major problem. The explanation was intimately related to the slim design of the iPhone, and to fact that Americans had never adopted text and multimedia messaging as avidly as Europeans did.

To understand what went wrong and how the issue was eventually solved, I need to explain 3G Radio Resource Control (RRC) states. The RRC protocol is the control plane between the 3G network and the UE (User Equipment, e.g., cellphone). It is responsible for handling many interactions such as locating the UE, waking it up, establishing/releasing connections for voice and data, and for sending SMS’es.

The UE can be in one of five possible RRC states, called Idle, URA_PCH, Cell_PCH, Cell_FACH, and Cell_DCH. In Idle mode the UE is only known to the network by its IMSI (telephone number), and only listens to system broadcasts and paging information. It only very rarely transmits (and even then only location updates) and barely uses its receiver (waking up periodically to check if it has been paged). Battery drain is thus extremely low. At the other extreme is the Cell Dedicated Channel state. Here the UE is using a dedicated high-speed data channel, and may be consuming 100 times more battery power. In between are the PCH states where the UE is connected but still relatively inactive, consuming only a little battery power; and the FACH state where the UE is using shared channels for exchange of small bursts of data, and consuming perhaps half of what it would consume in DCH.

Now, a UE in the Cell_PCH state that needs to send a short data packet (e.g., an application keepalive) will need to transition to Cell_FACH. It does this by sending a single signaling message and receiving a single reply. After sending its data packet, the UE will only drop back to Cell_PCH after a relatively long timeout (several seconds), and in the meantime will be wasting battery power. In order to conserve battery power many manufacturers, starting with RIM in its Blackberry, but more notably Apple in the iPhone and various manufacturers for Android devices, devised a trick. The UE sends a SCRI Signaling Connection Release Indication message, a message that was intended to convey that some unexpected error has occurred in the UE, and that the network should immediately release its connection. The UE drops into the Idle state, with almost no battery drain. However, the network effectively forgets it, and the next time the UE needs to transmit something, it needs to go from idle state to FACH, which is a signaling-intensive (over 25 messages) and lengthy operation.

The consequences of this trick were not very apparent when it was only used by Blackberry handsets, which are mainly used for email and occasional short data transfers. On the other hand, iPhone users tend to continually pull and push data, watch and stream videos, and are generally “always on”. In addition, the iPhone’s iconic slimness meant that Apple couldn’t use anything larger than a 1400 mAh battery, so that Apple was particularly aggressive in sending SCRIs. Finally, in the US where SMS had never been as popular as in Europe, the signaling infrastructure was woefully undersized for millions of iPhones disconnecting and reconnecting to the network.

The initial resolution involved increasing server resources and freeing up bandwidth for signaling channels. The eventual solution was a signaling enhancement in 3GPP Release 8 called Fast Dormancy, which Apple adopted towards the end of 2010. This enhancement enables the UE to transition quickly from FACH state to PCH, rather than to Idle as in the trick. Thus the network remembers the UE, and it can rapidly transition back and forth between FACH and PCH states.

Of course, iPhones are not alone in having caused signaling storms. In mid 2011 the Android port of Angry Birds caused significant signaling traffic that stressed networks until an update solved the problem, and in January 2012 NTT Docomo suffered a 4½ hour outage in Tokyo due to an Android application that overloaded the signaling plane.

And according to many reports, spectral exhaustion is right around the corner.

Y(J)S

Monday, January 9, 2012

Jobs and Ritchie

October 2011 marked the passing away of two men well-known in the computation and communications industries. One was Steve Jobs. In his honor Apple, Microsoft, and Disneyland all flew their flags at half-staff. October 16, 2011, was declared "Steve Jobs Day" in California. President Obama gave a eulogy calling Jobs “among the greatest of American innovators … a visionary”.

The other was Dennis Ritchie.

Ritchie died alone. His passing was not mentioned on the TV news, and was not picked up as a major item by the press. The only formal recognition was the dedication to his memory of the Fedora 16 distribution. For those who don’t recognize his name, Ritchie is the R in K&R (Kernighan and Ritchie’s “The C Programming Language”), a book known by heart to everyone who has ever written in C. In addition to creating C and introducing many of the constructs of imperative programming, Ritchie, along with Ken Thompson, created the UNIX operating system. In fact, C was created as a vehicle to make UNIX more portable.

For his contributions to computer science, Ritchie was awarded the Turing Award, the Hamming Medal, and the US National Medal of Technology. Until his retirement in 2007, Ritchie was head of research at Lucent’s System Software Department.

The papers eulogized Jobs as a great inventor, but were not very specific as to what precisely he invented. Of course they extolled technologies and devices with which his name is connected - the Apple 2, the MacIntosh, the mouse+icon GUI, the iPod, iPhone, and iPad, but mostly admitted that his contributions were in the area of design and evangelization, rather than invention. What they omitted was his major invention – his amazingly successful method of monetizing. Bill Gates convinced people to pay for software rather than receive it free of charge when purchasing hardware, but it was Steve Jobs who convinced people to give him a 30% royalty on third-party software (and music and videos) just in order to use it on his hardware.

In contrast, Ritchie convinced his employer AT&T to distribute UNIX to universities, under license but free of charge. The sources (mostly in C) were widely circulated in the book form and enabled programmers to enhance its features as well as to create their own software. After its divestiture AT&T was allowed to market software and quickly changed Unix System V into a proprietary closed system. This prompted a group at Berkeley to continue development of the BSD UNIX as an Open Source alternative, Ritchie to help in the development of the GNU free version of UNIX, and eventually Linux Torvalds to create Linux.

The computer industry is now segmented into Microsoft, Google/Android, and Apple. Microsoft’s most important asset is its Windows Operating System; this indeed is not based on UNIX but is programmed in C++ and promotes C#, two direct descendants of C. Google’s Android may exploit the Java language, not a direct descendent of C, but is itself based on Linux, a descendant of UNIX. And Job’s Apple uses the iOS operating system, a version of UNIX, and Objective C language – a derivative of C. So while Job’s influence is limited to a small a minority of PCs and one sector of the smartphone market, there is no mainstream computer or smart device without Ritchie’s fingerprints all over it.

Y(J)S

Sunday, December 25, 2011

The meaning of Apple's '647 patent

On December 19th the U.S. International Trade Commission (ITC) issued its final determination on Apple's claims against HTC of Taiwan, finding that HTC violated Section 337 of the Tariff Act by selling Android phones containing a technology that infringed a patent held by Apple. Section 337 enables the ITC to block importation into the US of foreign products that unfairly compete with domestic products, and infringing a valid US patent is considered such an unfair practice. Recently Section 337 investigations are being used more and more as faster and lower cost alternatives to enforcing US patents against foreign entities through litigation in district courts.

Of the 10 patents originally claimed by Apple to be infringed, the ITC rejected all but two in an earlier ruling, and in the final determination reduced this further to a single patent. The two patents in question are 5,946,647 entitled System and Method for Performing an Action on a Structure in Computer-Generated Data (filed Feb. 1 1996 and granted Aug. 31 1999) and 6,343,263 entitled Real-time Signal Processing System for Serially Transmitted Data (filed Aug. 2 1994 and granted Jan. 29, 2002). The ITC found that HTC did not infringe the '263 patent that protects the use of a Hardware Abstraction Layer to isolate real-time code from architectural details of a specific DSP.

The '647 patent discloses a system wherein a computer detects structures in data (such as text data, but possibly digitized sounds or images), highlights these structures via a user interface, and enables the user to select a desired action to perform. Apple's complaint to the ITC gives as an example of infringement the detection and highlighting of a phone number (e.g., in a received SMS) and enabling the user to click to call that number.

I have seen in blogs and forums many completely erroneous statements about what this patent actually means. People have claimed that '647 can't be valid, as hyperlinks or regular expression matching or SQL queries clearly predate the filing. However, a careful reading of the '647 patent shows that it does not claim to cover such obviously prior art. The following analysis is based on the text of the patent and on documents openly available on the web, and should not be considered legal advice.

After eliminating text required for patent validity (an input device, an output device, memory, and a CPU) the invention of '647 has three essential elements. First, an analyzer server parses the input data looking for patterns (called "structures"). Second, via an API the user-interface receives notice of the detected structures and possible actions for each one; displays the detected structures to the user; offers the user a list of actions that can be performed for each structure; and receives the user's selection. Third, an action processor performs the user's selected action (possibly launching new applications). The text of the '647 patent gives as an example the regular expression parsing of an email to find phone numbers, postal addresses, zip-codes, email addresses, and dates, and enabling the user to call a phone number, enter addresses into a contact list, send a fax to a number, draft an email, and similar actions.

Of course plain hyperlinks that are manually inserted into HTML are not covered by this patent since they are not automatically detected by an analyzer. A regular expression engine can potentially be used as an analyzer (although not necessarily by all embodiments as the patent mentions neural networks matching patterns in sounds and images) but is not claimed. The automatic parsing of a document for a list of patterns without offering a list of actions to a user is also not protected; indeed the Rufus file-type classifier is cited as prior art. Even the use of a regular expression engine to parse text and insert hyperlinks into a document is considered prior art, as the application references the Myrmidon text-to-html converter. It is possible that an editor or IDE that offers possible completions of text being typed would be considered infringing, depending on how broadly the patent's concept of input device is interpreted.

The three elements of the '647 patent are all present in many applications and devices used today. Users of Microsoft's Outlook are familiar with its automatic hyperlinking of email addresses and URLs in received messages. My old 2004 Sony-Ericsson K700 2G phone automatically highlights phone numbers in SMSes enabling single-click calling. However, Apple has targeted a very specific infringement - Android's Linkify class. Linkify enables the definition of a list of regular expression patterns to be detected, and a corresponding list of schemes, i.e., actions the user can select to be executed. It even comes with a few pre-defined patterns - email addresses, postal addresses, phone numbers, and URLs - which are almost precisely the examples given in the '647 patent.

While Apple's claims of infringement of '647 may be selective, they are not frivolous. In order to invalidate '647 the Android community would need to find publication of all three essential elements before 1996. I am sure that they have tried.

Removal of the Linkify feature from Android phones will put them at a definite ease-of-use disadvantage in comparison with the iPhone. And HTC has been given until April 19th 2012 to do just that.


Y(J)S

Wednesday, November 30, 2011

On exa, zetta, and beyond

Anyone who lives in metric system countries knows what "kilo" means. A kilogram is 1000 grams, a kilometer is 1000 meters. Of course frequencies are measured in kiloHertz and in the computer world we have kilobits and kilobytes (although we are never quite sure if that is 1000 or 1024!).

Most people even know that "mega" means a million. Power stations output megawatts of electricity, FM radios receive at megaHertz frequencies, and atomic bombs deliver megatons. For years our disks were measured in megabytes, and for most of us our Internet connections are in megabits (although we are not quite sure whether that is 1,000,000 of 1024*1024!).

People with state-of-the-art computers are aware that giga means a (US) billion (a thousand million), and that tera means a thousand of those, but only because disk capacities have increased so rapidly. When you ask people what comes next, you tend to get puzzled looks. Most people aren't even sure whether when they say a billion they mean a thousand million or a million million, so don't expect them to be expert in anything bigger than that!

Up to now only astrophysicists were interested in such large numbers, but with global data traffic increasing at over 30% per year, networking people are getting accustomed to them as well.

For those who are interested, the following numbers are peta (10^15), exa (10^18), zetta (10^21), and finally yotta (10^24). The last two names were only formally accepted in 1991. For those who prefer powers of two, the
IEC has standardized kibi (Ki) for 2^10, mebi (Mi) for 2^20, gibi (Gi) for 2^30, tebi (Ti) for^40, etc., although these terms don't seem to have caught on.

Several years ago I heard that the total amount of printed information in the worlds' libraries does not exceed a few hundred petabytes. On the other hand, present estimates are that global IP traffic now amounts to about 30 exabytes per month, or about ten times the world's accumulated printed knowledge every day. By the middle of this decade should surpass 100 exabytes per month, i.e., about the entire world's printed knowledge per hour.

These datarates, and particularly their time derivatives, present the telecommunications community with major challenges. We have grown accustomed to sophisticated applications that transfer massive amounts of data. A prime example is the new breed of cellphone voice/meaning recognition that sends copious amounts of raw data back to huge servers for processing. Such applications can only continue to be efficiently and inexpensively provided if the transport infrastructure can keep up with the demand for datarates.

And that infrastructure is simply not scaling up quickly enough. We haven't found ways to continually increase the number of bits/sec we can put into long-distance fiber to compensate for >30% annual increase in demand (although new research into mode division multiplexing may help). Moore's law is only marginally sufficient to cover increases in raw computation power needed for routers, but we will need Koomey's law for power consumption (MIPS / unit of energy doubles every year and a half) to continue unabated as well. And we haven't even been able to transition away from IPv4 after all of its addresses were exhausted!

If we don't find ways to make the infrastructure scale, then keeping up with exponential increases in demand will require exponential increase in cost.


Y(J)S

Wednesday, November 23, 2011

My new CTO job

As you all probably know, I have changed job titles.
I am now RAD's Chief Technology Officer instead of (or perhaps in addition to?) Chief Scientist.

Our previous CTO, Prof. Daniel Kofman, is still in touch with the company. However, he is a bit busy since in addition to his position as Professor at Telecom ParisTech (formerly ENST), he has been appointed by France's Minister of Research and Innovation as director of LINCS (Laboratory of Information, Networking, and Communication Sciences), a new research center in Paris.

So, what will I do be doing? Well, I will no longer be managing any R&D teams. The physical layer DSL chip development department I used to run closed many years ago, and last year my DSP software development department was dissolved as well. With my new appointment my HW/FPGA/Innovations department has transitioned to the newly formed Hardware and Innovations department, and my software team is moving to the new Advanced Technologies department. The Algorithmic Research department will still report to me.

I will continue to be responsible for tracking fundamental technology trends, and to steer RAD's participation in standardization forums (IETF, ITU, MEF, BBF, etc.). I will be working with academic research groups here in Israel, and perhaps abroad as well. I will be spending more time on IPR work - over the last few years this work has tended to be more defensive than creative. I will be doing more lecturing and more writing, and will function as editor in chief of the RAD Series on Essentials of Telecommunications (more on that some other time).

And I hope to have more time to blog.

Y(J)S

Thursday, November 17, 2011

MPLS-TP update

At the MPLS Working Group meeting this week it was announced that the core set of MPLS-TP RFCs have been finished.

Indeed, we now have (I hope that I haven't missed too many):
•RFC 5586 MPLS Generic Associated Channel (G-ACh and GAL)
•RFC 5654 Requirements of an MPLS Transport Profile
•RFC 5718 An In-Band Data Communication Network for MPLS-TP
•RFC 5860 Requirements for OAM in MPLS Transport Networks
•RFC 5921 A Framework for MPLS in Transport Networks
•RFC 5950 Network Management Framework for MPLS-TP
•RFC 5951 Network Management Requirements for MPLS-TP
•RFC 5960 MPLS-TP Data Plane Architecture
•RFC 5994 Application of Ethernet Pseudowires to MPLS Transport Networks
•RFC 6215 MPLS-TP UNI and NNI
•RFC 6370 MPLS-TP Identifiers
•RFC 6371 OAM Framework for MPLS-TP
•RFC 6372 MPLS-TP Survivability Framework
•RFC 6373 MPLS-TP Control Plane Framework
•RFC 6374 Packet Loss and Delay Measurement for MPLS Networks
•RFC 6375 Packet Loss and Delay Measurement Profile for MPLS-TP
•RFC 6378 Linear Protection MPLS-TP
•RFC 6425 Detecting Data-Plane Failures in Point-to-Multipoint MPLS - Extensions to LSP Ping
•RFC 6426 MPLS On-Demand Connectivity Verification and Route Tracing
•RFC 6427 MPLS Fault Management Operations, Administration, and Maintenance (OAM)
•RFC 6428 Proactive Connectivity Verification, Continuity Check, and Remote Defect Indication for the MPLS-TP
•RFC 6435 MPLS Transport Profile Lock Instruct and Loopback Functions

In addition, before the IETF meeting the ITU issued a statement reasserting that the IETF holds the pen on MPLS-TP.

It seems that the game is over.

Y(J)S

Wednesday, November 16, 2011

The notorious IP checksum algorithm

I have been asked several times to explain the checksum calculation used in the IP suite (IPv4, TCP and UDP all utilize the same checksum algorithm).

RFC 791, which defines IPv4, gives the checksum algorithm as follows :
The checksum field is the 16 bit one's complement of the one's
complement sum of all 16 bit words in the header. For purposes of
computing the checksum, the value of the checksum field is zero.
and the algorithm description was further updated in RFCs 1071, 1141, and 1624.


RFC 791 further states
This is a simple to compute checksum and experimental evidence
indicates it is adequate, but it is provisional and may be replaced by a CRC procedure, depending on further experience.

Back in 1981 when the RFC was written, Jon Postel already realized that this algorithm is very limited in its error detection capabilities (see below), but at the time CRC computation was too expensive computationally.

RFC 793, which defines TCP, says
The checksum field is the 16 bit one's complement of the one's complement sum of all 16 bit words in the header and text.

while RFC 768 for UDP says the same thing, but leaves a loophole
Checksum is the 16-bit one's complement of the one's complement sum of a pseudo header of information from the IP header, the UDP header, and the data, padded with zero octets at the end (if necessary) to make a multiple of two octets

If the computed checksum is zero, it is transmitted as all ones (the equivalent in one's complement arithmetic). An all zero transmitted checksum value means that the transmitter generated no checksum (for debugging or for higher level protocols that don't care).

On this latter issue, RFC 1180 “A TCP/IP Tutorial” adds
An incoming IP packet with an IP header type field indicating "UDP" is passed up to the UDP module by IP. When the UDP module receives the UDP datagram from IP it examines the UDP checksum. If the checksum is zero, it means that checksum was not calculated by the sender and can be ignored. Thus the sending computer's UDP module may or may not generate checksums. If Ethernet is the only network between the 2 UDP modules communicating, then you may not need checksumming. However, it is recommended that checksum generation always be enabled because at some point in the future a route table change may send the data across less reliable media.

and RFC 1122 “Requirements for Internet Hosts” adds
Some applications that normally run only across local area networks have chosen to turn off UDP checksums for efficiency. As a result, numerous cases of undetected errors have been reported. The advisability of ever turning off UDP checksumming is very controversial.

IPv6, as defined in RFC 2460, doesn’t bother with a header checksum, but closes the UDP loophole
Unlike IPv4, when UDP packets are originated by an IPv6 node,
the UDP checksum is not optional. That is, whenever originating a UDP packet, an IPv6 node must compute a UDP checksum over the packet and the pseudo-header, and, if that computation yields a result of zero, it must be changed to hex FFFF for placement in the UDP header. IPv6 receivers must discard UDP packets containing a zero checksum, and should log the error.

So, how precisely does the IP checksum algorithm work, and why is it designed this way?

The simplest method to protect against bit errors would be to xor bytes (or 16-bit words) together. This method suffers from the disadvantage that two bit errors in the same column cancel out, leaving no trace. Checksums are slightly stronger since they add words together instead of xoring them. Thus, two bit errors in the same column indeed leave that column correct in the sum, but the carry to the next column will be different.

Why does the IP checksum algorithm take the ones complement after adding together all of the words? Since this is a one-to-one transformation, it obviously doesn’t reduce the number of undetected errors. It does, however, protect against one special case – that of all zeros. If somehow the entire packet were wiped out and replaced by all zeros, the sum would still be OK (sum of zeros is zero). By flipping the result we catch this kind of bug.

To compute the IP checksum of some sequence of an even number of bytes (if the length is odd one pads with a zero byte), one groups the bytes in pairs which are considered as 16-bit words. Were one to have a computer that employs ones complement arithmetic, the algorithm would be simple to describe. One adds all of these words together, and returns the negative of this sum.

Unfortunately, ones complement machines are no longer in vogue, and essentially all computers now use twos complement representation. Ones complement and twos complement agree on how to represent positive numbers - they have a zero in the MSB. They also agree that negative numbers have a one in the MSB, but disagree about all the rest. Neither simply set the MSB (that’s called “signed magnitude” representation). In ones complement machines the negative of a positive number (its ones complement) is made by flipping all its bits. In two complement machines one flips all the bits and then adds one. Note that in ones complement representation there are two zeros – positive zero is all zeros and negative zero is all ones. Twos complement has only one zero - all zeros (all ones means -1).

Because of the difference in representation, the addition algorithms are also somewhat different for the two machine types. Twos complement machines add bits from LSB to MSB, and discard any carry from the MSB. Ones complement addition similarly adds the bits, but if a carry remains from the MSB it is added back to the LSB.

So if everyone uses twos complement arithmetic today, why does the IP checksum algorithm use ones complement addition and ones complement negation? Well, perhaps when the checksum algorithm was chosen one’s complement machines were more common (sigh).

More importantly, ones complement arithmetic has two (minor) advantages.

The first has to do with big- and little-endian conventions. Saying that a machine uses twos complement arithmetic still doesn’t completely pin things down. When building larger integers from bytes big endian machines place the higher order bytes to the left of the lower ones, thus if A and B are bytes, AB means A*256+B. Little-endian machines do the opposite – AB means B*256+A. Ones complement arithmetic has an interesting characteristic – addition is the same for big-endian and little-endian machines. This is not the case for twos complement arithmetic due to the discarding of the MSB carries.

For example,
in ones complement FF.FF+02.00=02.00 while FF.FF+00.02=00.02
in twos complement FF.FF+02.00=01.FF while FF.FF+00.02=00.01
thus one can write generic IP checksum code that directly uses 16-bit words that runs correctly on little-endian or big-endian machines, without knowing which kind of machine you have and without putting in compilation conditionals (#ifdef).

Another reason for ones complement is that it is slightly better at catching errors. Remember that twos complement addition discards MSB carries, so two bit errors in MSB positions are not caught, while ones complement propagates these carries back to the LSBs, thus catching this type of error. The difference is minor (for large TCP or UDP payloads the percentage of two bit errors missed by XORing is 6.25%, twos complement summing misses 3.32%, while ones complement summing only misses 3.125%).

Do these two small advantages justify the added complexity of using ones complement arithmetic? Probably not, but it is too late to change. With the greater computational power now available, stronger error detection algorithms should now be implemented. However, when IP is sent via Ethernet, it enjoys Ethernet's Frame Check Sequence, which is not only a CRC rather than a checksum, but is 32 bits in length! This makes the IP checksums superfluous.

Y(J)S

Wednesday, October 5, 2011

Network Coding

In conventional communications networks the active network elements (e.g., Ethernet switches or IP routers) are store-and-forward devices. They perform no nontrivial computation. It turns out that in certain cases it is possible to optimize network operation (to conserve some network resource or to improve some network performance measure) by embedding more intelligence in the network elements.

In order to understand how this is done, it is useful to start with two special cases.

First case : Two individuals communicating via a satellite having a single downlink/uplink coverage beam.

A transmits to B via satellite S and B transmits back to A via the same satellite. Since A and B must share satellite resources (namely time and frequency), the uplink transmissions must be separated in either time or frequency. In the conventional case the downlink transmissions are separated as well. Thus, if it costs one cent to transmit an uplink message from A or B to the satellite, and similarly one cent for the downlink message from the satellite to A or B, then the exchange of two messages, one from A to B and one from B to A costs 4 cents.

But, this does not need to be the case! Rather than S transmitting A’s message to B and afterwards (or on another frequency) B’s message to A, it can transmit only once the message A xor B on a frequency and at a time when both A and B are listening. A retrieves B’s message by xoring the received message with his own message (since B = (A xor B) xor A), and B performs the same operation to retrieve the message from A (since A = (A xor B) xor B).

This reduces the price from 4 units to 3 units (since there are only three transmissions: A-S, B-S, S-A+B) at the cost of the satellite having to perform the simple operation of xoring two messages. The xor operation performed by the satellite is a kind of “coding” operation that leads to reduction of required network resources.

Second case : Using coding to protect real-time or broadcast transmissions against packet loss.

In real-time and broadcast transmissions it is not possible for a receiver to request retransmission of a lost packet, as TCP and ARQ systems do. Some critical control protocols send each packet multiple times (three times is common), but this is extremely wasteful in network resources. RFC 2198 proposes repeating the audio data from the previous RTP packet in the present one, thus maintain the number of packets per second, but still doubling bandwidth requirements. The FECFRAME working group in the IETF standardized more efficient mechanisms in RFCs 5053 and 6015. I will explain only the simplest possible coding.

Assume that we know that there will never be more than 1 packet loss in 4 consecutive packets. Then for every four packets transmitted, a fifth “protection” packet consisting of the xor of these four packets is sent. If all four packets are received then this fifth packet is discarded. If any single packet is lost then it can be recovered by xoring the received three packets with the fifth “protection” packet. Thus, packet loss can be mitigated with only an increase of 25% in bandwidth, and an increase in delay.

But what does this have to do with network coding? In both of the above cases an information source performed some nontrivial operation in order to conserve some resource or to protect against some network defect. The extension to full network coding requires simply that the computation be performed by some network element along the information path that is able to perform the network coding. Unfortunately, examples of network coding can be quite complex. The simplest one is the “butterfly network” (see Figure 1) presented in a paper by Ahlswede, Cai, Li, and Yeung entitled “Network Information Flow”.





In this example a source S needs to multicast two packets of information P1 and P2 to two destinations A and B over the particular network of network elements U, V, W and X shown in the figure. All of the links have the same bandwidth, which is precisely the bandwidth needed to transmit the packets in the desired time.





It turns out that S can send P1 and P2 to both A and B at once, as shown in Figure 2. Network elements U, V, and X are multicast devices that are able to replicate a packet received on its input port to both of its output port. Network element W performs network coding by calculating the xor of two packets received on its two input ports and sending this to its output port.



Were W not able to perform this operation, it would need first to send P1 and then P2, thus taking twice the time, or alternatively would require twice the bandwidth on the link to X (contrary to our assumption on link bandwidths). It is not hard to convince yourself that without the network coding it is not possible to perform the desired task.

Network coding can be used for purposes other than bandwidth or delay minimization and packet loss protection. Recent research has explored applications to energy reduction, information security, file sharing, congestion control, and fairness.

Y(J)S

Sunday, August 21, 2011

The PW Associated Channel

In the beginning of the development of pseudowire technology, it was obvious to many of us that PWs would require some sort of OAM support. As always with OAM the question was how to make OAM packets fate-share with user data packets. The original RAD proposition was to define a special "OAM PW" that would be placed alongside the monitored PWs. In the MPLS case this meant a special PW label for OAM (RAD's proposal was to use the "all-ones" label), but to ensure that this OAM PW was placed in the same MPLS tunnel. This proposal still exists in Appendix D of RFC 5087.

The alternative proposal ("VCCV") placed special OAM packets in every PW. This meant much more OAM traffic for the prevalent case of many PWs in a single PSN tunnel, but simplified the assurance of fate-sharing. In order to enable interworking with other vendors, RAD abandoned its own proposal and adopted the VCCV approach, including advocating conformance with the newly standardized PWE3 control word and upgrading its equipment base accordingly.

Digression: VCCV stands for Virtual Channel Connectivity Verification, and is a complete misnomer. VC was an old (ATM-style) name for what is now called a PW. It was used in the early days of the PWE3 WG before the introduction of the term pseudowire, and should have been completely replaced. CV is a well-defined OAM term for detection of misconnections, that is, detecting that a packet arrives at the wrong destination. It should never be confused with Continuity Check (CC), which means checking that packets sent are actually received. Of course that is precisely the meaning in the term VCCV. Unfortunately by the time of the RFC it was too late to rename this function PWCC.

Three mechanisms were proposed for distinguishing between VCCV packets and user data packets, and all three became part of the standard. In the language of RFC 5085, there are three Control Channel types.



  • CC TYPE 1 When the PWE3 control word (CW) is used, the first nibble is set to 0001, instead of 0000.


  • CC TYPE 2 Router Alert Label (AKA out-of-band VCCV) - placing the reserved MPLS RA label above the PW label.


  • CC TYPE 3 TTL expiry - i.e., ensuring that the TTL in the PW label equals 1 at the PW endpoint.
Having three options sounds a bit confusing, but there were good reasons for all three. First, not all PWs use the CW; in fact, in some cases it would be wasteful to add 4 bytes to a small payload. Second, it has been argued that types 2 and 3 must be supported, as they are integral parts of the MPLS architecture. If a PW gateway receives a packet with the RA label, or with an expired TTL, it can not be expected to process it as a regular user packet!

It was realized early on that the CC types defined a PW associated channel could be used for functions other than VCCV, and that realization is captured in RFC 4385. However, this channel is limited to PWs, and could not be used for adding OAM functionality to non-PW MPLS traffic. So when the MPLS-TP effort required such functionality, the idea of an associated channel was generalized to a Generic Associated Channel (GACh) in RFC 5586. The generalization is obtained by defining what is essentially a fourth CC type - the GACh Label (GAL). This reserved MPLS label, unlike CC TYPE 2, sits at the bottom of the stack (there being no PW label), and is followed by what is essentially the PWE control word.

Those involved in the MPLS-TP effort want MPLS-TP mechanisms to work for PWs as well. This has led to a proposal to enable the use of the GAL for PW packets as well as for MPLS packets. For the PW case the idea is to put the GAL under the PW label. This proposal breaks an underlying characteristic of all PWs (explicitly stated in numerous RFCs), namely that the PW label sits at the bottom of the stack.

In my opinion three methods of indicating an associated channel packet is quite enough, and we don't need a fourth method. Yet, another proposal goes even further. This proposal suggests eliminating CC types 2 and 2, and leaving only type 1 (using the CW) and the new GAL approach. Were this proposed ten years ago I would probably have been in favor, although it is still not clear to me what a receiving PW gateway does when it receives a type 2 or type 3 packet. (Losing type 3 also excludes traceroute mechanisms.) However, at the present time this proposal would require upgrading ten years of live PW deployments, and I can not see how it can be implemented.

Y(J)S

Monday, May 30, 2011

"Seamless MPLS" and Denial of Service

A Denial of Service (DoS) attack is an attack that attempts to render a service temporarily unavailable to legitimate users of the service. DoS attacks are carried out by attackers disrupting the function of any link in the service supply chain. In the context of services provided over telecommunications networks, DoS attacks can be directed at a web or mail server, routers, or at any necessary utility functions such as the DNS system.

There are two main DoS attack strategies :
1. The attacker can send malware to the attacked device, causing its malfunction. In extreme cases (called phlashing) the attacked device may need to be completely replaced.
2. The attacker can flood the attacked device with a large number of seemingly legitimate service requests, thereby consuming its resources and degrading its ability to service other users. In order to more completely overwhelm a device (and camouflaging the source of the attack), Distributed Denial of Service (DDoS) attacks simultaneously send service requests from multiple sources. Rate limiting and traffic shaping are not true DoS prevention methods. First, they are ineffectual against the first type of attack. Second, although they may prevent overload of devices under attack, since they do not distinguish between attackers and legitimate users, they themselves reduce service quality. In addition, they become Achilles’ heels providing attackers with new devices to attack.

There are only two true defenses against DoS attacks :
1. discarding illegitimate service requests,
2. allowing only legitimate service requesters.

The first method is typically used against attacks that exploit packets carefully designed to confuse network devices or require greater than average processing resources. It is ineffectual against brute-force attacks by properly formed service requests, such as DDoS attacks. It also usually requires costly Deep Packet Inspection (DPI). The second method is universally effective, but can only be used when there is a way to accurately identify legitimate users of the service.

That way is called source authentication, and it works by verifying that each received packet was authentically sent by the source claiming to have sent it. Thus source identification is limited to packet formats that include a source address, such as Ethernet, IPv4, and IPv6. IPsec uses a Hash-based Message Authentication Code (HMAC) to verify both the integrity and authenticity of an IP packet. MACsec uses a combined algorithm to verify integrity and authenticity, and optionally encrypting the packet.

As is certainly well-known to readers of this blog, MPLS packets contain labels that proxy for destination addresses, but no explicit addresses, and certainly no source address. As stated RFC 5920 - Security Framework for MPLS and GMPLS Networks :

The MPLS data plane, as presently defined, is not amenable to source authentication, as there are no source identifiers in the MPLS packet to authenticate. The MPLS label is only locally meaningful. It may be assigned by a downstream node or upstream node for multicast support.

When the MPLS payload carries identifiers that may be authenticated (e.g., IP packets), authentication may be carried out at the client level, but this does not help the MPLS SP, as these client identifiers belong to an external, untrusted network.

An attacker with physical access to an MPLS network can readily cause mayhem. There are only a million possible MPLS labels, and thus it will not take an attacker long to come across a valid one. Once that is accomplished, nothing can stop packets he injects from traversing the network and appearing at supposedly isolated egress points. The attack is made even simpler because many LSRs are configured to employ platform-wide label spaces, and many LSR label generators produce labels in order from low to high.

Of course if the MPLS is carrying only IP traffic, then that network layer can be protected using well-known IPsec methods. But MPLS can also carry non-IP traffic, e.g. pseudowires. Imagine what would happen if extra TDM-PW traffic were successfully injected - buffer overflows, loss of timing, and complete service shutdown. Imagine what would happen if an attacker injected multicast PAUSE frames into an Ethernet PW – delayed frames, buffer overflow, and complete service denial.

So why haven’t there been widespread devastating attacks on the critical MPLS infrastructure ? Mainly because MPLS networks have, until now, been walled gardens, that is, closed tightly controlled networks, with no access to outside attackers. RFC 5920 calls them trusted zones, which it describes in the following manner :

A trusted zone contains elements and users with similar security properties, such as exposure and risk level. In the MPLS context, an organization is typically considered as one trusted zone.

The boundaries of a trust domain should be carefully defined when analyzing the security properties of each individual network … In principle, the trusted zones should be separate …
A key requirement of MPLS and GMPLS networks is that the security of the trusted zone not be compromised by interconnecting the MPLS/GMPLS core infrastructure with another provider's core (MPLS/GMPLS or non-MPLS/GMPLS), the Internet, or end users.


So, MPLS has been safe since it has been hidden away in the core, with no access to outsiders.

But this is about to change. The IETF MPLS WG recently elevated to working group status a document entitled Seamless MPLS Architecture (draft-leymann-mpls-seamless-mpls). This document proposes extending MPLS from the core into access networks, and seamlessly integrating the access domain into the core MPLS domain. In the words of the draft :

The motivation of Seamless MPLS is to provide an architecture which supports a wide variety of different services on a single MPLS platform fully integrating access, aggregation and core network. The architecture can be used for residential services, mobile backhaul, business services and supports fast reroute, redundancy and load balancing. Seamless MPLS provides the deployment of service creation points which can be virtually everywhere in the network.

With Seamless MPLS there are no technology boundaries and no topology boundaries for the services. Network (or region) boundaries are for scaling and manageability, and do not affect the service layer, since the Transport Pseudowire that carries packets from the AN to the SN doesn't care whether it takes two hops or twenty, nor how many region boundaries it needs to cross.

Seamless MPLS drops the boundaries between access, aggregation, and core networks. This may indeed simplify network management – but how are the security issues handled? The draft’s “Security Considerations” section states the following :

In a typical MPLS deployment the use of MPLS is limited to relatively small network consisting of core and edge nodes. Those nodes are under full control of the services provider and placed at locations where only authorized personal has access (this also includes physical access to the nodes). With the extensions of MPLS towards access and aggregation nodes not all nodes will be "locked away" in secure locations. Small access nodes like DSLAMs will be located in street cabinets, potentially offering access to the "interested researcher".

So far, so good. The draft authors understand the security problem they raise. But now for the punch line …

Nevertheless the unauthorized access to such in device SHOULD NOT impose any security risks to the MPLS infrastructure itself.

The term SHOULD NOT can be understood in two ways. Perhaps it is simply a statement that the authors believe that this placement of nodes in sites where they will be accessible to outsiders simply shouldn’t cause any problems, since no-one would think of attempting to exploit this vulnerability. Or perhaps this is a requirement for implementations, but not a strong MUST requirement, just a SHOULD requirement. In this case the authors are saying that perhaps in some cases it would be a good idea to do something about this, but only if there isn’t some other more important consideration.

But don’t panic - the draft authors add an additional sentence :

Seamless MPLS must be stable regarding attacks against access and aggregation nodes running MPLS.

Note that this requirement carries a non-normative must rather than a MUST. Also, seamless MPLS need not be impregnable to attacks, just stable. Network stability is defined in RFC 2360, the Guide for Internet Standards Writers. It means that the network does not take an infinite time to return to normal operation after some type of change. In this context, it apparently means that after a DoS attack is over, the network should return to normal functioning. Not a very strong requirement !

Can seamless MPLS be made safe (or at least as safe as present networks) ? Of course, but the effort would be substantial, requiring IETF to develop security mechanisms for non-IP traffic, something that has not been attempted to date. As the draft authors requested that the draft be accepted with all the rest of the security section marked “TBD”, fixing this lacuna does not seem to be very high on their list.

Y(J)S

Monday, January 10, 2011

MPLS is not a "successful" protocol

RFC 5218 defines what the Internet Architecture Board considers to be a "successful" protocol. A "successful" protocol is one that meets its original goals and is widely deployed, such as DNS, BGP, SMTP, and SIP. A "wildly successful" protocol far exceeds its original goals in terms of purpose and scale. Examples of the latter are IPv4, ARP, and HTTP. A protocol may be considered successful even if its deployment is still limited, as long as it meets its original goals.

At the technical plenary of the 74th IETF meeting in May 2009, there were presentations on the occasion of the 12th anniversary of the formation of the MPLS working group (subtitled “MPLS becoming a teenager”). This session was subtitled “Many consider MPLS a success, in the sense of RFC 5812's (sic) "What Makes for a Successful Protocol?" (see agenda and slide) .

Note the reference to 5812 instead of 5218. I find this typo enlightening. The first presentation of the session claimed that MPLS is a "wildly successful" protocol. In my opinion, MPLS can not be considered even “successful” in the sense of RFC 5218, but it may indeed be in the spirit of RFC 5812.

For those who haven’t read 5812, it is a proposed standard entitled “Forwarding and Control Element Separation (ForCES) Forwarding Element Model”. ForCES is a framework and a set of protocols that aim to standardize information exchange between the IP control and forwarding planes, enabling control elements (CEs) and Forwarding Elements (FEs) to become physically separated components. Although this was certainly not the intention of the speaker, this type of separation is indeed one of the ancillary benefits of MPLS.

The second talk at IETF-74 was an interesting presentation on the history of MPLS, but it carefully avoided stating the relevant facts. In the mid to late 90s, after opening up the Internet to the public at large and to commercial interests, the Internet started growing exponentially. This growth was exciting, but brought two main concerns, namely
1) address exhaustion - which lead to the development of IPv6 (we are still waiting for IPv6 to become a successful protocol …), and
2) slowing down of IP forwarding due to router table explosion - which lead to the development of MPLS.

The first issue was temporarily solved by the introduction of NAT, and I won’t discuss it further here. The second brought about a wave of innovation, with at least five solutions offered :

1) Cell Switching Router (Toshiba) (see RFCs 2098,2129)
2) IP Switching (Ipsilon, bought by Nokia) (see RFC 2297)
3) Tag Switching (Cisco) (see RFC 2105)
4) Aggregate Route-based IP Switching (IBM)
5) IP Navigator (Cascade acquired by Ascend which was acquired by Lucent which merged with Alcatel to become ALU)

With so many alternatives, BOFs were held in 1994-1995 and the MPLS working group chartered in 1997 with co-chairs from Cisco and IBM (which is the reason MPLS is so similar to tag switching and borrows a bit from ARIS).

However, the router manufacturers were not sitting idly waiting for MPLS to succeed, and improvements in algorithms and hardware increased the IPv4 forwarding speed to the point where MPLS was no longer needed.

So why is MPLS still being used ? There are at least two reasons. First, RSVP-TE-enabled MPLS enables hard QoS guarantees that are not possible in pure IP due to the lack of adoption of IntServ. Second, MPLS can carry non-IP packets (pseudowires).

I have heard the argument that the first reason was the true design goal of MPLS. However, a casual reading of RFC 3031, the RFC that defines MPLS, shows that QoS was considered an added advantage, not a design goal.

Some routers analyze a packet's network layer header not merely to choose the packet's next hop, but also to determine a packet's "precedence" or "class of service". They may then apply different discard thresholds or scheduling disciplines to different packets.
MPLS allows (but does not require) the precedence or class of service to be fully or partially inferred from the label.


Although MPLS is very widely deployed, the problem it was designed to solve has gone away (although it may return when IPv6 becomes more prevalent), and indeed on some platforms MPLS-based forwarding is actually slower than native IPv4 forwarding. Thus, according to 5218 MPLS is not successful. Yet.

Y(J)S

Monday, November 29, 2010

Reliable transport vs. reliable transport

One of the most contradictory uses of terminology in communications concerns the word transport as used by the IETF and the ITU-T communities. To make matters worse, the term’s prevalent modifier reliable leads to even further divergence in meaning.

To the Internet community, transport refers to the fourth layer of the OSI layer stack (a layer stack known to the ITU-T as X.200, but largely assumed to have been superseded by the more flexible G.80x layering model). The transport layer sits above the network layer (IP), and is responsible (or not) for a range of end-to-end path functionalities. The IETF has defined four transport protocols – the two celebrated ones being UDP (unreliable) and TCP (reliable), and their less fêted brethren are SCTP (highly reliable and message-oriented), and DCCP (unreliable but TCP-friendly).

Since the IP stack does not extend below OSI layer 3, the basic tenet is that nothing can be done about defects in lower layers (congestion, lost or misordered packets) and the best strategy is to compensate for any such problems using the layer over IP (e.g., retransmission, packet reordering). Reliability in this context thus means employing such compensation.

To the communications infrastructure community a transport network is a communications network that serves no function other than transport of user information between the network’s ingress and its egress. In particular, no complex processing such as packet retransmission of reordering is in scope. Reliability in this context means monitoring the functionality and performance of the lowest accessible layer (OAM), and bypassing defective elements in as short a time as possible (protection switching).

For many years the Rubicon between L2 and L3 so effectively separated the two communities that the disparate usages could continue un-noticed. But then came MPLS.

MPLS was originally invented as a method of accelerating the processing of IP packets, by parsing the IP header only at network ingress, and attaching a short label to be looked-up from then on. With advances in header parsing hardware and algorithms this acceleration became less and less significant. However, the possibility of treating a packet consistently throughout the network, and thus performing traffic engineering under layer three instead of compensation over it, kept MPLS from becoming marginalized. While IntServ at the IP level (implemented by RSVP) never caught on, traffic engineering at the MPLS layer (RSVP-TE) starting gaining momentum.

Of course, for MPLS to truly make IP more reliable, it required more “transport” functionality (in the ITU sense), such as stronger OAM and protection switching. This lead to the introduction of “Transport-MPLS” (T-MPLS), later to be renamed “MPLS-Transport Profile” (MPLS-TP).
Thus it became impossible to suppress the conflict between the two transports. In the IETF Work on MPLS Transport Profile in the IETF is obviously not performed in the “Transport Area” (having nothing to do with transport), but in the “Routing Area” (although it has very little to do with routing).

So what does the future hold for the words “transport” and “reliable” ? In theory it would be possible to adopt synonyms (such “carriage”, and “resilient”), although I doubt that either community would be willing to abandon its traditions. At a high enough level of abstraction the meanings coalesce, so perhaps the best tactic today (whenever there is room for error) is to say sub-IP transport and super-IP transport. Reliability can be left for the super-IP case (where it is not really apt) since there are multiple alternatives for the sub-IP case.

Y(J)S

Tuesday, November 23, 2010

IETF79 - Beijing !

I haven’t had much time to blog of late, having to catch up on work since returning from Beijing.

Beijing ? I hear you ask. Yes, the 79th IETF meeting was held 7-12 November in the Chinese capital.

This was my 26’th IETF, and things have changed since my first meeting. Back in the “old days” the meetings were mostly in the US (e.g., Minneapolis in the winter) and occasionally in Europe. This was then changed to a 3:2:1 rule, where half of the six meetings of two years taking place in North America (with Canada preferred to the US due to visa requirements), two meetings in Europe, and one meeting in Asia (Japan or S. Korea). Even then the default hotel chain with which the IETF had an arrangement was comfortably unvarying. The venue was so predictable that when I found out that the next European meeting was to be held in the French capital, I googled “Paris Hilton” and was surprised to retrieve photos of a scantily dressed heiress.

However, the proportion of Asian participants in SDOs has increased to such an extent that in the space of a single month, three of the SDOs that I follow held meetings in China – ITU-T SG15/Q13 (timing) met in Shenzhen 18 - 22 October, the MEF met 24-27 October in Beijing, followed by the IETF.

While the IETF general attendance figures were up (and for the first time the largest contingent was not from the US – but from China), several of the working groups that I attend suffered from a noticeable lack of major participants. In TICTOC, other than the two chairs and the Area Director, only two of the regulars were able to appear in person. This made it difficult to make any progress on the crucial issues.

However, the PWE3 session was lively, with the topics of making the control word mandatory and deprecating some of the VCCV modes drawing people to the mike. Unfortunately PWE3’s slot coincided with IPPM’s, but apparently IPPM was plagued with a situation similar to TICTOC’s. In the CODEC WG (whose chair couldn't make it to Beijing), the IPR-free audio codec for Internet use that is being developed was demoed. In the technical plenary there were interesting talks and exchanges in IPv6 operations and transitional issues, with the local speakers painting a grim picture of the IPv4 address availability.

All-in-all it was an interesting meeting in an interesting venue; a venue that I am certain to be visiting again.

Y(J)S

Tuesday, November 2, 2010

OAM for flows

Continuing my coverage of the recent joint IESG/IAB design team on OAM, this time I want to discuss the issue of OAM for flows in Packet Switched Networks (PSNs).

From a pure topology standpoint any communications network is imply a set of source ports (i.e., interfaces into which we may input information), a set of destination ports (i.e., interfaces from which we may receive information), and a set of links connecting source ports to destination ports. Of course, the destination ports will may be located very far from the source ports (and this is the reason we use the network in the first place), but this geometry is irrelevant from a topological point of view.

PSNs are communications networks that transfer information in units of packets. They can be classified as either Connection Oriented (CO) , or ConnectLess (CL). In a CO network the end-to-end path from source port to destination port needs to be set up (by physical connection, or by manual/management-system configuration, or by control-plane signaling) before information can be sent down the path. Once set up, it makes sense to call this end-to-end path a “connection”. A connection is essentially an association of a source port with a destination port that has been prepared to carry information.

In a CL PSN each packet of information is individually sent towards its destination. No set up is required as each packet is identified by a unique destination address. When we send a data packet from a source port to the desired destination we can still think about an association of the source and destination ports, but as this association is ephemeral, this association does not constitute a connection. If, however, many similar packets are sent from source to destination, it may be useful to speak of a “flow” of packets. Of course, there is no guarantee that all the packets travel from source to destination over precisely the same path through the network, but in many cases this is the case for substantial periods of time until a reroute event takes place. When load balancing is used the definition (and its consequences) becomes truly problematic.

OAM mechanisms were originally designed for Circuit Switched (CS) or CO communications systems, such as PDH, SDH, and ATM networks. For such networks it makes perfect sense to ask about continuity of the connection, or its performance parameters (e.g., delay). Thus Continuity Check (CC) OAM functions became standard, and PM functions recommended for CO networks. The issue is more complex for CL networks. Continuity doesn’t mean very much if every packet is sent to a different destination! It means somewhat more when there is a prolonged flow; but even then packet loss and delay are statistical combinations, since consecutive packets may traverse different network elements on their route from source to destination.

When packets are delivered to an incorrect destination we say a misconnection has occurred, and Connectivity Verification (CV) monitors for such events. (There is often confusion between CC – a functionality needed for CO and CL networks, and CV – which is only for CL.)

For some types of prolonged flows it makes sense to introduce OAM mechanisms to monitor continuity and performance parameters. Pseudostreaming of video over the Internet may involve hundreds of packets per second for many minutes. IPTV flows are even higher in rate and can last for hours. Ethernet Virtual Connections (EVCs) between customer sites last indefinitely.

In a future entry I will discuss when it doesn’t make sense to talk about flows, and what OAM means for such cases.

Y(J)S

Thursday, October 28, 2010

IETF and OAM

On October 12nd and 13th the IESG (Internet Engineering Steering Group) and IAB (Internet Architecture Board), the two IETF management bodies, held a joint design session on OAM. I was a bit surprised that the IETF leadership would be interested in devoting a separate meeting (not coinciding with an IETF conference) to the subject of OAM; OAM has never been an area of IETF expertise. Indeed, when the meeting was first announced on the IETF main discussion email list several long-time IETF participants asked for the acronym OAM to be spelled out! Of course, the ICMP (ping) was defined in RFC 792 circa 1981, and BFD that runs between routers has its own BFD Working Group (WG) in the IETF, but the overall concept of OAM has never been central to the IETF world view.

However, just as the interests of the ITU-T have been migrating up from synchronous networks to ones based on Ethernet, IP and MPLS, so have the interests of the IETF been migrating down from applications, end-to-end transport, and routing to pure transport functionality. And OAM is a crucial element of transport networks.

The physical meeting was held at George Mason University in Fairfax Virginia, but was also WebEx’ed. Thus I managed to actively participate, and even present slides, without having to travel; but did find myself jet-lagged due to shifting my work day by 6 hours. Unfortunately, the Internet connectivity at the conference site was not completely solid, and the remote attendees frequently found themselves talking to themselves on how to alert those on-site that the connection had failed. Some connectivity OAM would definitely have been useful …

So where does the IETF want to use OAM ? The main interest is now MPLS-TP, but there are still open issues regarding PW OAM.

The IETF PWE3 WG standardized an associated channel that shares fate with the PW traffic, which is mostly employed for OAM. This OAM is misnamed VCCV for Virtual Channel (an old name for PW) Connectivity Verification (which should be Continuity Check). VCCV presently allows IP ping, LSP ping, and BFD protocols to run inside the associated channel in order to provide FM. Back at IETF-67 (November 2006) I proposed using Y.1731 inside the associated channel. This idea was later developed into draft-mohan-pwe3-vccv-eth, backed by Nortel, RAD, France Telecom, KDDI, Huawei, NTT and Sprint, but was rejected by the larger community due to confusion as to its use of Ethernet (it was never intended to be limited to MPLS over Ethernet, or Ethernet PWs).

As I am sure all my readers know, MPLS-TP is a transport technology, being jointly developed by the ITU-T and IETF. The ITU-T views MPLS-TP as yet another transport network, which needs the same OAM functionality as all the other transport networks developed to date (SDH, OTN, carrier-grade Ethernet). In particular, the generic research on OAM for packet-based networks, and the protocol development (in cooperation with IEEE 802.1) of Y.1731, is seen by the ITU-T community as directly relevant to MPLS-TP. Work in the ITU, and Internet Draft draft-bhh-mpls-tp-oam-y1731 submitted to the IETF, proposed maximizing re-use of Y.1731 formats. This approach is strongly advocated by Alcatel-Lucent and Huawei, and is being backed by many operators, including China Mobile, China Telecom, Telecom Italia, France Telecom, Deutche Telekom, Telstra, and KPN. The idea expands on my earlier proposal, solves both FM and PM with a single OAM protocol, and is expected to undergo major deployment in the near future.

IETF participants from Cisco, Juniper, and Ericsson produced an alternative OAM FM proposal, based on the IETF’s own BFD instead of the ITU-T’s Y.1731. The IETF MPLS WG could not reach consensus as to which mechanism to prefer (in an email poll the community was split about 50/50). The MPLS WG chairs decided to exercise their authority to break such ties, and elevated the BFD-based draft to WG status as draft-ietf-mpls-tp-cc-cv-rdi, thus effectively killing draft-bhh for FM. The issue of PM was open for a while, but a Cisco draft has recently been elevated to become draft-ietf-mpls-tp-loss-delay, thus blocking draft-bhh from the PM function as well. This draft is not fully fleshed out, but it uses the MPLS-TP G-ACh mechanism, and allows either NTP-style or 1588-style timestamps.

So we see that OAM has become a hot (and contentious) topic in the IETF.

After this long introduction, my next entry will delve into a few of the subjects discussed at the joint design session.

Y(J)S

Wednesday, September 29, 2010

OAM for FM and PM

The Operations, Administration, and Maintenance (OAM) functionality provided in all modern communications systems supports two distinguishable functions, namely Fault Management (FM) and Performance Management (PM).
It is important to remember that despite the use of the word “management” here, OAM is a user-plane function. OAM may trigger control plane procedures (e.g., protection switching) or management plane actions (such as alarms), but the OAM itself is data that runs along with the user data.

FM deals with the detection and reporting of malfunctions. ITU-T Recommendation G.806 defines a scale of such malfunctions :
  • anomaly (n): smallest observable discrepancy between desired and actual characteristics
  • defect (d): sufficient density of anomalies that interrupts some required function
  • fault cause (c): root cause behind multiple defects
  • failure (f): persistent fault cause such that the ability to perform the function is terminated

The main FM functions include :

  • Continuity Check (CC): checking that data sent from A to B indeed arrives at B
  • Connectivity Verification (CV): checking that data set from A to B does not incorrectly arrive at C
  • Loopback (LB): checking that data can be sent from A to B can be returned from B and received at A
  • Forward Defect Indication (FDI) also called Alarm Indication Signal (AIS): when data sent from A to B is destined for C, B reports to C that it did not receive data from A
  • Backward Defect Indication (BDI) also called Reverse Defect Indication (RDI): when data is sent from A to B, B reports to A that it did not receive the data.

PM deals with monitoring of parameters such as end-to-end delay, Bit Error Rate (BER), and Packet Loss Ratio (PLR). While there may not be loss of basic connectivity if performance parameters are not maintained within their desired realms, the ability to provide specific services may be compromised, even to the extent that there is a loss of service. For example, excessive round-trip delay makes it difficult to hold interactive audio conferences, and excessive PLR may lead to loss of an IPTV service. For this reason, Service Providers (SPs) commit to Service Level Agreements (SLAs) that specify the acceptable PM parameters.

A partial list of PM parameters that may appear in an SLA is :

  • BER or PLR (for packet oriented networks)
  • 1-way delay (1DM) also called latency: the amount of time it takes for data to go between two points of interest (this measurement requires clock synchronization between endpoints)
  • 2-way delay also called roundtrip delay (RTD): the amount of time it takes for data to go to a point of interest and return (does not require clock synchronization)
  • Packet Delay Variation (PDV): the variation of delay (may be 1-way or 2-way, but even 1-way does not require time synchronization, although frequency synchronization may be required for highly accurate measurements)
  • Availability: percentage of time that the service can be provided
  • Throughput or Bandwidth profile (for packet oriented networks): methods of quantifying the sustainable data rate (will generally be needed for each direction separately)

While certain FM functions, in particular Continuity Check (CC), are usually run periodically, PM functions are frequently called on an ad-hoc basis. However, with an SLA in effect, the SP needs to periodically monitor the PM parameters, and the customer may want to do so as well. In fact, while customers typically trust legacy SPs to provide the promised service level (after all, a 2.048 Mbps leased line is never going to deliver only 1.9 Mbps!), they have much less trust for newer services (it is relatively easy for a SP to cheat and provide 8 Mbps Ethernet throughput instead of the promised 10 Mbps).

In future entries I will deal with questions such as what parameter levels are needed for particular applications, how PM impacts user experience, and how SPs and customers should monitor performance.

Y(J)S

Wednesday, September 8, 2010

Deployment, R&D, and protocols

In my last entry I discussed why the last mile is a bandwidth bottleneck while the backhaul network is a utilization bottleneck. Since I was discussing the access network I did not delve into the core, but it is clear that the core is where the rates are highest, and where the traffic is the most diverse in nature.

Based on these facts, we can enumerate the critical issues for deployment and R&D investment in each of these segments. For the last mile the most important deployment issue is maximizing the data-rate over existing infrastructures, and the area for technology improvement is data-rate enhancement for these infrastructures.

For the backhaul network the deployment imperative is congestion control, while development focuses on OAM and control plane protocols to minimize congestion and manage performance and faults.

For the core network the most costly deployment issue is large-capacity, fast and redundant network forwarding elements, along with rich connectivity. Future developments involve a huge range of topics, from optimized packet formats (MPLS) through routing protocols, to management plane functionality.

A further consequence of these different critical issues is the preference of protocols used in each of these segments. In the last mile efficiency is critical, but there no little need for complex connectivity. So physical-layer framing protocols rule. As there may be the need for multiplexing or inverse multiplexing, one sometimes sees non-trivial use of higher-layer protocols. However, these are usually avoided. For example, Ethernet has long had an inefficient inverse multiplexing mechanism (LAG), but this is being replaced with the more efficient sub-Ethernet PAF (EFM bonding) alongside physical layer (m-pair) bonding for DSL links.

In the backhaul network carrier-grade Ethernet has replaced ATM as the dominant protocol, although MPLS-TP advocates are proposing it for this segment. Carrier-grade Ethernet acquired all the required fault and performance mechanisms with the adoption of Y.1731, while the MEF has worked hard in developing the needed shaping, policing, and scheduling mechanisms.

In the core the IP suite is sovereign. MPLS was originally developed to accelerate IP forwarding, but advances in algorithms and hardware have made IPv4 forwarding remarkably fast. IP caters to a diverse set of traffic types, and the large number of RFCs attests to the richness of available functionality.

Of course it is sometimes useful to use different protocols. A service provider that requires out-of-footprint connectivity might prefer IP backhaul to Ethernet. An operator with regulatory constraints might prefer a pure Ethernet (PBBN) core to an IP one. Yet, understanding the nature and constraints of each of the segments helps us weigh the possibilities.

Y(J)S

Thursday, August 26, 2010

Bandwidth and utilization bottlenecks

Let us consider an end-to-end data transport path that can be decomposed into the following segments
* end-to-end path = LAN + access network + core network + access network + LAN
There may be distinct service providers for each of these segments, thus many different decompositions may make sense from the business perspective. Yet, the identity of the access network, and of its components
* access network = last mile + backhaul network
are useful constructs for more fundamental reasons.

These reasons emanate from the concepts of bandwidth and bandwidth utilization (the ratio of required to available bandwidth). In general :
1) LAN and core have high bandwidth, while the last mile has low bandwidth.
2) LAN and core enjoy low utilizations, while the backhaul network suffers from high utilization.
Let's see why.

LANs are the most geographically constrained of the segments, and thus physics enables them to effortlessly run at high bandwidth. On the other hand, LANS handle only their owner’s traffic, and thus the required bandwidth is low as compared with that available. And if the bandwidth requirements increase, it is a relatively simple and inexpensive matter for the owner to upgrade switches or cabling. So utilization is low.

Core networks have the highest bandwidth requirements, and are geographically unconstrained. This is indeed challenging, however, the challenge is actually financial rather than physical. Physics allows transporting without error any quantity of digital data over any distance; it just extracts a monetary penalty when both bandwidth and distance are large. Since it is the core function of core network operators to provide this transport, the monetary penalty of high bandwidth is borne. Whenever trends show that bandwidth is becoming tight, network engineering comes into play – that is, either some of the traffic is rerouted or the network infrastructure is upgraded.

Shannon’s capacity law universally restricts the bandwidth of DSL, radio, cable or PON links used in the last mile. However, utilization is usually not a problem as customers purchase bandwidth that is commensurate with their needs, and understand that it is worthwhile to upgrade their service bandwidth as these needs increase.

On the other hand, the backhaul network is a true utilization bottleneck. Frequently the access provider does not own the infrastructure, and purchases bandwidth caps instead. Since the backhaul is shared infrastructure, overprovisioning these rings or trees would seriously impact OPEX overhead. Even when the infrastructure is owned by the provider, adding new segments involves purchasing right-of-way or paying license fees for microwave links.

So, the sole bandwidth bottleneck is the last mile, while the sole utilization bottleneck is the backhaul network. Understanding these facts is critical for proper network design.

Y(J)S

Thursday, August 19, 2010

The access network equation

My last entry provoked several emails on the subject of the terms last/first mile vs. access networks. While answering these emails I found it useful to bring in an additional term – the backhaul network. Since these discussions took place elsewhere, I thought it would be best to summarize my explanation here.

Everyone knows what a LAN is and what a core network is. Simply put, the access network sits between the LAN or user and the core. For example, when a user connects a home or office LAN to the Internet via a DSL link, we have a LAN communicating over an access network with the Internet core. Similarly, when a smartphone user browses the Internet over the air interface to a neighboring cellsite, the phone connects over an access network to the Internet core.

However, the access network itself naturally divides into two segments, based on fundamental physical constraints. In the first example the DSL link can’t extend further than a few kilometers, due to the electrical properties of twisted copper pairs. In the second case when the user strays from the cell served by the base-station, the connection is reassigned to a neighboring cell, due to electromagnetic properties of radio waves. Such distance-limited media are the last mile (or first mile if you prefer).

DSLAMs and base-stations are examples of first aggregation points; they terminate last mile segments from multiple users and connect them to the core network. Since the physical constraints compel the first aggregation point to be physically close to its end-users, it will usually be physically remote from the core network. So an additional backhaul segment is needed to connect the first aggregation point to the core. Sometimes additional second aggregation points are used to aggregate multiple first aggregation points, and so on. In any case, we label the set of backhaul links and associated network elements the backhaul network.

We can sum this discussion up in a single equation:
* access network = last mile + backhaul network

I’ll discuss the consequences of this equation in future blog entries.

Y(J)S