Diplomacy of troubleshooting and the telecom alphabet soup: T1, CRC errors, ESF, B8ZS, HDSL, CSU/DSU, clock source

Wednesday 21 April 2004 at 13:20

This story is about the diplomacy of troubleshooting and the art of asking good questions. It is also a parable about the value of non-conflict. The telecom alphabet soup in the title is for the search engines so that people who are actually troubleshooting can just skip ahead to our solution to CRC errors on a T1 line.

bivio has a new office, but no one works there yet. We renovated because the place was a mess but we liked the location. We added new carpet, fresh paint, a kitchen, a server room, and enclosed one additional office. We also took out one wall in the office we'll use for ourselves -- the rest we plan to sublet. It's vastly improved except for one crucial detail: getting internet access has been a fiasco.

Our production systems are in a colocation facility in South Denver. We have a T1 line running from there to the basement of Rob's house. Our development systems are in the basement, along with some big disk arrays and a tape robot for backups and an incredibly heavy printer. The plan was to move the T1 line and our systems from Rob's basement to the office.

It's not as if a leased T1 line is some bleeding edge networking technology. We know a T1 has been in our office before -- the previous tenants appear to have been tele-marketers. Moving our T1 ought to be entirely painless, even trivial for the telco, right? That assumption was our first mistake.

We scheduled the cutover, moved the computers and waited for the packets to start flowing. Instead our routers reported CRC errors. Lots of CRC errors. About 5000 errors per second.

Rob was on the phone with technicians and management from the telco and our colocation facility for two-and-a-half days alternately doing the angry-customer-rant or helping to troubleshoot the problem. When no progress was in sight on the third day, we moved the computers back to the basement.

But the office looks great, really.

The blame game

Our telco resells Qwest's wires for "the last mile" to Rob's house and otherwise runs their own phone cloud from Boulder to our colocation facility. A tech came out to our office and put his test gear on the line. No errors whatsoever. "Must be your routers."

For five years our routers happily traded packets over the existing line in Rob's basement. We know they work and we know they work with the telco's cloud. Rob finally got fed up with the blame-the-router solution and dragged the tech over to his house to demonstrate that the router works from his basement. It worked flawlessly.

The next stop was the telco's facility in Boulder to isolate Qwest from the test. CRC errors piled on. Seemed pretty clear to us. It's not the routers, but a provisioning problem. So, what's different between the line to Rob's basement and the line to the new office?

Pursuing that question was the second mistake.

It turns out there is a difference: HDSL. We demonstrated that the routers were not the problem. Their test gear "demonstrated" that the line was not the problem. "Must be that HDSL." But we failed to get the routers working in their facility. We hadn't determined that the problem was in the one of those segments. In fact we had actually isolated those segments from the problem. Rob tried really hard to make this point and steer them away from the HDSL red herring, but it was like trying to stop a train wreck. They were convinced it was the HDSL.

Rob's house is a long way from the central office. HDSL can cover the distance. There's nothing about HDSL that would cause the CRC errors. And HDSL wouldn't explain why the router didn't work in their facility. Nevertheless, they requested HDSL be added to the line to the office and Qwest told them it would be five days.

Belief is incredibly powerful. Given evidence which conflicts with belief, we will sooner dismiss the evidence than change what we believe.

Any other options?

While our telco waited for Qwest, we investigated alternatives. The options included cable and DSL. Comcast offers cable service in our area, but the upstream bandwidth is limited to 384Kbps and you only get five static IP addresses. We might be able to work within the bandwidth limits, but need more than five static addresses. DSL won't work in our location. Qwest tells us there's 26 gauge wire between their central office and our office -- there's too much signal loss. They could maybe get 256Kbps to outside the building, but they don't think it would reach inside the building at all.

Our telco had not contacted us at all and we had no viable alternative so we got back on their case again last week. Qwest hadn't gotten back to them about the HDSL. Rob and I were finally talking with their lead T1 tech. A technician he knew at Qwest thought we couldn't even get HDSL at our location. (Of course we knew that already :-) But no one had a good theory about how HDSL was going to stop the CRC errors anyway.

Beginner's mind and language handshaking

Most of the time I'm very good at keeping my ego out of the way of solving problems. My most important tool is investigative questions. I assume that I don't know what I'm talking about and hope that someone among us does. I trust that if we can put our heads together we can discover a solution. Questions are powerful, but hostile questions are off limits. For example, "What the hell are you thinking?" "Are you out of you're mind?" "Do you even know what your doing?" "Why haven't you tried XYZ?" Bad questions. They divide us. I'll illustrate some better ones below.

As with all technical problems there were two important dimensions. Our primary obstacle was in the egos of all the people involved.

No one wants to take the blame when things are going disastrously wrong. There's nothing like a problem to bring out the worst in people's fears. This is why the blame game is so common.

Assuming beginner's mind is powerful because it lets everyone remain an expert. Give up learning, and put an end to your troubles. I'm not challenging anyone's authority or expertise. I'm not trying to affix blame. There's no power conflict. I'm just trying to understand whatever is necessary to solve the technical problem.

Our technical problem was CRC errors.

The lead tech drove out from Denver on Friday afternoon with a Cisco router borrowed from our colocation people. I met him at their Boulder facility with our router (a Lucent Pipeline 130). Unable to get our Pipeline 130 working in their facility, we tried getting the Cisco router configured. When the passwords they had given us failed, we got the colocation tech support on the phone. That turned out to be a really good thing to do. The telco tech had no router foo. My router foo has seven years of rust. The colocation tech had strong router foo. Back to being a beginner.

Remembering that telco people speak a different language, I had reviewed the Pipeline docs trying to refresh my vocabulary. It's tough to ask good questions if you don't speak with the same jargon. Without a language handshake you'll just talk right past each other -- there will be no real communication. Did leased lines mean anything? Nailed-up? Nope and nope. I remembered that framing and encoding would mean something. Turns out those were the only parameters the telco tech could really twiddle as far as provisioning the lines. The previous line, the new line and our routers were all set up for ESF and B8ZS. That wasn't our problem. What about channelization? That word he recognized. The new T1 line was not channelized.

Earlier in the conversation, the telco tech had drawn a picture of the major components in their cloud between the colocation facility and our office. The only thing that was changing was the end point. All the other links would stay the same. None of the links were channelized. "How about in the colocation facility? What's happening between the telco's OC3 into that facility and our cabinet? Were they expecting a channelized T1?" These are good questions because they are informational and non-threatening. There's no blame implicit in the answer. These questions can draw us together, hopefully toward a solution.

According to the colocation tech, almost all data communications use channelized connections. Maybe that was our problem. We know the telco isn't channelizing but that the routers almost certainly expect channels. "What's responsible for channelizing?" It's important to ask what device is responsible as opposed to who is responsible. Again the questions do two things: draw us together rather than divide us, and focus on the technical problem.

Turns out that's the CSU/DSU. Those are built into our routers. In days gone by the CSU/DSU was a separate device. Our Pipeline 130 includes one as does the Cisco competitor. The router I'd brought with me and the one in our cabinet at the colo were configured identically, except for the IP addresses. Both were expecting channels in the T1. I said this bit out loud to explain that channelization didn't appear to be our problem. But it triggered a crucial insight from the colocation tech -- while he explained channelization he mentioned the all important clock source.

Timing is everything

Neither of our routers were providing a clock source. A little telecom review is in order. T1 lines are also called DS1s. CSU/DSUs channelize the DS1 into 24 DS0s each providing 64Kbps bandwidth for a total of 1,536Kbps. Channelization gives you the option to split your T1 line -- part can be used for data and part for voice. Each DS0 could be a uniquely numbered plain old phone line, or they could be used for data. In our case it's all data. But it helps to know why lines get channelized at all.

The channels are created by time-division multiplexing. A few milliseconds of data are sent for the first channel, then a few milliseconds for the second channel, then a few for the third and so on through all twenty-four channels. Then they start over at the first channel. In traditional telecommunications this allows a bunch of different voice conversations to be carried over the same pair of wires. The phone cloud can divide and reassemble the conversations faster than we can hear -- we perceive no interruptions. The other CSU/DSU reassembles the streams -- demultiplexes them. We're splitting seconds here so if the clocks on the two CSU/DSUs are not in sync, no data will pass through. The line will pass all the tests offered by telco testing equipment. But no data will flow. The routers will be dancing to different beats. The music will be discord. Our routers were not syncing.

When I configured the router I'd brought with me to provide a clock source we were finally able to route data across the T1 line. Ping! We then confirmed that the new configuration worked from our office as well. Whatever the problem was, the solution was to configure one of the routers to provide a clock source.

The Blame

We had a solution to the technical problem. Unfortunately it came too late on Friday. Rob is gone this week and we didn't want to risk moving the computers again without him. We're stuck waiting another week.

My last question while I was at the telco's facility turned out to be threatening, though not intentionally. "So how did they work for five years without either of them providing the clock source?" I was trying to ask another investigative question, but there were other questions implied: "Who blew it? Why didn't someone look for the clock source weeks ago?" Those implied questions were threatening.

Speculations included "Maybe they were close enough to being in sync to work. Maybe we hadn't been getting our full potential out of the line all these years. Maybe HDSL includes a clock source." These answers didn't satisfy, but I left it alone because we had a solution to the technical problem and I hadn't intended to send us back into the defensive finger-pointing. We switched the line back to Rob's basement and scheduled time for our move next week.

The docs about the router's clock source mention that the phone system can provide the clock source. In that case our routers were configured correctly -- both routers accepting the clock source from the cloud. The telco tech says they never provide a clock source as a matter of policy. If that's the case then someone must have made an exception to that policy when the line to the basement was installed. Remember, we're splitting seconds here. Partly out-of-sync is out-of-sync and no data will flow. The HDSL providing a clock source is the most plausible speculation. And Qwest might be providing a clock source separately from the HDSL. Regardless, I'd say that clock source is part of the provisioning: framing, encapsulation and clock source. If it's Qwest, then the blame still falls in our telco's lap because they are reselling Qwest lines.

I expect our telco will continue to blame the routers. The finger-pointing will continue, but at least we will be able to move into our new office next week.

Jeff Thomson commented

30 April 2004 at 13:22

Hey man, from my experience, it is a wonder that the telcos manage to stay in business. I can't even count the number of hours I have spent at night trying to troubleshoot telco issues. Like you said, the worst is when they start going down one path, ignoring any other possibilities. It usually takes us (the customer) to brainstorm and get to the real problem. There is actually an acronym (CCBM) Came Clear By Magic, that the telcos use when the situation resolves itself ("No, we didn't do anything on our end." -- heard over top of the sound of furious typing!). Doesn't matter if your a small shop or the biggest sporting goods retailer in the country, telcos are a pain. Good luck with the reloc, and one of these days I'll give you a call to get some beers. Peace-Jeff