BONUS: Guest Post on History of Engineering Excellence at Microsoft
Jon DeVaan shares some history of "EE" at Microsoft along with the original "Engineering Excellence" memo that started the efforts.
This post is guest authored by Jon DeVaan (JonDe, and @jondevaan). He builds on the previous post 087. Reorg! Why Are We Together, Exactly? detailing his parallel efforts at building at rebuilding the engineering culture starting in the early 2000s he referred to as Engineering Excellence.
There is a little bit of parallel history to what is described in the post that is meaningful to inform what happened I joined in working on Windows in 2006.
When Steve Ballmer became CEO, I was tapped to co-lead the Consumer and Commerce Division with Brad Chase (BradC). This was a conglomeration of product teams ranging from the MSN dial-up ISP, MSN Portal and web properties, WebTV, MicrosoftTV, a nascent Web platform for 3rd party ISPs and the worldwide salesforce dedicated to telecommunications. This was my first experience with a large PUM led organization. I focused on the ISP, TV, and telecommunications areas. Inheriting a large set of PUM organizations revealed two important properties. 1) As Steven has pointed out, PUM organizations place a ceiling on the scope and seniority of engineering talent; and 2) PUM organizations lead to fierce competition for resources which generates massive distrust between teams and a huge amount of management noise arbitrating the resulting disputes. (Think of the jokey Internet meme Microsoft org chart involving handguns.) I can’t say that I completely figured out the right things to do during this time, but the experience definitely informed what would come next.
In the early 2000’s Microsoft endured a string of engineering project failures. The most notorious was Longhorn, the planned successor to Windows XP, but there were also major project failures in SQL Server, Developer Tools, and other areas. The Trustworthy Computing memo from Bill Gates in 2002 launched a drive to train engineers and standardize engineering processes across the company that while important, was perceived as overly bureaucratic and too heavy weight. It also became an excuse for project failures and for PUMs to demand more resources. In the face of this, I was asked to lead a new initiative called Engineering Excellence, working for BillG, to debug the Microsoft engineering culture and drive improvements.
In the debugging phase I talked to engineering leaders across the company to get their point of view on what was and was not working. It was readily apparent that for large swaths of the company, there wasn’t ANY senior engineering talent guiding the work of junior engineers. To get promoted people had to become PUMs, at which point they were consumed with fighting the internal battles over resources and mindshare. It was also clear that PUMs, the more senior version, General Managers (GMs), and VPs had no insight as to why their organizations were failing or for that matter what role, if any, they had to play in product development. Some organizations had architects who were technical staff people who attempted to guide other engineers, but they were not line managers and were often, to their frustration, ignored. I wrote a long memo about these insights that BillG hated, but I like to think it smoothed the waters for Steven’s memo to come a couple years later.
A key part of moving forward on Engineering Excellence was helping senior leaders of all kinds see that they were accountable for their organizations being conscious of how their engineering teams worked and ensure that how their product is created is as important strategically as what gets created. It was important to convince them that it was appropriate to spend headcount and budget on defining and improving their team’s engineering system. There just weren’t very many senior engineers who knew how to do it.
The company started working on growing engineering talent. The first step was writing down what was generally expected of engineers from the lowest to highest levels. Microsoft’s HR system always had a concept of levels that defined pay ranges for all employees, including engineers. In the oldest days of the “Productivity Software” (PS) team we had definitions of expectations for (at the time) levels 10-14 written by Charles Simonyi (CharlesS) and Jeff Harbers (JeffH). When the Office Product Unit was formed, I worked with Steven and GrantG to update these definitions for the times and include the Test and Program Management disciplines. Most Microsoft R&D divisions did not have written descriptions or had vaguely rehashed versions of the PS or Office descriptions. After SteveB became CEO he drove a transition from the old 10-14 system to the new Career Ladder system with new numbers (60-80) and more levels allowing for a more rapid velocity through promotions. This was an excellent opportunity for documenting expectations for engineers. Working with senior leaders across the company we created detailed definitions for each level that allowed managers to make more uniform evaluations, but also gave engineers a picture of what they needed to learn and demonstrate in order to be promoted. Self-direction and learning are the most important parts of building a career.
There are many factors which go into describing the progression from a junior to senior engineer. The obvious ones are knowledge, expertise, impact, and experience over both technology and teamwork. One of the factors was the notion of “scope,” the breadth of technology or people. To define scope, we used concepts from the Office organization. Scope would be defined mostly by technical manager level. A first line manager (Lead of a Feature Team discipline in MS jargon, ex: Test Lead) is responsible for about 6 people and the code that many people can be responsible for. The dimension of code scope is also important. Some teams were capable of handling much more code, or code which held the key algorithms of the product. This allowed for a range of levels applicable to Leads with different scopes of technology. The next scope definition was the Manger of Managers (Manager of a Product Team discipline, ex: Development Manager) is responsible for about 6 Leads and the commensurate amount of code. In a PUM/GM organization that was the maximum scope for engineers. There weren’t any higher definitions of scope in such an organization, which is what we were now changing. We added two additional levels of scope for engineers: Product Line (Manager^2 people scope and major product scope ex: Director of Program Management for Office (Steven’s old job when we worked together)) and Division. We now had definitions of technical scope that could be used to retain the most senior engineers as engineers and an organization structure that could grow and nurture them.
Steven writes in the post about Product Lines in the reorganization materials. This is where those concepts came from. I had tested the ideas when reorganizing the WebTV and MicrosoftTV teams and saw how they helped drive much better engineering performance. Now I needed another example. Longhorn was still floundering.
There were a lot of dynamics driving the Longhorn dysfunction. One dynamic was BillG’s pushing for certain technical innovations, like WinFS, and its separation from what the teams were actually working on. All of this has layers. The technical vision for WinFS was to adapt the SQL engine from the SQL Server product to be a component of the Windows NT file system and create a much more sophisticated set of data types and objects that could be stored as files and also allow for very advanced views of all types of data a user might store. The main issue was that, beyond making PowerPoint slides, there wasn’t anyone working on it. Remember SQL Server was having a project failure that required people to be recalled into the core team to complete. The Windows team was working on XPSP2 and Server 2003 with no file system people dedicated to WinFS. There were other Longhorn features for which this dynamic was true.
Another dynamic was the Windows team had not created a multi-threaded engineering system capable of working on more than one major release and servicing older releases. This engineering system was focused on XPSP2 and Server 2003. Windows PUMs whose teams were working on Longhorn features were doing so in isolation. In a world where new code was not being integrated at regular intervals and no process existed for resolving conflicts between different teams’ changes, the Longhorn build was frequently broken for weeks at a time. The stability and performance of a working build were poor. A large set of PUMs did not have the engineering talent working for them to solve these engineering system issues. PUMs also were competing for resources. Promising heroic schedules to satisfy BillG’s technology desires was an excellent way to gain capital.
As the work on Windows Server 2003 started winding down, there was a competition between the teams coming onto Longhorn from 2003, who believed the builds were too buggy and out of control and the other teams who were promising new features and that the problems could be overcome. Which position was right? The Windows team needed to make sound engineering decisions backed by the best data possible.
It is at this time that I worked with Jim Allchin (JimAl) and Brian Valentine (BrianV) to create Windows’ first product line organization, the Core Operating Systems Division (COSD), to build the engineering power necessary to make these decisions. I use the term power here rather than expertise because as much as it was about getting the best engineers on the front lines as possible, it was also necessary to have a major part of the organization stepping up to convince BillG of any decision. All credit goes to Jim and Brian, along with Amitabh Srivistava (AmitabhS, Development), Darren Muir (DMuir, Testing), and Chris Jones (ChrisJo, Program Management), who acted on this advice and then followed through to reset Longhorn and put in place the plan which shipped as Windows Vista.
In my original Engineering Excellence memo, I outlined how the team and management structure was letting Microsoft down and causing a raft of project failures. It is popular to talk about Vista’s problems in the market. However, it was also the first step of recovery. It was a successful engineering project that laid the technical foundation for many later Windows versions in addition to Server and later Windows Phone. Vista and COSD provided a proof point for the benefits of product line idea.
I was excited for the opportunity to work with Steven again. Neither of us knew if this journey was going to be successful, but the importance to Microsoft was clear and we wanted to prove out our ideas. It was also important for everyone to see that senior leaders could work together effectively, something that we both were going to drive through all levels of the organization.
—Jon DeVaan (JonDe, @jondevaan)
En route from my west side hotel to downtown LA in 2003 for the 2nd day of the PDC, I called a Windows architect friend on my cell to tell him how excited I was about WinFS and WPF and how useful this would be for Autodesk's customers. Imagine my disappointment when he told me that the PPTs I was reading and the keynotes I saw were fantasy. When I pointed out that they were being presented to thousands, many of us heads of product for key ISVs he said, "Gar, *no one* is working on these things. We're about to officially drop them".
At that day's keynote, watching Jim demo building an app using all this to the audience, I said to myself, "either my good friend of 10 years is uninformed or JimAl and these others are not telling the truth".
How awkward it must have been to be in the middle of that. But Jon describes the good downstream impact of this turmoil which is somewhat of a compensation. This failure sounds like a "look up" failure, that, looking up up the org chain rests on the CSA.
"The main issue was that, beyond making PowerPoint slides, there wasn’t anyone working on it. "
Brutal.