StackOverflow is so big, what is its architecture?

Little scum · Posted on 4/11/2018 5:33:18 PM

To make it easier to understand what this article is about, let's start with the change in Stack Overflow's average daily statistic. The following figures are from the statistics as of November 12, 2013:

The load balancer accepted 148,084,833 HTTP requests
Of these, 36,095,312 were page loads
833,992,982,627 bytes (776 GB) of HTTP traffic is used to send
A total of 286,574,644,032 bytes (267 GB) of data was received
A total of 1,125,992,557,312 bytes (1,048 GB) of data was sent
334,572,103 SQL queries (including only from HTTP requests)
412,865,051 Redis requests
3,603,418 Tag Engine requests
It took 558,224,585 ms (155 hours) on SQL queries
It took 99,346,916 ms (27 hours) on Redis requests
Spent 132,384,059 ms (36 hours) on the tag engine request
It took 2,728,177,045 ms (757 hours) on ASP.Net process processing

The following data shows the changes in statistics as of February 9, 2016, so you can compare:

HTTP requests received by load balancer: 209,420,973 (+61,336,090)
66,294,789 (+30,199,477) of which page loads
HTTP data sent: 1,240,266,346,053 (+406,273,363,426) bytes (1.24 TB)
Total amount of data received: 569,449,470,023 (+282,874,825,991) bytes (569 GB)
Total amount of data sent: 3,084,303,599,266 (+1,958,311,041,954) bytes (3.08 TB)
SQL queries (from HTTP requests only): 504,816,843 (+170,244,740)
Redis Cache Hits: 5,831,683,114 (+5,418,818,063)
Elastic Searches: 17,158,874 (not tracked in 2013)
Tag Engine Requests: 3,661,134 (+57,716)
Cumulative time consumed to run SQL queries: 607,073,066 (+48,848,481) ms (168 hours)
Redis cache hit time consumed: 10,396,073 (-88,950,843) ms (2.8 hours)
Time Consumed by Tag Engine Requests: 147,018,571 (+14,634,512) ms (40.8 hours)
Time Consumed in ASP.Net Processing: 1,609,944,301 (-1,118,232,744) ms (447 hours)
22.71 (-5.29) ms 49,180,275 issue pages average render time (of which 19.12 ms is consumed in ASP.Net)
11.80 (-53.2) ms 6,370,076 first pages average render time (of which 8.81 ms is consumed in ASP.Net)

You may be wondering why ASP.Net is processing 61 million more requests per day but reducing processing time by 757 hours (compared to 2013). This is mainly due to the upgrades we made to our servers in early 2015, as well as a lot of in-app performance optimization work. Don't forget: performance is still a selling point. If you're more curious about the specific hardware details, don't worry, I'll give the specific hardware details of the servers used to run these sites in the form of an appendix in the next article (I'll update this link when the time comes).

So what has changed in the past two years? Not much, just replacing some servers and network equipment. Here's an overview of the servers used to run your website today (note how they've changed since 2013)

4 Microsoft SQL Server servers (2 of which use new hardware)
11 IIS Web Servers (New Hardware)
2 Redis servers (new hardware)
3 Tag Engine servers (2 of which use new hardware)
3 Elasticsearch servers (same as above)
4 HAProxy load balancing servers (2 added to support CloudFlare)
2 network devices (Nexus 5596 core + 2232TM Fabric Extender, all devices upgraded to 10Gbps bandwidth)
2 x Fortinet 800C Firewall (replaces Cisco 5525-X ASAs)
2 Cisco ASR-1001 routers (replaces Cisco 3945 routers)
2 Cisco ASR-1001-x Routers (NEW!) ）

What do we need to make Stack Overflow work? Not much has changed since 2013, but thanks to the optimizations and the new hardware mentioned above, we now only need one web server. We have already tested this situation inadvertently, with success several times. Please note: I just said it works, I didn't say it's a good idea. But every time this happens, it's quite interesting.

Now that we have some baseline numbers on server scaling ideas, let's see how we made these cool web pages. Few systems exist completely independently (and ours are no exception), and without a holistic view that integrates these parts, the meaning of architecture planning is greatly reduced. Our goal is to grasp the overall situation. There will be many articles that go deep into each specific field in the future. This article is just a summary of the logical structure of the key hardware, and the next article will contain specific details about these hardware.

If you want to see what this hardware looks like today, here are a few photos I took of Cabinet A (Cabinet B is exactly the same as it) when I upgraded the server in February 2015:

Now, let's dive into the architecture layout. The following is a summary of the logical architecture of the main existing systems:

Basic principles

Here are some common principles that do not need to be introduced in turn:

Everything has redundant backups.
All servers and network devices have at least two 10Gbps bandwidth connections.
All servers have two power sources that provide power through two UPS unit sets, two generators behind them, and two grid voltage feedforward.
All servers have a redundant backup located in Rack A and Rack B.
All servers and services have double redundant backups in a separate data center (in Colorado), although I'm mainly covering New York.
Everything has redundant backups.

internet

First you need to find our website, which is a DNS thing. Finding websites is fast, so we're now giving it to CloudFlare because they have DNS servers in every corner of the world. We update DNS records through APIs, and they are responsible for "managing" DNS. However, in our villainous minds, we still have our own DNS servers because of the deep-rooted trust issues. When the apocalypse is apocalyptic – maybe because of the GPL, Punyon, or caching issues – and people still want to program to divert their attention, we switch to our own DNS servers.

Once your browser finds our hiding place, HTTP traffic from our four ISPs (Level 3, Zayo, Cogent, and Lightower in New York) enters one of our four advanced routers. We use the Border Gateway Protocol (BGP, a very standard protocol) to peer-to-peer traffic from network providers to control it and provide the most efficient way to access our services. The ASR-1001 and ASR-1001-X routers are divided into two groups, each of which should use active/active mode to handle traffic from both network providers – here there are redundant backups. Although they all have the same physical bandwidth of 10Gbps, the traffic from the outside is still independent of the traffic from the external VLAN and is connected to the load balancing separately. After the traffic passes through the router, you will come to the load balancer.

I think it might be time to mention that we have MPLS with 10Gbps bandwidth between the two data centers, although this is not really directly related to website services. We use this technology to perform off-site replication and rapid recovery of data to deal with certain emergencies. "But Nick, there's no redundancy in this!" Well, from a technical point of view you are right (in a positive sense), it is really a single point of failure at this level. But wait! Through the network provider, we also have two additional OSPF failover routes (MPLS is the first choice, and these are the second and third choices for cost reasons). Each of the previously mentioned sets of devices will be connected to Colorado's data center accordingly to load balance network traffic in the event of a failover. Of course, we could have connected these two sets of devices to each other, so that there are four sets of pathways, but forget it, let's move on.

Load Balancing (HAProxy)

Load balancing is implemented with HAProxy 1.5.15, running on CentOS 7 (our favorite version of Linux). And add TLS (SSL) secure transmission protocol on HAProxy. We're also keeping an eye on HAProxy 1.7, which will provide support for the HTTP/2 protocol right away.

Unlike other servers with dual 10Gbps LACP network connections, each load balancer has two 10Gbps connections: one for the external network and the other for the DMZ. These servers have 64GB or more of memory to handle the SSL protocol layer more efficiently. When we can cache and reuse more TLS sessions in memory, we consume less compute resources when connecting to the same client. This means we are able to restore sessions in a faster and cheaper way. Memory is so cheap that it's an easy choice.

Load balancing itself is easy to set up. We listen to different websites on multiple different IPs (mainly for certificate and DNS management reasons) and then route traffic to different backends (mainly based on host headers). The only thing we do here is to limit the rate and scrape some header information (from the web layer) to log into HAProxy's system log messages, in this way we can record performance metrics for each request. We'll mention this in detail later.

Web layer (IIS 8.5, ASP.Net MVC 5.2.3, and .Net 4.6.1)

Load balancing distributes traffic between 9 of what we call the primary web server (01-09) and 2 development web servers (10-11, our test environment). The main server runs Stack Overflow, Careers, and all the Stack Exchange sites, while meta.stackoverflow.com and meta.stackexchange.com are running on two other servers. The main Q&A app itself is multi-tenant, meaning that a single app handles all requests from the Q&A site. In other words, we can run the entire Q&A app on one application pool on one server. Other apps such as Careers, API v2, Mobile API, etc., are independent. Here's what you see in IIS for the master and dev servers:

Here's the distribution of Stack Overflow's web tier as seen in Opserver (our internal monitoring dashboard):

And here's the resource consumption of these web servers:

I'll go into more detail in a subsequent article about why we're over-providing so many resources, focusing on rolling build, leeway, and redundancy.

Service layer (IIS, ASP.Net MVC 5.2.3, . NET 4.6.1 and HTTP. SYS）

Next to the web layer is the service layer. They also run on top of IIS 2012 in Windows 8.5R2. This layer runs some internal services that support the web layer and other internal systems of the production environment. The two main services are: "Stack Server", which runs a tag engine and is based on http.sys (not IIS); Providence API (based on IIS). An interesting fact: I had to correlate the two processes to connect to different sockets, because the Stack Server would access the L2 and L3 caches very frequently when refreshing the list of issues at two-minute intervals.

The machines running these services are critical to the tag engine and backend APIs, so they must be redundant, but not 9x redundant. For example, we load all articles and their tags from the database every n minutes (currently 2 minutes), which is not low. We don't want to repeat this load operation 9 times at the web layer, 3 times is safe enough for us. We also use different hardware configurations for these servers to better optimize for the computational and data loading characteristics of the tag engine and elastic index jobs (also running in this layer). The "tag engine" itself is a relatively complex topic that will be covered in a dedicated article. The basic principle is that when you access the address /questions/tagged/java, you visit the tagging engine to get the questions that match it. The engine handles all tag matching except /search, so everywhere including the new navigation gets data through this service.

Caching & Publishing/Subscribing (Redis)

We used Redis in some places, and it has rock solid stability. Although there are as many as 160 billion operations per month, the CPU per instance does not exceed 2%, which is usually lower:

We use Redis for L1/L2 level caching systems. The "L1" level is the HTTP cache that works in a web server or any similar application. The "L2" level is to obtain data through Redis after the previous level cache fails. Our data is stored in the Protobuf format, implemented through protobuf-dot-net written by Marc Gravel. For the Redis client, we used the StackExchange.Redis library, which is an open-source library developed in-house. If a web server doesn't hit in both the L1 and L2 caches, it fetches data from its data sources (database queries, API calls, and so on) and saves the results to the local cache and Redis. The next server may be missing from the L1 cache when retrieving the same data, but it will retrieve the data in L2/Redis, eliminating the need for database queries or API calls.

We also run a lot of Q&A sites, each with its own L1/L2 cache: key as a prefix in the L1 cache and database ID in the L2/Redis cache. We will delve into this topic in future articles.

In addition to the two main Redis servers (one master and one slave) running all site instances, we also set up an instance for machine learning (mainly for memory reasons) using two other dedicated slave servers. This group of servers is used to provide services such as recommending questions on the homepage and making better job matching. This platform is called Providence, and Kevin Montrose wrote about it.

The main Redis server has 256GB of RAM (about 90GB used), and the Providence server has 384GB of memory (about 125GB used).

Redis is not just for caching, it also has a publishing and subscribing mechanism where one server can publish a message and other subscribers can receive the message (including Redis from downstream clients on the server). We use this mechanism to clear the L1 cache on other services to maintain cache consistency on the web server. But it has another important use: websockets.

Websockets（NetGain）

We use websockets to push real-time updates to users, such as notifications in the top bar, votes, new navigation, new answers, comments, and more.

The socket server itself runs on the web layer, using native sockets. This is a very small application based on our open-source library implementation: StackExchange.NetGain. At peak times, we had about 500,000 concurrent websocket connections, which is a lot of browsers. Fun fact: some of these browsers have been open for over 18 months, and you'll have to find someone to see if those developers are still alive. The following chart shows the pattern of websocket concurrency this week:

Why use websockets? At our scale, it's much more efficient than polling. This way, we can simply push more data with fewer resources and be more real-time for users. This approach is not without its problems, though: temporary ports, exhausted file handles on load balancers, are very interesting problems, and we'll talk about them later.

Search (Elasticsearch)

Spoiler: There's not much to get excited about here. The web layer uses Elasticsearch 1.4 and implements an ultra-lightweight, high-performance StackExchange.Elastic client. Unlike most things, we don't plan to open source this part, simply because it exposes a very small subset of the APIs we need to use. I'm sure that making it public outweighs the loss and will only confuse developers. We use elastic:/search in these places to calculate related questions and give suggestions when asking questions.

Each Elastic cluster (one for each data center) contains 3 nodes, each with its own index. The Careers site also has some additional indexes. A slightly less standard part of our configuration in elastic circles is that our cluster of 3 servers is a bit more powerful than the usual configuration: each server uses SSD storage, 192GB of memory, dual networking of 10Gbps bandwidth.

The same application domain of Stack Server (yes, we were tossed around by .Net Core in this place) also hosts a tag engine, which also uses Elasticsearch for continuous indexing. Here we use a little trick, such as using ROWVERSION in SQL Server (the data source) to compare with the "last place" document in Elastic. Because it is apparently sequential, it is easy for us to crawl and index content if it is modified after the last visit.

The main reason we use Elasticsearch instead of technologies like SQL full-text search is its scalability and cost-effectiveness. SQL is relatively expensive on CPUs, while Elastic is much cheaper and has a lot of new features lately. Why not use Solr? We need to search across the network (with multiple indexes at the same time), and Solr doesn't support this scenario at the time of our decisions. The reason we haven't used 2.x yet is because the types have changed a lot in 2.x, which means we have to re-index everything if we want to upgrade. I just don't have enough time to plan for requirements changes and migrations.

Database (SQL Server)

We use SQL Server as a single source of truth. All data in Elastic and Redis comes from SQL Server. We have two SQL Server clusters and are configured with AlwaysOn availability groups. Each cluster has a primary server in New York (which takes almost all of the load) and a replica server, in addition to a replica server in Colorado (our disaster recovery data center). All copy operations are asynchronous.

The first cluster is a set of Dell R720xd servers, each with 384GB of memory, a PCIe SSD with 4TB of space, and two 12-core CPUs. It includes Stack Overflow, Sites (that's a bad name, I'll explain that later), PRIZM, and Mobile's database.

The second cluster is a set of Dell R730xd servers, each with 768GB of memory, a PCIe SSD with 6TB of space, and two 8-core CPUs. This cluster contains all other databases, including Careers, Open ID, Chat, exception logs, and other Q&A sites (e.g., Super User, Server Fault, etc.).

At the database layer, we want to keep CPU utilization at a very low level, although in practice CPU usage will be slightly higher when some planned caching issues are occurring (which we are troubleshooting). Currently, NY-SQL02 and 04 are the primary servers and 01 and 03 are the replica servers, and we just rebooted them today because of the SSD upgrade. Here's how they performed in the last 24 hours:

Our use of SQL is very simple. Simple means fast. While some query statements can be perverted, our interaction with SQL itself is done in a fairly native way. We have some legacy Linq2Sql, but all of our new developments use Dapper, our open-source micro-ORM framework that uses POCO. Let me explain it another way: Stack Overflow has only one stored procedure in its database, and I'm going to kill this last remaining stored procedure and replace it with code.

library

Well, let's change our minds, here are things that can help you more directly. I've mentioned some of them before, but I'll give you a list of the many open-source .Net libraries that we maintain and that everyone uses. We open source them because they don't have core business value involved, but they can help developers around the world. I hope you can use them now:

Dapper (.Net Core) – A high-performance micro-ORM framework for ADO.Net
StackExchange.Redis – A high-performance Redis client
MiniProfiler – a lightweight profiler that we use on every page (also supports Ruby, Go, and Node)
Exceptional – For error logging in SQL, JSON, MySQL, etc
Jil – High-performance JSON serialization and deserializer
Sigil – .Net CIL Generation Helper (used when C# is not fast enough)
NetGain – High-performance websocket server
Opserver – Monitoring dashboard that polls most systems directly and can fetch information from Orion, Bosun, or WMI
Bosun – Monitoring system in the background, written in Go

StackOverflow is so big, what is its architecture?

Related Posts

Sections viewed