The road of program cultivating immortals--data structure design and high-performance visitor recording system

The road of program cultivating immortals--data structure design and high-performance visitor recording system

Ready to work overtime.....

Need points

Each user has his own personal space. When other users visit, you need to add a visitor record and update it to the latest visitor. Here is a pit. If there is a visit record for this user, you need to update the last visit of the user. time. Does this requirement have any characteristics in terms of technology? Think about it for 10 seconds, then look down! ! !

What are the design points?

  1. The user's visitor records must be cached, otherwise how to resist large concurrency?
  2. Since the latest visitor records change very quickly, there must be a data structure that can quickly add new data and delete old data.

Let s leave the chapter on caching aside today. Let s talk about the second point above, which leads to today s data structure protagonist: linked list

Linked List Encyclopedia: Linked list is a non-contiguous, non-sequential storage structure on a physical storage unit. The logical sequence of data elements is realized by the link order of pointers in the linked list. The linked list is a linear structure.

List classification

  1. Singly linked list: The elements in the linked list can only point to the next element in the linked list or are empty, and the elements cannot point to each other. It is a linear linked list.
 public class Node<T>
    {
       //
        public T Data { get; set; }
       //
        public Node<T> NextNode { get; set; }
    }
 
  1. Doubly linked list: Each linked list element has both a pointer to the next element and a pointer to the previous element. Each node has two pointers.
 public class Node<T>
    {
       //
        public Node<T> PreNode { get; set; }
       //
        public T Data { get; set; }
       //
        public Node<T> NextNode { get; set; }
    }
 
  1. Circular linked list: refers to the singly linked list and the doubly linked list, the last node of the two linked lists is pointed to the first node to realize the cycle.

characteristic

  1. The number of elements can be expanded at any time. Since the linked list is non-contiguous on the physical storage unit, this has its inherent advantages for a long time, and my node can allocate memory wherever it meets the requirements.
  2. Adding elements: Single-linked list: When inserting a new element after a position N, the singly-linked list first points the Next pointer of the element at the current position N to the new element, and then the Next pointer of the new element points to the element at the N+1 position. Of course, if you insert a new element at the first position, you only need to point the Next pointer of the new element to the first element of the linked list. Similarly, if you want to insert a new element at the end of the singly linked list, you only need to set the Next pointer of the tail element of the singly linked list. Point to the new element. As for the circular singly linked list, there is no distinction between the first element and the last element.
    Doubly linked list: Adding a new element after the position N is similar to the principle of a singly linked list, and the principle is to modify the pointer of the element. But there is a difference here. The doubly linked list needs to modify the pointers of the three Nodes of the front and back elements (N position and N+1 position) and the new element, so it is a little troublesome.
  3. Delete element: Singly linked list: When you want to delete the element at position N, you only need to point the Next pointer of the element at position N-1 to N+1.
    Doubly linked list: When you want to delete the element at position N, you need to modify the Next pointer of the element at N-1 to point to the N+1 element, and at the same time modify the Pre pointer of the element at N+1 to point to the N-1 element.
  4. Finding elements: Since the elements of the linked list are not continuous in memory, they cannot have O(1) lookup time complexity like an array. You can only traverse the linked list through the first element, so the time complexity is O(n)

programming

Give you 10 seconds to return to X total demand. Through the introduction of linked lists, which linked list should we choose? Here I first talk about my thinking, if there are any errors, please correct me:

  1. When a visitor enters the homepage of the personal space, in most cases, only the first 100 or 200 visitor records need to be cached, which means that there is hot data in this scenario, and 80% (or even higher) requests are hit In the last 100 visitor data, few people will look at the records long ago. So based on the memory space consideration, I decided to cache the last 100 visitor data.
  2. Suppose I cache the first 100 pieces of data with a linked list. Among them, there is a record of visitor A in the non-first position. At this time, A visits the user space again, and I need to move the record of A to the first position. This process has gone through deleting A. Data, add A data in the first position. If the starting position of A is N, when I delete the data of N position, I need to find the position element of N-1 to modify its pointer. If it is a singly linked list, because there is no information about the element of position N-1 in the element of the current position N, All need to re-traverse the linked list. If it is a doubly linked list, the element at position N stores the element at position N-1, so there is no need to re-traverse the linked list. This is also the advantage of doubly linked lists compared to singly linked lists, although the memory occupies one more pointer memory size , But it is more commonly used in actual application scenarios. So I choose a doubly linked list. The time complexity of delete operation and add operation are both O(1).
  3. Access to the same space inevitably has the problem of locking and multithreading. So when I chose the framework, I preferred the framework based on the Actor model. Avoid the operation of locking on the same user space.
  4. Because of the framework based on the Actor model, I did not use an out-of-process cache like Redis, but an in-process cache. After all, the speed of network transmission is much slower than memory operations. The Actor service at the application layer naturally supports distributed. If you don t know much about actors, you can save your mother.

optimization

  1. Do you feel that something is wrong after reading this? Yes, it is the search of linked list elements. Since it can only be traversed, the time complexity of all linked list search elements is O(n). Is there any way to optimize it? That is another data structure we will talk about later.
  2. The visitor records of the space are arranged in reverse order of the time dimension, so the design type of the business and DB time column is recommended to be UTC timestamp long type. After all, the long type occupies much less memory than the datetime type in most languages.
  3. Regardless of whether the cache is used, the user's access records need to be persisted by the DB. When there are a large number of requests, we can use a mechanism to persist to the DB in batches instead of accessing the database once per request.
  4. When the real-time requirements for the visitor record of the space are not very high, we can update the cache every 10 seconds or 5 seconds, that is, update the cache in batches, which is better than a single lock update cache.

X s total personal space needs are not over, and Cai Cai is still being optimized. Welcome to correct me!


Add attention, view more exquisite versions, and get more exciting