Veni, vidi, vici.
31 stories
·
1 follower

How we built an auto-scalable Minecraft server for 1000+ players using WorldQL's spatial database

1 Share

Raising Minecraft world capacity by spreading players across multiple synchronized server processes

Minecraft's multiplayer performance problems #

Minecraft's server software is single-threaded, meaning it must process all events in the world sequentially on a single CPU core. Even on the most powerful computers, a standard Minecraft server will struggle to keep up with over 200 players. Too many players attempting to load too much of the world will cause the server tick rate to plummet to unplayable levels. YouTuber SalC1 made a video talking about this issue which has garnered nearly a million views.

Back at the beginning of the 2020 quarantine I became interested in the idea of a supermassive Minecraft server, one with thousands of players unimpeded by lag. This was not possible at the time due to the limitations of Minecraft's server software, so I decided to build a way to share player load across multiple server processes. I named this project "Mammoth".

My first attempt involved slicing the world into 1024 block-wide segments which were "owned" by different servers. Areas near the borders were synchronized and ridden entities such as horses or boats would be transferred across servers. Here's a video on how it worked. This technique is no longer used; the Minecraft world is no longer sliced up by area.

It was a neat proof-of-concept, but it had some pretty serious issues. Players couldn't see each other across servers or interact. There was a jarring reconnect whenever crossing server borders. If one server was knocked offline, certain regions of the world became completely inaccessible. It had no way to mitigate lots of players in one area, meaning large-scale PvP was impossible. The experience simply wasn't great.

To actually solve the problem, something more robust was needed. I set the following goals:

  • Players must be able to see each other, even if on different server processes.
  • Players must be able to engage in combat across servers.
  • When a player places a block or updates a sign, it should be immediately visible to all other players.
  • If one server is down, the entire world should still be accessible.
  • If needed, servers can be added or removed at-will to adapt to the amount of players.

To accomplish this, the world state needed to be stored in a central database and served to Minecraft servers as they popped in and out of existence. There also needed to be a message-passing backend that allowed player movement packets to be forwarded between servers for cross-server visibility.

WorldQL is created #

While early versions of Mammoth used redis, I had some new requirements that my message passing and data storage backend needed:

  • Fast messaging based on proximity, so I could send the right updates to the right Minecraft servers (which in turn send them to player clients)
  • An efficient way to store and retrieve permanent world changes
  • Real-time object tracking

I couldn't find any existing product with these qualities. I found incomplete attempts to use SpatialOS for Minecraft scaling, and I considered using it for this project. However, their license turned me off.

To meet these requirements, I started work on WorldQL. It's a real-time, scriptable spatial database built for multiplayer games. WorldQL can replace traditional game servers or be used to load balance existing ones.

If you're a game developer or this just sounds interesting to you, please be sure to join our Discord server.

The new version of Mammoth uses WorldQL to store all permanent world changes and pass real-time player information (such as location) between servers. Minecraft game servers communicate with WorldQL using ZeroMQ TCP push/pull sockets.

Mammoth's architecture #

Mammoth has three parts:

  1. Two or more Minecraft server hosts running Spigot-based server software
  2. WorldQL server
  3. BungeeCord proxy server (optional)

WorldQL architecture diagram

With this setup, a player can connect to any of the Minecraft servers and receive the same world and player data. Optionally, a server admin can choose to put the Minecraft servers behind a proxy, so they all share a single external IP/port.

Part 1: Synchronizing player positions #

To broadcast player movement between servers, Mammoth uses WorldQL's location-based pub/sub messaging. This is a simple two-step process:

  1. Minecraft servers continuously report their players' locations to the WorldQL server.
  2. Servers receive update messages about players in locations they have loaded.

Here's a video demo showing two players viewing and punching each other, despite being on different servers!

The two Minecraft servers exchange real-time movement and combat events through WorldQL. For example, when Left Player moves in front of Right Player:

  1. Left Player's Minecraft server sends an event containing their new location to WorldQL.
  2. Because Left Player is near Right Player, WorldQL sends a message to Right Player's server.
  3. Right Player's server receives the message and generates client-bound packets to make Left Player appear.

Part 2: Synchronizing blocks and the world #

Mammoth tracks the authoritative version of the Minecraft world using WorldQL Records, a data structure designed for permanent world alterations. In Mammoth, no single Minecraft server is responsible for storing the world. All block changes from the base seed are centrally stored in WorldQL. These changes are indexed by chunk coordinate and time, so a Minecraft server can request only the updates it needs since it last synced a chunk.

Here's a video demonstrating real-time block synchronization between two servers. Complexities such as sign edits, compound blocks (like beds and doors) and nether portal creation all work properly.

When a new Minecraft server is created, it "catches up" with the current version of the world. Prior to recording the video below, I built a cute desert home then completely deleted my Minecraft server's world files. It was able to quickly sync the world from WorldQL. Normally this happens automatically, but I triggered it using Mammoth's /refreshworld command so I can show you.

This feature allows a Minecraft server to dynamically auto-scale; server instances can be created and destroyed to match demand.

Mammoth's world synchronization is incomplete for the latest 1.17.1 update. We're planning to introduce redstone, hostile mob, and weapon support ASAP.

Performance gains #

While still a work in progress, Mammoth offers considerable performance benefits over standard Minecraft servers. It's particularly good for handling very high player counts.

Here's a demonstration showcasing 1000 cross-server players, this simulation is functionally identical to real cross-server player load. The server TPS never dips below 20 (perfect) and I'm running the whole thing on my laptop.

These simulated players are created by a loopback process which:

  1. Receives WorldQL player movement queries.
  2. Modifies their location and name 1000 times and sends them back to the server.

This stress test results in the player seeing a wall of copycats:

Mammoth pushes Minecraft server performance further than ever and will enable entirely new massively-multiplayer experiences. Keep in mind this demo exists only to show off the efficiency of the message broker and packet code, this is not as stressing as 1000 real players connecting. Stay tuned for a demo featuring actual human player load.

Coming soon: Program entire Minecraft mini-games inside WorldQL using JavaScript #

Powered by the V8 JavaScript engine, WorldQL's scripting environment allows you to develop Minecraft mini-games without compiling your own server plugin. This means you don't have to restart or reload your server with every code change, allowing you to develop fast.

As an added bonus, every Minecraft mini-game you write will be scalable across multiple servers, just like our "vanilla" experience.

The process of developing Minecraft mini-games using WorldQL is very similar to using WorldQL to develop multiplayer for stand-alone titles. If you're interesting in trying it out when it's ready, be sure to join our Discord to get updates first.

Conclusions #

Thanks for reading this article! Feel free to check out our GitHub repository for the Mammoth Minecraft server plugin and join WorldQL's Discord!

/r/gamedev

Read the whole story
yee
11 days ago
reply
39.965424,116.324526
Share this story
Delete

How does FaceTime Work?

1 Share

As an ex-pat living in Denmark, I use FaceTime audio a lot. Not only is it simple to use and reliable, but the sound quality is incredible. For those of you old enough to remember landlines, it reminds me of those but if you had a good headset. When we all switched to cell service audio quality took a huge hit and with modern VoIP home phones the problem hasn't gotten better. So when my mom and I chat over FaceTime Audio and the quality is so good it is like she is in the room with me, it really stands out compared to my many other phone calls in the course of a week.

So how does Apple do this? As someone who has worked as a systems administrator for their entire career, the technical challenges are kind of immense when you think about them. We need to establish a connection between two devices through various levels of networking abstraction, both at the ISP level and home level. This connection needs to be secure, reliable enough to maintain a conversation and also low bandwidth enough to be feasible given modern cellular data limits and home internet data caps. All of this needs to run on a device with a very impressive CPU but limited battery capacity.

What do we know about FaceTime?

A lot of our best information for how FaceTime worked (past tense is important here) is from interested parties around the time the feature was announced, so around the 2010 timeframe. During this period there was a lot of good packet capture work done by interested parties and we got a sense for how the protocol functioned. For those who have worked in VoIP technologies in their career, it's going to look pretty similar to what you may have seen before (with some Apple twists). Here were the steps to a FaceTime call around 2010:

  • A TCP connection over port 5223 is established with an Apple server. We know that 5223 is used by a lot of things, but for Apple its used for their push notification services. Interestingly, it is ALSO used for XMPP connections, which will come up later.
  • UDP traffic between the iOS device and Apple servers on ports 16385 and 16386. These ports might be familiar to those of you who have worked with firewalls. These are ports associated with audio and video RTP, which makes sense. RTP, or real-time transport protocol was designed to facilitate video and audio communications over the internet with low latency.
  • RTP relies on something else to establish a session and in Apple's case it appears to rely on XMPP. This XMPP connection relies on a client certificate on the device issued by Apple. This is why non-iOS devices cannot use FaceTime, even if they could reverse engineer the connection they don't have the certificate.
  • Apple uses ICE, STUN and TURN to negotiate a way for these two devices to communicate directly with each other. These are common tools used to negotiate peer to peer connections between NAT so that devices without public IP addresses can still talk to each other.
  • The device itself is identified by registering either a phone number or email address with Apple's server. This, along with STUN information, is how Apple knows how to connect the two devices. STUN, or Session Traversal Utilities for NAT is when a device reaches out to a publically available server and the server determines how this client can be reached.
  • At the end of all of this negotiation and network traversal, a SIP INVITE message is sent. This has the name of the person along with the bandwidth requirements and call parameters.
  • Once the call is established there are a series of SIP MESSAGE packets that are likely used to authenticate the devices. Then the actual connection is established and FaceTimes protocols take over using the UDP ports discussed before.
  • Finally the call is terminated using the SIP protocol when it is concluded. The assumption I'm making is that for FaceTime audio vs video the difference is minor, the primary distinction being that the codec used for audio, AAC-ELD. There is nothing magical about Apple using this codec but it is widely seen as an excellent choice.

That was how the process worked. But we know that in the later years Apple changed FaceTime, adding more functionality and presumably more capacity. According to their port requirements these are the ones required now. I've added what I suspect they are used for.

Port Likely Reason
80 (TCP) unclear but possibly XMPP since it uses these as backups
443 (TCP) same as above since they are never blocked
3478 through 3497 (UDP) STUN
5223 (TCP) APN/XMPP
16384 through 16387 (UDP) Audio/video RTP
16393 through 16402 (UDP) FaceTime exclusive

Video and Audio Quality

A video FaceTime call is 4 media streams in each call. The audio is AAC-ELD as described above, with an observed 68 kbps in each direction (or about 136 kbps give or take) consumed. Video is H.264 and varies quite a bit in quality depending presumably on whatever bandwidth calculations were passed through SIP. We know that SIP has allowances for H.264 information about total consumed bandwidth, although the specifics of how FaceTime does on-the-fly calculations for what capacity is available to a consumer is still unknown to me.

You can observe this behavior by switching from cellular to wifi for video call, where often video compression is visible during the switch (but interestingly the call doesn't drop, a testament to effective network interface handoff inside of iOS). However with audio calls, this behavior is not replicated, where the call either maintaining roughly the same quality or dropping entirely, suggesting less flexibility (which makes sense given the much lower bandwidth requirements).

So does FaceTime still work like this?

I think a lot of it is still true, but wasn't entirely sure if the XMPP component is still there. However after more reading I believe this is still how it works and indeed how a lot of how Apple's iOS infrastructure works. While Apple doesn't have a lot of documentation available about the internals for FaceTime, one that stood out to me was the security document. You can find that document here.

FaceTime is Apple’s video and audio calling service. Like iMessage, FaceTime calls use the Apple Push Notification service (APNs) to establish an initial connection to the user’s registered devices. The audio/video contents of FaceTime calls are protected by end-to-end encryption, so no one but the sender and receiver can access them. Apple can’t decrypt the data.

So we know that port 5223 (TCP) is used by both Apple's push notification service and also XMPP over SSL. We know from older packet dumps that Apple used to used 5223 to establish a connection to their own Jabber servers as the initial starting point of the entire process. My suspicion here is that Apple's push notifications work similar to a normal XMPP pubsub setup.

  • Apple kind of says as much in their docs here.

This is interesting because it suggests the underlying technology for a lot of Apple's backend is XMPP, surprising because for most of us XMPP is thought of as an older, less used technology. As discussed later I'm not sure if this is XMPP or just uses the same port. Alright so messages are exchanged, but how about the key sharing? These communications are encrypted, but I'm not uploading or sharing public keys (nor do I seem to have any sort of access to said keys).

Keys? I'm lost, I thought we were talking about calls

One of Apple's big selling points is security and iMessage became famous for being an encrypted text message exchange. Traditional SMS was not encrypted and nor were a lot of (most) text based communication, including email. Encryption is computationally expensive and wasn't seen as a high priority until Apple really made it a large part of the conversation for text communication. But why hasn't encryption been a bigger part of the consumer computer ecosystem?

In short: because managing keys sucks ass. If I want to send an encrypted message to you I need to first know your public key. Then I can encrypt the body of a message and you can decrypt it. Traditionally this process is super manual and frankly, pretty shitty.

So Apple must have some way of generating the keys (presumably on device) and then sharing the public keys. They in fact do, a service called IDS or Apple Identity Service. This is what links up your phone number or email address to the public key for that device.

Apple has a nice little diagram explaining the flow:

As far as I can tell the process is much the same for FaceTime calls as it is for iMessage but with some nuance for the audio/video channels. The certificates are used to establish a shared secret and the actual media is streamed over SRTP.

Someone at Apple read the SSL book

Alright so SIP itself has a mechanism for how to handle encryption, but FaceTime and iMessage work on devices going all the way back to the iPhone 4. So the principal makes sense but then I don't understand why we don't see tons of iMessage clones for Android. If there are billions of Apple devices floating around and most of this relies on local client-side negotiation isn't there a way to fake it?

Alright so this is where it gets a bit strange. So there's a defined way of sending client certificates as outlined in RFC 5246. It appears Apple used to do this but they have changed their process. Now its sent through the application, along with a public token, a nonce and a signature. We're gonna focus on the token and the certificate for a moment.

Token

Certificate

  • Generated on device APN activation
  • Certificate request sent to <a href="http://albert.apple.com" rel="nofollow">albert.apple.com</a>
  • Uses two TLS extensions, APLN and Server name

So why don't I have a bunch of great Android apps able to send this stuff?

As near as I can tell, the primary issue is two-fold. First the protocol to establish the connection isn't standard. Apple uses APLN to handle the negotiation and the client uses a protocol apns-pack-v1 to handle this. So if you wanted to write your own application to interface with Apple's servers, you would first need to get the x509 client certificate (which seems to be generated at the time of activation). You would then need to be able to establish a connection to the server using APLN passing server name, which I don't know if Android supports. You also can't just generate this one-time, as Apple only allows each device one connection. So if you made an app using values taken from a real Mac or iOS device, I think it would just cause the actual Apple device to drop. If your Mac connected, then the fake device would drop.

But how do Hackintoshes work? For those that don't know, these are normal x86 computers running MacOS. Presumably they would have the required extensions to establish these connections and would also be able to generate the required certificates. This is where it gets a little strange. It appears the Macs serial number is a crucial part of how this process functions, presumably passing some check on Apple's side to figure out "should this device be allowed to initiate a connection".  

The way to do this is by generating fake Mac serial numbers as outlined here. The process seems pretty fraught, relying on a couple of factors. First the Apple ID seems to need to be activated through some other device and apparently age of the ID matters. This is likely some sort of weight system to keep the process from getting flooded with fake requests. However it seems before Apple completes the registration process it looks at the plist of the device and attempts to determine "is this a real Apple device".

Apple device serial numbers are not random values though, they are actually a pretty interesting data format that packs in a lot of info. Presumably this was done to make service easier, allowing the AppleCare website and Apple Stores a way to very quickly determine model and age without having to check with some "master Apple serial number server". You can check out the old Apple serial number format here: link.

This ability to brute force new serial numbers is, I suspect, behind the decision by Apple to change the format of the serial number. By switching from a value that can be generated to a totally random value that varies in length, I assume Apple will be able to say with a much higher degree of certainty that "yes this is a MacBook Pro with x serial number" by doing a lookup on an internal database. This would make generating fake serial numbers for these generations of devices virtually impossible, since you would need to get incredibly lucky with both model, MAC address information, logic board ID and serial number.

How secure is all this?

It's as secure as Apple, for all the good and the bad that suggests. Apple is entirely in control of enrollment, token generation, certificate verification and exchange along with the TLS handshake process. The inability for users to provide their own keys for encryption isn't surprising (this is Apple and uploading public keys for users doesn't seem on-brand for them), but I was surprised that there isn't any way for me to display a users key. This would seem like a logical safeguard against man in the middle attacks.

So if Apple wanted to enroll another email address and associate it with an Apple ID and allow it to receive the APN notifications for FaceTime/receive a call, there isn't anything I can see that would stop them from doing that. I'm not suggesting they do or would, simply that it seems technically feasible (since we already know multiple devices receive a FaceTime call at the same time and the enrollment of a new target for a notification depends more on the particular URI for that piece of the Apple ID be it phone number or email address).

So is this all XMPP or not?

I'm not entirely sure. The port is the same and there are some similarities in terms of message subscription, but the large amount of modification to handle the actual transfer of messages tells me if this is XMPP behind the scenes now, it has been heavily modified. I suspect the original design may have been something closer to stock but over the years Apple has made substantial changes to how the secret sauce all works.

To me it still looks a lot like how I would expect this to function, with a massive distributed message queue. You connect to a random APN server, <a href="http://rand%280%2C255%29-courier.push.apple.com" rel="nofollow">rand(0,255)-courier.push.apple.com</a>, initiate TLS handshake and then messages are pushed to your device as identified by your token. Presumably at Apple's scale of billions of messages flowing at all times, the process is more complicated on the back end, but I suspect a lot of the concepts are similar.

Conclusion

FaceTime is a great service that seems to rely on a very well understood and battle-tested part of the Apple ecosystem, which is their push notification service along with their Apple ID registration service. This process, which is also used by non-Apple applications to receive notifications, allows individual devices to quickly negotiate a client certificate, initiate a secure connection, use normal networking protocols to allow Apple to assist them with bypassing NAT and then establishes a connection between devices using standard SIP protocols. The quality is the result of Apple licensing good codecs and making devices capable of taking advantage of those codecs.

FaceTime and iMessage are linked together along with the rest of the Apple ID services, allowing users to register a phone number or email address as a unique destination.

Still a lot we don't know

I am confident a lot of this is wrong or out of date. It is difficult to get more information about this process, even with running some commands locally. I would love any additional information folks would be willing to share or to point me towards articles or documents I should read.

Citations:

Read the whole story
yee
12 days ago
reply
39.965424,116.324526
Share this story
Delete

选数据线不只是「看形状」:一文看懂常见的 USB 和雷雳协议 - 少数派

1 Share

今年 3 月份,USB-IF (也就是 USB 协议标准化协会)发布了 USB 4 协议。这本来应该是一个好消息,但是当我看到具体的传输速率参数时,我就预感到未来消费者在选购 USB 4 设备或线缆时又会被坑得挺惨了。

这背后主要有两个问题:一方面是 USB 4 规范的速率的设定;另外一方面就是现在大家常常遇到的 Type-C 背后使用协议不同,线缆协议能否完整支持这些协议。在选购 USB 3 和 Thunderbolt 3 (以下简称为 TB 3)的线缆或是设备的时候,大家可能会遇到重金购买的设备完全不能使用,或是和官方宣传的特性不相符的情况,但是追寻其中原因时,往往会迷惑在两者之间复杂的对应关系。

其实 USB 接口的形状和背后使用的协议并无对应关联。

所以我将在下文中详细介绍这两部分,也希望这篇文章可以帮助到正在选购 USB 3 和 TB 3 线缆和相关接口设备的你。

注:本文写作时,USB-IF 只公布了一部分的 USB 4 协议的特性。请注意,USB-IF 未来可能会加入更多的特性。

USB Type-C 真的只是个接口的形状

一直以来 USB 接口的形状多样,而协议实际上并不怎么多(加上 USB-IF 一贯喜欢给已发布的协议改个名,这点在下文中还会继续提到),所以给一般消费者的感觉就是一个接口对应一个协议。

下图展示了 USB 2 和 USB 3 的各种接口种类。在这张图里我们可以看到,无论是在 USB 2 还是 USB 3 协议下,除了大家常见的 Type-A 口、Micro-B 口和 Type-C 接口以外,还有很多日常生活中不怎么见到的接口。另外值得注意的一点是,USB 2 和 USB 3 的 Type-A 口除了颜色不一样以外,接口的形状几乎完全一样。

上述两点都能说明,USB 接口的形状和背后使用的协议其实一直是毫无关联的两部分

在 Type-C 接口上也同样是这个道理,USB Type-C 是接口形态和电缆规格,背后承载的协议可以是 USB 3 协议,也可以是 USB 2 协议、 未来的 USB 4 协议和 TB 3 协议。

这篇文章首先对接口形状和协议进行分离的目的在于,希望大家未来在选购硬件的时候不要只看到接口形状,而是要格外关注 Type-C 接口背后的多样协议,千万不要按图索骥、买错东西。

接口说完了,下面就要进入大家所关心的协议说明环节,TB 3 和 USB 之间的传输协议也是各有不同的,所以我会按照协议用途逐个进行说明。

这些 USB 传输协议我们很常用

数据传输

USB 协议的主要任务就是传输数据,不同的 USB 标准带来的本质区别是拥有不同的传输速度。尽管 USB-IF 一直喜欢给老的协议改个新的名字,导致 USB 命名混乱,但从我列出的表中可以看出,在同一系列命名中,依旧摆脱不了「数字大的传输速度越快」这样的规律。

在实际购买中需要注意的是,有些商家可能只标注 USB 3.1 和 USB 3.2 ,而非 Gen 1 或者 Gen 2 之类的额外参数,这恰恰是我们选购时需要考虑的重点。

显示传输

USB 传输协议也可以被用作显示传输,不过这一点仅限 Type-C 形状的接口。这个形状的接口针脚最多,可以进入一个名叫「备用模式」(Alt-Mode)的状态,而显示传输正需要通过这个「备用模式」才能正常工作。

我们遇到的诸如无法点亮屏幕等问题,主要是因为「备用模式」对于各大供应商来说并不是必选项,而是一个可选项。也就是说,不是任何 USB-C 设备都需要支持所有的「备用模式」。在选购时我们大家一定要了解自己购买的设备所支持的备用模式是否与线材支持的相融。

截至本文发布时,USB-IF 已经规定了如下的备用模式用于显示传输:

「备用模式」除了能用于显示传输以外,还将在未来支持以太网的传输。

音频传输

USB 还支持了针对音频的传输协议,方便类似于麦克风、扬声器、耳机、电话、乐器这样的设备利用 模拟信号 将音频流直接传递其他设备。这类协议的低版本甚至无须任何驱动就可以跨平台使用,我们常见的免驱动的 USB 声卡一般就会使用该协议。

除此之外,利用 USB Type-C ,我们还可以让播放器直接将音频的 数字信号 传输到支持的耳机等音频输出设备。这种模式能保证音频在传输的过程中受到的干扰尽可能小,且保证质量,主要为没有 3.5mm 音频插口的手机所用。

在日常使用的时候,我们可能需要注意设备(比如手机)输出时具体使用的音频协议。如果你的手机是仅支持数字模式输出音频的设备,那么耳机或者适配器必须要有 DAC(数模转换器,将数字信号转换为模拟信号) 和 运算放大器(将声音放大) ,否则我们将听不到任何声音。这类手机包括但不限于 Google Pixel 2, HTC U11, Essential Phone, Razer Phone 等。

USB Type-C 在音频传输方面还提供了另外一套模拟模式:「音频适配器模式」。利用该模式能将数字信号转换成模拟信号,额外拓展成支持左声道、右声道、地面声道三个输出声道以及一个麦克风输入声道的模拟音频信号接口。

为了保证最大兼容性, USB-IF 规定带有 USB Type-C 接口作为输入的耳机必须同时支持数字音频模式和音频适配器模式。

电源传输协议

除了传输数据,在日常生活中我们使用 USB 最多的场景可能就是充电了。如何充电才能又快又安全?

USB 本身有一个标准的供电协议,现在也被合并进入了 USB Power Delivery 中。USB Power Delivery 目前已经更新到了 3.0 协议了(下文简称 PD 3.0),PD 3.0 目前支持 5V、9V、15V 和 20V 四档电压,最大支持 100W 的功率输出,具体每个电压档位可以输出的功率和电流请看下图(需要注意 micro-B 只支持 PD 1.0 协议,最大输出 60 W 的功率):

在为电脑等高功率设备供电,需要输出 60~100W 范围功率的时候,切记一定要使用能够承载 5A 电流的电线,否则可能会导致硬件损坏甚至人身伤害。

手机方面,Apple 目前在售的设备已经兼容 PD 充电协议。Android 各个厂商虽会采用各式各样的快充协议,不过好在 USB PD 协议目前已经兼容了行业中部分的其他快充标准(比如 QC 4.0),而且 Google 也强制规定了「2019 年以后发布的采用 USB Type-C 接口的新设备,必须确保与 USB-C PD 协议的兼容性」( 信息来源 )。相信在未来某一日,大家能够感受到一根线缆到处充电的便捷。

传输高速的 Thunderbolt 3 其实也是多面手

数据传输协议

如果你对数据传输速度有极高的要求,Thunderbolt 应该是你最先考虑采用的协议。

Thunderbolt 由 Intel 和苹果公司一起研发的高速数据传输协议,我们可以把它看成是将内置的 PCI-E 通道的另一种供用户使用的暴露形态,因此有着极高的速度传输上限。比如 Thunderbolt 3 号称可以最大提供 40 Gbps (约 5GB/s)的带宽。

如果你在实际使用中发现达不到这样的理论极限速度,也许并不是出了什么问题。根据微博 @iBuick 的测试,TB 3 最大只能提供 32 Gbps (约 4GB/s)的带宽 。究其原因还是为数据完整性让路,通过 TB 传输的每 8bit 数据都需要有额外的 2bit 校验位,这样才能保证在数据快速传输之时,又不在传输过程中产生问题。

TB 协议还配有独特的「菊花链」(Daisy-chain)技术。这种技术可以让设备与设备通过线缆环环相连(如上图)变相达成一台设备连接到电脑的条件。每个 TB 接口都支持以这样的形式最多串联 6 个设备,除了显示器并不支持菊花链功能(且必须位于菊花链的末端),一般 1 个 TB 口可以拓展给 6 个设备进行使用,菊花链中的带宽也是动态分配的。 

显示传输

无论是何种版本 TB 协议,都是默认支持 DisplayPort 协议的,在连接到显示器显示画面时,可以直接通过 TB 内置的 DP 协议直接输出。最新版本的 TB 3 协议兼容的是 DP 1.4 协议(最大支持 8K 60Hz或者4K 120Hz 10bit色深 HDR 视频的传输)。需要额外注意的是,Thunderbolt 的芯片名称是:Titan Ridge,否则仅支持 DP 1.2。

由于 TB 3 开始使用 USB Type-C,因此也有一个「替代模式」叫做 TB3 Alt Mode。通过该模式,可以将显示流传输到工作在 USB Type-C 替代模式下的显示器。

比如 iPad Pro 2018 可以直接使用配套的线缆将画面推送到新款 LG UltraFine 5K 显示器上(新款支持 USB Type-C 替代模式)。而旧款的 LG UltraFine 5K 显示器就没那么便捷了,你会发现是这两者是无法使用 Type-C 接口进行显示传输的。老版本的只支持旧的 DP over Thunderbolt 3 模式,而这个模式不兼容 Type-C 进行显示输出,还是需要用到传统的 TB 线缆。

电源传输

TB 3 的电源传输使用的是标准 PD 3.0 协议,具体如何协商的可以看下图:

来自 USB 4 的实用新特性

上面说了那么多过去和现在使用协议的情况,那么未来我们将会使用的 USB 4 协议有哪些特性呢?

  • 最大 40 Gbps 带宽
  • 动态分配带宽
  • 强制支持 USB PD 3 充电协议
  • 仅有 USB Type-C 一种形态
  • 向下兼容
  • 部分产品可完全兼容 TB 3

其中有几点可以特别注意下:

动态分配带宽可以让 USB 分线器 上的带宽利用变得更加彻底,比如一个 U 盘传输数据占用掉了分线器 10% 左右的带宽,剩下 90% 的带宽可以释放给外置高速 SSD 使用,相比在旧版本的 USB 协议中所有设备平均分享总带宽要好上不少。

强制支持 USB PD 3 充电协议其实是「仅有 USB Type-C 一种形态」带来的特性,所有的线缆在未来都能直接拿来充电,而且未来可能只需要一根线缆就能解决大部分的充电问题。不过这样也会有一个问题,用户没办法预计电流会从哪个设备流向另外个设备,实际使用的时候可能会出现用户所不希望的「反向充电」情况的出现。

注意:本文写作时,USB-IF 标准化组织也只公布了上述的新特性,未来或许会加入更多的令人兴奋的新特性,本文可能无法及时更新。

线缆选购指南

Type-C 接口目前和未来所有将会使用的协议都讲完了,但是设备和设备之间还需要线缆进行连接的,如果线缆不合格,即使是接收端和发送端都支持的某个协议的情况下,这个协议应该还是无法正常使用。

因此选购一条合适的线缆变得非常重要。

完整特性的 TB 3 线缆

TB 3 线缆在选购的过程中需要注意 E-TAG 和供电的问题。

E-TAG 是为了保证 TB 3 线缆在长度超过 1 米时保证性能的物理硬件,放置在线缆的两端。如果厂商重点宣传了「长距离传输性能」等特性,多半是具备 E-TAG。长距离的 TB 3 线缆如果两端没有 E-TAG,速度可能只有 20Gbps(而不是 40 Gbps)。一般我们会把带有 E-TAG 的线缆称为主动线,没有 E-TAG 则被称为被动线缆。用作 USB 线缆的时候,带有 E-TAG 的 TB 3 线缆只支持到 USB 2.0,数据传输方面会有一定的速度缺失,所以算不上全能。

其次就是供电,虽然 TB 3 都支持 PD 协议,但并不是每根线都支持最高的 100W 功率,大部分的 TB 3 线缆可能只支持到 60W,需要在选购的时候额外注意。

我目前用过的完整支持所有特性的 TB 3 线缆有:

推荐的 Type-C 线缆

我这里只推荐 Type-C 到 Type-C (简称为 C2C)线缆,主要是因为其他的 USB 接口形状不会遇到 5A 那么大电流的情况,况且大家其他的 USB 接口的线缆也应该有不少了才对。高品质的 Type-C 在我看来应该有以下条件:

  • 支持 100W PD (也就是 5A 电流)
  • 至少支持 USB 3.0 (现称为 USB 3.2 Gen 1)的传输速度
  • 完整的其他协议和替代模式的支持

我目前在用的符合上述特性的 C2C 线缆有:

当然我这里还有其他的 C2C 的线缆,它们或多或少都有点不那么完美:支持的 PD 功率只有 60W、传输速率只有 USB 2.0 的等级……在日常生活中目前也是够用的。

所有协议都支持的万能线缆

想要支持上述的我所介绍的所有协议,那么买短距离 TB 3 被动线缆准没错。不过需要注意避开短距离的 TB 3 主动线(带 E-TAG)这个坑。

短距离 TB 3 被动线缆是指的是总长度小于等于 1m 且没有 E-TAG 的 TB 3 线缆。因为所有TB 3 的线缆都在两端带有 TB 3 协议的控制器,除了 TB 3 和 100W PD 的支持,这种控制器额外提供 Thunerbolt 备用模式、 DisplayPort 备用模式以及 USB 3.0 或以上的(控制器芯片需要是 Titan Ridge 才会支持 USB 3.1)协议支持。

不过经过我的测试,确定是全功能的线缆就那么一条: Apple 雷雳 3 (USB‑C) 连接线 (0.8 米)  官网售价320元。

好了,以上就是我帮大家整理好的 USB 和 Thunderbolt 有关的内容,希望大家在选购自己心仪产品时能帮助到大家,免得被商家坑到,或是浪费时间精力甚至钱财。

> 下载少数派 客户端、关注 少数派公众号,了解更多数码小知识 💻

> 特惠、好用的硬件产品,尽在 少数派sspai官方店铺 🛒

Read the whole story
yee
39 days ago
reply
39.965424,116.324526
Share this story
Delete

How a Docker footgun led to a vandal deleting NewsBlur’s MongoDB database

2 Shares

tl;dr: A vandal deleted NewsBlur’s MongoDB database during a migration. No data was stolen or lost.

I’m in the process of moving everything on NewsBlur over to Docker containers in prep for a big redesign launching next week. It’s been a great year of maintenance and I’ve enjoyed the fruits of Ansible + Docker for NewsBlur’s 5 database servers (PostgreSQL, MongoDB, Redis, Elasticsearch, and soon ML models). The day was wrapping up and I settled into a new book on how to tame the machines once they’re smarter than us when I received a strange NewsBlur error on my phone.

"query killed during yield: renamed collection 'newsblur.feed_icons' to 'newsblur.system.drop.1624498448i220t-1.feed_icons'"

There is honestly no set of words in that error message that I ever want to see again. What is drop doing in that error message? Better go find out.

Logging into the MongoDB machine to check out what state the DB is in and I come across the following…

Two thoughts immediately occured:

  1. Thank goodness I have some recently checked backups on hand
  2. No way they have that data without me noticing

Three and a half hours before this happened, I switched the MongoDB cluster over to the new servers. When I did that, I shut down the original primary in order to delete it in a few days when all was well. And thank goodness I did that as it came in handy a few hours later. Knowing this, I realized that the hacker could not have taken all that data in so little time.

With that in mind, I’d like to answer a few questions about what happened here.

  1. Was any data leaked during the hack? How do you know?
  2. How did NewsBlur’s MongoDB server get hacked?
  3. What will happen to ensure this doesn’t happen again?

Let’s start by talking about the most important question of all which is what happened to your data.

1. Was any data leaked during the hack? How do you know?

I can definitively write that no data was leaked during the hack. I know this because of two different sets of logs showing that the automated attacker only issued deletion commands and did not transfer any data off of the MongoDB server.

Below is a snapshot of the bandwidth of the db-mongo1 machine over 24 hours:

You can imagine the stress I experienced in the forty minutes between 9:35p, when the hack began, and 10:15p, when the fresh backup snapshot was identified and put into gear. Let’s breakdown each moment:

  1. 6:10p: The new db-mongo1 server was put into rotation as the MongoDB primary server. This machine was the first of the new, soon-to-be private cloud.
  2. 9:35p: Three hours later an automated hacking attempt opened a connection to the db-mongo1 server and immediately dropped the database. Downtime ensued.
  3. 10:15p: Before the former primary server could be placed into rotation, a snapshot of the server was made to ensure the backup would not delete itself upon reconnection. This cost a few hours of downtime, but saved nearly 18 hours of a day’s data by not forcing me to go into the daily backup archive.
  4. 3:00a: Snapshot completes, replication from original primary server to new db-mongo1 begins. What you see in the next hour and a half is what the transfer of the DB looks like in terms of bandwidth.
  5. 4:30a: Replication, which is inbound from the old primary server, completes, and now replication begins outbound on the new secondaries. NewsBlur is now back up.

The most important bit of information the above chart shows us is what a full database transfer looks like in terms of bandwidth. From 6p to 9:30p, the amount of data was the expected amount from a working primary server with multiple secondaries syncing to it. At 3a, you’ll see an enormous amount of data transfered.

This tells us that the hacker was an automated digital vandal rather than a concerted hacking attempt. And if we were to pay the ransom, it wouldn’t do anything because the vandals don’t have the data and have nothing to release.

We can also reason that the vandal was not able to access any files that were on the server outside of MongoDB due to using a recent version of MongoDB in a Docker container. Unless the attacker had access to a 0-day to both MongoDB and Docker, it is highly unlikely they were able to break out of the MongoDB server connection.

While the server was being snapshot, I used that time to figure out how the hacker got in.

2. How did NewsBlur’s MongoDB server get hacked?

Turns out the ufw firewall I enabled and diligently kept on a strict allowlist with only my internal servers didn’t work on a new server because of Docker. When I containerized MongoDB, Docker helpfully inserted an allow rule into iptables, opening up MongoDB to the world. So while my firewall was “active”, doing a sudo iptables -L | grep 27017 showed that MongoDB was open the world. This has been a Docker footgun since 2014.

To be honest, I’m a bit surprised it took over 3 hours from when I flipped the switch to when a hacker/vandal dropped NewsBlur’s MongoDB collections and pretended to ransom about 250GB of data. This is the work of an automated hack and one that I was prepared for. NewsBlur was back online a few hours later once the backups were restored and the Docker-made hole was patched.

It would make for a much more dramatic read if I was hit through a vulnerability in Docker instead of a footgun. By having Docker silently override the firewall, Docker has made it easier for developers who want to open up ports on their containers at the expense of security. Better would be for Docker to issue a warning when it detects that the most popular firewall on Linux is active and filtering traffic to a port that Docker is about to open.

The second reason we know that no data was taken comes from looking through the MongoDB access logs. With these rich and verbose logging sources we can invoke a pretty neat command to find everybody who is not one of the 100 known NewsBlur machines that has accessed MongoDB.


$ cat /var/log/mongodb/mongod.log | egrep -v "159.65.XX.XX|161.89.XX.XX|<< SNIP: A hundred more servers >>"

2021-06-24T01:33:45.531+0000 I NETWORK  [listener] connection accepted from 171.25.193.78:26003 #63455699 (1189 connections now open)
2021-06-24T01:33:45.635+0000 I NETWORK  [conn63455699] received client metadata from 171.25.193.78:26003 conn63455699: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:33:46.010+0000 I NETWORK  [listener] connection accepted from 171.25.193.78:26557 #63455724 (1189 connections now open)
2021-06-24T01:33:46.092+0000 I NETWORK  [conn63455724] received client metadata from 171.25.193.78:26557 conn63455724: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:33:46.500+0000 I NETWORK  [conn63455724] end connection 171.25.193.78:26557 (1198 connections now open)
2021-06-24T01:33:46.533+0000 I NETWORK  [conn63455699] end connection 171.25.193.78:26003 (1200 connections now open)
2021-06-24T01:34:06.533+0000 I NETWORK  [listener] connection accepted from 185.220.101.6:10056 #63456621 (1266 connections now open)
2021-06-24T01:34:06.627+0000 I NETWORK  [conn63456621] received client metadata from 185.220.101.6:10056 conn63456621: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:34:06.890+0000 I NETWORK  [listener] connection accepted from 185.220.101.6:21642 #63456637 (1264 connections now open)
2021-06-24T01:34:06.962+0000 I NETWORK  [conn63456637] received client metadata from 185.220.101.6:21642 conn63456637: { driver: { name: "PyMongo", version: "3.11.4" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "5.4.0-74-generic" }, platform: "CPython 3.8.5.final.0" }
2021-06-24T01:34:08.018+0000 I COMMAND  [conn63456637] dropDatabase config - starting
2021-06-24T01:34:08.018+0000 I COMMAND  [conn63456637] dropDatabase config - dropping 1 collections
2021-06-24T01:34:08.018+0000 I COMMAND  [conn63456637] dropDatabase config - dropping collection: config.transactions
2021-06-24T01:34:08.020+0000 I STORAGE  [conn63456637] dropCollection: config.transactions (no UUID) - renaming to drop-pending collection: config.system.drop.1624498448i1t-1.transactions with drop optime { ts: Timestamp(1624498448, 1), t: -1 }
2021-06-24T01:34:08.029+0000 I REPL     [replication-14545] Completing collection drop for config.system.drop.1624498448i1t-1.transactions with drop optime { ts: Timestamp(1624498448, 1), t: -1 } (notification optime: { ts: Timestamp(1624498448, 1), t: -1 })
2021-06-24T01:34:08.030+0000 I STORAGE  [replication-14545] Finishing collection drop for config.system.drop.1624498448i1t-1.transactions (no UUID).
2021-06-24T01:34:08.030+0000 I COMMAND  [conn63456637] dropDatabase config - successfully dropped 1 collections (most recent drop optime: { ts: Timestamp(1624498448, 1), t: -1 }) after 7ms. dropping database
2021-06-24T01:34:08.032+0000 I REPL     [replication-14546] Completing collection drop for config.system.drop.1624498448i1t-1.transactions with drop optime { ts: Timestamp(1624498448, 1), t: -1 } (notification optime: { ts: Timestamp(1624498448, 5), t: -1 })
2021-06-24T01:34:08.041+0000 I COMMAND  [conn63456637] dropDatabase config - finished
2021-06-24T01:34:08.398+0000 I COMMAND  [conn63456637] dropDatabase newsblur - starting
2021-06-24T01:34:08.398+0000 I COMMAND  [conn63456637] dropDatabase newsblur - dropping 37 collections

<< SNIP: It goes on for a while... >>

2021-06-24T01:35:18.840+0000 I COMMAND  [conn63456637] dropDatabase newsblur - finished

The above is a lot, but the important bit of information to take from it is that by using a subtractive filter, capturing everything that doesn’t match a known IP, I was able to find the two connections that were made a few seconds apart. Both connections from these unknown IPs occured only moments before the database-wide deletion. By following the connection ID, it became easy to see the hacker come into the server only to delete it seconds later.

Interestingly, when I visited the IP address of the two connections above, I found a Tor exit router:

This means that it is virtually impossible to track down who is responsible due to the anonymity-preserving quality of Tor exit routers. Tor exit nodes have poor reputations due to the havoc they wreak. Site owners are split on whether to block Tor entirely, but some see the value of allowing anonymous traffic to hit their servers. In NewsBlur’s case, because NewsBlur is a home of free speech, allowing users in countries with censored news outlets to bypass restrictions and get access to the world at large, the continuing risk of supporting anonymous Internet traffic is worth the cost.

3. What will happen to ensure this doesn’t happen again?

Of course, being in support of free speech and providing enhanced ways to access speech comes at a cost. So for NewsBlur to continue serving traffic to all of its worldwide readers, several changes have to be made.

The first change is the one that, ironically, we were in the process of moving to. A VPC, a virtual private cloud, keeps critical servers only accessible from others servers in a private network. But in moving to a private network, I need to migrate all of the data off of the publicly accessible machines. And this was the first step in that process.

The second change is to use database user authentication on all of the databases. We had been relying on the firewall to provide protection against threats, but when the firewall silently failed, we were left exposed. Now who’s to say that this would have been caught if the firewall failed but authentication was in place. I suspect the password needs to be long enough to not be brute-forced, because eventually, knowing that an open but password protected DB is there, it could very possibly end up on a list.

Lastly, a change needs to be made as to which database users have permission to drop the database. Most database users only need read and write privileges. The ideal would be a localhost-only user being allowed to perform potentially destructive actions. If a rogue database user starts deleting stories, it would get noticed a whole lot faster than a database being dropped all at once.

But each of these is only one piece of a defense strategy. As this well-attended Hacker News thread from the day of the hack made clear, a proper defense strategy can never rely on only one well-setup layer. And for NewsBlur that layer was a allowlist-only firewall that worked perfectly up until it didn’t.

As usual the real heros are backups. Regular, well-tested backups are a necessary component to any web service. And with that, I’ll prepare to launch the big NewsBlur redesign later this week.

Read the whole story
yee
79 days ago
reply
39.965424,116.324526
Share this story
Delete

近几年我在职场踩过的坑

1 Share
Google

正如大家所知,我最近换了一个组。回顾过去几年,被坑的时候实在不少。从中我意识到了一些事情,故用这篇文章记录一下,也算再给自己的想法做个 snapshot。文章的三部分别讲项目、老板以及组的坑。最后再聊聊换组。

“坑”不仅指外界的坑,也指自己犯过的错。本文的适用范围还请读者自行判断,毕竟大到公司,小到组和部门,它们的差别都极大。你觉得和你遇到的情况不符,请不要杠,这没意义。虽然这么说,我觉得大部分内容还是通用的。另,有些英文说法没有翻译,但都不难理解。

项目的坑

当需求不是直接来源于用户时一定要谨慎

我转组的原因之一是 2020 年做的大项目失败了。项目实现了规划的功能,架构也清晰合理,但没有达到预期的 impact,而我原本是指望用它来到 L5 的。为什么会这样呢?归根到底,这个项目并非来源于客户的直接需求,而是老板拍脑袋想出来的。我的(前)老板某一次和其它几个老板聊天,发现其它组都有一个工具。他觉得这玩意我们大部门也挺需要的,那就找人做吧。不是说它完全没用,毕竟我们最终还是找到了两个客户,但我们很想推的组没推动,并且未来大概也不会有更多用户。后来我才知道老板先问了我的 TL,TL 没有接,然后才找的我。TL 依靠丰富的经验避开了这个坑,但我就没那么好运气了。结论:需求不是直接来源于用户是一个巨大的危险信号。

内部 2c 项目(大概率)好于 2b 项目

这里仅讨论内部工具项目,不适用于面向外部用户的产品。几年下来我有一个很直接的观察,成功的内部工具基本都是 2c,很少有 2b 的。可能有人会好奇,内部工具也有 2b2c 之说吗?有个很简单判断方法:一个工具或系统,如果工程师能自己用起来,那就是 2c;如果需要部门里 >1 人讨论并决定,那就是 2b。举两个例子:一个开箱即用的任务监控面板是 2c 的,而一个需要修改 release 流程才能使用的工具就是 2b 的。当然,我观察到的样本很少,所以结论未必对。但我真心感觉 2b 的东西想推动,要花费几倍于 2c 工具的时间,效果还未必好。总之如果你有的选,请尽量把时间花在 2c 项目上。

警惕那些依赖其它组的项目

OKOK 我明白,世界上就没几个不需要跨部门合作的项目。但合作方式也是多种多样的。假如对方是你的客户,你需要和他们沟通来实现一个需求,又或者大家同属一个大组做同一个产品,这种就没问题。而当对方在你的关键路径上——比如你想做的东西没有对方配合就完不成,且你们又是跨部门,这种就有很大风险。如果你正在参与此类项目,请一定确保:

  • 双方(或多方)要 aligned(阿里话的“对齐”),尤其是老板之间。包括对项目的预期、优先级、工期、资源投入,都必须互相知晓并认可,而且要落实到 OKR 层面。不要相信工程师的口头承诺,我被坑过,而且很惨。
  • 要有定期例会。
  • Escalate(上升)的渠道一定要通畅。当对方没有按期交付或出现各种问题时,你要能通过自己老板向对方施压。这也是为什么双方老板的支持是必需的。

了解组里其它人在做什么

除了上面提到的那个项目,其实我曾有更好的机会升 5。我们组有个项目,本来由两个人分别负责前后端,结果后端哥们儿写一半离职了。后端是用 C++ 实现的,而我当时是组里唯一有 C++ readability 的人,如果申请接手,老板 100% 会同意。然而我当时埋头于自己的任务,完全不关注组里其它人在做什么(即使我一直帮他们 review 代码)。因为不了解,也没有意识到这个项目的潜力。后来这个项目交由前端哥们儿负责,并且成了组里的重点项目。我就这么错失了好机会。说白了这又是“选择大于努力”的例证。

于是我决定,进新组之后至少和每个人 1:1 一次,即使他们的工作看起来和你毫无关联。同时关注各种新提案和设计文档,了解其它人在做什么。除了寻找可能的机会,这也有助于站在更高视角全面理解当前的工作。

不要对项目产生感情

我当初之所以选择换到现在的组,主要是因为(前)老板答应我可以继续之前的工作——一个我很喜欢并认为有很大潜力的项目。时至今日,这个项目完全兑现了潜力,然而我却因此耽误了两年。如果当时抛弃对项目的感情,我完全可以找个更合适的组。转组的事之后会细聊。

老板的坑

之前和 Phil 做过一期节目聊了这个话题,不过我们遇到的情况不尽相同,感兴趣的可以去听一下。我只说我遇到的情况。

不要盲目相信老板

我的三任老板都有过很离谱的判断:

  • 第一任老板认为写什么语言都不影响换组(我一开始写 Dart。。。)。然而事实就是换组非常看背景。很多组明面上说可以进去再学,实际上都希望申请者会写特定的语言(等着进去再学的人根本拿不到面试:)。
  • 第二任老板极度厌恶 C++ 并热爱 Java。他要求我把一个 C++ 服务用 Java 重写。这明显是个费力不讨好的活,但又不能不干,于是我和老板说服了另一个人去写。结果如我所料,他花了时间却没拿到任何 impact(真对不起他)。
  • 现任老板比较短视。他上任后大刀阔斧地砍项目(交给别的组维护),理由是我们要 focus 在几个重点项目上。很多成功或有潜力的项目都上了被砍名单。问题是 EngProd 组的活本来就少,被这么一砍,不少人拿不到足够有 impact 的项目,只能做些小 feature。去年离职了好几个,今年目测还要跑一波。

所以我认为,老板的话要听,但决策一定要自己做(或者说,做自己认可的决策)。这里面有个更本质的原因。自己做决策,即使错了,也可以复盘反思,并取得进步。而让他人做决定,自己将很难进步——对不知道对在哪,错也不知道错在哪。

警惕只想把你当工具人的老板

我换组的另一个原因是发现老板有把我当工具人的倾向。在 Google,好的老板要能平衡组与组员的利益,既保证组的发展,又确保个人能够成长(说白了就是晋升)。前两任老板虽说都给过我坑项目,但至少他们会和你聊个人发展并给出建议。这些我都看在眼里。现老板去年还是工程师,在前老板走后被提拔上来。一开始的感觉是他不太关注我的项目,即不了解也不想去了解。今年 Q1 快结束时,事情开始变糟。如之前所说,老板开始砍项目,并把我调往那个我曾经错过的项目。本以为老板打算让我从现在开始 own 它,然而并不是。老板的意思是让我 Q2 去实现一个比较重要的功能,后续安排再议。从我知道的情况看,这个项目中很有 impact 的一块已经被分给了一个即将到来的 intern——是的你没听错,intern。我问老板:明明这个 feature L3 工程师就能做(简单的 CRUD),并且对他们晋升帮助很大,为什么让我来?老板说你擅长这块,你来做比较快。话说到这份上,继续聊下去已经没太大意义了。

怎么判断老板有没有把你当工具人?我觉得有这么两点:

  • 看老板给不给你做职业规划。好的老板会说:我觉得你想晋升,impact/difficulty/leadership 还不够,做XXX任务可能对你有帮助,或者我们一起想想能做什么。把你当工具人的老板不会给你规划。

  • 看老板对你是否诚实。我曾对现老板说:仅仅加这个功能基本上就是 L3 的活,感觉对我晋升没有帮助。老板说不要紧,只要你做得快,拿 SEE(类比 3.75)是没问题的。在我看来这就是典型的不诚实,或者职场 PUA 话术。你再资深,该花的时间还是得花,收集需求、设计、实现、测试,这其中有的时间是你无法控制的。资深工程师无非是做出来的东西稳定一些,考虑更周全,bug 更少,并不一定就能做得更快——因为最快的方法永远是糙快猛。

    如果是前老板,我觉得他会说:“虽然没法让你到 5,但我们有客户急需这个功能。我们可以规划一下之后的安排,比如让你顺势 own 这个项目,或者找一个更好的项目”。诚实是最重要的。不要骗我们,我们不傻。

当然,现老板也和我聊过晋升。他说如果我想晋升就得自己想一个好项目。于是我想了一个 impact 很大的,写了提案给相关组审阅,他们很满意。然后老板说我们不做这个,要 focus。我:???

组的坑

组的坑既是最好聊的又是最不好聊的。相对来说,组的坑比较明确,容易提前识别。但另一方面,组的坑多种多样,且和老板和项目都有关联,难以穷尽。所以这里只说说我知道的那些。

业务/技术/个人兴趣,至少要占一项

有的组业务有前途,有的组技术有挑战,有的组两边都有,有的组两边都没有。两边都没有的基本就是坑,这种组在大公司相当多,比如 Google 的 EngProd。当然,如果某个组做的东西就是你的兴趣所在,那就加入吧,这时候别的都没那么重要了。

Fake infra 和 True infra

EngProd 具体怎么坑,我以后或许会专门写篇文章讲,但最根本的一点在于他们负责的是 fake infra。EngProd 全称 Engineering Productivity(研发效能),现在的 title 虽然是 Software Engineer,但在 2019 年以前是 SETI(software engineer, tools and infrastructure)。有人一看到 "infrastructure" 这个词就被唬住了,以为做的是什么不得了的东西。然而 EngProd 虽不负责业务,做的东西却也跟 infra 不沾边,我称之为 "fake infra"。比如修改一下公司的集成测试框架给某个组用,帮某个组优化 presubmit/postsubmit time,减少 test flakiness,诸如此类的杂活是 EngProd 传统艺能。我理解公司设立这个职位的初衷就是把杂活分出去,让 SWE 们专注于开发。

True infra 和 fake infra 的区别是什么呢?我觉得 true infra 一定是有技术挑战的,且至少要服务于某个大部门(比如 Search/Ads/Cloud),而不是一个或几个组。那些大家耳熟能详的技术都是 true infra,比如 Spanner,F1,Flume,Bigtable 等。有意思的是,做 true infra 的组从来就不属于 EngProd,而 EngProd 也从来不负责 true infra。

不要去 scope 太小和前景莫测的组

记得第一任老板总是强调要了解 big picture,当时不懂,现在才发现说得真对。Scope 指一个组负责的工作或者产品有多大。比如前面说到的 true infra,他们的 scope 就是全公司,而 EngProd 组的 scope 则通常是一个部门(几十到一百人)。面向用户的产品更复杂一些。比如 Gmail 和 YouTube 虽然有几十亿人用,但某个组负责的小功能却未必。这时候你只能依靠经验去判断。很多时候我们看不清,这不要紧,但至少有一些坑是可以避开的,比如 messaging app。

我第一次换组的时候也犯过错误,为了能去搞 AR 花了很多时间学图形学,导致没怎么认真看其它组。现在想起来觉得自己真蠢:在 Google 做 AR 没前途不是明摆着的么?到今天为止都没有一款面向消费者的成功 AR 产品,更不要说在这方面投入不多的 Google 了。而且说白了我也不是真的有兴趣,只是觉得这个技术很酷而已。当时犯的错误,现在要用几倍时间来还。

换组心得

下面来聊聊换组。虽然和上一节相关,我还是决定单独拿出来说。很多公司不像 Google 有这么多换组机会,还可能面临老板不放人的情况。因此我不打算写换组的流程,而是把重点放到“选组”上,尽可能让内容通用一些。

关注具体的工作内容和大方向

只看产品选组还不够,具体的工作内容和大方向也很重要。比如一个明星产品组招人可能是为了是搞测试,而一个内容平台的年度 OKR 可能是加强审查。要了解一个组的方向,最简单的方法是去看它的年度 OKR。如果找不到,一定要和老板问清楚。

我遇到过这么一个组,他们有 hc 所以打算先招几个人,至于做什么等进去再分配。这也是一个危险信号。通常来说,一个职位的职责越明确,项目的风险和掉坑的概率就越小,也能让你做出更有根据的选择。

Impact 是否好兑现

我实在不知道 impact 怎么翻译比较好,“影响力”这个说法其实不准确,所以我就不翻了。Impact 是否好兑现,我们可以通过举例说明(注意,好兑现和“大小”是两码事):

  • 好兑现的:
    • 实现一个面向用户的新功能
    • 依赖的 API 要 deprecate 了,使用新 API 重构系统
    • 优化响应时间
    • 给覆盖率很低的项目写单元测试
  • 不好兑现的:
    • 做一套测试框架
    • 重构系统,但是功能不变
    • 优化调试或监控工具

好兑现的项目简单明了,只要做完了 impact 就实打实地在那,不需要费劲给人解释。不好兑现的则各有各的问题:

  • 做测试框架:做了并不代表有人用,没人用那 impact 就是 0。你得花心思推广。
  • 重构系统:你觉得老代码写得太烂了难以维护,所以要重构。问题在于证据呢?这时候你就必须得收集数据,比如证明在老代码下加一个功能平均需要三天,而重构以后只需要一天——假设你能拿到这么理想的数据。实际情况往往是拿不到,或者并没有显著差异。
  • 优化调试或监控工具:这里的困难还是在收集数据,你怎么证明改进后比改进前要好?并不是说证明不了,而是需要花心思。

优先选择 impact 好兑现的组。一般来说,做新东西比较好兑现。

和工程师聊天

和老板聊天往往容易浮于表面,或者大饼满天飞。找到组里的工程师,一般他们都会毫无保留地告诉你真实信息。我会问很多尖锐的问题,比如“你觉得组里有哪些缺点”,“晋升速度如何”等,还会详细了解工作的细节。有个工程师直接告诉我,如果你想晋升,就不要来我们组——老板们断然是不会这么说的。如果你觉得在职员工讲话会有顾虑,也可以找刚转出去的员工。实际聊下来我没发现有太大差别,只要是工程师都会如实说明情况。

什么时候该考虑转组?

这个问题没有标准答案。我现在倾向于认为只要满足条件(Google 的要求是待满一年),就可以开始关注新机会了。我们有个系统可以设置 filter,出现满足条件的内部转岗机会会自动给你发邮件。每周看一次邮件其实花不了几分钟,而且看机会并不意味着一定就要走。别看每天都有很多组在招人,实际上单看某一个组,招人的窗口都只占一年中的很小一段,甚至有的组几年都不招。很多机会错过就是错过了。

有种情况我会建议你立马开始看机会:人事变动。不论是老板离职还是组织架构调整,只要是人事变动,多多少少一定会影响工作。影响可能好可能坏,可能多可能少,你无法预测,所以最好未雨绸缪。尤其如果老板给你提供了很多支持,甚至本身就是项目的发起者,当他离职之后你的工作必然受到冲击。事情变坏往往就是一瞬间的事,如果没有准备可能会措手不及。当然这里还是要强调,看机会不等于一定要换组,只是给自己准备一条后路。如果一切照常甚至变好,那完全没有换组的必要。

总结

相信你现在也感受到了我这几年是怎么被坑过来的。我当然希望一切都一帆风顺,但既然事情已经发生,也必须总结一下引以为戒。祝大家职场之路顺利。

comments powered by
Read the whole story
yee
129 days ago
reply
39.965424,116.324526
Share this story
Delete

Mad Marx: The Class Warrior

3 Comments and 14 Shares












Read the whole story
yee
1574 days ago
reply
39.965424,116.324526
popular
1578 days ago
reply
Share this story
Delete
3 public comments
rraszews
1578 days ago
reply
Karl Marx of the Wasteland headshotting Ayn Rand is the single most beautiful thought I have ever been gifted with.
Columbia, MD
CarlEdman
1578 days ago
reply
So true! All of the world's problems could be solved by Marx(ists) killing more of their opponents.
Falls Church, Virginia, USA
quad
1578 days ago
Your irony game is so strong.
rclatterbuck
1578 days ago
reply
I'd watch it
Next Page of Stories