> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.
Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.
But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.
"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)
I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.
I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since
Could you propose to fail the build based on the number of warnings to ensure it doesn't go up?
I did something similar with spotbugs. There were existing warnings I couldn't get time to fix so I configured the maven to fail if it exceed the level at which I enabled it.
This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
However, management felt kinda burned because that was a bunch of time and unsurprisingly nobody was measurably more productive afterwards (it turns out those are just shitty code tidbits, but not highly correlated with areas which where it is miserable to make changes. Some of the over-refactorings probably made things harder.
It was a lovely measurable metric, making it an easy sell in advance. Which maybe was the problem idk.
> This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
Our tests are often written with a list of known exceptions. However, such tests also fail if an exception is no longer needed - with a congratulatory message and a notice that this exception should be removed from the list. This ensures that the list gets shorter and shorter.
If you can instead construct a list of existing instances to grandfather in, that doesn't suffer from this problem. Of course many linting tools do this via "ignore" code comments.
That feels less arbitrary than a magic number (because it is!) and I've seen it work.
We used this approach to great effect when we migrated a huge legacy project from Javascript to Typescript. It gives you enough flexibility in the in between stages so you're not forced to change weird code you don't know right away, while enforcing enough of a structure to eventually make it out alive in the end.
> Squashing any error, strangeness and warning can be very expensive in some projects
Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.
Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.
Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.
Sometimes it can take a serious effort to understand why a problem is happening and I'll accept an unknown blip that can be corrected by occasionally hitting a reset button occasionally when dealing with third party software. From my experience my opinion aligns with yours though - it's also worth understanding why an error happens in something you've written, the times we've delayed dealing with mysterious errors that nobody in the team can ascribe we've ended up with a much larger problem when we've finally found the resources to deal with it.
Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.
This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.
That's very fair - at least with third party software, it can be nigh impossible to track down a problem.
With third party libraries, I've too-often found myself reading the code to figure something out, although that's a frustrating enough experience I generally wouldn't wish on other people.
At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
It then creates immense value by avoiding a lot of risk and uncertainty for little effort.
Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.
This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.
It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world".
Because it's much easier to keep it green, clean and automated than to move there later on
That's because it moves from being a project to being a process. I've tried to express this at my current job.
They want to take time out to write a lot of unit tests, but they're not willing to change the process to allow/expect devs to add unit tests along with each feature they write.
I'll be surprised if all the tests are still passing two months after this project, since nobody runs them.
That’s why TDD (Test-Driven Development) has become a trend. I personally don’t like TDD’s philosophy of writing tests first, then the code (probably because I prefer to think of a solutions more linearly), but I do absolutely embrace the idea and practice of writing tests along side of the code, and having minimum coverage thresholds. If you build that into your pipeline from the very beginning, you can blame the “process” when there aren’t enough tests.
At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.
If manual input can generate undefined behavior, you depend on a human making a correct decision, or you're dealing with real-world behavior using incomplete sensors to generate a model...sometimes, the only reasonable target is "fail gracefully". You cannot expect to generate right outputs with wrong inputs. It's not wrong to blame the user when economics, not just laziness, say that you need to trust the user to not do something unimagineable.
I think this is the kind of situation where a little professionalism would have prevented the issue: Handling uncaught exceptions in your threadpool/treemap combo would have prevented the problem from happening.
> That was before "move fast, break things and blame the user" became the norm.
When VCs only give you effectively 9 months of runway (3 months of coffees, 9 months of actual getting work done, 3 months more coffees to get the next round, 3 more months because all the VCs backed out because your demo wasn't good enough), move fast and break things is how things are done.
If giving startups 5 years of runway was the norm, then yeah, we could all be professional.
> Rarely is this accepted by whoever chooses what we should work on.
I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.
The other thing is don't catch and ignore exceptions. Even "catch and log" is a bad idea unless you specifically know that program execution is safe to continue. Just let the exception propagate up to where something useful can be done, like return 500 or displaying an error dialog.
> Rarely is this accepted by whoever chooses what we should work on.
You need to find more disciplined teams.
There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.
Most teams are less disciplined than they should be. Also, job/team mobility is very low right now. So the question becomes, how do you increase discipline on the team you're on?
For very small teams, exploring new platforms and / or languages that compliment correctness is an option. Using a statically typed language with explicit managed side effects has made a huge difference for me. Super disruptive the larger the team though of course.
Yes, but…I suppose you have to pick your battles. There was recently a problem vexing me about a Rails project I maintain where the logs were filled with complaints about “unsupported parameters”, even though we painstakingly went through all the controllers and allowed them. It’s /probably/ benign, but it adds a lot of noise to the logs. Several of us took a stab at resolving it, but in the end we always had much higher priorities to address. Also it’s hard to justify spending so many hours on something that has little “business value”, especially when
there is currently no functional impact.
It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?
As someone with pretty bad hemorrhoids, I’m hesitant to ask my doctor about surgery because I’ve been told the hemorrhoids will come back, without question. So it’s even still just a temporary fix…
We’ve been down many paths on this. In some cases we know exactly where it’s happening, but despite configuring everything correctly, it still complains. It might just be a bug in the Rails code or a fault in the way parameters are passed in (some of the endpoints take a lot of parameters, some of them optional). We could “fix” the issue by simply allowing all parameters, but of course this opens a security risk. This is a 10+ year old code base and I am told it has been a thorn in their side for a long time. It’s one of those battles thar I suppose we are not going to try fighting unless we get really bored and have nothing else to work on.
Also, stack trace should show you everything you need to know to fix this, or am I missing something? (no experience with Ruby)
Otherwise, I see the cleanups and refactoring as part of normal development process. There is no need to put such tasks in Jira - they must be done as preparation for the regular tasks. I can imagine that some companies take agile too seriously and want to micromanage every little task, but I guess lack of time for refactoring is not the biggest problem.
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.
And from a security perspective, the "might cause a problem 0.000001% of the time" flaws can often be manipulated into becoming a problem 100% of the time.
Ooof. The core collections in Java are well understood to not be thread-safe by design, and this should have been noticed.
OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.
>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.
That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe.
>Controversial Fix: Track visited nodes
Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way now, or in the future (if the implementation changes in the next Java release).
>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.
That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.
public void someFunction(SomeType relatedObject,
List<SomeOtherType> unrelatedObjects) {
...
treeMap.put(relatedObject.a(), relatedObject.b());
...
// unrelatedObjects is used later on in the function so the
// parameter cannot be removed
}
That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.
You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.
Does it not strike anyone else as odd that if someone said they had a single CPU, and that CPU were running a normal priority task at 100%, and that caused the machine to barely allow ssh, we'd say there's a much bigger problem than that someone is letting on?
No 32 core (thread, likely) machine should ever normally be in a state where someone can "barely ssh onto it". Is Java really that janky? Or is "barely ssh onto it" a bit hyperbolic?
Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.
I have seen a lot of incorrect Comparators and Comparable implementations in existing code, but haven’t personally come across the infinite-loop situation yet.
To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).
Interesting. I haven’t seen an infinite loop either, but I can imagine one if a comparator tries to be too “clever” for example if it bases its comparison logic on some external state.
Another common source of comparator bugs is when people compare floats or doubles and they don’t account for NaN, which is unequal to everything, including itself!
In Java, the usual symptom of comparator bugs is that sort throws the infamous “Comparison method violates its general contract!” exception.
What about spotting a cycle by using an incrementing counter and then throwing an exception if it goes above the tree depth or collection size (presuming one of these is tracked)?
Unlike the author’s hash set proposal it would require almost no memory or CPU overhead and may be more likely to be accepted.
That being said, in the decade plus I’ve used C# I’ve never found that I failed to consider concurrent access on data structures in concurrent situations.
Is there a way to ensure that whatever happens (CPU, network overloaded etc) one can always ssh in? Like reserve a tiny bit of stuff to the ssh daemon?
On Linux I’ve done this by pinning processes to a certain range of CPU cores, and the scheduler will just keep one core free or something. Which allows whatever I need in terms of management to execute on that one core, including SSH orUI.
Yep, ran into this way too many times. Performing concurrent operations on non thread-safe objects in java or generally in any language produces the most interesting bugs in the world.
Which is why you manage atomic access to non-thread-safe objects yourself, or use a thread-safe version of them when using them across threads.
Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.
Every time I think I'm sorta getting somewhere in my understanding of how to write code I see a comment like this that reminds me that the rabbithole is functionally infinite in both breadth and depth.
There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!
It's not that bad. We just don't have the equivalent of GC for multi-threading yet, so the advice necessarily needs to be "just remember to take and release locks" (same as remembering to malloc and free).
Hopefully someone will invent something like STM [1] in the distant year of 2007 or so [2]. It has actual thread-safe data structures. Not just the current choice between wrong-answer-if-you-dont-lock and insane-crashing-if-you-dont-lock.
Rust takes pride in its 'fearless concurrency' (strict compile-time checks to ensure that locks or similar constructs are used for cross-thread data, alongside the usual channels and whatnot), while Go takes pride in its use of channels and goroutines for most tasks. Not everything is like the C/C++/C#/Java situation where synchronization constructs are divorced from the data they're responsible for.
For C++, abseil’s thread annotations are quite nice for getting closer to the Rust style of locking. Of course, the Rust style is still much easier to understand and less manual.
Synchronization primitives in Go are just as divorced as elsewhere, sometimes even more so - it does have channels, but Goroutines cannot yield a value, forcing you to employ a separate storage location together with WaitGroup/Mutex/RWMutex (which, unlike Rust's RWLock, is separate too, although C# lets you model it to an extent). This results in community developing libraries like https://github.com/sourcegraph/conc which attempt to replicate Rust's Futures / C#'s Tasks.
It's not a perfect situation, but C# has some dedicated collection classes for concurrent use - https://learn.microsoft.com/en-us/dotnet/api/system.collecti.... There's still some footguns possible, but knowing "I should use these collections instead of the regular versions" is less error-prone than needing to take/release locks at every single use site.
Some (maybe most?) operations on Java Collections perform integrity checks to warn about such issues, for example map throwing ConcurrentModificationException
ConcurrentModificationException is typically thrown from an iterator when it detects that it’s been invalidated by a modification to the underlying collection. It’s harder to check for the case described in this article, which is about multiple threads calling put() concurrently on a non thread safe object.
I've universally found that even when I am convinced that I am OK with the consequences of sharing something that isn't synchronized, the actual outcome is something I wasn't expecting.
The only things that should be shared without synchronization are readonly objects where the initialization is somehow externally serialized with accessors, and atomic scalars -- C++ std::atomic, Java has something similar, etc.
I ran into my share of concurrency bugs, but one thing I could never intentionally trigger was any kind of inconsistency stemming from removing a "volatile" modifier from a mutable field in Java. Maybe the JVM I tried this with was just too awesome.
The author has discovered a flavor of the Poison Pill. More common in event sourcing systems, it’s a message that kills anything it touches, and then is “eaten” again by the next creature that encounters it which also dies a horrible death. Only in this case it’s live-locked.
Once the data structure gets into the illegal state, every subsequent thread gets trapped in the same logic bomb, instead of erroring on an NPE which is the more likely illegal state.
> Could an unguarded TreeMap cause 3,200% utilization?
I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.
I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.
[1] This undersynchronization was far from the only problem with that codebase.
I was excited to see that not only does this article cover other languages, but that this error happens in go. I was a bit surprised because map access is generally protected by a race detector, but the RedBlack tree used doesn't store anything in a map anywhere.
I wonder if the full race detector, go run -race, would catch it or not. I also want to explore if the RB tree used a slice instead of two different struct members if that would trigger the runtime race detector. So many things to try when I get back to a computer with go on it.
Why does the fix need to remember all the nodes we have visited? Can't we just keep track of what span we are in? That way we just need to keep track of 2 nodes.
In the graphic from the example we would keep track like this:
low: - high -
low: 11 high: -
low: 23 high: -
low: 23 high: 26
Error: now we see item 13, but that is not inside our span!
What I like here is the discovery of the extra loop, and then still digging down to discover the root cause of the competing threads. I think I would have removed the loop and called it done.
Just to probe the code review angle a little bit: shared mutable state should be a red/yellow flag in general. Whether or not it is protected from unsafe modification. That should jump out to you as a reviewer that something weird is happening.
Very well said, and very nice to see references to others on point.
As a sidebar: I'm almost distracted by the clarity. The well-formed structure of this article is a perfect template for an AI evaluation of a problem.
It'd be interesting to generate a bunch of these articles, by scanning API docs for usage constraints, then searching the blog-sphere for articles on point, then summarizing issues and solutions, and even generating demo code.
Then presto! You have online karma! (and interviews...)
But then if only there were a way to credit the authors, or for them to trace some of the good they put out into the world.
So, a new metric: PageRank is about (sharing) links-in karma, but AuthorRank is partly about the links out, and in particular the degree to which they seem to be complete in coverage and correct and fair in their characterization of those links. Then a complementary page-quality metric identifies whether the page identifies the proper relationships between the issues, as reflected elsewhere, as briefly as possible in topological order.
Then given a set of ordered relations for each page, you can assess similarity with other pages, detect copying (er, learning), etc.
Should I read this as the Java TreeMap itself is thread unsafe, and the JVM is in a loop, or that the business logic itself was thread unsafe, and just needed to lock around its transactions?
"Note that this implementation is not synchronized. If multiple threads access a map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with an existing key is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be "wrapped" using the Collections.synchronizedSortedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map..."
To add to this: Java’s original collection classes (Vector, Hashtable, …) were thread-safe, but it turned out that the performance penalty for that was too high, all the while still not catching errors when performing combinations of operations that need to be a single atomic transaction. This was one of the motivations for the thread-unsafe classes of the newer collection framework (ArrayList, HashMap, …).
> still not catching errors when performing combinations of operations that need to be a single atomic transaction
This is so important. The idea that code is thread-safe just because it uses a thread-safe data structure, and its cousin "This is thread-safe, because I have made all these methods synchronized" are... not frequent, but I've seen them expressed more often than I'd like, which is zero times.
The business logic is thread unsafe because it uses a TreeMap concurrently. It should either use something else or lock around all usages of the TreeMap. It does not seem to be "in itself" wrong, meaning for any other cause than the usage of TreeMap.
Does a ConcurrentSkipListMap not give the correct O(log N) guarantees on top of being concurrency friendly?
java.util.concurrent is one of the best libraries ever. If you do something related to concurrency and don't reach for it to start, you're gonna have a bad time.
It's often a better design if you manage the concurrency in a high-level architecture and not have to deal with concurrency at the data structure level.
Designing your own concurrency structures instead of using ones designed by smart people who thought about the problem more collective hours than your entire lifetime is unwarranted hubris.
The fact that ConcurrentTreeMap doesn't exist in java.util.concurrent should be ringing loud warning bells.
It's not like you have a choice. Thread safety doesn't compose. A function that only uses thread-safe constructs may not itself be thread safe. This means that using concurrent data structures isn't any sort of guarantee, and doesn't save you from having to think about concurrency. It will prevent you from running into certain kinds of bugs, and that may be valuable, but you'll still have to do your own work on top.
If you're doing that anyway, it tends to be easier and more reliable to forget about thread safety at the bottom, and instead design your program so that thread safety happens in larger blocks, and you have big chunks of code that are effectively single-threaded and don't have to worry about it.
The GP comment is not about designing your own concurrency data structures. It’s about the fact that if your higher-level logic doesn’t take concurrency into account, using the concurrent collections as such will not save you from concurrency bugs. A simple example is when you have two collections whose contents need to be consistent with each other. Or if you have a check-and-modify operation that isn’t covered by the existing collection methods. Then access to them still has to be synchronized externally.
The concurrent collections are great, but they don’t save you from thinking about concurrent accesses and data consistency at a higher level, and managing concurrent operations externally as necessary.
The people behind the concurrent containers in java.util.concurrent are smart, but they are limited by the fact that they are working on a necessarily lower-level API. As an application programmer, you can easily switch the high-level architecture so as not to require any concurrent containers. Perhaps you have sharded your data beforehand. Perhaps you use something like map-reduce architecture where the map step is parallel and require no shared state.
Once they expose data structures that allow generic uses like "if size<5, add item" I'll take another look. Until then, their definition of thread-safety isn't quite the same as mine.
Haha, I was recently running a backfill was quite pleased when I managed to get it humming along at 6400% CPU on a 64vcpu machine. Fortunately ssh was still receptive.
Anyone else mildly peeved by how CPU load is just load per core summed up to an arbitrary percentage all too often?
Why not just divide 100% by number of cores and make that the max, so you don't need to know the number of cores to know the actual utilization? Or better yet, have those microcontrollers that Intel tries to pass off as E and L cores take up a much smaller percentage to fit their general uselessness.
IDK but the current convention makes it easy to see single-threaded bottlenecks. So if my program is using 100% CPU and cannot go faster, I know where to look.
This is “Irix” vs “Solaris” mode of counting, the latter being summed up to to 100% for all cores. I think the modern approach would be to see how much of its TDP budget the core is using.
In practice, it's rarely an issue in C# because it offers excellent concurrent collections out of box, together with channels, tasks and other more specialized primitives. Worst case someone just writes a lock(thing) { ... } and calls it a day. Perhaps not great but not the end of the world either.
I did have to hand-roll something that partially replicates Rust's RWLock<T> recently, but the resulting semantics turned out to be decent, even if not providing the exact same level of assurance.
> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.
Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.
But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.
"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)
I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.
I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since
Could you propose to fail the build based on the number of warnings to ensure it doesn't go up?
I did something similar with spotbugs. There were existing warnings I couldn't get time to fix so I configured the maven to fail if it exceed the level at which I enabled it.
This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
Absolutely could!
However, management felt kinda burned because that was a bunch of time and unsurprisingly nobody was measurably more productive afterwards (it turns out those are just shitty code tidbits, but not highly correlated with areas which where it is miserable to make changes. Some of the over-refactorings probably made things harder.
It was a lovely measurable metric, making it an easy sell in advance. Which maybe was the problem idk.
> This has the unfortunate side effect that if it drops and no one adjusts the threshold then people can add more issues without failing the build.
Our tests are often written with a list of known exceptions. However, such tests also fail if an exception is no longer needed - with a congratulatory message and a notice that this exception should be removed from the list. This ensures that the list gets shorter and shorter.
If you can instead construct a list of existing instances to grandfather in, that doesn't suffer from this problem. Of course many linting tools do this via "ignore" code comments.
That feels less arbitrary than a magic number (because it is!) and I've seen it work.
We used this approach to great effect when we migrated a huge legacy project from Javascript to Typescript. It gives you enough flexibility in the in between stages so you're not forced to change weird code you don't know right away, while enforcing enough of a structure to eventually make it out alive in the end.
> Squashing any error, strangeness and warning can be very expensive in some projects
Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.
Depends a lot on the project, I think, as the parent comment suggests.
Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.
Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.
Sometimes it can take a serious effort to understand why a problem is happening and I'll accept an unknown blip that can be corrected by occasionally hitting a reset button occasionally when dealing with third party software. From my experience my opinion aligns with yours though - it's also worth understanding why an error happens in something you've written, the times we've delayed dealing with mysterious errors that nobody in the team can ascribe we've ended up with a much larger problem when we've finally found the resources to deal with it.
Nobody wants to triage an issue for eight weeks, but one thing to keep in mind is that the more difficult it is to triage an issue the more ignorance about the system that process is revealing - if your most knowledgeable team members are unable to even triage an issue in a modest amount of time it reveals that your most knowledgeable team members have large knowledge gaps when it comes to comprehending your system.
This, at least, goes for a vague comprehension of the cause - there are times you'll know approximately what's going wrong but may get a question from the executive suite about the problem (i.e. "Precisely how many users were affected by the outage that caused us to lose our access_log") that might take weeks or months or be genuinely nigh-on-impossible to answer - I don't count questions like that as part of issue diagnosis. And if it's a futile question you should be highly defensive about developer time.
That's very fair - at least with third party software, it can be nigh impossible to track down a problem.
With third party libraries, I've too-often found myself reading the code to figure something out, although that's a frustrating enough experience I generally wouldn't wish on other people.
if there are any warnings I'm supposed to ignore then there are effectively no warnings.
there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored
At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
The last point is the key.
It then creates immense value by avoiding a lot of risk and uncertainty for little effort.
Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.
This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.
It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world". Because it's much easier to keep it green, clean and automated than to move there later on
That's because it moves from being a project to being a process. I've tried to express this at my current job.
They want to take time out to write a lot of unit tests, but they're not willing to change the process to allow/expect devs to add unit tests along with each feature they write.
I'll be surprised if all the tests are still passing two months after this project, since nobody runs them.
That’s why TDD (Test-Driven Development) has become a trend. I personally don’t like TDD’s philosophy of writing tests first, then the code (probably because I prefer to think of a solutions more linearly), but I do absolutely embrace the idea and practice of writing tests along side of the code, and having minimum coverage thresholds. If you build that into your pipeline from the very beginning, you can blame the “process” when there aren’t enough tests.
At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.
> professionalism." That was before "move fast, break things
I think you're professing a false dichotomy. Is it unprofessional to "move fast, break things"?
I'm a slow moving yak shaver partly due to concious intention. I admire some outcomes from engineers that break things like big rockets.
I definitely think we learn fast by breaking things: assuming we are scientific enough to design to learn without too much harm/cost.
It very much depends on the nature of your work.
If manual input can generate undefined behavior, you depend on a human making a correct decision, or you're dealing with real-world behavior using incomplete sensors to generate a model...sometimes, the only reasonable target is "fail gracefully". You cannot expect to generate right outputs with wrong inputs. It's not wrong to blame the user when economics, not just laziness, say that you need to trust the user to not do something unimagineable.
I think this is the kind of situation where a little professionalism would have prevented the issue: Handling uncaught exceptions in your threadpool/treemap combo would have prevented the problem from happening.
> That was before "move fast, break things and blame the user" became the norm.
When VCs only give you effectively 9 months of runway (3 months of coffees, 9 months of actual getting work done, 3 months more coffees to get the next round, 3 more months because all the VCs backed out because your demo wasn't good enough), move fast and break things is how things are done.
If giving startups 5 years of runway was the norm, then yeah, we could all be professional.
> Rarely is this accepted by whoever chooses what we should work on.
I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.
The other thing is don't catch and ignore exceptions. Even "catch and log" is a bad idea unless you specifically know that program execution is safe to continue. Just let the exception propagate up to where something useful can be done, like return 500 or displaying an error dialog.
> Rarely is this accepted by whoever chooses what we should work on.
You need to find more disciplined teams.
There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.
Most teams are less disciplined than they should be. Also, job/team mobility is very low right now. So the question becomes, how do you increase discipline on the team you're on?
For very small teams, exploring new platforms and / or languages that compliment correctness is an option. Using a statically typed language with explicit managed side effects has made a huge difference for me. Super disruptive the larger the team though of course.
Yes, but…I suppose you have to pick your battles. There was recently a problem vexing me about a Rails project I maintain where the logs were filled with complaints about “unsupported parameters”, even though we painstakingly went through all the controllers and allowed them. It’s /probably/ benign, but it adds a lot of noise to the logs. Several of us took a stab at resolving it, but in the end we always had much higher priorities to address. Also it’s hard to justify spending so many hours on something that has little “business value”, especially when there is currently no functional impact.
It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?
As someone with pretty bad hemorrhoids, I’m hesitant to ask my doctor about surgery because I’ve been told the hemorrhoids will come back, without question. So it’s even still just a temporary fix…
Couldn’t you just run a debugger to find all of the incidents of that issue?
We’ve been down many paths on this. In some cases we know exactly where it’s happening, but despite configuring everything correctly, it still complains. It might just be a bug in the Rails code or a fault in the way parameters are passed in (some of the endpoints take a lot of parameters, some of them optional). We could “fix” the issue by simply allowing all parameters, but of course this opens a security risk. This is a 10+ year old code base and I am told it has been a thorn in their side for a long time. It’s one of those battles thar I suppose we are not going to try fighting unless we get really bored and have nothing else to work on.
Also, stack trace should show you everything you need to know to fix this, or am I missing something? (no experience with Ruby)
Otherwise, I see the cleanups and refactoring as part of normal development process. There is no need to put such tasks in Jira - they must be done as preparation for the regular tasks. I can imagine that some companies take agile too seriously and want to micromanage every little task, but I guess lack of time for refactoring is not the biggest problem.
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.
Another problem with lingering warnings is that it's easy to overlook that one new warning that's actually important amongst floods of older warnings.
And from a security perspective, the "might cause a problem 0.000001% of the time" flaws can often be manipulated into becoming a problem 100% of the time.
All security issues are subclass of bugs. Security is a niche version of QA.
Ooof. The core collections in Java are well understood to not be thread-safe by design, and this should have been noticed.
OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.
>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.
That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe.
>Controversial Fix: Track visited nodes
Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way now, or in the future (if the implementation changes in the next Java release).
>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.
That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.
FTA: The code can be reduced to simply:
That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.
Good point. It should be replaced with an if not empty check.
Does it not strike anyone else as odd that if someone said they had a single CPU, and that CPU were running a normal priority task at 100%, and that caused the machine to barely allow ssh, we'd say there's a much bigger problem than that someone is letting on?
No 32 core (thread, likely) machine should ever normally be in a state where someone can "barely ssh onto it". Is Java really that janky? Or is "barely ssh onto it" a bit hyperbolic?
Another way to get infinite loops is using a Comparator or Comparable implementation that doesn’t implement a consistent total order: https://stackoverflow.com/questions/62994606/concurrentskips... (This is unrelated to concurrency.)
Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.
Have you seen this before in person? It would make a great blog post.
I haven't personally encountered a buggy comparator without a total order.
I have seen a lot of incorrect Comparators and Comparable implementations in existing code, but haven’t personally come across the infinite-loop situation yet.
To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).
Interesting. I haven’t seen an infinite loop either, but I can imagine one if a comparator tries to be too “clever” for example if it bases its comparison logic on some external state.
Another common source of comparator bugs is when people compare floats or doubles and they don’t account for NaN, which is unequal to everything, including itself!
In Java, the usual symptom of comparator bugs is that sort throws the infamous “Comparison method violates its general contract!” exception.
I knew someone who missed out on a gold medal at the International Olympiad of Informatics because his sort comparator didn’t have a total order.
Ouch. Any idea which problem? Those problems are public: https://ioinformatics.org/page/contests/10
What about spotting a cycle by using an incrementing counter and then throwing an exception if it goes above the tree depth or collection size (presuming one of these is tracked)?
Unlike the author’s hash set proposal it would require almost no memory or CPU overhead and may be more likely to be accepted.
That being said, in the decade plus I’ve used C# I’ve never found that I failed to consider concurrent access on data structures in concurrent situations.
That's much better. Constant memory. The number of nodes is guaranteed to be less than or equal to the height of the tree.
"I could barely ssh onto it"
Is there a way to ensure that whatever happens (CPU, network overloaded etc) one can always ssh in? Like reserve a tiny bit of stuff to the ssh daemon?
On Linux I’ve done this by pinning processes to a certain range of CPU cores, and the scheduler will just keep one core free or something. Which allows whatever I need in terms of management to execute on that one core, including SSH orUI.
Nice? Or maybe give the Systemd slice a special cgroup with a reservation.
I'd consider doing the inverse and give the JVM a lower priority instead.
Yep, ran into this way too many times. Performing concurrent operations on non thread-safe objects in java or generally in any language produces the most interesting bugs in the world.
Which is why you manage atomic access to non-thread-safe objects yourself, or use a thread-safe version of them when using them across threads.
Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.
Every time I think I'm sorta getting somewhere in my understanding of how to write code I see a comment like this that reminds me that the rabbithole is functionally infinite in both breadth and depth.
There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!
It's not that bad. We just don't have the equivalent of GC for multi-threading yet, so the advice necessarily needs to be "just remember to take and release locks" (same as remembering to malloc and free).
Hopefully someone will invent something like STM [1] in the distant year of 2007 or so [2]. It has actual thread-safe data structures. Not just the current choice between wrong-answer-if-you-dont-lock and insane-crashing-if-you-dont-lock.
[1] https://www.adit.io/posts/2013-05-15-Locks,-Actors,-And-STM-...
[2] https://youtu.be/4caDLTfSa2Q?feature=shared
Rust takes pride in its 'fearless concurrency' (strict compile-time checks to ensure that locks or similar constructs are used for cross-thread data, alongside the usual channels and whatnot), while Go takes pride in its use of channels and goroutines for most tasks. Not everything is like the C/C++/C#/Java situation where synchronization constructs are divorced from the data they're responsible for.
For C++, abseil’s thread annotations are quite nice for getting closer to the Rust style of locking. Of course, the Rust style is still much easier to understand and less manual.
Synchronization primitives in Go are just as divorced as elsewhere, sometimes even more so - it does have channels, but Goroutines cannot yield a value, forcing you to employ a separate storage location together with WaitGroup/Mutex/RWMutex (which, unlike Rust's RWLock, is separate too, although C# lets you model it to an extent). This results in community developing libraries like https://github.com/sourcegraph/conc which attempt to replicate Rust's Futures / C#'s Tasks.
It's not a perfect situation, but C# has some dedicated collection classes for concurrent use - https://learn.microsoft.com/en-us/dotnet/api/system.collecti.... There's still some footguns possible, but knowing "I should use these collections instead of the regular versions" is less error-prone than needing to take/release locks at every single use site.
> "just remember to take and release locks"
If only it were so easy.
Tell that to inexperienced developers or making a massive single-thread project have multi-threaded capabilities.
I've been that developer making a single-threaded app multi-threaded. Best way to learn though!
Multi-threading - ain't nobody got time for that.
Yeah, our software politely waits for one customer to finish up with their GETs and POSTs before moving onto the next customer.
We have almost one '9' of uptime!
There are better ways than threading.
Yeah, like pretending you aren't
I don't know what you mean.
Some (maybe most?) operations on Java Collections perform integrity checks to warn about such issues, for example map throwing ConcurrentModificationException
ConcurrentModificationException is typically thrown from an iterator when it detects that it’s been invalidated by a modification to the underlying collection. It’s harder to check for the case described in this article, which is about multiple threads calling put() concurrently on a non thread safe object.
I've universally found that even when I am convinced that I am OK with the consequences of sharing something that isn't synchronized, the actual outcome is something I wasn't expecting.
The only things that should be shared without synchronization are readonly objects where the initialization is somehow externally serialized with accessors, and atomic scalars -- C++ std::atomic, Java has something similar, etc.
I ran into my share of concurrency bugs, but one thing I could never intentionally trigger was any kind of inconsistency stemming from removing a "volatile" modifier from a mutable field in Java. Maybe the JVM I tried this with was just too awesome.
Were you only testing on x86 or any other "total store order" architecture? If so, removing the volatile modifier has less of an impact.
The author has discovered a flavor of the Poison Pill. More common in event sourcing systems, it’s a message that kills anything it touches, and then is “eaten” again by the next creature that encounters it which also dies a horrible death. Only in this case it’s live-locked.
Once the data structure gets into the illegal state, every subsequent thread gets trapped in the same logic bomb, instead of erroring on an NPE which is the more likely illegal state.
> Could an unguarded TreeMap cause 3,200% utilization?
I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.
I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.
[1] This undersynchronization was far from the only problem with that codebase.
Exceptions in threads are an absolute killer.
Here's a story of a horror bughunt where the main characters are C++, select() and thread brandishing an exception: https://news.ycombinator.com/item?id=42532979
I remember reading that article but being unable to understand it due to my lack of knowledge in the area. I will have to give it another go.
I was excited to see that not only does this article cover other languages, but that this error happens in go. I was a bit surprised because map access is generally protected by a race detector, but the RedBlack tree used doesn't store anything in a map anywhere.
I wonder if the full race detector, go run -race, would catch it or not. I also want to explore if the RB tree used a slice instead of two different struct members if that would trigger the runtime race detector. So many things to try when I get back to a computer with go on it.
Why does the fix need to remember all the nodes we have visited? Can't we just keep track of what span we are in? That way we just need to keep track of 2 nodes.
In the graphic from the example we would keep track like this:
What I like here is the discovery of the extra loop, and then still digging down to discover the root cause of the competing threads. I think I would have removed the loop and called it done.
Just to probe the code review angle a little bit: shared mutable state should be a red/yellow flag in general. Whether or not it is protected from unsafe modification. That should jump out to you as a reviewer that something weird is happening.
i found this article very/deeply informative , memory-wise , concurency vs optimizations and troubles thereof:
Programming Language Memory Models
https://research.swtch.com/plmm
https://news.ycombinator.com/item?id=27750610
Very well said, and very nice to see references to others on point.
As a sidebar: I'm almost distracted by the clarity. The well-formed structure of this article is a perfect template for an AI evaluation of a problem.
It'd be interesting to generate a bunch of these articles, by scanning API docs for usage constraints, then searching the blog-sphere for articles on point, then summarizing issues and solutions, and even generating demo code.
Then presto! You have online karma! (and interviews...)
But then if only there were a way to credit the authors, or for them to trace some of the good they put out into the world.
So, a new metric: PageRank is about (sharing) links-in karma, but AuthorRank is partly about the links out, and in particular the degree to which they seem to be complete in coverage and correct and fair in their characterization of those links. Then a complementary page-quality metric identifies whether the page identifies the proper relationships between the issues, as reflected elsewhere, as briefly as possible in topological order.
Then given a set of ordered relations for each page, you can assess similarity with other pages, detect copying (er, learning), etc.
Should I read this as the Java TreeMap itself is thread unsafe, and the JVM is in a loop, or that the business logic itself was thread unsafe, and just needed to lock around its transactions?
https://docs.oracle.com/en/java/javase/21/docs/api/java.base...
"Note that this implementation is not synchronized. If multiple threads access a map concurrently, and at least one of the threads modifies the map structurally, it must be synchronized externally. (A structural modification is any operation that adds or deletes one or more mappings; merely changing the value associated with an existing key is not a structural modification.) This is typically accomplished by synchronizing on some object that naturally encapsulates the map. If no such object exists, the map should be "wrapped" using the Collections.synchronizedSortedMap method. This is best done at creation time, to prevent accidental unsynchronized access to the map..."
To add to this: Java’s original collection classes (Vector, Hashtable, …) were thread-safe, but it turned out that the performance penalty for that was too high, all the while still not catching errors when performing combinations of operations that need to be a single atomic transaction. This was one of the motivations for the thread-unsafe classes of the newer collection framework (ArrayList, HashMap, …).
> still not catching errors when performing combinations of operations that need to be a single atomic transaction
This is so important. The idea that code is thread-safe just because it uses a thread-safe data structure, and its cousin "This is thread-safe, because I have made all these methods synchronized" are... not frequent, but I've seen them expressed more often than I'd like, which is zero times.
Java TreeMap is thread unsafe.
The business logic is thread unsafe because it uses a TreeMap concurrently. It should either use something else or lock around all usages of the TreeMap. It does not seem to be "in itself" wrong, meaning for any other cause than the usage of TreeMap.
Both! The TreeMap is thread unsafe. The business logic needs to protect against concurrent access to the TreeMap or not use the treemap at all.
Does a ConcurrentSkipListMap not give the correct O(log N) guarantees on top of being concurrency friendly?
java.util.concurrent is one of the best libraries ever. If you do something related to concurrency and don't reach for it to start, you're gonna have a bad time.
It's often a better design if you manage the concurrency in a high-level architecture and not have to deal with concurrency at the data structure level.
Designing your own concurrency structures instead of using ones designed by smart people who thought about the problem more collective hours than your entire lifetime is unwarranted hubris.
The fact that ConcurrentTreeMap doesn't exist in java.util.concurrent should be ringing loud warning bells.
It's not like you have a choice. Thread safety doesn't compose. A function that only uses thread-safe constructs may not itself be thread safe. This means that using concurrent data structures isn't any sort of guarantee, and doesn't save you from having to think about concurrency. It will prevent you from running into certain kinds of bugs, and that may be valuable, but you'll still have to do your own work on top.
If you're doing that anyway, it tends to be easier and more reliable to forget about thread safety at the bottom, and instead design your program so that thread safety happens in larger blocks, and you have big chunks of code that are effectively single-threaded and don't have to worry about it.
The GP comment is not about designing your own concurrency data structures. It’s about the fact that if your higher-level logic doesn’t take concurrency into account, using the concurrent collections as such will not save you from concurrency bugs. A simple example is when you have two collections whose contents need to be consistent with each other. Or if you have a check-and-modify operation that isn’t covered by the existing collection methods. Then access to them still has to be synchronized externally.
The concurrent collections are great, but they don’t save you from thinking about concurrent accesses and data consistency at a higher level, and managing concurrent operations externally as necessary.
The people behind the concurrent containers in java.util.concurrent are smart, but they are limited by the fact that they are working on a necessarily lower-level API. As an application programmer, you can easily switch the high-level architecture so as not to require any concurrent containers. Perhaps you have sharded your data beforehand. Perhaps you use something like map-reduce architecture where the map step is parallel and require no shared state.
Once they expose data structures that allow generic uses like "if size<5, add item" I'll take another look. Until then, their definition of thread-safety isn't quite the same as mine.
You're right. That would have been a better choice.
Haha, I was recently running a backfill was quite pleased when I managed to get it humming along at 6400% CPU on a 64vcpu machine. Fortunately ssh was still receptive.
Anyone else mildly peeved by how CPU load is just load per core summed up to an arbitrary percentage all too often?
Why not just divide 100% by number of cores and make that the max, so you don't need to know the number of cores to know the actual utilization? Or better yet, have those microcontrollers that Intel tries to pass off as E and L cores take up a much smaller percentage to fit their general uselessness.
IDK but the current convention makes it easy to see single-threaded bottlenecks. So if my program is using 100% CPU and cannot go faster, I know where to look.
This is “Irix” vs “Solaris” mode of counting, the latter being summed up to to 100% for all cores. I think the modern approach would be to see how much of its TDP budget the core is using.
At least it was using all of the cores. The CPU running this application was cooking.
In practice, it's rarely an issue in C# because it offers excellent concurrent collections out of box, together with channels, tasks and other more specialized primitives. Worst case someone just writes a lock(thing) { ... } and calls it a day. Perhaps not great but not the end of the world either.
I did have to hand-roll something that partially replicates Rust's RWLock<T> recently, but the resulting semantics turned out to be decent, even if not providing the exact same level of assurance.
TL;DR; don't use thread unsafe data structures from multiple threads at once
Java is usually shockingly inefficient.