Øredev 2010 - Day 4 - Tim Jeanes

11 Nov 2010

Session 1 - Patterns of Parallel Programming - Ade Miller

[The source code from this session is available at: http://parallelpatterns.codeplex.com/]

With massively multi-processor PCs becoming increasingly mainstream, it's important to utilise the full scope of processing power available wherever possible. .NET 4 implements a bunch of features to make this relatively painless. Unless we take advantage of parallelism, our software may actually run more slowly as newer machines have more, slower cores, rather than a faster single one.

We saw the Visual Studio 2010 profiler. This is baked into VS2010 and shows clearly where the CPU is being used, together with disk access, etc, on a timeline. This looks really handy for identifying where the bottlenecks really lie. Using profiling is critical - understanding the application and where the problems are first is vital, rather than just wildly parallelising unnecessarily.

There are a couple of models for parallelism: one is task-based parallelism where we consider what tasks need to be done and run them in parallel. The other is data parallelism: for example in image processing you could split the image into pieces, process them in parallel and then stitch them together in the end.

In data parallelism, it's important to get the data chunk size right: too big and you're under-utilised; too small and you waste too much time thrashing.

You also have to take into account at runtime what degree of parallelism is appropriate: your software may end up running on a machine in a few years that has far more processors than were available when you wrote the software.

Rather than counting processors yourself and manually creating threads, it's better if we can hand this responsibility to the .NET framework and allow it to take care of the degree of parallelism itself. Ideally we just express where parallelism can take place.

In .NET we can do this using the Task<> class. We specify a task that needs to be performed, but we don't say when it starts. We only request a result from it. You can specify dependencies between tasks.

There are a couple of standard patterns that are addressed for data parallelism: loops where items can be handled independently; and loops were the required result is some kind of aggregation of all items in the set.

The first of these is trivial: replace for() with Parallel.For() and you're done. Bear in mind though that you can never assume that the items will be processed in any kind of order at all. Parallel.ForEach can even be used on collections where you don't know the collection size up front.

There's also an overload of Parallel.For that allows for data aggregation between threads. The only gotcha is to ensure you do your own locking in the step that combined the sub-aggregations from each parallel section. Locks are pretty bad in terms of performance though, so if you find you're getting a lot of them in your parallel tasks, it's a good idea to consider whether or not this is the right way to go.

this isn't a silver bullet: parallelism is still hell if your tasks need to share data or need to do a lot of synchronisation.

Task.WaitAll allows you to wait until all parallel tasks have completed; Task.WaitAny allows you to continue after just one has finished. Tasks can be cancelled if they're no longer needed. These last two can be combined if you're doing a parallel search for a single item in a large set.

The Pipeline pattern can be used where many tasks have to be performed on data items that are idenpendent of one another. I.e. once Task A has finished with a data item, it can immediately be passed to Task B. Buffers exist between the tasks that can have size limits on them to ensure that processing capacity is used most where you need it. This can prevent thrashing and memory overflows.

In some cases it's appropriate to combine parallel strategies: if you pipeline has a bottleneck, that stage can itself be parallelised (much like adding more worker to the slow step of a production line).

Session 2 - Run!

Billed as a 5km run, it was mercifully a little shorter than that. Man, running along the Swedish coast in November is cold!

What was I thinking?

Session 3 - Personal Kanban - Jim Benson

Building a personal kanban board for your own work (or even for your own dreams) can build a lot of clarity in your own mind. It removes the brain clutter than creates stress and dissatisfaction, giving clarity to your current position and how well you're doing at whatever it is you do.

Even in a work-related personal kanban, it's worth including non-work items. The fact that you're worried about a sick relative is a distraction to you today, so it belongs in your WIP column as it's a distraction to you that's impacting on your performance.

We tend to want to take on way more work than we can deal with, because we want to be productive - or at least be seen to be. We often don't recognise that we have our own WIP limit, that when exceeded, dramatically impacts on our productiveness.

Kanban can also be used for meetings: It makes for a more flexible, dynamic agenda that contains things the attendee actually want to talk about. it also helps to keep the conversation focussed. I'm not totally convinced on this though - it's hard to say for sure when a discussion on a topic is definitely "done".

Session 4 - MongoDB - Mathias Stearn

MongoDB is a document-orientated noSQL database. A document is essentially a JSON object, which gives a few advantages over a traditional SQL-based database.

As the data isn't stored in defined tables, all objects can be expanded dynamically. Also, as relationships aren't used except where needed, parent and child objects can be held as a single object.

For example, if you're storing a blog in a database, you'd hold each post as a document. That would include all tags and comments as array properties on the blog post object. Physically, these are all held in a single location on disk (effectively as a binary representation of the JSON string) making object retrieval very fast. Data writes are also fast. This makes MongoDB appropriate for high-traffic web apps, realtime analytics or high-speed data logging.

Querying the data is function-based rather than SQL based, but this really only leads to a syntax difference: db.places.find({zip:10011, tags:"business"}).limit(10); is an example query equivalent. Pretty self-explanatory, and a little shorter than SQL. Critically though, there's been no join between a Business table and a Tag table that you'd get with SQL.

More complex queries are also possible, such as {latlog:{$near:[40,70]}}.

Data can be indexed by property to improve performance.

Updates to records are achieved by combining criteria to find the relevant document, with a $push command that adds or updates properties on the document.

Where appropriate, objects needed be combined into single documents. Joins can be achieved by adding ObjectId references as properties on documents. There's no such thing as referential integrity in this case though.

Actions on a single document can be chained together and will be treated atomically, giving you a rough equivalent to SQL transactions. There's no such thing as atomic operations across multiple collections.

MongoDB is impressively mature in terms of deployment features such as replication and database sharding.

Session 5 - Challenging Requirements - Gojko Adzic

Customers often ask you to implement their solution to a problem. This often leads to nightmare projects that are way bigger than they need to be. It's generally better to understand what the real problem is and solve that. The implementation of the true solution can well be better than implementing the solution the customer initially identified.

Similarly refuse to use the technology the customer specifies unless you first confirm that the technology actually matches their need. Often they'll think they know the best way to implement a solution, but another option may be far simpler and more appropriate.

Don't rush into solving the first problem they give you; keep asking "why" until you get to the money: that'll be their real requirement.

Know your stakeholders: who is going to use this and why?

Don't start with stories. Start with a very high level example of how people will use the system and push back to the business goals. The story you're presented with may well not be a realistic one.

Great products come not from following the spec; they come from understanding the real problem and whose problem it is.

Effect maps can be used to trace the purpose of all features. They ask why the feature is needed, who the people are that want the feature, then what the target group want to do and how the product should be designed to fulfil that.

Session 6 - Kanban and Scrum - making the most of both

OK, I think it's fair to say I'm officially totally in love with Kanban now. However, I'm also fairly fond of Scrum. Short of a Harry Hill solution to this dilemma, I attended this session to see how we could take the best of both worlds.

The key features of kanban is to limit the WIP at any stage, and to measure the flow (typically by measuring the lead time - the time it takes for a task to cross the board).

Having a lot of parallel tasks or projects running simultaneously leads to more task switching, which leads to more downtime and delays, which leads to all the projects being completed later.

Doing tasks in series, perhaps with a background task to work on while the main project is blocked, keeps everyone focused and more productive, completing projects sooner: leading to happier customers and happier developers.

There's and example of the evolution of a kanban board at http://www.crisp.se/kanban/example

Scrum prescribes rules more than kanban does, such as planning and committing to sprints, regular releases and retrospectives. Some of these items can be useful to add to the basic kanban model, depending on what's appropriate for the company.

Kanban doesn't prescribe sprints (though they are allowed). I think we may well go without sprints, just because at Compsoft we need to be able to react much more quickly - it's often too hard to commit to a period of time during which our workload can't be altered.

Kanban focuses on having multi-ability teams, where team members frequently help out on tasks outside of their normal primary area of expertise. It's not that everyone has to do everything though (just as well - my Photoshop skills are pretty lacking.).

Estimation is flexible in kanban. Some don't estimate at all - just count; some estimate in t-shirt sizes (S, M, L), some in story points, some in man-hours.

Got a project? Let's work together

Compsoft is a remote working company except for Thursdays when we work together from the Alton Maltings in Alton, Hampshire (UK). Our registered address is detailed below. We'd be delighted to have either a virtual chat with you or to make arrangements to meet in person.