- Explain when sets are used
- Describe sets as unordered unique collections of objects of arbitrary type
- Create a set
- Explain and use functions, methods and operators on sets including adding, deleting, membership, etc.

We previously introduced the notion of objects and Object-Oriented Programming,
but didn’t answer the question of how objects are defined. Objects are defined
by Classes, specifically objects are *instances* of classes.

We can think of the class as the blueprint describing what data and methods an object will have. Consider a “people” class. People will have attributes, e.g. name, age, and methods that perform computations on those attributes. An instance of the People class will be a specific “person”.

In Python classes are synonymous with types, that is each type is a Class. When
we invoke `help`

with a type we are obtaining information about the class, e.g.
`help(str)`

or `help(int)`

.

As we continue throughout the semester we will learn more about Classes. For the purpose of today we will just focus on how to create instances of a class.

Recall when we introduced Lists, we described a List as a “data structure”. And that data structures were a particular way of organizing data, with different data structures being designed for different kinds of computations.

Today we are going to introduce a new data structure,
sets and the `set`

class. In a mathematical context, a set is a collection of distinct values. So
too in Python (and CS generally). Specifically a set data structure is a:

- Unordered collection, in which
- All values are unique, i.e., there are no duplicates

A set differs from a List in these two key properties: recall that a List is ordered and can have duplicates.

These properties facilitate very fast operations like de-duplication, membership and more but also introduce some requirements on the kind of values we can store in a set. Specifically all values must be “hashable”. For our purposes, “hashable” implies “comparable” (so uniqueness can be enforced) and immutable so that prior comparisons aren’t invalidated by changing an object within the set.

What are the kinds of operations we would want to perform with sets?

- Create a new set
- Add values to a set
- Remove a value from a set
- Query if a value is in a set
- Set intersection (collection of values in two/both sets)
- Set union (collection of values in one or both sets)
- Set difference (“subtract” all values in another set)
- Set symmetric difference (collection of values only in one of two sets)

What operations are provided by Python `set`

s?

```
>>> help(set)
```

Let’s discuss some these in turn…

Recall that we could create `list`

s directly using square brackets

```
>>> l = [1, 2, 3, 4]
>>> type(l)
<class 'list'>
```

we can do something similar for `set`

s:

```
>>> s = {1, 2, 3, 4}
>>> type(s)
<class 'set'>
```

But we can also create `set`

s using the initializers. Initializer functions (often called constructors in other languages) are the name
of the functions that create (and initialize, hence the name) new instances (or objects) of a Class - they have
the same name as the type. The define how to “construct” objects from different
inputs.

The first part of the help describes the initializers:

```
>>> help(set)
...
class set(object)
| set() -> new empty set object
| set(iterable) -> new set object
...
```

The first creates an empty set (but is not equivalent to `{}`

, for reasons we
will learn about soon). The second creates a new `set`

from an `iterable`

object. We will learn more about `iterable`

s in lab, but for purposes of today
we just need to know that lists and strings are “iterable” and can be used to
construct a set (just in the way we could use a string to construct a list). For example:

```
>>> set()
set()
>>> set(l)
{1, 2, 3, 4}
>>> set("abcd")
{'c', 'a', 'd', 'b'}
```

What do we notice about the last set we created? The letters are no longer “in order”. Recall that sets are “unordered”. What about their second property, uniqueness?

```
>>> set("abcda")
{'c', 'a', 'd', 'b'}
```

We have used other initializers. Can you think of an example? Here are some:

```
>>> str(10)
'10'
>>> int("10")
10
```

We have two broad categories of set methods, those that mutate the object and those that don’t.

Using the help output, which of the following methods mutate the set object on which they were invoked?

`add`

: Mutate`clear`

: Mutate`difference`

: Non-mutating`difference_update`

: Mutate`intersection`

: Non-mutating`intersection_update`

: Mutate

Set methods in action:

```
>>> s = {1, 2, 3, 4}
>>> s.add(5)
>>> s
{1, 2, 3, 4, 5}
>>> s2 = {4, 5, 6, 7}
>>> s.difference(s2)
{1, 2, 3}
>>> s
{1, 2, 3, 4, 5}
>>> s2
{4, 5, 6, 7}
>>> s.union(s2)
{1, 2, 3, 4, 5, 6, 7}
>>> s.intersection(s2)
{4, 5}
>>> s.intersection_update(s2)
>>> s
{4, 5}
```

Why is there not a `union_update`

? There is, called just `update`

, as adding
values to set is effectively a union operation.

Note that some operators are “overloaded” to implement set operations:

`|`

: Set union`-`

: Set difference`&`

: Set intersection`^`

: Set symmetric difference

Are these operators mutating or not?

What about membership? We apply the `in`

operator:

```
>>> s2
{4, 5, 6, 7}
>>> 1 in s2
False
>>> 5 in s2
True
>>> "abc" in s2
False
```

What about comparisons? What would make sense in the context of sets? For lists
(and strings and other sequences) Python performs ordered
comparisons,
i.e. compares the first elements then the second elements and so on. But sets
are not ordered. Instead, set
comparisons
are defined in terms of subsets and supersets. For example, `a <= b`

if set `a`

is a subset of set `b`

, while `a >= b`

if set `a`

is a superset of set `b`

.

What about the indexing operator?

```
>>> s2[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing
```

Recall sets are unordered, so there is no notion of indexing. But we can still
iterate, that is sets are `iterable`

```
>>> for val in s2:
... print(val)
...
4
5
6
7
```

Does that mean we could implement indexing if we really want to? Yes. But should we? Is there an semantic meaning of indices for a set?

*PI Questions (sets)*

Couldn’t we accomplish the same operations with lists? We could. And lists
actually do support some of the same operations as sets, e.g., `in`

```
>>> l = [1, 2, 3, 4]
>>> 4 in l
True
>>> "abc" in l
False
```

So why sets? The uniqueness invariant and performance. The `set`

class is implemented in way to speed up set
operations relative to lists. Consider implementing a function `contains`

that
implements the same functionality as `in`

.

```
def contains(list, item):
for val in list:
if val == item:
return True
return False
```

If we double the length of the list, how much longer will it take to execute
`contains`

? The time will also double. We would say `contains`

executes in
“linear” time with respect to the length of the input list.

In contrast, we can perform the same operation with `set`

in constant time (or
at worst, logarithmic time) with respect to the size of the set.

Let’s convince ourself with data. Checkout lists_vs_sets.

This script includes functions for creating lists and sets of a given size
(with random data). And then querying those data structures a given number
times. `speed_test`

times how long it takes to create and query the different
data structures.

For smaller input size, the performance is similar:

```
>>> speed_test(1000, 100)
List creation took 0.001810 seconds
Set creation took 0.002273 seconds
--
List querying took 0.001780 seconds
Set querying took 0.000172 seconds
```

As we increase the input sizes or the number of queries, we start to see real differences in performance:

```
>>> speed_test(1000, 100)
List creation took 0.001810 seconds
Set creation took 0.002273 seconds
--
List querying took 0.001780 seconds
Set querying took 0.000172 seconds
>>> speed_test(10000, 100)
List creation took 0.018895 seconds
Set creation took 0.024366 seconds
--
List querying took 0.016907 seconds
Set querying took 0.000159 seconds
>>> speed_test(10000, 1000)
List creation took 0.018688 seconds
Set creation took 0.018990 seconds
--
List querying took 0.156710 seconds
Set querying took 0.002958 seconds
>>> speed_test(100000, 1000)
List creation took 0.161801 seconds
Set creation took 0.187271 seconds
--
List querying took 1.591348 seconds
Set querying took 0.001975 seconds
```

Lets investigate in a more systematic way:

```
>>> speed_data(1000, 10000, 100000, 5000)
size list set
10000 0.157616 0.001780
15000 0.251193 0.001595
20000 0.312362 0.001505
25000 0.392334 0.001448
30000 0.480571 0.001570
35000 0.566010 0.002419
40000 0.634969 0.001507
45000 0.732042 0.001493
50000 0.815638 0.001570
55000 0.942068 0.001659
60000 1.031601 0.001606
65000 1.164648 0.001574
70000 1.288893 0.001734
75000 1.400574 0.001537
80000 1.443860 0.001565
85000 1.633924 0.001546
90000 1.784104 0.001501
95000 1.799544 0.001495
```

We can plot this in Excel. We will learn how to plot from with Python later in semester.

`set`

vs `list`

Does order matter? Use a `list`

. Is order unimportant but uniqueness or query
performance matters? Use a `set`

.

What are some situations when we might want to use a set? Membership testing, removing duplicates from a sequence, …

So far we have largely ignored questions about performance or efficiency. Our programs have been so small that efficiency has been a non-issue. But that is/will not always be the case. Has anyone encountered performance issues already?

What about in the standard deviation calculation? If you calculate the average in each loop iteration, your program could be very slow on the larger census datasets. We will talk more formally about computational complexity later in the semester, but we can start informally today.

We said finding an element in a list takes an amount of time that grows
linearly with the size of the list. Whereas finding an element in a set takes
constant time (or in the worst case, proportional to `log(size of set)`

). What
about the time for computing the average in each loop iteration of standard deviation? That
would increase as the square of the size of the dataset; if the size of the
dataset is *n*, computing the average takes “order” *n* time (summing all *n*
elements) and we are doing that computation for each data point, i.e. *n*
times. *n^2* can grow very quickly, even for “smallish” *n*.

We will continually be building up our “efficiency” toolbox throughout the semester. Our first two tools are:

- Choose the “right” data structure for the task at hand
- “Hoist” unchanging computations out of loops, that is compute the result once and assign to a variable to be used within the loop.