The Little-Known Ways Ethereum Reveals User Location Data
At Devcon4, Geth developer Péter Szilágyi detailed the little-known ways that data about ethereum users can become public.
"People don't realize how much information is out in the open."
That's Peter Szilagyi, an ethereum core developer who manages the software client Geth, referring to the fact that little attention has been paid to the blockchain's underlying network layer, where information is sometimes exposed in complex and unpredictable ways.
Indeed, there's an awareness of the implications of such exposure, and it's given rise to an ongoing acceleration in research on how to better obscure user data at the application level, which sits on top of the ethereum blockchain, a transparent system that publishes smart contract and transaction data.
In an interview, Szilagyi described the peer-to-peer components that underlie the world's second-largest blockchain by market capitalization as a "black magic thing."
This state of affairs was highlighted during his talk at the annual developer conference, Devcon4, in Prague last week. Szilágyi detailed a number of concerns that could cause user metadata to leak out over time – and under the worst-case scenario, provide the basis for an accurate, global map of ethereum user locations.
During last Friday's talk, Szilagyi focused on two ways in which this could happen, with a focus on websites like popular blockchain explorer, Etherscan, and "light clients" such as mobile or browser-based apps.
"When people are transitioning away from full nodes they are giving up certain guarantees and I just want to highlight what potential issues might arise," Szilagyi told CoinDesk.
Szilagyi began encountering the issues following his pursuit of a side project: an alternative to Facebook that is decentralized and private-by-default. As a result of the research, Szilagyi said metadata leaks make it difficult to interact anonymously with others.
"We don't have that in ethereum," Szilagyi explained. "The reason why these leaks began to bother me is because of that project."
Speaking on Friday, Szilágyi said that many of the problems are so deeply ingrained that it's hard to address them without running the risk of breaking applications that run on top of ethereum. Still, the developer detailed methods that could alleviate the risk for users.
"Most people in blockchain and ethereum they want to build on top, while there's a team at the bottom doing the dirty work," he told CoinDesk, adding:
'Weird trackers'
During his Devcon talk, Szilágyi broke down the various ways in which sensitive user information can be exposed by interacting with ethereum. Taking the example of Etherscan, Szilagyi said that a particular combination is revealed to the website when users access it – namely, a link between a user's IP address and their ethereum address.
And that's notable because, as a unique computer identification number, an IP address reveals user location data.
This information is then shared with Google Analytics and Etherscan. Plus, Etherscan's underlying comment tool – a popular website comment add-on named Disqus – also receives this info, and further shares that activity with its partners.
"Disqus actually reveals the IP-to-ethereum address mapping to Facebook, Twitter and Google Plus," Szilagyi said.
Disqus has 11 such integrations in total, such as YouTube, Vimeo and other services, that are given this information as well. The tool also contains other "weird trackers," Szilágyi explained, including artificial intelligence platforms and data marketplaces.
And that's notable because it doesn't just impact Etherscan, but any decentralized application (dapp) that uses the same tools.
"This is an issue because you are essentially associating your IP-to-ethereum address mapping and you're revealing that to a whole lot of services," Szilagyi continued.
Etherscan has taken measures to remove these features, Szilagyi said. Currently, it uses Google Analytics, but the team behind it is looking to remove that aspect from the website. Once having relied on an external ad company, Etherscan is taking steps to internalize the ad network as well.
But other dapps that are affected may not be as proactive as Etherscan in addressing the leaks, according to Szilagyi.
As he explained:
The same information – IP-to-ethereum address – is shared when users access other services as well, Szilágyi continued, like Infura, MetaMask, and MyCryptoWallet.
Discovery protocol
Szilagyi offered some other routes around this dilemma, including the use of the Tor network to hide IP addresses and the Brave browser to block online trackers.
But there are other, more subtle ways that access to ethereum can expose sensitive information as well, according to the developer. Taking the example of light clients – the stripped down, low-storage way for ethereum users to access the network – Szilagyi said that there are two kinds of activity on the network that are highly traceable.
The first is what is known as the "discovery protocol."
When light clients connect to the ethereum network, the IP is also revealed. Because light clients are continuously reconnecting over time, the discovery protocol reveals an accurate map of user location.
"Every time I connect to the network I am actually revealing to the network that this machine which last week is in Berlin, this week was in Prague," Szilagyi said.
This location data is public, so in theory, anyone can scan the network to build a highly accurate, global map of ethereum user locations.
"If you are willing to do this, for example, every day, just try to scan the network every day, then actually you can create an extremely accurate history of where each individual ethereum node was moving over time," Szilagyi said.
Additionally, key to how light clients work is the way in which the software minimizes activity by connecting to addresses that are associated with a user. But while this approach reduces bandwidth, latency and traffic, the impact is that IP and address relationships are rendered explicit on the network.
"Light servers will be able to statistically map out that this particular IP address is interested in one particular address," Szilagyi said.
Similarly to the discovery protocol, this information can be easily accessible. And unfortunately, connecting over Tor will actually damage the reliability of the light client.
"Now we don't a world map of moving IPs, now we have a world map of moving ethereum addresses," Szilagyi said, adding:
Best practice
Unfortunately, according Szilagyi, there's no simple fix for many of these problems, as some are inherent to how light clients and explorers function. But, speaking to the audience on Friday, the developer had precise recommendations to share with ethereum users and developers going forward.
Specifically, Szilagyi broke down three ways in which this information can be better concealed in the near-term.
First, he argued that users should run full nodes. While more hardware intensive, full nodes mean you can store all data locally and can access that data without interacting with anyone else. Additionally, because full nodes verify that ethereum's underlying state is correct, running a full node comes with security benefits as well.
"Although people don't like full nodes, full nodes are actually the best anonymizers in the ethereum ecosystem," Szilagyi said.
Secondly, Szilagyi contended that developers should look to the work that has been done by anonymizing network layers, such as Tor browser and I2P, for research on how to better conceal metadata leaks at the network level.
"Privacy on ethereum is bad, really, really bad. But that doesn't mean that it's an impossible task to solve," he said. "There have been 20 years of research going into how to do this properly, so let's at least try to learn from their results and try to fix it."
Lastly, Szilagy urged developers not to blame users for bad privacy practice when interacting with ethereum. He also noted that many users may be unaware that options like the Tor browser exist in the first place.
As such, Szilagy said: "It's kind of up to us as dapp and platform developers to figure it out and fix it."
With this in mind, Szilagy ended on a note of caution. Pointing to Facebook as an example, the developer said that when privacy-enforcing characteristics aren't embedded at the start, such an approach might carry repercussions in the future.
"I don't think Facebook was created to gather user data, it wasn't created to abuse elections, that kind of just happened," Szilagy said, concluding:
Devcon image via CoinDesk archives